Do your regional websites talk funny?
By Chen Hui Jing / @hj_chen
Hello everybody. Welcome to CityJS. Thank you for being here on this lovely Friday morning. I'm not sure why the organisers put me in the opening slot because there's really not a lot of Javascript going on in my talk. So maybe that's why.
All of you clearly paid money to see the hard-hitting hardcore Javascript rockstars later in the day, but you need to warm up first. That's what opening acts are for. Like how Paramore opened for Taylor Swift, right?
😏 Singlish-o-meter 🤔
Because this is the Singapore edition of CityJS, I feel obliged to stay true to the culture. For the benefit of the audience who are unfamiliar with English spoken in Singapore, there is a range. Best explained in a graphical form. FYI, this is a legitimate HTML range input. Come ask me how it's built later.
So today, I will try to keep the level somewhere in the middle of Queen's english, and extreme Singlish. In general, the “R” sound sometimes becomes “L”, like ready will may sound like leddy. Also, the “ver” sound may often just be omiited. Like, government, will become gahmen. Or nevermind, will become nehmind. Also, grammar is optional.
Other than that, it should be mostly comprehensible. If you feel a lack of confidence, please sit next to a friendly Singlish-speaking attendee. They will decipher all of this for you.
🇲🇾
👾
🏀
🚲
🖌️
👟
💻
🖊️
🎙
🐈⬛
🧗
🏳️🌈
My name is Hui Jing. It's a Chinese name, so our family name comes first. And these are my representative emojis.
I've been gainfully employed as a web developer for more than a decade. Of course not with the same employer, but regardless of who it is, an employer who pays your CPF on time is a good employer.
So right now, the Interledger Foundation is paying my CPF. They're very punctual, I love it.We are a non-profit organisation working building a payments network based on the TCP/IP protocol, just like the internet. Make paying someone as easy as sending an email
Anyway, today you'll hear me talk to you about stuff mostly related to i18n, which is short for internationalisation. Maybe you'll learn something, maybe you won't. It's my hope you will be at least mildly entertained, and if you fall asleep, just try not to snore too loudly.
Internationalisation is the design and development of a product, application or document that enables easy localisation for target audiences that vary in culture, region, or language.
There is no canonical set-in-stone definition of i18n, but the W3C does offer this guidance. It is designing and developing in a way that enables easy localisation for specific audiences. Localisation covers a broad range of customisations including but not limited to: numerals, time and date formats, currencies, symbols, icons and colours, even textual and graphical references.
This implies that i18n is a lot more than simply translating the content on a site to different languages. There are nuances to the presentation of that content which will affect the experience of a native speaker using your site.
I've had the opportunity to work on a number of different projects that required internationalisation. Even though internationalisation is more than just translation, getting the translation part correct is really important. For this talk, because of time constraints, I picked a couple things that I encountered personally.
The “Billion” problem
The Chinese numeral system (also used by Japanese and Korean) has specific words for large numbers as per the traditional Chinese grouping of 10,000.
Simp. Chinese
Trad. Chinese
Japanese
Korean
10
十 (shí)
十
十 (juu)
십 (ship)
100
百 (bǎi)
百
百 (hyaku)
백 (baek)
1000
千 (qiān)
千
千 (sen)
천 (cheon)
10,000
万 (wàn)
万
万 (man)
만 (man)
100,000,000
亿 (yì)
億
億 (ichioku)
억 (eok)
1,000,000,000,000*
兆 (zhào)
兆
兆 (icchou)
조 (jo)
Let's talk about numbers. Specifically large numbers. I am Chinese, so I can only talk about this numeral system here. But there are also other languages that have specific words for powers of the base. Like India has lakh for 100 thousand and crore for 10 million.
For us, we have a word for 10,000, 万, and after that it goes up by power 4. So the next word is for 100 million, which is 亿. So 1 billion is 9 zeroes, and we don't have a specific word for that.
Translation strings
import {useTranslation} from "react-i18next";
function HeaderComponent() {
const {t, i18n} = useTranslation('common');
return {t('welcome.title')}
}
{
"welcome": {
"title": "Welcome to the app!"
}
}
{
"welcome": {
"title": "Laipni lūdzam lietotnē!"
}
}
Translation strings are a way for us to mark strings to be extracted from the source code to be passed over to translators to translate.
Most languages have some similar pattern, where the translation strings are all kept in separate data files, could be JSON, could be YAML. Then those strings are used in your application code via t functions.
Interpolation
/* src/translations/en.json */
{
"welcome": {
"title": "Welcome to {{framework}}"
}
}
/* src/translations/de.json */
{
"welcome": {
"title": "Willkommen bei {{framework}}"
}
}
There will inevitably be instances where it makes sense to pass in variables into these translation strings, especially for things like data, where consistency across languages is important information.
You'd think this would be a relatively straight-forward affair, and I think most of the time it is. But this does depend on exactly what you're passing into the string, and how your translation flow hands this off to the folks doing the translation.
What translators see
Not this:
But this:
Over %{total_stores} businesses in %{total_countries} countries around the world have made over $%{total_gmv_billions} billion USD in sales using Shopify
For example, in my previous place of employment, we did it like this. There was a set of stats that should be consistent throughout the entire site and across all languages, so those were substituted into the translation strings like so.
The source string is what the different translation teams see on the other end of the system, and they return the translations accordingly.
This was not a problem for the most part, in fact, there was only 1 instance where a problem arose. Because the number returned for {total_gmv_billions}
was 200. I don't know who made the decision, but I'm betting the person only spoke English? I don't know, man.
The end result
This was what we got back from the translators. For those of you who don't speak Korean or Chinese, let me explain the problem. The following translations have expressed the value of 200 billion as 200 ten million.
The problem
Saying that the GMV is 200 ten million dollars is quite a glaring grammatical error, if you speak the language.
Definitely not the translators fault that the end result is like this because if I put myself in their shoes, and I see a variable for X billion. I would intuitively think it's like oh maybe 5 or 6 billion. They didn't know the number was 200.
But regardless, a mistake is a mistake. Users don't care why, they only see that we made a rookie error.
But why didn't Japan have this problem? 🤯
The Japanese team had worked around this issue previously by editing the locale files that reference our total GMV numbers like so:
%{total_gmv_billions}0億米ドル
Our Japan team spotted this error and made the adjustment of replacing 十 with a 0.
It was actually a Korean language reviewer who was meant to review some other content that flagged this to us. They were like, I know you told us only to look at the designated section, but this is too glaring to ignore.
When I got the ticket, I was like, the other CJK locales must have this also. But only the Chinese site got. Japan was fine. Because back then, our Japan team was quite stacked with native Japanese speakers. And they manually went into the translation file to replace the character for 十 (juu) with a 0. Perfect.
See solution, steal it
This will be how it looks for each respective language:
%{total_gmv_billions}0억 달러(USD)
%{total_gmv_billions}0 亿美元
%{total_gmv_billions}0 億美元
So of course I just stole it for the fix. Another thing that tends to get overlooked is spaces between words or characters. Again, I can only talk about the CJK languages because I don't know about others.
For Chinese, there is no space between characters, but there is usually a space between a Chinese character and a non-Chinese character. Korean has its own spacing rules, called 띄어쓰기 (ttiosseugi).
This is important when dealing with translation strings and interpolation because sometimes the spaces literally get lost in translation.
Lessons and best practices
Interpolate with caution
Do Not Manually Construct Sentences or Manipulate Text in Code
Let i18n Libraries Handle the Hard Stuff
Lessons From Linguistics: i18n Best Practices for Front-End Developers by Lucas Huang
Interpolation is super useful but needs to be done very deliberately. Other than what I described earlier, things like pluralization can also become an issue. There are six plural forms identified in the Unicode Common Locale Repository, for zero, one, two, few, many and other. Some languages also have gendered nouns.
The point is, hard-coding word order or trying to manipulate text at the code level tends to make translation very difficult to get right. The folks that built and maintain the i18n libraries have done a lot of the tricky work for us, so it's probably a good idea to use those tried and tested packages, like for Javascript, we have i18next
which exists in many popular framework flavours.
The next story involves Vietnamese characters. Some of you might have heard of the acronym CJK, I've also mentioned it earlier, but sometimes the acronym is CJKV, which includes Vietnamese as well. Because historically, Vietnamese also used Chinese characters and original Vietnamese characters in their writing system previously.
The Vietnamese alphabets are listed in several non-contiguous Unicode ranges:
Basic Latin {U+0000..U+007F}
Latin-1 Supplement {U+0080..U+00FF}
Latin Extended-A, -B {U+0100..U+024F}
Latin Extended Additional {U+1E00..U+1EFF}
Combining Diacritical Marks {U+0300.. U+036F}
The Vietnamese đồng currency symbol is ₫ (U+20AB)
Unicode & Vietnamese Legacy Character Encodings
Vietnamese is interesting, because the current writing system uses the 29-letter Latin-script based Vietnamese alphabet. It is tricky because the Vietnamese alphabets exist across several non-contiguous Unicode ranges. Because designing fonts is a lot of work, most fonts are not going to support every single Unicode range in the universe. Even the Latin-based scripts probably cover until Latin-1 Supplement, unless they explicitly want to support specific languages.
Missing Vietnamese glyphs
So when choosing web fonts for your project, make sure to check the font's language support, otherwise, you might end up with the situation where missing glyphs look very obvious on your site. This situation occurs because when the browser cannot find a corresponding glyph in your chosen font to represent a character, it will use a fallback font that has that glyph.
If your fallback font doesn't really match well to your chosen font, then it will become like this. And I am aware that some people cannot tell if there is a problem here. But the point is I can tell, it's glaring to me, and I think this shouldn't happen. It looks unprofessional.
The :lang()
pseudo-class
:lang(vi) {
font-family: -apple-system, BlinkMacSystemFont, sans-serif;
}
Some products have commissioned fonts, and those are really quite expensive, so you're probably going to want to use them in as many places as possible. And that's totally justifiable. So we can use an alternative font stack only for those languages that your font doesn't support. CSS lets you do that.
Basic web development best practice is that you must declare a lang
attribute. If you want to know why, come ask me later, I explain to you. Assuming you (or most likely your framework) dutifully set the lang
attribute, you can use this pseudo-class to target only Vietnamese content. And provide an appropriate alternative font stack that fits your branding and all that.
Issue with Apple SD Gothic Neo
:lang(ko) {
font-family: 'Work Sans', -apple-system, BlinkMacSystemFont, sans-serif;
}
Since we've talked about commissioned fonts, this one is a bit of an edge-case, I would think. It has to do with Apple's choice of default font for Korean. Most CJK fonts will also have a Latin-based character set simply because we cannot escape from English on the web. That's just how it is. But anyway, Apple uses Apple SD Gothic Neo for Korean.
I discovered that it does not have support for the Latin-1 Supplement range, which means that we end up with a blasphemous situation where Beyoncé would not render correctly. Sacrilegious. For cases like this, specifying the Latin-based font first would fix it, even if that font doesn't have Korean language support because then the default fallback Korean will trigger for the Hangul characters. And everything will render properly. Talk to your designers about crafting a nice font stack.
谢谢
감사합니다
ありがとう ございます
Cảm ơn
🙇🏻