Recently, Clikthrough launched a new interactive and custom video player solution for EuroRSCG and their client, Sony, that supported the launch of Sony’s new Playstation game, MAG. The original project scope included a customized video player with support of 12 interactive videos. Once we got into the project, we were asked to also support 15 languages and closed-captioning. We are proud to say that Clikthrough now supports over 20 different languages, and our player is the only one on the market that supports interactivity in this many languages. For other companies and web development professionals interested in this subject, we have provided the knowledge we gained as we worked through this project.
What is Internationalization?
Internationalization (shortened to 18n for most people that don’t want to type) is a work in progress for a lot of web technology companies. Very few sites have a formal approach to supporting internationalization.
Key Terms:
I18n – internationalization – covers the changing of languages to match that of the user. Keep in mind to overhaul an existing system you’ll need to replace navigation, error messages, tooltips or other helpers, and the big one… filter or translate all content.
L18n – localization – identifies the exact language and cultural settings for a user. Each locale includes that region’s formatting of:
• Dates
• Times
• Numbers
• Currency (both value and representation/format)
Locale – A locale is best thought of as a region (usually within a country). For instance, in Switzerland there are different regions within the country where German, French and Italian are the commonly spoken “local” languages.
ISO – The ISO (International Organization for Standards) is the world’s largest developer and publisher of International Standards. Throughout this blog, we make heavy reference to the ISO and the standards that it has put out there regarding internationalization.
Confusion in the Implementation of ISO Codes:
As web user’s we see many internationalized websites with language/locale codes in their urls. In some cases they’re added to the urls (url path or sub domain), and in other cases stored in session. Some examples are:
-http://translate.google.com/translate_t?ie=UTF-8&text=test&sl=en&tl=zh-TW – uses correct ISO code format w/ a dash
-http://www.mag.com/ja_JP/mag.html - correct capitalization, using _ (underscore)
-http://de.wikipedia.org/ - they’ve included language code as a sub domain (de – Deutsch, German)
-http://zh.wikipedia.org/wiki/ this is actually wrong as it should be ch-ZH (Chinese – Simplified)
Confusion in the Definition of ISO Codes:
Wikipedia (http://en.wikipedia.org/wiki/Locale) defines the correct format as [language[_territory][.codeset][@modifier]]. For example, Australian English using the UTF-8 encoding is en_AU.UTF-8.
IBM (http://publib.boulder.ibm.com/infocenter/zos/v1r9/index.jsp?topic=/com.ibm.zos.r9.cbcpx01/locnamc.htm) defines convention as “<Language>-<Territory>.<Codeset>” yet the language/territory separator in all examples is an underscore (_). Also, it lists language codes with the first character capitalized (which is against standards) and there is also an error in the documentation in the language_territory separator, they list it as “Li-LT” (using a dash).
W3C (http://www.w3.org/TR/REC-html40/struct/dirlang.html#langcodes) states that the language attribute’s value is a language code that identifies a natural language spoken, written, or otherwise used for the communication of information among people. Computer languages are explicitly excluded from language codes. RFC1766 (published in March 1995) defines and explains the language codes that must be used in HTML documents. Briefly, language codes consist of a primary code and a possibly empty series of sub codes. For example:
language-code = primary-code ( “-” subcode )*
Here are some sample language codes: “en”is English and “en-US”: the U.S. version of English.
The Actual ISO/RFC Standards:
The original language code standard was RFC 1766. That standard was superseded in January 2001 by RFC 3066 (http://tools.ietf.org/html/rfc3066). RFC 3066 was superseded in September 2006 by RFC 4646 & 4647.
Popular vs Standards-Based Implementations:
Here in 2010, you would think that this issue would be resolved given the standards have been in place for over 10 years. Yet, there still seems to be confusion over the standard of coding localization/language codes. In all the ISO specs dating from the 1990’s, everything was done with a “-“ (dash) separating the language and the territory (country) code. However, in many modern implementations seen on the web, people are using a “_” (underscore) to separate the language and country codes. The simple fix for us was to accept either separator in all externally facing interfaces/API’s (Application Protocol Interface).
Setting Things Straight:
Written in ISO code RFC1766, and subsequent amendments 4646& 4647 define the ISO code as follows:
[language[-territory][.codeset][@modifiers]]
In this example above, the language is in two lowercase characters as described in ISO639. For example, en for English, ja for Japanese, and zh for Chinese.
Further, territory (country) is two uppercase characters as described in ISO3166. For example, GB for United Kingdom (Great Britain), KR for Republic of Korea (South Korea), CN for China.
As you can see, there are a lot of ways to go wrong with internationalization and the language and locales are just the first part. Stay tuned for more posts on internationalization here in the future. To see an example of the internationalization work we have done, check out the Clikthrough player deployed on the www.MAG.com site. Clikthrough is the only interactive video player in the market that supports over 20 languages.