USING UNICODE WITH PHP Translation, localization, and 100% less mojibake guaranteed or your users won’t come back!
The whole world uses the internet
Why is internationalization important? Content language of websites Percentage of Internet users by language
Worse than no internationalization? Mojibake
Unicode is the solution! Well – kind of 1. Different encodings 2. OS’s have different default implementations 3. All software encodings have to match or convert Unicode Idea == simple Unicode Implementation == hard
Back to Basics WHAT IS UNICODE?
U·ni·code ˈyoooniˌkōd/ Noun COMPUTING 1. an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.
In the Beginning, there was ASCII
Code Pages In which things get really weird…
Representing characters differently ASCII Unicode One character to Code point bits in memory A -> 0100 0001 A -> U+0041 Direct Abstract But how do we represent this in memory?
Encoding Madness UTF – Unicode Transformation Format Maps a Code Point to a Byte Sequence
What is a character? å (A + COMBINING RING or A-RING) How long is the string? 1. In bytes? 2. In code units? 3. In code points? 4. In graphemes?
Crash course in Computer Memory Big endian systems - most significant bytes of a number in the upper left corner. Decreasing significance. Little endian systems – most significant bytes of a number in the lower right. Increasing significance.
Big Endian? Little Endian? You’re hurting my brain Hello -> U+0048 U+0065 U+006C U+006C U+006F 00 48 00 65 00 6C 00 6C 00 6F – Little Endian 48 00 65 00 6C 00 6C 00 6F 00 - Big Endian But.. It’s the same way to encode unicode… Now I have a headache!
UTF-8 to the rescue! Hello in ANSI -> 48 65 6C 6C 6 Hello in UTF8 -> 48 65 6C 6C 6
Moral of the story Unicode is a standard, not an implementation Text is never plain Every string has an encoding From a file From a db From an HTTP POST or GET (or PUT or file upload…) No encoding? Start praying to the Mojibake gods… If you do web – use UTF-8
Mojibake on rye with swiss. WHY DO YOU NEED UNICODE?
Helgi Þormar Þorbjörnsson
More than just UTF8 BEYOND STRINGS
I18n and L10N • Internationalization – adaptation of products for potential use virtually everywhere • Localization - addition of special features for use in a specific locale
Date and Time Formats 30 juin 2009 fr_FR 30.06.2009 de_DE Jun 30, 2009 en_US And don’t forget the time zones!
Collation (Sorting) • The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k” • Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d” • Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e”. • Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z”.
String Translation • Translation is never one to one, especially when inserting items like numbers • Some languages have different grammars and formats for the strangest things • Usually translated strings are separated into “messages” and stored, then mapped depending on the locale • Large amounts of text need even more – different tables in a database, files in directories, or more
Layout and Design • Reading order • Right to left • Left to right • Top to bottom • Word order • Cultural taboos (human images, for example)
3.5 extensions for triple the pain! HOW TO UNICODE WITH PHP
Upgrade to at least 5.3 • No, really, I’m entirely serious • If you’re not on 5.3 you’re not ready for unicode • At all • You have far bigger issues to deal with – like no security updates • (oh, and the extensions and apis you need either don’t exist or won’t work right)
Install the bare minimum • intl extension (bundled since PHP 5.3) • mb_string (if you need zend_multibyte support or on the fly conversion, but most anything else it can do intl does better) • iconv extension (optional but excellent for dealing with files) • pcre MUST have utf8 support (CHECK!)
PHP strings 101
C strings and encoding char - 1 byte (usually 8 bit) char * - a pointer to an array of chars stored in memory • Can handle Code Page encodings, although generally need special APIs for dealing with multibyte code pages • Usually null terminated… well unless it’s a binary string • Unix cleverly supports utf8 with apis • Windows … does not
Introducing a new type wchar_t – C90 standard (horribly ambiguous) • Windows set it at 16 – and defined A and W versions of everything • Unix set it at 32 C99 and C++11 do char16_t and char32_t to fix the craziness Non-portable and api support sketchy • Libraries to fix this exist • Few are cross-platform • Except for ICU – which just rocks
Why do we care? • PHP talks ONLY to ansi apis on windows • PHP functions assume ascii or binary encodings (except for a few special ones) • Although most functions are now marked “binary safe” and don’t flip out on null bytes within a string, some still assume a null terminated string • string handling functions treat strings as a sequence of single-byte characters.
C locales or how to make servers cry • Setlocale is Per process • I will repeat that – setlocale sets PER PROCESS • Locales are slightly different on different OS’s • Windows does not support utf8 properly
What setlocale will break •gettext extension • strtoupper • strtolower • number_format • money_format • ucfirst • ucwords • strftime
INTL to the rescue! • Wrapper around the excellent ICU library • Standardized locales, set default locale per script • Number formatting • Currency formatting • Message formatting (replaces gettext) • Calendars, dates, timezones and time • Transliterator • Spoofchecker • Resource Bundles • Convertors • IDN support • Graphemes • Collation • Iterators
Some intl caveats • New stuff is only in newer PHP versions • All strings in and out must be UTF-8 except for Uconvertor • Intl doesn’t yet support zend_multibyte • Intl doesn’t support HTTP input/output conversion • Intl doesn’t support function “overloading”
mb_string • enables zend_multibyte support • supports transparent http in and out encoding • provides some wrappers for functionality such as strtoupper (including overloading the php function version…)
Iconv • Primarily for charset conversion • output buffer handler • mime encoding functionality • conversion • some string helpers • len • substr • strpos • strrpos • stream filter stream _filter_ap p end ($ fp , 'convert.ico nv.ISO -2 0 2 2 -JP /EU C -JP ');
What do you mean mysql is giving me garbage? BEYOND THE CODE
Databases Table/Schema encoding and connection • Mysql you need to set the charset right on the table AND • Set the charset right on the connection (NOT set names, it does not do enough) AND • Don’t use mysql – mysqli or pdo • postgresql - pg_set_client_encoding • oracle – passed in the connect • sqlite(3) – make sure it was compiled with unicode and intl extension is available • sqlsrv/pdo_sqlsrv – CharacterSet in options
Other gotchas • Plain text is not plain text, files will have encodings • Files will be loaded as binary if you add the b flag to fopen (here’s a hint, always use the b flag) • You can convert files on the fly with the iconv filter • You cannot use unicode file names with PHP and windows at all (no, not even utf8) – unless you find a 3rd party php extension • Beware of sending anything but ascii to exec, proc_open and other command line calls
The best and worst in PHP apps CASE STUDIES
Applications • Wordpress • gettext (sigh) • Drupal • gettext files but NOT gettext api
My Little Project • Get everything needed into intl from mb_string and iconv so you need only 1 solution • stream filter from iconv • output handler from iconv • zend_multibyte support from mb_string • http in and output conversion from mb_string • Some simplified apis to make “overloading” doable