
| PHP and UTF-8 Howto |
PHP and UTF-8 Howto - Experiences from WebCollabWriting the UTF-8 version of WebCollab was not straightforward. There is not much good information on PHP with Unicode, and a lot of bad information. Some web sites even said it was impossible. This page documents how we successfully made WebCollab to be UTF-8 functional. PHP mbstring libraryPHP has an optional library specifically for handling multi-byte strings, known as mbstring. This library makes using UTF-8 much easier. Despite what a lot of websites say, you should use mbstring. However, not all web hosting providers enable mbstrings on their implementation of PHP. For most of the mbstrings functions there are discrete PHP code equivalents that can be found on the web. For WebCollab, we chose not to take this approach. For a working example of a PHP UTF-8 application, visit the demo website for WebCollab HTTP HeadersFirstly we must correctly set the HTTP headers to instruct the browser to use UTF-8: header( 'Content-Type: text/html; charset=UTF-8' ); Then to make doubly sure the browser uses UTF-8, we send a meta tag in the HTML head: <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> PHP Internal EncodingBy default PHP uses 'ISO-8859-1' for it's internal encoding schema. Change this to UTF-8: mb_internal_encoding( 'UTF-8' ); This fixes a lot of internal PHP problems. HTTP Form SubmissionAlthough not specifically mandated by the W3C, almost all web browsers will submit an HTTP form in the same character set as the page was served up in. Put another way, if you deliver your pages in UTF-8, then submitted responses will also be in UTF-8. There is no need to try and verify the character set in a submitted HTTP form. Our experience has been that trying to accurately determine the submitted character set will result in more 'false positive' errors than just accepting that it is correct. Character ValidationOverly long UTF-8 sequences and UTF-16 surrogates are a serious security threat. Validation of input data is very important. Here is an algorithm is derived from http://www.w3.org/International /questions/qa-forms-utf-8.html that handles validation.
preg_match_all('/([\x09\x0a\x0d\x20-\x7e]'. // ASCII characters
'|[\xc2-\xdf][\x80-\xbf]'. // 2-byte (except overly longs)
'|\xe0[\xa0-\xbf][\x80-\xbf]'. // 3 byte (except overly longs)
'|[\xe1-\xec\xee\xef][\x80-\xbf]{2}'. // 3 byte (except overly longs)
'|\xed[\x80-\x9f][\x80-\xbf])+/', // 3 byte (except UTF-16 surrogates)
$input, $clean_pieces );
$clean_output = join('?', $clean_pieces[0] );
Several points to noted here:
This algorithm does however have a serious drawback in PHP coding: The PCRE regex preg_match_all() seems to choke on very long strings. In WebCollab, we use mb_substr() to break up the string to pieces of 1000 characters each. Each string piece is validated individually then recombined. An equivalent algorithm using preg_replace() is given below. This will be used in the next release of WebCollab. This method does not crash on very long strings, but is slightly slower.
$body = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
'|[\x00-\x7F][\x80-\xBF]+'.
'|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
'|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'
'|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
'?', $body );
$body = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
'|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body );
In the above algorithm the first preg_replace() only allows well formed Unicode (and rejects overly long 2 byte sequences, as well as characters above U+10000). The second preg_replace() removes overly long 3 byte sequences and UTF-16 surrogates. Alternate Methods of Validation
MySQL DatabaseYou need to use MySQL 4.1, or better. Earlier versions do not have Unicode support. When creating a database for PHP and UTF-8, use the command: CREATE database_name DEFAULT CHARACTER SET utf8; There is no '-' (dash) in 'utf8' for MySQL. All tables and character columns built after this will default to use the UTF-8 character set. If you have an existing database converted to UTF-8, or create individual tables with UTF-8 columns, we have found that you must also set the database to UTF-8 to avoid problems. ALTER database_name DEFAULT CHARACTER SET utf8; When connecting to MySQL with PHP, you should tell MySQL, what character set to expect by using two commands: mysql_query( "SET NAMES utf8", $database_connection ); mysql_query( "SET CHARACTER SET utf8", $database_connection ); MySQL will then expect input data to be in UTF-8, and will output results in UTF-8. It is possible to set and have a different connection character set than the back end database character set. MySQL will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'. PostgreSQL DatabasePostgreSQL has had good UTF-8 support for considerable time. You should create databases with UTF-8 encoding: CREATE DATABASE database_name WITH ENCODING 'UTF8'; After connecting, PHP has a built-in function for client encoding: pg_set_client_encoding( $database_connection, 'UTF8' ); Note that this function returns -1 for an error. You can also use SQL commands: SET CLIENT_ENCODING TO 'UTF8'; Or you can use the standard SQL syntax SET NAMES: SET NAMES 'UTF8'; It is possible to set and have a different connection character set than the back end database character set. The PostgreSQL client will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'. PostgreSQL checks the validity of UTF-8 on input, and will abort with an error message if an invalid byte is found. LinksUTF-8 SamplerUTF-8 and Unicode FAQ UTF-8 Test Page Comments, Criticisms and Suggestionsandrewsimpson at users dot sourceforge dot net |
|