WebCollab logo

Introduction  :: Screenshots :: Requirements ::Online demo :: Downloads :: Installation :: Getting started :: FAQ 
Fast, secure and simple
PHP and UTF-8 Howto

PHP and UTF-8 Howto - Experiences from WebCollab

Writing the UTF-8 version of WebCollab was not straightforward. There is not much good information on PHP with Unicode, and a lot of bad information. Some web sites even said it was impossible.

This page documents how we successfully made WebCollab to be UTF-8 functional.

PHP mbstring library

PHP has an optional library specifically for handling multi-byte strings, known as mbstring. This library makes using UTF-8 much easier. Despite what a lot of websites say, you should use mbstring.

However, not all web hosting providers enable mbstrings on their implementation of PHP.

For most of the mbstrings functions there are discrete PHP code equivalents that can be found on the web. For WebCollab, we chose not to take this approach.

For a working example of a PHP UTF-8 application, visit the demo website for WebCollab

HTTP Headers

Firstly we must correctly set the HTTP headers to instruct the browser to use UTF-8:

header( 'Content-Type: text/html; charset=UTF-8' );

Then to make doubly sure the browser uses UTF-8, we send a meta tag in the HTML head:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

PHP Internal Encoding

By default PHP uses 'ISO-8859-1' for it's internal encoding schema. Change this to UTF-8:

mb_internal_encoding( 'UTF-8' );

This fixes a lot of internal PHP problems.

HTTP Form Submission

Although not specifically mandated by the W3C, almost all web browsers will submit an HTTP form in the same character set as the page was served up in. Put another way, if you deliver your pages in UTF-8, then submitted responses will also be in UTF-8.

There is no need to try and verify the character set in a submitted HTTP form. Our experience has been that trying to accurately determine the submitted character set will result in more 'false positive' errors than just accepting that it is correct.

Character Validation

Overly long UTF-8 sequences and UTF-16 surrogates are a serious security threat. Validation of input data is very important. Here is an algorithm is derived from http://www.w3.org/International /questions/qa-forms-utf-8.html that handles validation.

preg_match_all('/([\x09\x0a\x0d\x20-\x7e]'.           // ASCII characters
                '|[\xc2-\xdf][\x80-\xbf]'.            // 2-byte  (except overly longs)
                '|\xe0[\xa0-\xbf][\x80-\xbf]'.        // 3 byte (except overly longs)
                '|[\xe1-\xec\xee\xef][\x80-\xbf]{2}'. // 3 byte (except overly longs)
                '|\xed[\x80-\x9f][\x80-\xbf])+/',     // 3 byte (except UTF-16 surrogates)
                  $input, $clean_pieces );

$clean_output = join('?', $clean_pieces[0] );

Several points to noted here:

  • The characters are limited to those below U+10000 (largest possible 3 byte character), because this is the limitation in MySQL and PostgreSQL. MySQL will silently reject characters above U+10000, while PostgreSQL will give an error message (According to a Debian bug report, it seems PostgreSQL 8.1 has removed the U+10000 limit). The characters above U+10000 are uncommon language scripts.
  • The PCRE function (preg_match_all) is used because they are generally faster than the mbstring ereg equivalents, and this is byte / byte matching. It is not necessary to use the mbstring for byte / byte matching.

This algorithm does however have a serious drawback in PHP coding: The PCRE regex preg_match_all() seems to choke on very long strings. In WebCollab, we use mb_substr() to break up the string to pieces of 1000 characters each. Each string piece is validated individually then recombined.

An equivalent algorithm using preg_replace() is given below. This will be used in the next release of WebCollab. This method does not crash on very long strings, but is slightly slower.

$body = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
                     '|[\x00-\x7F][\x80-\xBF]+'.
                     '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
                     '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'
                     '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
                     '?', $body );

$body = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
                     '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body );

In the above algorithm the first preg_replace() only allows well formed Unicode (and rejects overly long 2 byte sequences, as well as characters above U+10000). The second preg_replace() removes overly long 3 byte sequences and UTF-16 surrogates.

Alternate Methods of Validation

  • Use mb_convert_encoding() or iconv() to verify encoding:
    $str = mb_convert_encoding($str, "UTF-8", "UTF-8" );
    
    $str = @iconv("UTF-8", "UTF-8//IGNORE", $str );
    
    These methods do not replace stripped characters with '?' and they do not trap characters above U+10000, nor do they remove UTF-16 surrogates. A preg_replace() needs to be added to trap these two conditions as well. Benchmarking shows these methods to be similar in speed to the regex above.
  • Use preg_match_all() with the /u (Unicode) modifier. The PCRE engine will check for valid UTF-8, but it allows characters above U+10000 to pass and has other limitations. It will also reject the entire string, if only one character is invalid. It may also abort the PHP script giving no error messages, as we have found.
  • 'UTF-8 to Code Point Array Converter in PHP' which can be used for validation. Our benchmark testing has shown this to be slower than the regex.
  • Sample code in PHP manual. Not tested.
  • mb_check_encoding() in the PHP manual. Requires PHP 4 of 4.4.3 or higher / PHP 5 of 5.1.3, or higher. This function seems to accept at least some malformed UTF-8.

MySQL Database

You need to use MySQL 4.1, or better. Earlier versions do not have Unicode support.

When creating a database for PHP and UTF-8, use the command:

CREATE database_name DEFAULT CHARACTER SET utf8;

There is no '-' (dash) in 'utf8' for MySQL.

All tables and character columns built after this will default to use the UTF-8 character set.

If you have an existing database converted to UTF-8, or create individual tables with UTF-8 columns, we have found that you must also set the database to UTF-8 to avoid problems.

ALTER database_name DEFAULT CHARACTER SET utf8;

When connecting to MySQL with PHP, you should tell MySQL, what character set to expect by using two commands:

mysql_query( "SET NAMES utf8", $database_connection );

mysql_query( "SET CHARACTER SET utf8", $database_connection );

MySQL will then expect input data to be in UTF-8, and will output results in UTF-8.

It is possible to set and have a different connection character set than the back end database character set. MySQL will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'.

PostgreSQL Database

PostgreSQL has had good UTF-8 support for considerable time. You should create databases with UTF-8 encoding:

CREATE DATABASE database_name WITH ENCODING 'UTF8';

After connecting, PHP has a built-in function for client encoding:

pg_set_client_encoding( $database_connection, 'UTF8' );

Note that this function returns -1 for an error.

You can also use SQL commands:

SET CLIENT_ENCODING TO 'UTF8';

Or you can use the standard SQL syntax SET NAMES:

SET NAMES 'UTF8';

It is possible to set and have a different connection character set than the back end database character set. The PostgreSQL client will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'.

PostgreSQL checks the validity of UTF-8 on input, and will abort with an error message if an invalid byte is found.

Links

UTF-8 Sampler
UTF-8 and Unicode FAQ
UTF-8 Test Page

Comments, Criticisms and Suggestions

andrewsimpson at users dot sourceforge dot net


SourceForge.net Logo

PHP logo

MySQL logo

Postgresql logo

Valid XHTML 1.0!

Valid CSS

Last modified Jan 2007