WebCollab PHP and UTF-8 Howto

WebCollab logo

Introduction :: Screenshots :: Requirements ::Online demo :: Downloads :: Installation :: Getting started :: FAQ

PHP and UTF-8 Howto

PHP and UTF-8 Howto - Experiences from WebCollab

Writing the UTF-8 version of WebCollab in early 2004 was not straightforward. There was not much good information on PHP with UTF-8, and a lot of bad information. However, contrary to many doomsayers, PHP can be made to run with UTF-8 without too much trouble.

This page documents how we successfully made WebCollab to be UTF-8 functional.

PHP mbstring library

PHP has an optional library specifically for handlingmulti-byte strings, known as mb_strings (short for multi-byte strings library). This library makes using UTF-8 much easier. Because this library is optional, not all web hosting providers enable mbstrings on their implementation of PHP. However given that UTF-8 is becoming more widespread, most providers should now provide this library

For most of the mb_strings functions there are also discrete PHP code equivalents/workarounds that can be found on the web. There is no real advantage in using these workarounds, and a number of disadvantages:

Extra code is required to implement the built-in functions provided by mb_strings.
PHP internal character set handling is not done in UTF-8.
Extra code complications to force UTF-8 workarounds on the above two criteria.

For a working example of a PHP UTF-8 application, visit the demo website for WebCollab

HTTP Headers

Firstly we must correctly set the HTTP headers to instruct the browser to use UTF-8:

header( 'Content-Type: text/html; charset=UTF-8' );

Then to make doubly sure the browser uses UTF-8, we send a meta tag in the HTML head:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

PHP Internal Encoding

By default PHP uses 'ISO-8859-1' for it's internal encoding schema. Change this to UTF-8:

mb_internal_encoding( 'UTF-8' );

This makes the PHP internal functions 'UTF-8 aware'. It also ensures that input and output are in UTF-8 with PHP trying to force character set changes.

HTTP Form Submission

Although not specifically mandated by the W3C, almost all web browsers will submit an HTTP form in the samecharacter set as the page was served up in. Put another way, if you deliver your pages in UTF-8, then submitted responses will also be in UTF-8.

There is no need to try and verify the character set in a submitted HTTP form. Our experience has been that trying to accurately determine the submitted character set will result in more 'false positive' errors than just accepting that it is correct.

Character Validation

Overly long UTF-8 sequences and UTF-16 surrogates are a serious security threat. Validation of input data is very important. An algorithm using preg_replace() is given below, and is used in current versions of WebCollab.

$body = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 
'|(?<=^|[\x00-\x7F])[\x80-\xBF]+'.

'|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.

'|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.

'|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/',

'�', $body );


$body = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.

'|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body );

In the above algorithm the first preg_replace() only allows well formed Unicode (and rejects overly long 2 byte sequences, as well as characters above U+10000). The second preg_replace() removes overly long 3 byte sequences and UTF-16 surrogates.

Several points to be noted here:

The characters are limited to those below U+10000 (largest possible 3 byte character), because this is a limitation in MySQL, though MySQL 5.5 now has the utf8mb4 character set for 4 byte characters. The characters above U+10000 are, however, uncommon language scripts.
The PCRE function (preg_reg) is used because they are generally faster than the mbstring ereg equivalents, and this is byte / byte matching. It is not necessary to use the mbstring for byte / byte matching.
Note that in the code above the 'u' (Unicode) PCRE pattern modifier in preg_replace() is not required. This is because the search pattern is for specific bytes and not Unicode characters.

Alternate Methods of Validation

Here is an algorithm is derived from http://www.w3.org/International /questions/qa-forms-utf-8.html that handles validation.

preg_match_all('/([\x09\x0a\x0d\x20-\x7e]'. // ASCII characters

'|[\xc2-\xdf][\x80-\xbf]'. // 2-byte (except overly longs)

'|\xe0[\xa0-\xbf][\x80-\xbf]'. // 3 byte (except overly longs)

'|[\xe1-\xec\xee\xef][\x80-\xbf]{2}'. // 3 byte (except overly longs)

'|\xed[\x80-\x9f][\x80-\xbf])+/', // 3 byte (except UTF-16 surrogates)

$input, $clean_pieces );


$clean_output = join('?', $clean_pieces[0] );

This algorithm does however have a serious drawback in PHP coding: The PCRE regex preg_match_all() seems to choke on very long strings. There is an apparently related bug report for PHP about this.

Use mb_convert_encoding() or iconv() to verify encoding. Two example methods of validating:
```
$str = mb_convert_encoding($str, "UTF-8", "UTF-8" );
```
```
$str = @iconv("UTF-8", "UTF-8//IGNORE", $str );
```
These methods do not replace stripped characters with '?' and they do not trap characters above U+10000, nor do they remove UTF-16 surrogates. A preg_replace() needs to be added to trap these two conditions as well. Benchmarking shows these methods to be similar in speed to the regex above.
Use preg_match_all() with the /u (Unicode) modifier. The PCRE engine will check for valid UTF-8, but it allows characters above U+10000 to pass and has other limitations. It will also reject the entire string, if only one character is invalid. It may also abort the PHP script giving no error messages, as we have found.
'UTF-8 to Code Point Array Converter in PHP' which can be used for validation. Our benchmark testing has shown this to be slower than the regex.
Sample code in PHP manual. Not tested.
mb_check_encoding() in the PHP manual. Requires PHP 4 of 4.4.3 or higher / PHP 5 of 5.1.3, or higher. This function seems to accept at least some malformed UTF-8 characters, when we tested the function.

Equivalent Functions

Some common text handling fuctions do not work directly in UTF-8 and have equivalent multibyte functions. Some of the more common equivalents are listed below:

mail()	mb_send_mail()
strlen()	mb_strlen()
strpos()	mb_strpos()
strrpos()	mb_strrpos()
substr()	mb_substr()
strlower()	mb_strtolower()
strtoupper()	mb_strtoupper()
substr_count()	mb_substr_count()
split()	mb_split()

Regular Expressions

The PCRE regular expressions require a pattern modifier of 'u' to make the PCRE engine aware that UTF-8 is being used.

The POSIX regular expressions have equivalent multibyte functions such as below:

ereg()	mb_ereg()
ereg_replace()	mb_ereg_replace()

MySQL Database

You must use at least MySQL 4.1, for Unicode support .

When creating a database for PHP and UTF-8, use the command:

CREATE database_name DEFAULT CHARACTER SET utf8;

Note: There is no '-' (dash) in 'utf8' for MySQL.

All tables and character columns built after this will default to use the UTF-8 character set.

If you have an existing database converted to UTF-8, or create individual tables with UTF-8 columns, we have found that you must also set the database to UTF-8 to avoid problems.

ALTER database_name DEFAULT CHARACTER SET utf8;

When connecting to MySQL with PHP, you should tell MySQL, what character set to expect by using two commands:

mysql_query( "SET NAMES utf8", $database_connection );

mysql_query( "SET CHARACTER SET utf8", $database_connection );

MySQL will then expect input data to be in UTF-8, and will output results in UTF-8.

It is possible to set and have a different connection character set than the back end database character set. MySQL will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'.

PostgreSQL Database

PostgreSQL has good UTF-8 support. Ideally, you should create databases with UTF-8 encoding:

CREATE DATABASE database_name WITH ENCODING 'UTF8';

After connecting, PHP has a built-in function for client encoding:

pg_set_client_encoding( $database_connection, 'UTF8' );

Note: This function returns -1 for an error condition, rather than the 0, or boolean false that would be usual.

You can also use SQL commands:

SET CLIENT_ENCODING TO 'UTF8';

Or you can use the standard SQL syntax SET NAMES:

SET NAMES 'UTF8';

It is possible to set and have a different connection character set than the back end database character set. The PostgreSQL client will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'.

PostgreSQL checks the validity of UTF-8 on input, and will abort with an error message if an invalid byte is found.

Links

UTF-8 Sampler
UTF-8 and Unicode FAQ
UTF-8Test Page

Comments, Criticisms and Suggestions

andrewsimpson at users dot sourceforge dot net

Last modified Jan 2016