Ingezonden door Ken op zo, 01/04/2009 - 17:10
Any international programmer will one day be confronted with the problem of handling languages other than English. These contain many diacritics and other letters or symbols which do not exist in the English language. When it goes wrong, things can get very ugly, with strange characters (examples: ¶, é, â, ÃŒ, ü, �, ë, ®, â��, ï) appearing in your database. How do you avoid data corruption and other problems related to character encoding? We won't be explaining everything in-depth, but provide some quick and easy pointers which allow you to get started right away.
In your my.cnf file, add the following lines below [mysqld]
character-set-server = utf8
collation-server = utf8_unicode_ci
After restarting the server, make sure you test that all variables are configured correctly by executing these queries:
SHOW VARIABLES LIKE 'character%'
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
SHOW VARIABLES LIKE 'collation%'
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
Also make sure every field that may contain special characters one day is set to use the utf8_unicode_ci encoding. When creating a new table, you could use this query as an example:
CREATE TABLE IF NOT EXISTS `my_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`internationalized_text` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
After providing the login credentials and establishing a connection, the first query you should execute is "SET NAMES utf8". This ensures the data will be passed to the database correctly.
Add this header in your PHP, before any output is sent to the browser: header("Content-Type: text/html; charset=UTF-8");
Add this meta tag in the HEAD of your HTML: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">. This will make sure browsers are aware of the fact that your website supports UTF-8.
Although diacritics should now be handled properly without running them through any more functions (no, utf8_encode() or utf8_decode are not required) there might be problems when using special characters such as the ampersand (&) in your HTML because they won't validate properly (see the W3C specifications for more information). To avoid this issue, those characters must be converted: $myLink = htmlspecialchars("<a href='http://example.com'>Iñtërnâtiônàlizætiøn</a>", ENT_QUOTES);
If you want to display a value which might contain quotes inside a textfield (example: <input name="inputName" value="$quotedText" />), then it's especially important to run $quotedText through htmlspecialchars() using ENT_QUOTES as an option because the rendered HTML might otherwise look like this: <input name="inputName" value="Bruce "Batman" Wayne" />
When using forms, make sure you add the charset property: <form action="myFormHandler.php" method="post" accept-charset="UTF8">. This will make sure that the input sent to the server is UTF-8 encoded.
This is the only requirement to handle unicode correctly, but care should also be taken with special characters such as quotes to avoid SQL injection. First of all, check if Magic Quotes is enabled and make sure those quotes are removed:
if ( get_magic_quotes_gpc() ) {
$_GET = array_map('stripslashes',$_GET);
$_POST = array_map('stripslashes',$_POST);
$_COOKIE = array_map('stripslashes',$_COOKIE);
}
The data should be properly saved to the database using the functions provided by your database abstraction layer. If you're using the default functions provided by PHP, store data this way: "UPDATE t SET column='" . mysql_real_escape_string($_POST['input'], $conn) . "' WHERE ...";
It's generally recommended not to save your PHP files as UTF-8 because some editors (including Zend Studio) do not handle them properly. We have configured PHP Eclipse to do so however, and haven't noticed any problems yet.