Unzip on Ubuntu, the mess it makes with UTF-8 and the PHP fix

Create a file called “2 clés.html”, zip it up and, name the file html.zip then unzip it with the following command: unzip html.zip -d /tmp, check the /tmp folder, what do you see?

If it looks like “2 cl#U00e9s.html” you’re screwed if you want to use the file name further (for instance using it as the title of the article whose content is contained in the html file), like storing it in a database.

We need to get back to “2 clés” again but how?

I did some googling and found an acceptable but not fantastic solution on stack overflow.

I incorporated it in my file handler class like this:

function fixFileName($str){
	
	if(!function_exists('mb_convert_encoding'))
		return $str;
	
	$str = str_replace("#U", "\u", $str);
	
	if(!function_exists('replace_unicode_escape_sequence')){
		function replace_unicode_escape_sequence($match) {
		    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
		}
	}
	
	return preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
			
}

Having an instance of the file handler I can then simply do this: $title = $filer->fixFileName($title);

Related Posts

Tags: , ,