Thursday, December 6, 2007

BOM Squad

Cartoony Bomb
Since posting about phpBB3 and UTF-8 BOM/Signature in October, I discovered there seems to be alot of people tripping over UTF-8 encoded files that have a BOM, otherwise known as a Signature embeded.
Most people seem to know what the BOM is and what it does, but don't know how to get rid of it or how it got there.

Well, now seems like a good time to go over how the BOM gets in files, how to find out if an editor is silently prepending a BOM to UTF-8 encoded files, and how it can be removed if needed.

A major source of these BOM headaches seems to be Notepad on Windows. When saving a file with UTF-8 encoding using Notepad on Windows, it automaticly prepends the BOM to the file but doesn't inform the user about this.

I think this is partly due to quite a few people suggesting that Notepad can be used to easily and quickly convert files to UTF-8 in order to solve issues with strange characters showing up in files.

utf-8 Windows Notepad
Here's a screenshot of the choices you get when saving a file using Notepad. You can see there's no mention of BOM, or Signature there.
I can't help but wonder why Microsoft decided to diferentiate between big & little endian for Unicode but not leave a choice for the UTF-8 BOM.

From what I understand Windows text editors are the main contributers of UTF-8 BOM in files. UNIX type system applications generally don't include the BOM because it can cause problems with configuration files, which are primarily text files.

The first thing to do is check your editors' settings, particularly areas having to do with encoding, very well for UTF-8, BOM, and or Signature. You may have controll over what your editor does and it's good to be aware if you do.

One method to determine if your editor is secretly prepending the BOM to your files is to save an empty file using UTF-8 encoding and look at the filesize. If what should be an empty file has a filesize of three(3), your editor is prepending a BOM. You might want to consider leaving a single character in the file & making sure it's not four(4) bytes instead, some editors may default to non-UTF encoding for empty files.

I use a Notepad replacement called Notepad2 for quick editing. That application gives me an option to save UTF-8 encoded files with or without the BOM, though it calls it a Signature. Notepad2 displays which encoding is used in the file on the applications status bar which comes in handy from time to time. The application is self-contained and compressed is only about 250KB, easily carried around on a USB memory stick.

Another option for getting rid of the BOM is using the Perl script in this forum thread. I tried it myself (as seen in that thread), it works exactly as expected. This is probably the one of the easiest options if the files are on a Ubuntu or Linux system.

I'm sure there's plenty of people looking for a PHP solution to removing the BOM, so here's function designed for that situation called debom_utf8. It should work in either PHP4, PHP5. The methods used are generic enough that it should continue to work in PHP6.

<?php
/*
@author: http://develobert.blogspot.com/
@description : PHP Function to remove UTF-8 BOM/Signature from the beginning of a file.
@param $filename: Name of the file a BOM should be looked for and removed from.
@returns (bool): Returns true if the file didn't, or no longer contain(s) a BOM, false on error.

@example usage: echo debom_utf8('BOM.txt') ? 'free of BOM' : 'error';
*/
function debom_utf8($filename = '')
{
if($size = filesize($filename) && $size < 3)
{// BOM not possible
return true;
}
if($fh = fopen($filename, 'r+b'))
{
if(bin2hex(fread($fh, 3)) == 'efbbbf')
{
if($size == 3 && ftruncate($fh, 0))
{// Empty other than BOM
fclose($fh);
return true;
}
else if($buffer = fread($fh, $size))
{// Shift file contents to beginning of file
if(ftruncate($fh, strlen($buffer)) && rewind($fh))
{
if(fwrite($fh, $buffer))
{
fclose($fh);
return true;
}
}
}
}
else
{// No BOM found
fclose($fh);
return true;
}
}
return false;
}
?>


Comments on other methods to deal with BOM disposal are more than welcome.

2 comments:

Anonymous said...

Hi,

I saw once this post and enjoy reading it, now I was in need of the script you posted above, I've search the net about an hour to find that again, I think you're the only one that publish a php script to remove UTF8 signature / BOM.

thank you vary very much!!!

I still haven't the time to make and test my suggestions but I will come back after I'll test them.

Quote (of develobert):
I couldn't quite figure out how to detect the BOM at the binary level. I ended up converting the 3 potential BOM bytes into an ASCII hexidecimal & doing a string comparison. :scratchhead:

I'va seen this:
<?php
function removeBOM($str=""){
if(substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str=substr($str, 3);
}
return $str;
}
?>
Here

Is that what you searched for?
It's also look much simple from your script but of course checks less conditions.


last but not least, it would be great if your script could get a directory path or a list of file and remove the BOM from all of them.

waiting for your answer
---------------------------------------------------

Freelance Web Designer Rudolf Megert on hopes for better internet technology said...

Yep, would have been interesting to see a comment on this topic as i just tried the full scriptlet - but have a sloightly different issue.

I use a regular HTML layout file saved as php file with a Session start header.
Then i use a "include" command for a otherwise external contact for file also in php.

I can post that script into either of the 2 files that run on the same webpage, but if i try to place it on the second file i get an error message saying something like "already runs on - other file ......." or so.

So, one would really wish these Microsoft nerds (i use Notepad) would be a bit more open or clever about this nuicance of another hidden issue.

Suggestions on getting a solution for both files would be greatly welcome....