Manually writing the byte order mark (BOM) for an encoding into a stream
I recently discovered a problem with our WebCopy and Cyotek Sitemap Creator products to do with "corruption" of plain text documents, where non-ANSI characters appeared incorrectly. It didn't take long to realize that these programs were saving text content as ANSI files. Which I found curious as the Crawler library they use detects response encoding and uses this to save the files.
Or does it? Consider the code below:
string fileName;
byte[] data;
Encoding encoding;
fileName = Path.GetTempFileName();
data = new byte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;
using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream, encoding))
writer.Write(data);
}
Looking at this, you might be tempted to assume (as I did) that this code would save the content in the given encoding. When I tried opening one of the files generated by similar code to the above in Notepad++, I found they were encoded as ANSI files. Switching the encoding to UTF-8 immediately displayed the files correctly without the "corruption". So it seems the byte order mark (BOM) isn't actually written by the BinaryWriter - I think it only uses the given encoding for converting strings to a byte array. All this time I assumed files were being saved as UTF-8 (or whatever the response encoding was) and properly supported Unicode, and all this time I was wrong.
So how do you manually write a BOM into a document? The oddly
named GetPreamble
function available from the Encoding
class
is what you need - this returns the bytes that comprise the BOM,
and you can then write this directly to your stream:
string fileName;
byte[] data;
Encoding encoding;
fileName = Path.GetTempFileName();
data = new byte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;
using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream, encoding))
{
writer.Write(encoding.GetPreamble());
writer.Write(data);
}
}
Note that you only need to write a BOM if your document is actually supposed to be a text file - if it is "normal" binary data (such as an image or a gzip stream) then you definitely do not want to write a BOM, or you truly will have a corrupt file.
Now the files produced by WebCopy and Sitemap Creator are encoded correctly and I can be happily with yet another bug squashed, unhappy at yet another reminder of why I need to write a proper set of automated tests for the libraries I use, but happy again that I had another (albeit brief) tip to post on this blog.
Update History
- 2012-12-11 - First published
- 2020-11-21 - Updated formatting
Leave a Comment
While we appreciate comments from our users, please follow our posting guidelines. Have you tried the Cyotek Forums for support from Cyotek and the community?