July 9, 2010
MarkBernstein.org
 
Follow me on Twitter

Encodings

A puzzle for text fans out there.

Imagine you’re a program like Twig or . Please drag text or paste it from all sorts of places. Some is unicode, some is MacRoman, some is encoded in other ways.

Now, let’s also suppose that some of the text sources are themselves confused. They say, “I am Unicode utf-8,” but they aren’t. You see this all the time on the Web, for example, because sites get their metadata tags mixed up.

Now, given the possibility of misrepresentation, what's the best policy for receiving text, such that (a) you will accept all correctly encoded text, (b) you will translate other encodings to utf-8, and (c) even if the original source lies about its encoding, you will never have invalid utf-8?

I would expect this to be trivial and routine, but if so, I’m looking in the wrong place.