Since I’ve been thinking I ought to write about my work more often, and inspired by the strangeness of this incident, here goes.
I’d been trying to debug how a few ?s came to be in an ad banner tag submission. I’d dug into change logs and other points where we log transactions to no avail. Since we’d never seen anything like it before, I’d basically decided I’d spent enough time on it and was about to resort to a “it was caused by network ghosts” type explanation. I figured the ?s came from some erroneous network transmission.
On our system, there was nothing strange in the tag field whatsoever. On the adserver though, there appeared some question marks, looking like this:
Then, though when I don’t really know, it hit me. I should view that offending code in a more verbose setting, don x-ray specs if you will – my first choice was VI. Lo & behold, the offending characters appeared before my eyes.
Ah-ha, there is something in there! WTF is that? Naturally, I google “feff”. Within the first few results it is clear: the offending character represents a Unicode character called ZERO WIDTH NO-BREAK SPACE. That’s just too damn perfect.
After that fine moment of glory, I notice another search result of interest, a humorous recount blog post by a Microsoft employee who I just happened to have recently seen (think it was him…) during “The New Efficiency” event. (I was totally there for the info, not just the free Windows 7 Ultimate, honest.) Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) It’s the story of about the famous “Notepad” application and how this particular UTF-8 character played a part. My favorite line from his post:
“This post is sponsored by “” U+feff (ZERO WIDTH NO-BREAK SPACE, of course)
Though he was a little bitter about the lack of visible representation here, I was unable to find the little guy to spray paint him so that you could all see him here today. He is between those quotes, I can promise you that.”
How to prevent this is tricky – one solution could be a better function to sanitize input from the user. But we already do a decent job of that. And even with more filtering, so many strange things manage to wiggle their way in. A strict white-list is probably the way to go.
Thanks for the laugh, FEFF.