mail usprint this pagerss feed

Liip is hiring!

Encoding issue with XMLHttpRequest and Firefox 3

In Firefox 3.0.0 there is a "strange" regression issue regarding the encoding of XMLHttpRequest requests. It's not a bug per se, it's just different behavior, which we ran into (and no other browser does it this way)

What we basically do on the client side in JavaScript:

this.data = new XMLHttpRequest();   
this.data.open('POST', dataURI);   
this.data.send(xml); 

where "xml" is a DOMDocument Object.

In Firefox 2.0 this request came with a

Content-Type: application/xml

and the xml in the POST body was encoded in UTF-8 (no encoding information in the XML declaration)

IE7 does:

text/xml; charset=UTF-8,

But Firefox 3.0.0 sends this as

Content-Type: application/xml; charset=ISO-8859-1

and the xml in the body is actually ISO-8859-1 encoded, but there is no encoding information in the XML declaration (eg. no <?xml encoding="ISO-8859-1"?>) and of course our XML loader fall flat on its nose, when it had non-ASCII characters in it...

While having the encoding information only in the HTTP header and not also in the XML declaration is (as far as I can remember, didn't look up any specs) correct from a technical point of view, it was pretty annoying to find this "bug". And now I have to check on the backend, how the request is encoded on that request on not just rely on "it's UTF-8 nowadays anyway or at least written in the XML declaration, so the XML parser can take care of it" (which was maybe naive from the beginning :))

Here's the code-snippet for the PHP server side:

function transformFromContentTypeToUTF8($str) {
    
    if (isset($_SERVER['CONTENT_TYPE']) && preg_match('#charset=([^/s^;]+)#',$_SERVER['CONTENT_TYPE'],$matches)) {
        if ($matches[1] == 'UTF-8') {
            return $str;
        }
        if ($matches[1] == "ISO-8859-1") {
            return utf8_encode($str);
        }
        return iconv($matches[1],"UTF-8",$str);
    } 
    //if no charset, then return as it came
    return $str;
}

function fixXMLEncodingFromHTTP($xml) {
    if (!preg_match("#<?xml[^>]+encoding=#",$xml)) {
        return transformFromContentTypeToUTF8($xml);   
}
return $xml;
}

$rawpost = fixXMLEncodingFromHTTP(file_get_contents('php://input'));    

// create a new DOM document out of the posted string
$xmlData = new DOMDocument();
$xmlData->loadXML($rawpost);

BTW, for non-ISO-8859-1 characters, FF 3 does transform them to numeric entities, welcome web 1.0 :)

And there's already a report of that issue on bugzilla, of course. But no idea, if they change that back soon

Related Entries:
Firefox 3.0.4 and even more issues with the encoding in XMLHttpRequest
whereami extension: now privacy enabled
the whereami firefox extension
Javascript 2.0
Fangs - Screen Reader Emulator
Comments (9) |  Permalink

Comments

Christof @ 25.06.2008 12:25 CET
Danke fürs schnelle Bugfixing! Jetzt noch der Ampersand, und ich bin restlos glücklich ...
Yoan @ 25.06.2008 12:51 CET
Firefox 3 simply sends the current encoding. If it's iso it sends iso-8859-1, if it's unicode, it send utf-8. So check that the pages that sends this are in unicode.

Althought it's not a bad thing to have some kind of content negociation.

Firefox 3 is somehow buggy but not there :-)
Yoan @ 25.06.2008 12:57 CET
After reading the bug, I understood that you're using createDocument and not any existing ones, right? The workaround can be on the JS side, and should be imho, since the bug is apparently there (https://bugzilla.mozilla.org/show_bug.cgi?id=407213#c8).
Chregu @ 25.06.2008 13:28 CET
We do everything in UTF-8 actually and there is really no reason, why FF3 sends that as ISO-8859-1. Was also one of my first reactions, but that really wasn't the issue.

And the workaround with new DOMParser() creates a lot of other cross-browser issues and is really ugly :)
Patrice @ 25.06.2008 15:05 CET
I thought it was a bug as I always thought the default encoding of any XML document is UTF-8. But after going back to the specification I stand corrected:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration


Source: http://www.w3.org/TR/2006/REC-xml-20060816/#NT-EncodingDecl
chregu @ 25.06.2008 15:12 CET
Patrice: Thanks for the clarification, I knew I read something like that somewhere.
Yoan @ 25.06.2008 23:57 CET
You can be very naughty and do it that way:

var req = new XMLHttpRequest();
req.open("POST", "/spam/and/eggs.php", false);
if(document.characterSet !== xml.characterSet && xml.characterSet.toLowerCase() !== "utf-8") {
// http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_utf8_encode/
xml = new XMLSerializer().serializeToString(xml);
xml = utf8_decode(xml);
}
req.send(xml);

It's funny to see that years after the XMLSerializer is still causing or solving from troubles: http://blog.liip.ch/archive/2004/07/02/broken_xmlserializer_in_mozilla_1_7_and_firefox.html

serializer.serializeToString() works with non-document as well, so I guess you can create new nodes from the current document (with the correct characterSet) and serialize them using it. Like XMLHttpRequest, XMLSerializer is a yummy ActiveX for MSIE (that I guess leaks).

Best of luck! You shouldn't do such nasty things you know? Cannot you just concatenate strings like anyone else does? (I'm kiddin')
Chregu @ 26.06.2008 06:40 CET
thanks for that utf8_decode link (but shouldn't it be encode?). But I still prefer the server side option, 'cause firefox isn't actually doing much wrong, just "different", so the server side check has to be done anyway for correctness.

But it's always fun to explore all the possibilities, thanks for the input :)
Yoan @ 26.06.2008 10:22 CET
Yes, utf8_encode, you're right. It was a bit late. Anyway this solution isn't alright.

BTW this is the code inside Y!Mail (one bug description mention it):

if(Lb.a){
this.a=new DOMParser().parseFromString("<?xml version='1.0' encoding='UTF-8'?><dummy/>","application/xml");
this.a.removeChild(this.a.firstChild);
}else{
this.a=activeX._new(activeX.xmldom);
}

It's for creating SOAP messages.

document.implementation.createDocument is used inside loadXML when the content is empty.

Are you already behaving a different way for MSIE client side? Of course the server side trick is a great sanity check but it's not fixing the issue at its root.

The workaround I prefer is (only if characterSet is iso-8859-1 of course):

req.send(new XMLSerializer().serializeToString(newDoc));

It doesn't look like to be a hack at all.

BTW, don't forget to do a little html_entity_decode: html_entity_decode(iconv(...), ENT_NO_QUOTES, "UTF-8");
I wouldn't like to see, in my DB, chinese documents written in HTML entities only.

北 -> &#21271; = f34r!

Did I tell you that I was opiniated? ;-)

Cheers,

add a comment

Your email adress will never be published.
Comment spam will be deleted!

For Spammers Only
Name*
E-Mail
URL
Comment*
Notify me via E-Mail when new comments are made to this entry
Remember me (needs cookies)