How to get rid of invalid UTF-8 characters
We aggregate the PHP category from del.icio.us on the Planet PHP website. Unfortunately the input from del.icio.us does sometimes contain invalid UTF-8 characters, which leads to errors in the XML parsing. But the following iconv line gets rid of all invalid UTF-8 characters.
$t = iconv("UTF-8","UTF-8//IGNORE",$t);
Problem solved ;) (But I also wrote a mail to the del.icio.us people about the problem, 'cause this shouldn't happen in the first place)
Comments
Evt. vorher $text= iconv("ISO-8859-1","UTF-8", $text);
Ansonsten werden auch Umlaute in HTML-Entit㳥n ausgeschnitten.
Gruss luc
chregu
@ 26.01.2005 21:06 CET
luc: I assumed, $text is already UTF-8 encoded. If it isn't, you can directly do that in one step... But if it's proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
luc: I assumed, $text is already UTF-8 encoded. If it isn't, you can directly do that in one step... But if it's proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
well, doesn't the // make the rest of the line commented... what language is this. and where does it go ;)
admin
@ 23.01.2006 19:30 CET
chregu,
You are truely an idiot for saying what you did below.
--------------------------
chregu @ 26.01.2005 21:06 CET
luc: I assumed, $text is already UTF-8 encoded. If it isn't, you can directly do that in one step... But if it's proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
--------------------------
I would NEVER hire you as a programmer.
If what you said below were true, nobody would be asking these questions now would they.
I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser.
IDIOT!!!!!!!!!!!!!!!!
chregu,
You are truely an idiot for saying what you did below.
--------------------------
chregu @ 26.01.2005 21:06 CET
luc: I assumed, $text is already UTF-8 encoded. If it isn't, you can directly do that in one step... But if it's proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
--------------------------
I would NEVER hire you as a programmer.
If what you said below were true, nobody would be asking these questions now would they.
I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser.
IDIOT!!!!!!!!!!!!!!!!
Dear anonymous admin
Thanks for the job offer :)
I didn't claim that the output will be "correct", but I stand to my claim that a ISO-8859-1 encoded string doesn't have *invalid* UTF-8 characters.
And about the & character. That was not part of the problem to be solved and it has the same encoding anyway. But you smart guy knew that already, of course :)
A year and a half later, triggered by this: http://www.sitepoint.com/blogs/2006/08/09/scripters-utf-8-survival-guide-slides/#comment-43992
You're both wrong ;)
"I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser."
An ISO-8859-1 & sign is the same thing as an ASCII & sign. So you could happily pull one out of an IS0-8859-1 string and drop it into a UTF-8 string, and have valid UTF-8. But putting & in an XML document _will_ cause a problem but not because of character encoding - rather the & sign is an XML delimiter - see http://www.xml.com/pub/a/2001/01/31/qanda.html so @admin - best fire yourself.
"claim that a ISO-8859-1 encoded string doesn't have *invalid* UTF-8 characters."
If an ISO-8859-1 string only contains characters in the ASCII range, then you're right. But anything that requires 8 bits to represent (e.g. umlauts, MS smart quotes) - i.e. characters outside of the the ASCII range - will need conversion and 2 bytes to represent in UTF-8 - PHP's utf8_encode() fn would do the job.
Ah - becoming annoying ;)
I'm not sure anymore, what I really wanted to claim back then, I think it was that:
$t = iconv("ISO-8859-1","UTF-8//IGNORE",$t);
doesn't make sense, since there's nothing in a proper iso-8859-1 string, which can't be converted to UTF-8, so that :
$t = iconv("ISO-8859-1","UTF-8",$t);
(or utf8_encode) would be enough :)
Of course you can't output a iso-8859-1 as utf-8 and expect, that it's the same :)
"You are truely an idiot for saying what you did below."
Perhaps you meant "truly."
You're welcome!
fred
@ 08.06.2007 18:47 CET
I don't understand why people, when typing messages on the web, seem desperate to admonish each other for their mistakes in the harshest possible way. If someone writes something that's incorrect, they'd probably appreciate being notified of their mistake -- so why not just do that, notify them? Where is the sense in insulting them? Why do people get so hot-headed on the web?
This seems to mostly happen in public forums, not in private messages like e-mails, so I think these kind of insults are all about showing off.
I don't understand why people, when typing messages on the web, seem desperate to admonish each other for their mistakes in the harshest possible way. If someone writes something that's incorrect, they'd probably appreciate being notified of their mistake -- so why not just do that, notify them? Where is the sense in insulting them? Why do people get so hot-headed on the web?
This seems to mostly happen in public forums, not in private messages like e-mails, so I think these kind of insults are all about showing off.
bob - I think you are guilty of what fred states.
To the point: there is a good use for this code. I often have to deal with incoming data (the source of which I have no control over) which claims to be UTF-8 but does occasionally have invalid chars. Because XML is invalid with invalid chars - it has to be sorted one way or another. This is certainly one way of doing it.
add a comment
Your email adress will never be published.
Comment spam will be deleted!
