mail usprint this pagerss feed

www.liip.ch

Liip is hiring!

How to get rid of invalid UTF-8 characters

We aggregate the PHP category from del.icio.us on the Planet PHP website. Unfortunately the input from del.icio.us does sometimes contain invalid UTF-8 characters, which leads to errors in the XML parsing. But the following iconv line gets rid of all invalid UTF-8 characters.

 $t = iconv("UTF-8","UTF-8//IGNORE",$t);

Problem solved ;) (But I also wrote a mail to the del.icio.us people about the problem, 'cause this shouldn't happen in the first place)

Comments (17) |  Permalink

Comments

Hannes @ 25.01.2005 02:52 CEST
HAPPY HAPPY HAPPY HAPPY BIRTHDAY CHREGU!
DON'T CODE TODAY! :)
luc @ 26.01.2005 16:37 CEST
Evt. vorher $text= iconv("ISO-8859-1","UTF-8", $text);
Ansonsten werden auch Umlaute in HTML-Entit㳥n ausgeschnitten.
Gruss luc
chregu @ 26.01.2005 22:06 CEST
luc: I assumed, $text is already UTF-8 encoded. If it isn't, you can directly do that in one step... But if it's proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
Tim @ 29.01.2005 16:39 CEST
Great - but where do I put this?

- Tim
chregu @ 29.01.2005 16:42 CEST
Tim: Before you try to load the string with dom for example
loo @ 17.04.2005 23:06 CEST
Sorry, but where do i put this?
yo @ 11.05.2005 00:35 CEST
well, doesn't the // make the rest of the line commented... what language is this. and where does it go ;)
admin @ 23.01.2006 20:30 CEST
chregu,

You are truely an idiot for saying what you did below.

--------------------------
chregu @ 26.01.2005 21:06 CET
luc: I assumed, $text is already UTF-8 encoded. If it isn't, you can directly do that in one step... But if it's proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
--------------------------

I would NEVER hire you as a programmer.

If what you said below were true, nobody would be asking these questions now would they.

I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser.

IDIOT!!!!!!!!!!!!!!!!
chregu @ 24.01.2006 01:03 CEST
Dear anonymous admin

Thanks for the job offer :)

I didn't claim that the output will be "correct", but I stand to my claim that a ISO-8859-1 encoded string doesn't have *invalid* UTF-8 characters.

And about the & character. That was not part of the problem to be solved and it has the same encoding anyway. But you smart guy knew that already, of course :)
Harry Fuecks @ 09.08.2006 15:21 CEST
A year and a half later, triggered by this: http://www.sitepoint.com/blogs/2006/08/09/scripters-utf-8-survival-guide-slides/#comment-43992

You're both wrong ;)

"I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser."

An ISO-8859-1 & sign is the same thing as an ASCII & sign. So you could happily pull one out of an IS0-8859-1 string and drop it into a UTF-8 string, and have valid UTF-8. But putting & in an XML document _will_ cause a problem but not because of character encoding - rather the & sign is an XML delimiter - see http://www.xml.com/pub/a/2001/01/31/qanda.html so @admin - best fire yourself.

"claim that a ISO-8859-1 encoded string doesn't have *invalid* UTF-8 characters."

If an ISO-8859-1 string only contains characters in the ASCII range, then you're right. But anything that requires 8 bits to represent (e.g. umlauts, MS smart quotes) - i.e. characters outside of the the ASCII range - will need conversion and 2 bytes to represent in UTF-8 - PHP's utf8_encode() fn would do the job.

Ah - becoming annoying ;)
chregu @ 09.08.2006 15:32 CEST
I'm not sure anymore, what I really wanted to claim back then, I think it was that:

$t = iconv("ISO-8859-1","UTF-8//IGNORE",$t);

doesn't make sense, since there's nothing in a proper iso-8859-1 string, which can't be converted to UTF-8, so that :

$t = iconv("ISO-8859-1","UTF-8",$t);

(or utf8_encode) would be enough :)

Of course you can't output a iso-8859-1 as utf-8 and expect, that it's the same :)
Robert Glen Fogarty @ 11.09.2006 01:03 CEST
"You are truely an idiot for saying what you did below."

Perhaps you meant "truly."

You're welcome!
fred @ 08.06.2007 19:47 CEST
I don't understand why people, when typing messages on the web, seem desperate to admonish each other for their mistakes in the harshest possible way. If someone writes something that's incorrect, they'd probably appreciate being notified of their mistake -- so why not just do that, notify them? Where is the sense in insulting them? Why do people get so hot-headed on the web?

This seems to mostly happen in public forums, not in private messages like e-mails, so I think these kind of insults are all about showing off.
bob @ 25.12.2007 15:44 CEST
@fred: Hello there, captain obvious.
trev @ 05.06.2008 12:45 CEST
bob - I think you are guilty of what fred states.

To the point: there is a good use for this code. I often have to deal with incoming data (the source of which I have no control over) which claims to be UTF-8 but does occasionally have invalid chars. Because XML is invalid with invalid chars - it has to be sorted one way or another. This is certainly one way of doing it.
Mark @ 13.05.2010 20:48 CEST
The solution initially suggested here worked for my problem of decoding sms chars for a php application. Thanks for posting!!
Phil @ 05.07.2010 19:31 CEST
Yep, valid ISO-8859-1 does NOT guarantee valid UTF-8, ISO-8859-1 defines character codes from 0-255, there are several of those values that are by UTF-8 definition invalid

add a comment

Your email adress will never be published.
Comment spam will be deleted!

For Spammers Only
Name*
E-Mail
URL
Comment*
Notify me via E-Mail when new comments are made to this entry
Remember me (needs cookies)