How to get rid of invalid UTF-8 characters

We aggregate the PHP category from del.icio.us on the Planet PHP website. Unfortunately the input from del.icio.us does sometimes contain invalid UTF-8 characters, which leads to errors in the XML parsing. But the following iconv line gets rid of all invalid UTF-8 characters.

Problem solved ;) (But I also wrote a mail to the del.icio.us people about the problem, ’cause this shouldn’t happen in the first place)

HAPPY HAPPY HAPPY HAPPY BIRTHDAY CHREGU!
DON’T CODE TODAY! :)

Evt. vorher $text= iconv(“ISO-8859-1″,”UTF-8”, $text);
Ansonsten werden auch Umlaute in HTML-Entit㳥n ausgeschnitten.
Gruss luc

luc: I assumed, $text is already UTF-8 encoded. If it isn’t, you can directly do that in one step… But if it’s proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )

Great – but where do I put this?

– Tim

Tim: Before you try to load the string with dom for example

Sorry, but where do i put this?

well, doesn’t the // make the rest of the line commented… what language is this. and where does it go ;)

chregu,

You are truely an idiot for saying what you did below.

————————–
chregu @ 26.01.2005 21:06 CET
luc: I assumed, $text is already UTF-8 encoded. If it isn’t, you can directly do that in one step… But if it’s proper ISO-8859-1, there are no invalid UTF-8 characters in there, anyway )
————————–

I would NEVER hire you as a programmer.

If what you said below were true, nobody would be asking these questions now would they.

I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser.

IDIOT!!!!!!!!!!!!!!!!

Dear anonymous admin

Thanks for the job offer :)

I didn’t claim that the output will be “correct”, but I stand to my claim that a ISO-8859-1 encoded string doesn’t have *invalid* UTF-8 characters.

And about the & character. That was not part of the problem to be solved and it has the same encoding anyway. But you smart guy knew that already, of course :)

A year and a half later, triggered by this: http://www.sitepoint.com/blogs/2006/08/09/scripters-utf-8-survival-guide-slides/#comment-43992

You’re both wrong ;)

“I dare you to try putting a ISO-8859-1 & sign in an xml page and see what happens in your browser.”

An ISO-8859-1 & sign is the same thing as an ASCII & sign. So you could happily pull one out of an IS0-8859-1 string and drop it into a UTF-8 string, and have valid UTF-8. But putting & in an XML document _will_ cause a problem but not because of character encoding – rather the & sign is an XML delimiter – see
http://www.xml.com/pub/a/2001/01/31/qanda.html so @admin – best fire yourself.

“claim that a ISO-8859-1 encoded string doesn’t have *invalid* UTF-8 characters.”

If an ISO-8859-1 string only contains characters in the ASCII range, then you’re right. But anything that requires 8 bits to represent (e.g. umlauts, MS smart quotes) – i.e. characters outside of the the ASCII range – will need conversion and 2 bytes to represent in UTF-8 – PHP’s utf8_encode() fn would do the job.

Ah – becoming annoying ;)

I’m not sure anymore, what I really wanted to claim back then, I think it was that:

$t = iconv(“ISO-8859-1″,”UTF-8//IGNORE”,$t);

doesn’t make sense, since there’s nothing in a proper iso-8859-1 string, which can’t be converted to UTF-8, so that :

$t = iconv(“ISO-8859-1″,”UTF-8”,$t);

(or utf8_encode) would be enough :)

Of course you can’t output a iso-8859-1 as utf-8 and expect, that it’s the same :)

“You are truely an idiot for saying what you did below.”

Perhaps you meant “truly.”

You’re welcome!

I don’t understand why people, when typing messages on the web, seem desperate to admonish each other for their mistakes in the harshest possible way. If someone writes something that’s incorrect, they’d probably appreciate being notified of their mistake — so why not just do that, notify them? Where is the sense in insulting them? Why do people
get so hot-headed on the web?

This seems to mostly happen in public forums, not in private messages like e-mails, so I think these kind of insults are all about showing off.

@fred: Hello there, captain obvious.

bob – I think you are guilty of what fred states.

To the point: there is a good use for this code. I often have to deal with incoming data (the source of which I have no control over) which claims to be UTF-8 but does occasionally have invalid chars. Because XML is invalid with invalid chars – it has to be sorted one way or another. This is certainly one way of doing it.

The solution initially suggested here worked for my problem of decoding sms chars for a php application. Thanks for posting!!

Yep, valid ISO-8859-1 does NOT guarantee valid UTF-8, ISO-8859-1 defines character codes from 0-255, there are several of those values that are by UTF-8 definition invalid

For me the statement

$t = iconv(“UTF-8″,”UTF-8//IGNORE”,$t);

was very helpful in context of

“get rid of invalid UTF-8 characters”

Thanks.

@admin: do you have a problem?

Helped me too.

Thanks!

Thanxs!!!
From Bolivia

its work!!
– for mpdf
caracteres especiales…

$var = iconv(“UTF-8″,”UTF-8//IGNORE”,$var);