Fix broken UTF8 encoded RSS feeds in php

Parsing feeds with the SimpleXML object is a walk in the park. But it can turn into a pain in the ass, when the provided XML feed isn’t correctly UTF8 encoded and I can tell you, there are some ugly ones out there in the wild.

Parsing RSS feeds with SimpleXML

Normally you can load and parse a feed as simple as:

<?
$feed = simplexml_load_file(rss.xml);

foreach ($feed->channel->item as $item){ ?>
  <h3><?= $item->title ?></h3>
  <p><?= $item->description ?></p>
<? } ?>

That’s a standard. Sometimes though the XML is not proper UTF8 encoded, and the following nasty error occurs loading in into the SimpleXML object.

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 6: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xEB 0x6C 0x65 0x20 in index.php on line 33

How to fix this

First I tried to filter out the bad characters with str_replace, but that’s silly. I didn’t feel right. So I looked further and I stumbled on this quite unknow function iconv.

What’s the iconv module

This module contains an interface to iconv character set conversion facility. With this module, you can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set. Supported character sets depend on the iconv implementation of your system.

The iconv function is available since PHP 4.0.5

Description

string iconv ( string $in_charset , string $out_charset , string $str )

Performs a character set conversion on the string str from in_charset to out_charset .

A better and safer way of parsing

To be on the safe side, never read in a feed directly into a SimpleXML object, but do it this way:

$feed = file_get_contents($feed_url);

$feed = iconv("UTF-8","UTF-8//IGNORE",$feed);

$feed = simplexml_load_string($feed); 

There is another reason to do this. Reading in a feed with file_get_contents is much faster then loading it straight into a SimpleXML object , at least in PHP < 5.2x. So following above instructions you will experience a performance gain and get rid of some nasty unexpected errors.

5 comments

  1. Dirty you said? Not at all. I’ve spending more than a half of an hour to find out a consistent solution to this problem. Most of the references relay vague and not developed “pseudo-solutions” or things like “go there and you’ll find utf8_encode is your friend” or “ask your xml file provider to add encode UTF8 etc…”.
    Thanks a lot. I’ll record your RSS feed among the most useful dev. ones around.

  2. Hi
    thanks for the wonderful post. With you post i got my half solution.

    For me little modification to the code was required i changed the code to $feed = iconv(“UTF-8″,”ISO-8859-1//IGNORE”,$feed);

    but thanks a lot …

  3. Absolutely genius. Saved my hours of mysql text conversion. Now I just ignore the non-utf code that gets input into the cms and my XML feeds will work!….Thanks a lot!

Comments are closed.