Archive for July, 2009

Fix broken UTF8 encoded RSS feeds in php

Friday, July 3rd, 2009

Parsing feeds with the SimpleXML object is a walk in the park. But it can turn into a pain in the ass, when the provided XML feed isn’t correctly UTF8 encoded and I can tell you, there are some ugly ones out there in the wild.

Parsing RSS feeds with SimpleXML

Normally you can load and parse a feed as simple as:

<?
$feed = simplexml_load_file(rss.xml);

foreach ($feed->channel->item as $item){ ?>
  <h3><?= $item->title ?></h3>
  <p><?= $item->description ?></p>
<? } ?>

That’s a standard. Sometimes though the XML is not proper UTF8 encoded, and the following nasty error occurs loading in into the SimpleXML object.

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 6: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xEB 0x6C 0x65 0x20 in index.php on line 33

How to fix this

First I tried to filter out the bad characters with str_replace, but that’s silly. I didn’t feel right. So I looked further and I stumbled on this quite unknow function iconv.

What’s the iconv module

This module contains an interface to iconv character set conversion facility. With this module, you can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set. Supported character sets depend on the iconv implementation of your system.

The iconv function is available since PHP 4.0.5

Description

string iconv ( string $in_charset , string $out_charset , string $str )

Performs a character set conversion on the string str from in_charset to out_charset .

A better and safer way of parsing

To be on the safe side, never read in a feed directly into a SimpleXML object, but do it this way:

$feed = file_get_contents($feed_url);

$feed = iconv("UTF-8","UTF-8//IGNORE",$feed);

$feed = simplexml_load_string($feed); 

There is another reason to do this. Reading in a feed with file_get_contents is much faster then loading it straight into a SimpleXML object , at least in PHP < 5.2x. So following above instructions you will experience a performance gain and get rid of some nasty unexpected errors.

Your are browsing
the Archives of My Beloved PHP for July 2009.
Categories
Archives
Links