Fetching a looot of feeds with PHP

blog.ch has currently more than 580 blogs in its database and the usual way with getting each feed after each other doesn’t work anymore reliably, ’cause some feeds just time out or respond very slowly and take the whole process down.

There’s a post by Wez about how to fetch more than one “feed” at once with non-blocking sockets. This works perfectly (I didn’t test it with 500 feeds, which may be too many, but we can still split the feeds into little packages), but has one little drawback. We would have to implement a somehow decent HTTP header parser (302, 304 status codes, etc.) for all the possibilities out there… HTTP_Request from PEAR does a quite good job on that, but doesn’t work with stream_socket_client.

Does anyone have an idea, if someone did already do that? Or knows of any other way for helping out blog.ch? Matthias would be grateful :)

http://www.php.net/manual/en/function.curl-multi-exec.php might be an alternative, with more control over the connection than the crude stream handling Wez uses.

Assuming you’re running this script on a cronjob, have you thought about having a “fetchFeeds.php” script, which fetchs all the relevant urls from the database, and then spawns (ie. exec() ) a “fetchFeed.php” (note the use of singular and plural!) which takes the feed URL (and other details if necessary) and fetchs that feed only. Add in some
usleep()’ing so you don’t spawn 580 processes (!) at once and you have a winner.

For the record (I will hopefully summarize my findings later): Mike pointed me on IRC to his PECL extension http which has a HttpRequestPool class. See the docs for details.

Also, you might want to consider keeping the regular refresh schedule of the aggregator quite infrequent, and refreshing feeds one-by-one based on pings.

I’ve been thinking about doing this for the Midgard aggregator, but haven’t bothered to deal with the Magpie RSS cache system yet.

Not PHP but what about twisted (http://twistedmatrix.com/) – also uses async IO (i.e. no threads to worry about). And there’s basically a ready to roll solution here:

“From some tests I’ve been doing, it takes 6 minutes to download and parse over 730 feeds, which is less than half a second for each feed.”

And there’s even an OReilly Twisted book now.

SWiK.net subscribes to a great deal of feeds, however we use a lightweight threaded Java system to pull down the feeds to a directory that is then scanned and parsed by a long running PHP process.

I made a quick’n’dirty solution with curl_multi_*. It fetches and parses all 580 feeds from blog.ch in approx. 2 minutes. Not bad. I’ll blog about it in more depth later. The script is available here (as I said, quick’n’dirty, definitively improvable)

Looking in more detail at that twisted aggregator, it’s almost all there except for conditional GETs (and other HTTP caching mechanisms if you wish to use them). In other words, every time it runs, it pulls down a fresh copy of the feed right now.

There’s a class is in twisted.web called HTTPPageGetter which is doing the work. Right now it handles various status codes, including redirects but not “Not Modified” headers – these get returned as errors to the FeederProtol.getError method in the activestate example. HTTPPageGetter also doesn’t make the HTTP headers available to the caller, so
right you can’t get at things like ‘Last-Modified’ headers.

So basically it looks like you’d need to subclass HTTPPageGetter and also the factory that controls it HTTPClientFactory. There’s a already a subclass of both of these in twisted.web.client for downloading files – perhaps caching could be handled transparently in a similar manner to this downloader.