How to fetch a lot of feeds – A solution · Blog

4 days ago, I asked for a solution for fetching hundreds of feeds “at once” and here comes a solution.

The problem was that Matthias from blog.ch had serious problems reliably fetching those 600+ feeds every hour. As he fetched them sequentially, slow or not at all responding feeds could take down the whole process. A better solution with concurrent and asynchronous requests was badly needed.

Gregor was the first to respond with the curlmulti* functions available since PHP 5.0. Richard wanted me to exec() each request. Mike pointed me via IRC to his PHP extension http, which has a HttpRequestPool class. Henri mentioned pinging the aggregator, and last but not least, Harry wrote about Twisted – an event-driven networking framework written in Python, he even wrote a little aggregator based on it.

I didn't want to go down the python route (at the end, it has to work on Matthias' hosted server and should be integrateable in his current setup). Telling 600+ bloggers, that they have to ping from now on blog.ch wasn't a solution either, not to speak of the problem, that a lot of people can't tell their (hosted) blog software to ping certain servers (although it would be nice as an optional service). Exec()-ing didn't sound like a good option either and while the http-extension looks useful, I decided to give curl a try.

Here's the quick'n'dirty script. What we basically do is getting all the rss-urls from the database, and call the get_multi_rss function with them. For not overloading the system with too many open connections, we do only 50 requests at once and give it a timeout of 20 seconds. This fetches the 600+ feeds in maximum 4 minutes, with “good internet weather” in less than 2 minutes. Not too bad, IMHO :) The settings are quite conservative for avoiding too many timeouts (we had some timeout problems with some servers during testing , but I couldn't reproduce it later), but if you fetch more feeds at once with less timeout you can get them even faster.

The system isn't really integrated in Matthias' feed fetching setup, but what we do currently is only getting all those feeds, putting them into a directory, where Matthias gets them afterwards with his old script (sequential, Magpie and all that). This has mainly 2 reasons, Matthias' server has currently only PHP 4 available (he will move soon to a better hoster here in Switzerland) but curl_multi is only available for PHP 5. And it had to go quick, 'cause the situation on blog.ch was almost unbearable. Therefore, until he moved, has PHP 5 available and some spare time, we will fetch the feeds for him from the 600 different blogs and he fetches then them from our server. It works pretty good and at least the timeout problem has gone.

The next task would be to integrate such a solution into magpie, so that the results can be directly parsed by magpie. Looks like a doable task with some spare time. Anyone wants to take over? :)

Do you have a question, a comment, or just feeling inspired? Mention us or share this article on Mastodon, Twitter or LinkedIn.