How to fetch a lot of feeds – A solution

4 days ago, I asked for a solution for fetching hundreds of feeds “at once” and here comes a solution.

The problem was that Matthias from blog.ch had serious problems reliably fetching those 600+ feeds every hour. As he fetched them sequentially, slow or not at all responding feeds could take down the whole process. A better solution with concurrent and asynchronous requests was badly needed.

Gregor was the first to respond with the curl_multi_* functions available since PHP 5.0. Richard wanted me to exec() each request. Mike pointed me via IRC to his PHP extension http, which has a HttpRequestPool class. Henri mentioned pinging the aggregator, and last but not least, Harry wrote about Twisted – an event-driven networking framework written in Python, he even wrote a little aggregator based on it.

I didn’t want to go down the python route (at the end, it has to work on Matthias’ hosted server and should be integrateable in his current setup). Telling 600+ bloggers, that they have to ping from now on blog.ch wasn’t a solution either, not to speak of the problem, that a lot of people can’t tell their (hosted) blog software to ping certain servers (although it would be nice as an optional service). Exec()-ing didn’t sound like a good option either and while the http-extension looks useful, I decided to give curl a try.

Here’s the quick’n’dirty script. What we basically do is getting all the rss-urls from the database, and call the get_multi_rss function with them. For not overloading the system with too many open connections, we do only 50 requests at once and give it a timeout of 20 seconds. This fetches the 600+ feeds in maximum 4 minutes, with “good internet weather” in less than 2 minutes. Not too bad, IMHO :) The settings are quite conservative for avoiding too many timeouts (we had some timeout problems with some servers during testing , but I couldn’t reproduce it later), but if you fetch more feeds at once with less timeout you can get them even faster.

The system isn’t really integrated in Matthias’ feed fetching setup, but what we do currently is only getting all those feeds, putting them into a directory, where Matthias gets them afterwards with his old script (sequential, Magpie and all that). This has mainly 2 reasons, Matthias’ server has currently only PHP 4 available (he will move soon to a better hoster here in Switzerland) but curl_multi is only available for PHP 5. And it had to go quick, ’cause the situation on blog.ch was almost unbearable. Therefore, until he moved, has PHP 5 available and some spare time, we will fetch the feeds for him from the 600 different blogs and he fetches then them from our server. It works pretty good and at least the timeout problem has gone.

The next task would be to integrate such a solution into magpie, so that the results can be directly parsed by magpie. Looks like a doable task with some spare time. Anyone wants to take over? :)

What about integration this into the PEAR Feed Parser thingy ?
Magpie is anyway badly written (I’m using it in a system for the shere reason that I have to support PHP 4 for now).

Anyway the pear feed parser would be the way to go for him when he’s on PHP 5 (even tho I haven’t looked at the code, can’t say if it sucks or not)

Thanks for the plug Helgi :) I was about to ask if you’d considered using XML_Feed_Parser. I’d really appreciate feedback from heavy use situations.

It’s currently in alpha state, but I’m hoping to get some more unit tests and perhaps some benchmarking done over the next couple of weeks so I can push it towards beta.

http://pear.php.net/package/XML_Feed_Parser

James, thanks for the info! Magpie suited me fine so far, but a better solution wouldn’t hurt. I’ll try it out.

Here is a PHP XML/RSS library to parse single RSS feeds and to aggregate / mix multiple RSS feeds.

http://devcorner.georgievi.net/phplib/baoxmllib/

Thanks for the link, Ivan. But it doesn’t look like it solves the “get many feeds at once” problems.

The HTTP extensions request functionality is actually based upon libcurl and HttpRequestPool is the interface to curl_multi, so from a performance point of view, it’s probably better to use curl directly. The following phpt would be an example on how to do it with HttpRequestPool: http://cvs.php.net/co.php/pecl/http/tests/HttpRequestPool_003.phpt

Hello
I tried your script chregu, but it doesn’t seems to fetch anything at all :/

I only get this:
Total feeds:100 timer 0 timer 0 timer 0 bad 0 array(3) { [“badurls”]=> array(0) { } [“bad”]=> int(0) [“good”]=> int(0) } Start – curl 0 0.00232291221619 curl 50 0.00198912620544 curl 100 0.00191688537598 magpie 0.000154972076416 Stop 1.81198120117E-05

foxydemon: You did not feed the function with urls :)

I take them from our DB, but you have to add it by yourself. Just give it an array with urls you want to fetch

At least it didn’t take too long to fetch nothing!

o_O
indeed that was just that :S

So, I have succeeded to make your script work, BUT it is strangely really slow O_O
Here is the benchmark for only 5 feeds:
curl 0 60.5521318913
magpie 62.9399521351
Stop 9.89437103271E-05

The “only” thing I have modified in your script is that I remplaced
curl_setopt($conn[$i],CURLOPT_FILE,$f[$i]);
by
curl_setopt($conn[$i],CURLOPT_URL,$url);

because I would like to evitate to download each feed on my server before to parse them… Is it what cause the thing to be so slow?

Well, I tried using your original script with the downloading of each feed inside a specific directory… and same result. I tried with php5.0 and 5.1…
How is it possible? O_O

What is more strange in that, is that the downloaded feeds in the htdocs directory are *empty*

I tried to reinstall curl, thinking about a bug or something like that… but I keep going have that problem.

Do you have an idea why it doesn’t work?

foxydemon: You removed
curl_setopt($conn[$i],CURLOPT_FILE,$f[$i]);
This line would save the output in that file, but as you removed it, no saving of the content

The files are zerosized, because we do a touch for each file at the end. So if it’s not there, “touch” creates a zero sized file.

Why it takes so long, I don’t know, It should stop after at least 20 seconds (if you set the timeout to 20 seconds)

I used your original code without modifying anything, I asked a friend to try it on his own server, with the exact same feeds, and it was working perfectly…

So I just don’t understand. I have just tried to remove the touch lines to see if the content is saved before it is “touched”… But the result is that the files keep being zero sized… Excepted for some exceptions: for 50 files, there is sometimes one feed wich isn’t empty(before and after the removing of touch lines)…

The only thing I have in mind about these problems is that my curl doesn’t work properly… Though I have reinstalled it several times… as for php5… I installed it through apt-get packages from people.dexter.org, is it possible their packages are buggy o_O ?

Well.. I just don’t know what to do as it doesn’t come at all from your code… Perhaps you have an idea?

I installed the packages from http://dotdeb.org who are presumed to be the more stable, and I got exactly the same problem…
My server is a dual opteron.. I just don’t understand the problem.

Any idea?

FWIW, I get this result for 622 feeds:

[“bad”]=>
int(76)
[“good”]=>
int(546)
}
Start –
curl 0 27.092756986618
curl 50 27.517721176147
curl 100 34.920758962631
curl 150 27.061724901199
curl 200 55.909920930862
curl 250 29.084393978119
curl 300 29.087845087051
curl 350 28.676599979401
curl 400 27.624128103256
curl 450 49.660580873489
curl 500 31.021418094635
curl 550 27.434705018997
curl 600 14.837267875671
magpie 11.539301156998
Stop 0.00020885467529297

I complied php5 and libcurl, thinking the debian packages were buggy…
But same problem o_O

Total feeds:3
timer 0
array(3) {
[“badurls”]=>
array(3) {
[0]=>
string(47) “http://www.sacarny.com/blog/wp-rss2.php?cat=2 0”
[1]=>
string(39) “http://alanjstr.blogspot.com/atom.xml 0”
[2]=>
string(41) “http://www.croczilla.com/blog/rdf10_xml 0”
}
[“bad”]=>
int(3)
[“good”]=>
int(0)
}
Start –
curl 0 45.2878551483
Stop 3.19480895996E-05

That’s CRAZY :S

My config is:
Apache 1.3.3 – PHP5.0.5 – libcurl/7.15.0 OpenSSL/0.9.7e zlib/1.2.2 libidn/0.5.18

Any idea? :/

Veery nice ;)

hi guys,

on my debian sarge server i always get a timeout after the first 4 connects in the curl multi loop… i array_chunk 10.000 urls into 100 pakets per curl_multi. the first 4 connects/sockets giving back a normal result, mostly the 5th too but after this the rest gets a timeout. if i try to fetch 100 pages from the same domain at the same time with
curl_multi (100 sockets) it works. but with different hosts it could only get 4 results per loop…

any ideas? cant find the problem… must be some kind of limit in debian or with the hoster… but i cant find a solution… :(

Hi tosbn,

I think this is a known bug in PHP which should be fixed in PHP 5.2.

See here:

http://bugs.php.net/bug.php?id=38620

Cheers
Nick