RSS growing pains
While I've been buried in other CTO duties, I've seen a lot of useful feedback on my latest column, "
RSS Growing Pains." There's even a
healthy discussion going on over at Slashdot, including entertaining (at least to me)
debates about InfoWorld IT departments past and present.
Rob Malda noted in his post: "We've seen similiar problems over the years. RSS (or as it should be called, 'Speedfeed') is such a useful thing, it's unfortunate that it's ultimately just very stupid."
Mark Fletcher of Bloglines recognized the scaling problem I was talking about (and by the way, Mark, we love Bloglines for it's functionality and what it does to reduce the load by acting as a proxy for so many people):
This is the scaling problem that I've been talking about since we launched Bloglines a year ago. It's a serious concern. Centralized services like Bloglines avoid this problem because we only fetch a feed once regardless of how many subscribers we have to it. Desktop aggregators can't do that, of course, and end up generating huge amounts of traffic to sites like Infoworld. There are various things that a desktop aggregator can do to mitigate the load, like using the HTTP last-modified header and supporting gzip compression. But the aggregator still has to query the server, so there will always be a load issue.
Because Bloglines has a vested interest in increasing RSS (in the generic sense) adoption, we're looking at ways we can help. We are working on a couple of projects right now, and we're of course open to suggestions.
Tim Bray posted a pointer to my column last Friday on the atom-syntax mailing list, and Dare Obasanjo responded with a post with some useful practical advice (I added links to the relevant technologies):
Testing their feed with http://www.rexswain.com/cgi-bin/httpview.cgi I see that Infoworld neither supports GZip encoding nor HTTP conditional GET. No wonder their bandwidth usage is so high. Using both techniques should reduce their bandwidth usage by at *least* a factor of 10.
Almost every time I see someone complain about how RSS wastes bandwidth, I check their RSS feed and find they aren't using basic HTTP techniques for saving bandwidth which are supported by ALL the major RSS aggregators. Until we find that existing solutions are inadequate it seems like premature optimization to start talking about even more sophisticated bandwidth management techniques specific to syndication feeds.
First, a quick quote from my column:
Our hourly RSS surge has all the characteristics of a distributed DoS attack, and although the requests are legitimate and small, the sheer number of requests in that short time period creates some aggravating scaling issues. These issues aren’t enough to make me want to abandon RSS (in fact, I’ll keep pushing it), but its workings can create operational annoyances. If RSS is going to go from fairly big to absolutely huge, we’re all going to need to do a little more work on the plumbing.
I pointed out this quote because the "operational annoyance" (not quite the level of a "problem") had less to do with bandwidth (as many folks who responded wrongly assumed) and more to do with Apache configuration (it's always worthwhile to re-read the "Apache Performance Tuning" page). I would advise folks out there to read this page and make sure that your Apache settings support the level of simultaneous connections that a bunch of newsreaders waking up at the same time will create. In our case, we had fairly conservative settings for MaxClients and ServerLimit. Read the Apache docs for more on these two settings and you'll know where I'm headed:
The MaxClients directive sets the limit on the number of simultaneous requests that will be served. Any connection attempts over the MaxClients limit will normally be queued, up to a number based on the ListenBacklog directive. Once a child process is freed at the end of a different request, the connection will then be serviced.
. . . .
Special care must be taken when using this directive. If ServerLimit is set to a value much higher than necessary, extra, unused shared memory will be allocated. If both ServerLimit and MaxClients are set to values higher than the system can handle, Apache may not start or the system may become unstable.
So, basically, the "operational annoyance" was the thousands of simultaneous connections at the same time that eventually overwhelmed our MaxClients settting and caused requests to queue at the top of the hour, slowing down our response time but not killing our servers our draining our bandwidth. It's the kind of thing you don't notice until you hit the wall, and as our RSS traffic grew steadily, we didn't see the wall approaching. We had a higher MaxClients setting than recommended anyway, but we raised the number further to something more respectable. Believe me, though -- when RSS gets really popular at your site, the characteristics of tons of incoming connections all at once on a regular basis will seem unusual relative to "regular" web traffic. This is not the kind of news-driven surge that I saw during big breaking news events during my days at CNN.com (and even Salon.com to a lesser degree) -- it really does look like a DDoS attack, even when you know it isn't.
In the final analysis, there are a number of ways to deal with RSS surges, and none of them are rocket science, but no one should assume they will just take care of themselves and it's clear from what I'm seeing that more people need to know about the solutions. Not every IT group will roll with the RSS punches, and the more conservative ones might just cut RSS off at the knees even if it's a minor hassle on top of everything else. One of the wonders of weblogs (and the web in general) though is that some poor soul will face the same annoyance we had and hopefully will find this post and fix it quickly.
I have a ton of e-mail to dig through with other suggestions that I will post, but work calls again. . . . stay tuned.
Update: More links and discussion:
- "Will RSS Readers Clog the Web?" (story from Wired News in April)
- Dare Obasanjo suggests gzip encoding and conditional GET, but notes: "The one thing that HTTP doesn't provide is a way for clients to deal with numerous connections being made to the site at once." There are methods to deal with that, of course (more servers, CDN services like Akamai and Speedera, etc.) but those solutions smack of mindlessly adding more lanes to the freeway instead of doing the hard work of analyzing the traffic problem and working on the fundamental issues.
- Sam Ruby points back to Dare Obasanjo's suggestions and notes: The functionality is clearly there in HTTP. The word is clearly not getting out to everywhere it should be.
- Nick Bradbury of FeedDemon fame add his voice to the chorus, and also notes that FeedDemon only checks feeds every three hours by default
- Phil Windley: None of these problems are unsolvable and frankly, its nice to have scalability problems. It's a sign of success. (Agreed!)
Posted by Chad Dickerson at July 20, 2004 03:12 PM