oops
I’ll get the embarrassing part out of the way first. So… Ravelry had over an hour’s worth of hiccups and short periods of downtime yesterday. The cause of the problem was just a stupid full disk - I missed rotating a log, I was too lazy to set up Nagios monitoring for all my disks, and I didn’t notice the steadily climbing graph. However, I changed a whole bunch of big stuff just before the the problem hit and ended up wasting time looking in all of the wrong places before this caught my eye:

I just wanted you to know that the new stuff that I’m about to gush about wasn’t the cause.
6 million Rails requests per day
Ravelry does just over 2 million page views each day. Once you add in all of the AJAX hits, RSS feeds, API calls, and a few other things it adds up to 6 million requests that actually hit the Rails app. (I just grepped one day’s master syslog for the “Completed in” lines)
We currently have 70 Thin instances that run the Rails application (previously using Mongrel - keep reading). Each can handle a single request at a time and I guess you can call them “app servers” for lack of a better term. Requests come in to the web server, things that aren’t static files like images are passed along to Rails, the response goes back to the client.
Our biggest challenges at this layer have been 1) making sure that the perceived speed of the site isn’t affected if one or some of the instances go crazy or go down and 2) deploying new versions of the site without interruption.
HTTP server to Rails - the progression
We are now on our 3rd real “version” of our setup when it comes to the front end web server and its connection to the Rails instances on the backend. There is a lot of people thinking (and coding) about better ways to run and deploy Rails apps and I’m sure that we’ll continue to change how we do things.
Version 0: Apache + FastCGI
Used this combo in an early development version. Kind of a crappy setup, but it was all I had since I was still on a shared host. If you’ve tried playing with Ruby on a shared host with this setup and you hate it, don’t worry. There is another way.
Version 1: Apache → mod_proxy_balancer → Mongrel
This is probably the simplest setup because most people already have a standard Apache installation. Start up your Mongrels (however many you want), configure your Apache, and Apache proxies the requests to your Mongrels using a simple balancing algorithm.
Our biggest problems? Both hot deploys and recovering problem Mongrels was made difficult by mod_proxy_balancer’s behavior.
Restarting part or all of the cluster often resulted in Apache returning 500 errors anytime the affected servers were hit, even once they were back and up running properly. I often had to reload or restart Apache (which was a slow process when Apache was getting hit with lots of traffic) to bring things back to life.
Version 2 - nginx → Mongrel
I originally set up an Apache alternative because we were moving (finally) from a single server to our 3 machine, multiple virtual machine setup. I wanted to try nginx and see what all of the fuss was about. nginx uses less CPU than Apache, *far* less memory, and is simpler to configure. nginx is great - I highly recommend it. The biggest benefits that we got from this setup were more free memory, lighter load, and no more 500 errors.
I used Evented Mongrel for part of this time, due to the promise of extra stability under load.
Version 3: nginx + upstream_fair → Mongrel
We moved to the fair proxy balancer for nginx shortly after it came out - it was a great improvement. The fair module solves a major problem with connecting Mongrels to nginx the usual way. Because a Rails app running on Mongrel can only handle one request at a time, nginx’s round robin sharing of the load would often lead to requests waiting in line for a busy Mongrel when there are other free instances that could be used. Switching to this setup greatly improved the perceived speed because users were much less likely to be affected by a slow running requests initiated by others. If it weren’t for this module, I would have had to move some API calls, some searches, and administrative functions to background tasks or alternate clusters. It came out at just the right time
If you are starting something small and you aren’t dead set on using Apache, I recommend this configuration as
a great and simple setup. nginx is excellent, upstream_fair makes the proxying smarter.
We ran this way for the last 4 months. I finally made some changes yesterday because my feature wish list had grown too large. There were 2 major things that I was hoping to correct. First, requests were still loaded on to the single-connection handling Rails app without regard to their current state. 2000 impatient concurrent users could easily turn a little slowness into a big pile-up by clicking madly and hitting refresh (as people tend to do if a site is unresponsive). This made it really hard to do releases without putting up a maintenance page, and since we were often doing daily releases, I *really* wanted to avoid taking the site down for maintenance all the time. Second, things would sometimes go funny with the balancing. I’d notice that an entire server’s worth of Mongrels was being ignored (by looking at my graphs) and I’d have to reload nginx to wake it up. I’m pretty sure that this only happened when I added and removed servers from the cluster during deploys but I’m not positive. In any case, the behavior was a little disconcerting.
Version 4: nginx → HAProxy (running now) → Thin
Disclaimer - I’ve only been running this for 2 days
and I’ll explain Thin after - the fact that I replaced Mongrel with Thin isn’t that important to my story.
HAProxy is a software load balancer. I love it. Instead of making nginx proxy requests to our set of 70 Mongrels (okay, Thins) it sends everything to HAProxy which does a much nicer job of balancing the load. HAProxy can handle HTTP and TCP traffic with some caveats - if you have *any* load balancing needs, I highly recommend that you take a peek at it.
Why I like HAProxy:
- Understands per-server connection limits and configurable request queuing
- Watches servers for up/down-ness (or single slow running requests, in the case of Rails/Mongrel) and routes requests appropriately
- “abortonclose” tosses out useless aborted requests. From the HAProxy manual: “In presence of very high loads, the servers will take some time to respond… When clients will wait for more than a few seconds, they will
often hit the “STOP” button on their browser, leaving a useless request in
the queue, and slowing down other users, and the servers as well, because the
request will eventually be served, then aborted at the first error
encountered while delivering the response.”
- Can lay off the back end and provide users with some sort of feedback (even if it is a 503 Not Available error) if things are going badly.
- Great logging. It is so nice to be able to see (and analyze) how the balancing/proxying is going.
- Fast and light! Low memory footprint, low CPU impact. nginx doesn’t put it to shame, which is saying something.
- Cool and useful statistics page that shows up/down servers, session counts, request queue, server status and downtime, etc. Plenty of good stuff for my Nagios monitoring and Munin graphs.
- This one isn’t scientific. I can do my hot deploys. By its very nature, HAProxy deals really well with a rolling restart of all servers.
Like I said, it has only been 2 days, but I am loving this setup so far. It is *so nice* to have lots of configurability and lots of visibility when it comes to the connection between my HTTP server and the application servers.
The HAProxy stats page:
nginx, HAProxy, and Mongrel/Thin
I didn’t see many (any) examples on the web, so I just want to share a few configuration snippets…
nginx config
upstream haproxy {
server 127.0.0.1:8000;
}
server {
# blah blah do the same thing that you would with the usual Mongrel setup
}
haproxy config
This should get you started - check out the haproxy manual for all you ever wanted to know. Make sure to look into the things that I bolded.
global
log 127.0.0.1 local0 warning
daemon
# and uid, gid, daemon, maxconn...
defaults
mode http
retries 3
option abortonclose
option redispatch
maxconn 4096
timeout connect 5000
timeout client 50000
timeout server 50000
frontend rails *:8000
default_backend mongrels
backend mongrels
option httpchk
balance roundrobin
server server-1 127.0.0.1:8001 maxconn 1 check
server server-2 127.0.0.1:8002 maxconn 1 check
Thin
Earlier, I mentioned that I switched from Mongrel to Thin.
Thin is basically a drop-in replacement for Mongrel - you don’t have to do anything special (other than “gem install”) to run your rails app under Thin.
Why?
The reasons why I like Thin are pretty simple. Development is very active, it is nicely packaged and includes examples, rake tasks, and scripts for controlling the service, and Rack makes things more flexible. I’m happy with the change, but I don’t think that people need to switch their Rails apps from Mongrel (yet).
Tomorrow is Monday
Since we are still adding 800ish people to the site a day and Monday is our busiest day, we break a new traffic record every Monday. We’ll see how things go on day 3 of my new setup.