Writing the software to run a site is only part of the work – you also got to make sure that everything keeps running smoothly. Keeps your systems tuned and happy doesn’t have to be drudgery – I’ve found it to be fun and interesting work as long as you simplify and streamline by gathering the right tools.
At the moment, we are releasing new versions of Ravelry daily. Capistrano 2 makes it easy – I hit a few keys and watch it go. Here is what Capistrano does for me during a typical release:
- Checks out the latest version of the application code from Subversion on each virtual server (actually, it updates a working copy)
- Updates the database to the new schema by running any Rails migrations that have been written since the last update. The schema version is stored inthe database. Easy peasy database releases!
- Removes and stops half of the app server cluster before swapping to the new version of code and bringing them back online
- Does the same with the other half
- Voila! New Ravelry, and users probably didn’t even notice that it was happening
All this is done over SSH – no weird client software is required. If you work with Linux (even if you don’t work with Ruby or Rails) I highly recommend that you check out Capistrano.
Sometimes we can’t do a hot deploy (for whatever code-change related reason). In that case, Cap puts up a maintenance page while it fiddles with the app servers.
Nagios and Munin – I love these two tools.
Nagios is an excellent monitoring tool. It can watch all of your services and email alerts when certain criteria are met. Ravelry has lots of moving parts – web server, app servers, master and slave database, mail, DNS, two different types of search servers, memcached cache servers, screengrabber, feed aggregator… If something goes wrong with one of these services or with a system itself (CPU, disk, etc) it is very nice to be alerted immediately. Plus, the Nagios configuration itself serves as a handy organizational tool. Best of all – it is free and flexible open source software with a feature set that beats many commercial monitoring packages.
Munin is a really flexible and simple graphing tool. Using Munin, I can have a really handy at-a-glance dashboard that shows me the health of all of my different systems and software. I graph pretty much everything that is graphable 🙂
I find these types of graphs very valuable because I can monitor resource utilization over time, take a look at the effects of code changes on resources/performances, and easily spot spikes and other oddities. Here are two example graphs. The first is the query traffic hitting our master MySQL server. The second is the load average on the VM that grabs RSS feeds and takes screenshots of people’s blogs and other websites. The spike on the graph is Firefox freaking out about something while trying to grab a screencap. You can also see that the load has been stepping up little by little – something I’ll have to look into.
I have a few other data sources that I periodically review: web usage stats, MySQL query logs, and Rails logs
We use Google Analytics for more advanced web stats and I hardly ever look at it. For the basic stats, I use plain old Webalizer (actually the Stone Steps version). Webalizer provides most of the information that I care about from a sys/network admin perspective.
Here is some output from the Ravelry API stats. We currently have several hundred users who are using a JSON API to show works in progress on their blogs. It shows hits, bytes, etc and breaks it down by day. Good enough for me – I can get a rough idea of where our bandwidth is going and I can see trends in the data.
The rails application logs include great timing information that helps me find bottlenecks in our application. I use SyslogLogger to funnel all of the rails logs (and other useful logs) to a central syslog-ng server. The centralized logs are compressed and rotated raily, and whenever I want to take a look at performance information for 1 day of logs (which is plenty of data) I run a log file through pl_analyze.
MySQL can be configured to output a very useful slow query log. Unfortunately, you can’t set the definition of “slow query” to anything less than 1 second, but it is still pretty helpful. Once you’ve got your slow query log, you can use the handy MySQL Statement Log Analyzer (mysqlsla) to summarize the data into easily digestible statistics.
Hackmysql.com has several other useful tools, including mysqlsniffer. Sometimes, I want to grab a snapshot of ALL MySQL activity so that I can roll it up into a summary and look for waste and opportunities for caching or more refined queries. To do this, I just run the sniffer for a while and dump the output to a file.
Bits and bobs
- You probably want to monitor your app from outside your network as well. Pingdom is good and cheap.
- Don’t waste your time looking for exceptions and errors in your logfiles. Add exception_notifier to your Rails app.
- Munin plugins are really easy to write, but check out the plugin library on MuninExchange before you start rolling your own.