My Sysadmin Toolbox

Writing the software to run a site is only part of the work – you also got to make sure that everything keeps running smoothly. Keeps your systems tuned and happy doesn’t have to be drudgery – I’ve found it to be fun and interesting work as long as you simplify and streamline by gathering the right tools.

Deployment

At the moment, we are releasing new versions of Ravelry daily. Capistrano 2 makes it easy – I hit a few keys and watch it go. Here is what Capistrano does for me during a typical release:

  • Checks out the latest version of the application code from Subversion on each virtual server (actually, it updates a working copy)
  • Runs asset_packager which minifies, combines, and versions the site’s CSS and Javascript so that we have smaller files and we can ask browsers to cache like crazy
  • Updates the database to the new schema by running any Rails migrations that have been written since the last update. The schema version is stored inthe database. Easy peasy database releases!
  • Removes and stops half of the app server cluster before swapping to the new version of code and bringing them back online
  • Does the same with the other half
  • Voila! New Ravelry, and users probably didn’t even notice that it was happening

All this is done over SSH – no weird client software is required. If you work with Linux (even if you don’t work with Ruby or Rails) I highly recommend that you check out Capistrano.

Sometimes we can’t do a hot deploy (for whatever code-change related reason). In that case, Cap puts up a maintenance page while it fiddles with the app servers.

Monitoring

Nagios and Munin – I love these two tools.

Nagios

Nagios is an excellent monitoring tool. It can watch all of your services and email alerts when certain criteria are met. Ravelry has lots of moving parts – web server, app servers, master and slave database, mail, DNS, two different types of search servers, memcached cache servers, screengrabber, feed aggregator… If something goes wrong with one of these services or with a system itself (CPU, disk, etc) it is very nice to be alerted immediately. Plus, the Nagios configuration itself serves as a handy organizational tool. Best of all – it is free and flexible open source software with a feature set that beats many commercial monitoring packages.

Although it has a handy web interface, I usually interact with Nagios via the Firefox extension, the “nsc” command line utility, and email.

Munin

Munin is a really flexible and simple graphing tool. Using Munin, I can have a really handy at-a-glance dashboard that shows me the health of all of my different systems and software. I graph pretty much everything that is graphable :)

I find these types of graphs very valuable because I can monitor resource utilization over time, take a look at the effects of code changes on resources/performances, and easily spot spikes and other oddities. Here are two example graphs. The first is the query traffic hitting our master MySQL server. The second is the load average on the VM that grabs RSS feeds and takes screenshots of people’s blogs and other websites. The spike on the graph is Firefox freaking out about something while trying to grab a screencap. You can also see that the load has been stepping up little by little – something I’ll have to look into.

Reporting

I have a few other data sources that I periodically review: web usage stats, MySQL query logs, and Rails logs

Webalizer

We use Google Analytics for more advanced web stats and I hardly ever look at it. For the basic stats, I use plain old Webalizer (actually the Stone Steps version). Webalizer provides most of the information that I care about from a sys/network admin perspective.

Here is some output from the Ravelry API stats. We currently have several hundred users who are using a JSON API to show works in progress on their blogs. It shows hits, bytes, etc and breaks it down by day. Good enough for me – I can get a rough idea of where our bandwidth is going and I can see trends in the data.

Rails Logs

The rails application logs include great timing information that helps me find bottlenecks in our application. I use SyslogLogger to funnel all of the rails logs (and other useful logs) to a central syslog-ng server. The centralized logs are compressed and rotated raily, and whenever I want to take a look at performance information for 1 day of logs (which is plenty of data) I run a log file through pl_analyze.

MySQL

MySQL can be configured to output a very useful slow query log. Unfortunately, you can’t set the definition of “slow query” to anything less than 1 second, but it is still pretty helpful. Once you’ve got your slow query log, you can use the handy MySQL Statement Log Analyzer (mysqlsla) to summarize the data into easily digestible statistics.

Hackmysql.com has several other useful tools, including mysqlsniffer. Sometimes, I want to grab a snapshot of ALL MySQL activity so that I can roll it up into a summary and look for waste and opportunities for caching or more refined queries. To do this, I just run the sniffer for a while and dump the output to a file.

Bits and bobs

  • You probably want to monitor your app from outside your network as well. Pingdom is good and cheap.
  • Don’t waste your time looking for exceptions and errors in your logfiles. Add exception_notifier to your Rails app.
  • Munin plugins are really easy to write, but check out the plugin library on MuninExchange before you start rolling your own.

Comments (6)

  1. Vis Major wrote:

    I knew you were a big-brained smarty pants, but these glimpses into Ravelry under the hood are really interesting and confirm even more that you’re a big-brained smarty pants. :)

    Thursday, December 20, 2007 at 9:16 pm #
  2. Richard wrote:

    GoStats might be a userful stats tool to help reduce the webstats load on your servers. GoStats is free and rich system. (more advanced than webalizer) …and it’s always good to have a backup ;)

    Friday, December 21, 2007 at 9:28 am #
  3. trillian42 wrote:

    OK, I freely admit that 99% of the software talk goes directly over my head, but I did pass the link along to my hubby.

    I’m just proud of myself for getting the “Monkey Island” reference!

    Friday, December 21, 2007 at 7:53 pm #
  4. max wrote:

    Pure awesomeness, bro! :)

    Wednesday, December 26, 2007 at 7:21 pm #
  5. Vis Major — I can confirm that Casey is a big-brained smarty pants. Pre-Ravelry we worked at the same company. The world hasn’t been the same since he left us for some whacked knitting site.

    Wednesday, January 9, 2008 at 8:11 pm #
  6. Adam wrote:

    Thanks a lot for this post. Immensely useful.

    Monday, March 3, 2008 at 1:12 pm #

Trackbacks/Pingbacks (2)

  1. Network Query Tool on Wednesday, February 13, 2008 at 7:30 pm

    Multi Homing Network Tools…

    Remember a few years ago when the massive blackout struck New York City and a large section of North Eastern U.S….

  2. Load Balancing < Code Monkey Island on Monday, March 10, 2008 at 12:51 am

    [...] Cool and useful statistics page that shows up/down servers, session counts, request queue, server status and downtime, etc. Plenty of good stuff for my Nagios monitoring and Munin graphs. [...]