Saving money by putting a web cache in front of Amazon S3

We use Amazon S3 to store all of our files (images, PDFs, etc) We do this because it is cost effective – building out our own redundant, distributed storage system would cost roughly the same as 1 year of S3 storage service in equipment purchases alone.

Paying Amazon to serve those files to end users? Not so cost effective.

For the last 5 months, we’ve been serving up files from a caching proxy server that only hits Amazon when necessary.

To illustrate the savings: in the last 30 days we’ve served up 2.7 billion requests for files that are stored on S3 and we’ve transferred a total of 24 terabytes of data.

Recurring monthly cost to pay Amazon to do this

  • GET requests: $270
  • Data transfer: $3100

Total: $3370

Recurring cost to handle it ourselves:

Note that this bandwidth is from a single ISP – we’re not paying for any redundancy and that’s okay because we can always temporarily serve from S3 if there is a problem.

  • Bandwidth (120 Mbps): $1200
  • Cache misses (S3 costs): roughly $400

Total: $1600

One-time costs to handle it ourselves:

  • Old 1U server with 8 GB RAM: $0 but let’s say $1000
  • 2 x SSD drives: $600
  • 1U SuperMicro machine acting as pfSense gigabit router: $1000
  • GigE fiber installation fees: $600
  • Fiber to Cat 5e converter: $200

..so after 2 months we saved enough to cover the up front costs.

SSDs and pfSense helped us keep our up front costs down. SSDs are perfect for this sort of workload and super fast drives made it possible to to repurpose a single crusty old server for this task. pfSense is fantastic open source firewall software that enabled us to build a gigabit firewall/router for $1000 instead of getting totally ripped off by Cisco or someone else on gigabit speed hardware.

There are few disadvantages to this – if anything ever goes wrong with our hardware or connection, we just failover to S3 and stop saving money. Our app currently requires a restart to switch between Amazon and our own hosting (or some other CDN) but it’s a fairly fast and seamless process.

We use nginx to do the proxy caching. It’s pretty much zero-maintenance and the configuration was really simple and straightforward. How do you actually configure it? I posted an answer on this Stack Overflow question: How to set up Nginx as a caching reverse proxy?

Comments (4)

  1. Charlie Crystle wrote:

    Great post–we’re on S3 as well, and I imagine as we scale we’re going to run into similar issues.

    You could also try some local caching, but I guess that depends on browser version.

    Do users lose session state on a shutdown?

    Tuesday, May 17, 2011 at 11:24 am #
  2. Jimmy wrote:

    I think you need to put cache directives on the files you serve, it could save a lot of money..

    http://pagespeed.googlelabs.com/#url=http_3A_2F_2Fwww.ravelry.com_2F&mobile=false&rule=LeverageBrowserCaching

    Friday, May 20, 2011 at 8:13 am #
  3. Casey wrote:

    Hi Jimmy,

    Yep – all of our files are already set with the proper headers directing browsers to cache them forever.

    We serve a lot of image-heavy traffic.

    Friday, May 20, 2011 at 9:10 am #
  4. hassan wrote:

    i need exactly the same thing as you’ve implemented, though i dont understand how you’d configure nginx to get an image from s3 server and cache it locally if not available.

    What i need is basically:

    image requested serve from nginx if available get from s3, save locally and serve from nginx ..

    any ideas?

    Monday, January 9, 2012 at 10:48 am #

Trackback/Pingback (1)

  1. Quora on Monday, May 16, 2011 at 2:56 pm

    What are good open source firewalls for Linux?…

    Here’s a nice article about the chap from Ravelry who used pfsense to good effect and save some money to boot!

    http://codemonkey.ravelry.com/2011/05/16/saving-money-by-putting-a-web-cache-in-front-of-amazon-s3/