(Ken Struys' Blog)

web-developer, serious schemer

Building a Scalable Web Architecture

A lot of people talk about building a scalable web architecture but don't actually explain what they mean. "Oh well you know use memcache and stuff", "No, that doesn't explain anything, how do you actually do it?". I'm sick of hearing it, so I'm going to explain how I do it. The best part about it is, other then server cycles/memory all the software I use are free. Here's a diagram of everything in the stack and that I will cover:

Let's start from the first entry point. First we have some kind of Load Balancer. In terms of software there are two great products out there, SQUID and Varnish. SQUID is the one I'm familiar with so I'm going to go over that one, but Varnish is being used by companies like Facebook, Twitter and Flickr it's really awesome and hopefully I will soon get the chance to try it out.

There are two main services SQUID provides. The first service is reverse proxying. SQUID receives a request from the web and send it's own request to one of the application servers (usually following round robin). That satisfies the load balancing of application servers. The other service SQUID provides is caching, when SQUID receives the response from the application server it caches the result by URL. You save a lot with SQUID's cache because application servers running things like PHP can be crazy CPU heavy. Caching generated results for even minutes we can handle a lot more load.

SQUID is policies are controlled by the squid.conf file. I like the configuration file because it allows you to control the caching policies. You can do things like disallow cache access of logged in users (acl aclname req_header Cookie [-i] user_cookie) and we can create rewrite programs to do url rewriting (url_rewrite_program /etc/squid/rewrite.py). I won't go through everything related policy by here's a starting configuration that I think makes sense. When you get the conf from SQUID it comes with an entire manual inline and it's hard to actually see what your configuration is doing either use mine, or remove the manual, it's available online.

Next we have our actual webservers, I usually just go with vanilla apache running whatever MVC framework I fancy. You might also want a webserver that better handles loading of assets (like images), for that I suggest nginx.

The final step that pretty much everyone takes these days is using some form in memory data cache. Usually I use memcached (distributed memcache). memcache isn't magic it's actually the simplest thing in the world: key value storage (max 1 MB values). You only really need two functions it provides, get(key) and set(key, value). The distributed aspect is that each webserver gets it's own memcache. When you add another webserver you add another memcache to the pool and all webservers can see the entire pool of memcaches. Why do the webservers need access to the entire pool? Let's say two requests come for the same page and we have a architecture decribed above. First request comes in and SQUID makes the request to the first webserver. The webserver tries to get data from memcache and misses so it pulls data from the database and stores it in it's memcache. Second request comes in, SQUID makes the request to the second webserver, this webserver checks and it hits because the first webserver stored it in the cache, no database call. If the second webserver wasn't able to see the entire distributed cache, it would have required a call to the database.

That's about all you really need to know to have pretty reasonable performance. Don't get me wrong, companies like Twitter and Facebook do far more then what I've described, but until your product requires fail whales stay lean, you can hire a scalability expert later. To give you idea of what companies actually use, I compiled the following table:

CompanyLoad BalancerWeb ServerApp ServerMemcacheDatabaseOther Cool Stuff
TwitterVarnishMongrelRuby on Rails/ScalaMemcachedCassandraMurder
WikipediaOver 40 SQUIDs and physical load balancersApacheMediaWiki/PHPMemcached (over 80 memcache servers, >3gb storage)MySQL
DiggPhysical Load BalancerApachePHP/PythonMemcachedLazyboy/CassandraRabbitMQ
RedditElastic Load BalancingTornadoPython/DjangoMemcachedb, tons of stuff precomputedPostgreSQL (mostly denormalized)All on EC2
facebookVarnishHipHop embeded , over 10,000 serversPHP/C++ via Hip-HopMemcached over 800 instancesCassandraHipHop, BigPipe, Haystack, Hadoop/Hive
YoutubeNetScaler/VarnishApache, lighttpd for videoPython with psycoMemcached a lot of pre-rendered pagesMySQLCDN for most popular videos per locale