Work With Kyle

  • Archive
  • RSS
  • Ask me anything

Sharding

I’ve spent the last two posts talking about some of the performance challenges we’ve faced over the years, so today I want to shift the focus a bit and talk about performance’s cousin, scalability*.  Scalability is a broad topic, so for this post I’m going to focus on scaling databases, a particularly acute challenge for many web applications.

To provide some background for the uninitiated: databases have to be “scaled” because, if you have enough users, no matter how much you optimize performance, at some point you won’t be able to process everything on one db server. You’re going to be forced to split your data. But, in dividing your data onto multiple servers you end up creating a new challenge for yourself - you have to be able to figure out which data is on which server. The solution: sharding**.

At Bunchball, we generally try to avoid premature optimization, but this is one case where we eschewed that maxim and planned for scale from almost the beginning. We’d heard horror stories about retro-fitting sharding solutions onto mature web apps and we knew it was an inevitable hurdle we’d have to face, so we built it into our architecture early on.

Broadly, there are two types of sharding, vertical and horizontal:

Vertical (Functional) Sharding:

Vertical sharding is the partitioning of databases by feature (“vertical”). What this means is that instead of having one server do everything, you break up responsibilities so that one server is in change of, say, leaderboards while another is in charge of virtual items, and so on. The primary virtue of this approach is that it’s simple. Your code is (hopefully) already broken up by feature, so you can just migrate your data to new servers, point different parts of the code to different databases, and voila, you have a sharding solution (your results may vary :) ).

Sadly, vertical sharding is not a panacea - it doesn’t scale forever. No matter how finely you partition functionality, someday, some component won’t fit on one server. So, despite its utility and simplicity, at some point you have to turn to the alternative, horizontal sharding (though the two are often used in conjunction).

At Bunchball, due to its limitations, we use vertical sharding for only a few things: in our platform we separate configuration data (each of our customers has a custom configuration) and user data. And, like almost every web service on the planet, we separate our data warehouse from our main platform. In these cases the different access patterns on the databases justified the division:
  •      Site configuration: frequent reads, infrequent writes
  •      User data: frequents reads and writes
  •      Data warehouse: rare bulks reads and writes
Horizontal Sharding:

Horizontal sharding partitions by user or customer instead of by feature. So, in theory, it’s infinitely scalable because as you add more users, you can just spin up additional servers, ad infinitum. But with the promise of infinite scalability comes additional complexity because unlike the hard-coded db connections from vertical sharding, horizontal sharding requires you to figure out which database to connect to dynamically, based on a partitioning key.

The first step in designing a horizontal sharding solution is figuring out what to partition on. Often the best choice is userId, but in our case, we actually first partitioned by customer website. This meant that each of our customers had their own database (though we co-locate many databases on a single physical server). Though this wasn’t a perfect solution because we couldn’t handle individual, super-high traffic sites, it did help us scale out quickly.

That all changed when we signed MySpace, a site that at the time had 50M+ monthly uniques. We knew there was no way we could handle that level of traffic without more granular sharding, but that story will have to wait until another post…


Notes:
* The difference between performance and scalability trips up many people. At some point I’ll write a post explaining my understanding of the distinction.
** Many projects in the NoSQL movements attempt to shield their users from this complexity by making sharding transparent… more on NoSQL later… though I do think, even given the success of NoSQL, everyone should understand the basics of sharding

  • 6 months ago
  • Permalink
  • Share
    Tweet
← Previous • Next →

About

My name is Kyle Vigen, and I lead back end development at Bunchball, a gamification startup. I started this blog to share the cool stuff I work on with the world and to try to convince other talented engineers that they would love working with me and my co-workers at Bunchball.

I have a BS in computer science from Stanford, and in my free time I enjoy playing basketball, reading anything I can get my hands on, and talking about how I'm going to start training for a triathlon.

Please check out our careers page: http://bunchball.com/careers


Also check out:
http://WorkWithKasey.com
http://WorkWithMolly.com
http://WorkForSteve.com
  • RSS
  • Random
  • Archive
  • Ask me anything
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr