Work With Kyle

  • Archive
  • RSS
  • Ask me anything

Sharding (continued)

A month ago I ended a post with a teaser about how we had to quickly scale up to handle MySpace’s traffic, so today I want to talk about the architectural changes that made that possible. In particular, I want to dive a little deeper into one of the fundamental questions we had to answer: how to horizontally partition (ie shard) our data. Broadly, there are two approaches to sharding…

Fixed Schemes:
A fixed partitioning strategy is one where a fixed characteristic of the item you’re sharding determines which partition it’s on, or, more concretely, it’s like an old school encyclopedia (the physical kind we had before Wikipedia). Those encyclopedias contained too much information for one book, so the articles were “partitioned” across multiple volumes - one book had all the As; another had the Bs and Cs, and so on. To find the right volume you used a fixed characteristic of the article, namely the first letter in the title.

There are many ways to incorporate fixed schemes into distributed systems. One naive approach is to partition users based on their last names, with all the “As” on one server, “Bs” on another, and so on. But this scheme has a couple drawbacks, the most important of which is that not all the partitions are same size (the “J” block is much bigger than the “X” one)*, and this disparity adds complexity. Fortunately, computer science has provided us with a tool for skirting exactly this obstacle - hash functions. To get an idea how hashing could help, consider the following simple, though imperfect, formula**:

node index = (hash(user name)) mod (number of nodes)

With a little hashing and modular arithmetic you can still used a fixed characteristic (the user name) to determine which node a user’s data is on, and you can also evenly distribute the data. Unfortunately, though this is a really powerful and widely used approach, it’s not perfect. Notably, whenever you add new nodes some of your old data has to be moved between nodes**. In some systems migration is manageable (see many of the NoSQL dbs), but with most relational databases, MySQL included, it’s a huge chore. So, here at Bunchball, to avoid the complexity of building and maintaining migration scripts, we turned to the alternative, dynamic schemes.

Dynamic Schemes:
In dynamic partitioning strategies the application cannot determine which node contains an item solely from the properties of the item itself. Instead, the app determines which shards contains the data by looking it up in an (often large) lookup table.

This scheme has several advantages, the obvious one being that any single piece of data can be moved to a new shard simply by updating the lookup table (and copying over the data). But although that flexibility is nice, it still leaves us with the same core problem as fixed schemes - copying MySQL data can be a pain. What we really want is to be able to add new nodes without moving data, and (surprise, surprise) with dynamic schemes we can do this. The trick is that we now control which shard new users are assigned to, so whenever a shard gets “full” we can stop assigning it new users. That way its current users never need to be re-assigned to a new shard.

At first glance, one might object that all the nodes won’t have the same number of users and/or the same amount of activity - the exact same problem we worried about with fixed schemes. But, what I left out in my discussions of fixed schemes is that though this makes life harder, with good preparation it’s not unmanageable***. To understand how this could work you first need to know two things about MySQL. First, you can put more than one logical MySQL database (ie shard) on a single physical server. Second, unlike moving individuals rows, migrating entire MySQL databases is relatively simple. Thus, using these two properties, we can maintain fairly equal database load by doing two things:

1. Keeping each of our logical databases small relative to the size our physical servers 

2. Mixing and matching databases appropriately.

One downside of using dynamic lookup tables is that they are a single point of failure and a potential bottleneck, but though these concerns are legitimate we ultimately decided that we could live with them because we thought that with memcached, database replication, and SSDs we could make it work in practice. Also, we knew that if at some point, our dynamic scheme didn’t scale, it would be pretty easy to build a fixed scheme on top of the lookup table. 

In practice, we’ve implemented a dynamic scheme that stores data for over 100M users over a couple hundred logical shards, and though we’ve spent some time moving shards around to balance out database server load and adding extra caching around our global lookup table, for the most part it’s just worked. Also, since each of our biggest customers can have their own lookup tables, we feel pretty comfortable we’ll be able to scale with the current system until, inevitably, Facebook becomes a customer.

Future topics: querying across shards, hashing rings, hybrid schemes

Notes:
* We will see solutions to this later in the article

**  If, for a given user, the hash mod (old # of nodes) doesn’t equal mod (new # of nodes) then their data has to be moved from the old node to the new node. No matter what you do, some data has to be moved. That said, techniques to reduce the amount of data moved have been developed, notably consistent hash rings (here’s a simple explanation), a common technique in NoSQL solutions, that I hope to explore in further detail in later entries.

For the mathematically inclined reader, you can show that if you add one new server at a time, that formula forces you to move almost all the data every time. In fact if the number of nodes is p, then only (1/(p + 1)) of the data stays on the same node; the rest must be moved. You can show this by proving the following two statements, which I leave to the intrepid reader (with a hint for #2 below)
1. Show that p and p + 1 are relatively prime for all integers, p
2. Prove that following two statements are equivalent (ie, one is true if and only if the other is)
   a. p and q are relatively prime
   b. x = (c % p) = (c % q) implies x = c % (pq)
3. Conclude that approximately (1 / (p + 1)) of the data stays on the same node

<
Hint: Use a consequence of the Chinese Remainder Theorem: if a and b are relatively prime then there exist integers x and y such that a * x + b * y = 1

*** Dynamic schemes present a slightly, but critically different problem from fixed schemes with regard to this issue. Also, from a didactic perspective, I still think equal size nodes are a good introduction into hashing schemes.

 

  • 4 months ago
  • 1
  • Permalink
  • Share
    Tweet

1 Notes/ Hide

  1. bunchball reblogged this from workwithkyle and added:
    Kyle Vigen, Bunchball engineer extraordinaire...Bunchball. He even mentions
  2. workwithkyle posted this
← Previous • Next →

About

My name is Kyle Vigen, and I lead back end development at Bunchball, a gamification startup. I started this blog to share the cool stuff I work on with the world and to try to convince other talented engineers that they would love working with me and my co-workers at Bunchball.

I have a BS in computer science from Stanford, and in my free time I enjoy playing basketball, reading anything I can get my hands on, and talking about how I'm going to start training for a triathlon.

Please check out our careers page: http://bunchball.com/careers


Also check out:
http://WorkWithKasey.com
http://WorkWithMolly.com
http://WorkForSteve.com
  • RSS
  • Random
  • Archive
  • Ask me anything
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr