Sharding (continued)
node index = (hash(user name)) mod (number of nodes)
<
Reading
Sharding
- Site configuration: frequent reads, infrequent writes
- User data: frequents reads and writes
- Data warehouse: rare bulks reads and writes
Queuing
When we left off last time I was describing our philosophy of performing pre-computation at write time to improve our read performance. At first we applied this approach only to leader boards, but as we added new features we continuously incorporated them into this framework, and it wasn’t long before we were doing quite a bit of work at write time - updating leader boards, news feeds, the data warehouse, etc… So, in solving our read response time problem, we were gradually creating a write response time problem. We needed to figure out a way to do the pre-computation without sacrificing write performance:
Parallelization:
One option was parallelization. Updating leader boards, adding to the news feed, etc… could all be done independently. So, we could modularize the work into a handful of self-contained “operations”, allowing us to realize significant performance improvements simply by kicking off a thread for each operation. This solution had a lot of nice properties:
1. It would be relatively easy to implement
2. It would reduce our response time to the response time of the longest operation.
Still, we knew that parallelization is never a panacea, and we would have to balance its benefits with the downsides - the possibility of bizarre synchronization bugs and increased code complexity. Luckily, since we chose independent units of work and relied, primarily, on the underlying database for synchronization, we were able to sidestep most of those risks. So, after a little internal discussion, we decided to throw together a simple multi-threading implementation.
This solution reaped huge benefits, but it didn’t handle a couple scenarios very well, notably:
1. Traffic spikes would cause us to spin off a ton of threads, and could overwhelm our servers
2. If any of the individual operations failed, we had to roll them all back
3. Though we sidestepped most of the new synchronization issues, lock contention increased
So, we decided to complement parallelization with another classic software paradigm, queuing.
Queuing:
Queuing, as I’m sure you know, is a simple concept: instead of doing work immediately you put in a queue and do it “when you can”. And though it isn’t applicable in all situations, it’s widely used for both time-insensitive and bulk operations.
There are a fair number of open source queuing solutions available, but after evaluating the options, we decided to build our own in house because we we wanted something simple that we had complete control over. We’re currently improving our model to make it more robust and flexible so I’ll hold off on the details until we’ve finished, but for now it suffices to say that I view it as probably the most important component in our back-end architecture - the paradigm that powers many of our most complex features.
Summary Tables
I want to kick-off the blog with a series of posts detailing some of the challenges the Bunchball engineering team has faced over the past few years. I hope they’ll provide insight into the way we think about architecture, performance, reliability, etc… This first post will likely be remedial for those of you who’ve built scalable websites, but even if you’re an expert I’d still recommend skimming through the post cause it’ll frame the rest of the series. Now, on to the story…
When our back-end was young, life was simple. When a user watched a video, we just logged the activity and rewarded the user’s good behavior with some points. When a user wanted to see a leader board with the site’s ten most frequent videos viewers, we just ran a simple SQL query (modified for readability):
select user_id, count(*) from video_watches group by user_id order by count(*) desc limit 10
Unfortunately, though simple is nice, in this case, it didn’t scale. As our user base and the number of videos they watched grew, the query took longer and longer. We needed a solution where the query time didn’t increase with the number of users. We brainstormed a couple options:
One was the reflex response to any database performance issue: to cover it up with a caching layer. The logic being that if you run a query only once every fifteen minutes then it isn’t so bad if it takes thirty seconds each time. We evaluated this approach and agreed it would solve the immediate pain point, but would also harm the user experience because the user’s scores wouldn’t show up in the leader board immediately. We also realized that even if we compromised and accepted this as a necessary evil, at some point we were still going to have to address the underlying problem. We couldn’t just let the query steadily take longer and longer. Taking all that into account, it wasn’t a tough decision to eschew the caching solution and implement our other idea, “summary tables”, instead.
Summary Tables:
At their core, leader board summary tables are simple. They’re just a database table that stores a user’s score in the leader board:
Schema: (leaderboard_id, user_id, score), indexes: (user_id), (score)
Their implementation is also straight forward: every time a user watches a video we update their score in the table. Summary tables have a lot of nice properties, most notably, their “score” index allows us to get a list of the top users quite quickly. And though they cause extra overhead from the extra db writes needed to keep them up-to-date, compared to caching or running huge select queries, they’re more than worth it.
So, I’m sure a lot of you are wondering why I began this series with summary tables. They’re certainly not a revolutionary idea. The answer is that I think they clearly illustrate our philosophy of doing extra computation at write time (ie when the user watches the video) to improve performance at read time (ie when the user looks at the leader board). Now, that’s hardly a unique philosophy (it’s shared by most teams building web services), but I think it’s a critical one that greatly informs our decision making.
Also, the motivation behind summary tables leads nicely into my next topic…
Why the Blog…
Now that you know a bit about me, let me take a second to talk about Bunchball and this blog.
Bunchball is a gamification (more to come on what this means) startup in the bay area. We’ve built a web platform and suite of widgets that help our customers incorporate game dynamics into their digital experiences. We have a team of six great engineers, but with surging interest in gamification we need to continue bringing on top talent. The challenge, as anyone who’s worked at a startup before knows, is that finding amazing people is difficult, so we’ve decided to start this blog to educate the world about engineering life at Bunchball and, hopefully, convince our readers that they want to join us on our journey.
Recruiting:
Simplifying to the essential, I believe that if you want to recruit successfully as a startup you must do two things:
1. You must expand awareness. You’ve got to get top candidates to send you their resumes
2. Once you’ve got the attention of great people, you’ve got to sell them on the position. A non-exhaustive list of what they might be looking for includes:
a. Interesting, challenging problems
b. Other great people
c. A hot industry
d. A company that strives to make the world a better place
Not surprisingly, I think Bunchball embodies all these characteristics, and I hope to use this blog to convey that.
Content:
Towards this end, I’m planning on writing about the interesting and challenging problems I face. From ten thousand feet, my responsibilities can be divided into two groups:
1. Improving our back end platform
2. Developing our analytics offering
Most of what I write about will fall under those two umbrellas - topics ranging from performance, scalability, and code architecture to data warehouses and data mining. That said, I don’t want this to be a carbon copy of every other engineering blog, so I’ll also occasionally veer off course and ramble about random topics for no reason other than that I think they’re fascinating.
On Me
Welcome to my blog! Seeing as this is my inaugural post, I figure I should take a second to tell the story of how I became a software developer at Bunchball…
When I began my freshman year of college I didn’t know what I wanted to do with my life. I only knew that my interests leaned towards math, so I spent my first two years taking all the prescribed introductory physics, math, computer science, and economics classes, but nothing captured my imagination. I did love CS theory (I’ve always been a sucker for abstract mathematics), but couldn’t bring myself to declare the major cause I found the coding tedious and worried about becoming the stereotypical “code monkey”.
It wasn’t until the end of my sophomore year, when I needed to decide on some industry for my summer job, that I was forced to take a stand. So, I flipped through some job postings and stumbled across an opening at a small startup, Robot Genius, and decided why not. It’s only a summer, it couldn’t hurt to give it a shot… And To make a long story short, it turned out to be the best summer of my tlife. Beyond all the fun I had, I learned three lessons that fundamentally changed the way I looked at my career:
1. Building software can be a lot of fun
2. A really good engineer isn’t a code monkey.
3. Working with a talented, motivated team can be extremely rewarding.
The next summer I worked at Robot Genius, again, and loved it, again. Still, I wasn’t 100% certain that startups were the place for me. Maybe I had just gotten really lucky and found a great one. Maybe working for a big company would be even better. So, after college, I tried something completely different and moved to Redmond to work for Microsoft on the Windows file system. And well… I don’t want to give the impression that working for Microsoft is terrible - it certainly isn’t - but the passion and the fire just weren’t there. Most Microsoft employees get pigeonholed into one small part of a much larger system and don’t get a chance to do the experimentation and exploration that happens every day at a startup. So, after a year, I decided it was time to return to the bay area and get back to what I knew I loved, building software at a startup.
That brings us to three years ago when I started working at Bunchball…
