head in the clouds
or: scaling with elastic map-reduce
Between a rapidly evolving compute environment in which cores are multiplying like springtime rabbits, and a business domain in which the fecundity of market data is making those same rabbits look downright prudish, we are always looking for ways to scale our efforts. There are three levels at which this can be typically done:
- “below the processor” with things like CUDA or FPGAs,
- “amongst the cores” with things like TBB, Cilk, &tc, or
- “in the cloud” (or grid) with things likes Amazon’s EC2 and its bewildering and fast-growing coterie of associated technologies and products
Today we’ll look at a simple example which we implement on top of Hadoop and then deploy into Amazon’s cloud to get a back-of-the envelope feel for what kind of scaling we might expect to gain – and at what costs – from using this smorgasboard of technologies.
‘smorgasboard’ is being generous
Probably the hardest thing whenever you look at some new set of technologies is just getting your bearings. Amazon’s stack is complex and very much of a moving target, so I’ll start by providing a bit of a taxonomy of the beasties we’ll need to at least familiarize ourselves with in order to get anything done.
First, there’s the family of Amazon’s Web Services (AWS). This has been around for a good few years, so some parts of it are quite mature and well documented and understood where others are bleeding edge with an emphasis on bleeding. All of them can be played with quite easily by simply signing up (this involves providing a credit card, so head’s up). I can’t think of any more interesting family of technologies, so if you think there’s any possibility these technologies might help you scale, then I encourage you to sign up and give it a spin. Very interesting stuff.
Anyway, within AWS we’ll play with Elastic Compute Cloud (EC2), Elastic MapReduce (EMR), Simple Storage Service (S3) and Elastic Block Storage (EBS).
EC2 provides you with direct access to Amazon’s racks of humming servers in the cloud. Thanks to the miracles of modern virtualization technologies, an entire machine can be bundled up as an image file – an Amazon Machine Image (AMI – think of an OS’s DVD but with software that you specify, configured as you please). Once bundled into an AMI, your computer can be ‘installed’ onto any number of machines within the cloud. Nifty.
Such machines can be bought for long term use (at quite affordable rates), or they can be used on an as-needed basis for an hourly charge. Hence the ‘Elastic’ in EC2. They can also be managed programmatically, and until recently, this was pretty much the only way to develop jobs in the cloud.
Each machine that you use in EC2 has some local storage associated with it, but this is not persistent across sessions. Once your instance goes down, the data is all gone. To get around this and to provide an all-around storage solution, AWS provides S3 which is, to me, an overly-simplified distributed file system which is something of a pain to use. It is also limited to file sizes of 5G which I find to be a significant annoyance. The good news is that S3 seems to be both highly durable and reasonably performant.
A more recent entry in the AWS world of storage is the EBS facility. Here, you can create arbitrary sized “disks” in the cloud which can be attached directly to your computer nodes just as though they were local disks. Very handy. They don’t have the same level of durability as S3 as they’re not distributed, but I haven’t had any difficulties with them and they’re very useful for putting together AMIs within the cloud.
Finally, EMR brings the Hadoop implementation of Google’s marvelous distributed map-reduce model into the AWS cloud. If you haven’t read about map-reduce, well perhaps you should – it’s really powerful. Hadoop is an open-source implementation of it done in Java and seems quite nice though it’s under very active development, so the APIs are still very much a work in progress. Be prepared to change your code. Even though it’s written in Java, any language can be used to implement your own applications, and it seems as though both ruby and python are used pretty extensively along with java and c/c++.
what map-reduce buys you
The first thing map-reduce buys you is something of a headache as fitting your application design into the map-reduce model isn’t so obvious. That said, it may well be worth a couple aspirin as it brings a great deal to the game.
Consider an application we might find interesting which we’d like to place in the cloud – strategy back-testing. It seems to be a very natural fit for parallel processing as each strategy is independent of the others, so theoretically we should be able to run each one on it’s own virtual server without interdependencies. While this is true, it’s rather easier than it sounds. In order to implement our distributed backtesting platform, we first need to implement one that works on one node. Fair enough. We must then introduce a “chunker” which will break the overall set of simulations into bite-sized chunks. We must then define an intermediate format for sending the strategies (or instructions) to each node. And we must define an intermediate format for the results. And then we need to implement an “assembler” which coalesces all of the results back into one result set which can be reported on or displayed etc. And if one of the nodes fails, we have to notice it. And if we notice that a node has failed, we need to figure out which job(s) it was responsible for and reallocate them to another node which presumably hasn’t failed…
And on and on. The point is that this isn’t something that you’re going to get done in an afternoon. Indeed, it’s unlikely that you could write a correct and complete specification of the system in an afternoon.
What map-reduce buys you is all of the distribution bits that I mention above. The costs are that you will have to figure out how to model things such that they’re compliant with the map-reduce model and that the end result might not be as efficient as a hand-coded solution. But when you can scale easily and cheaply, then raw efficiency may not be such an over-arching concern.
scaling in the cloud with elastic map-reduce
In order to get an idea how difficult it is to actually scale a solution, I implemented a simple program using real data and collecting some timings across a variety of AWS/EMR configurations. To do this, I first had to code against Hadoop v18.3 which is what is supported in AWS. (Initially I wrote it to v20.2 and was surprised at how much I had to change to back-port it to 18.3.)
The details of my test program aren’t important – basically counting various features from a set of heterogeneous files. The files are approximately the same size ~1.8G and there were 10 of them, so the overall dataset is a bit over 18G. I only do one map-reduce; I’m not chaining them together. On a local machine – a pre-nehalem, dual-chip w/ quad-cores (8 total) xeon server with hardware raid – the process took 1068s or a touch under 18 minutes.
I was impressed to see that with absolutely no code changes, I was able to run the system on EMR without difficulties. In fact, selecting the number of nodes to run against and the size of those nodes is just a matter of changing launch parameters. Very nice. AWS offers a variety of different specs on their servers. I tested the classic ‘small’ server which provides a 32-bit O/S, one core and 1.7G of memory as well as the ‘large’ server which comes with a 64bit O/S and 2X2 cores and 7.5Gs of memory. There are other configurations available, but I didn’t test them. In all cases I was running centos5.4.
The results are tabulated and graphed below.

performance locally and in various cloud configurations

time to complete vs. # nodes for large & small nodes
Scaling isn’t exactly linear, but it is good and it is simple and it is cheap & on-demand. Clearly for any given use-case, it will make sense to do tests like this to figure out what makes the most sense. Another dimension to consider is cost. Even partial hours count as a full hour, so putting big or many boxes against jobs that don’t take long to run isn’t cost effective. That said, even though I fecklessly ran jobs which took as little as 10 minutes while paying for 60, the entire costs associated with developing and running the example I present here was less than $4US.
One thing I noticed is that the time to initialize instances seems to grow as you increase the number of nodes, though I didn’t quantify this. In some cases, it was really quite long before real work started getting done – several minutes at least. Getting this to work for interactive processes will take some thought and, probably, some dedicated nodes. Another things to note is that although my local server did a fine job for itself, if we were to increase the dataset by a factor of, say 100, we’d reach a point where it simply couldn’t complete the operation whereas the cloud isn’t so strictly bounded. It’s also pretty interesting that my local machine with 8-cores and 24G of RAM and pretty decent RAID subsystem got soundly spanked by eight of Amazon’s cheapie 1-core boxes.
Given that it is cheap, massively and simply scalable and not terribly difficult, one could say that if you don’t spend some time with your head in the clouds, then it’s possible you’ve got it stuck in the sand.
Another excellent post! Keep up the good work.
Thanks, Craig – I’m glad you found it interesting. I’m kicking myself for not having gotten further into this earlier!
Hi Tito,
Thanks for an excellent post. Did you persist your data in EBS or S3? How different is it from reading the data locally from your disk?
I would think you would need a bit a configuration changes to make your jobs read data locally versus from the cloud. Does HDFS provide a degree of abstraction here that simplifies this?
Hi Suchindra,
I used both, but EBS for development time stuff and S3 for the actual EMR runs. EBS is treated exactly like a disk and in order to use it during an EMR run, you would need to have each hadoop node run an initialization/mount script before beginning the real work. S3, instead, can be referenced across the network with an url. Thus, I really (seriously!) didn’t need to make any config changes. Running locally, I would say something like:
doSomething /var/foo/input /var/foo/output
while running in the cloud I’d instead reference S3 ‘buckets’ in the same way:
doSomething s3://foo/input s3://foo/output
As always, the fastest way is to move deliberately (“festina lente”). Take the hadoop Wordcount example and run it locally. Then run it as one of EMR’s example apps. Then run it in EMR as your own app. Then you’re ready to start writing & running your own stuff…
Interesting stuff, a very informative post!
In my day job we use the Rackspace cloud, this is mostly for web hosting and DB apps. I’ll have to check out the Amazon tech though as it sounds rather fascinating!
@Andy
Thanks, glad you enjoyed it. Compared to a straight-forward hosting environment, aws adds whole new layers of services. I like to imagine how an algo-trading centric aws might look…
Great post thanks
Hi,
Great blog, not been updated for a while though? Hope all is well.
Mark