billions and billions

While Carl Sagan’s famous formulation introduced a generation to the vastness of the cosmos, more recent history suggests that his memorable term might now be more aptly applied to financial extents: our deficits and debts, perhaps, to the economically or politically minded. But for those of us with the markets on our mind, the term has to evoke the enormity of the data we create and must manage every day. We’ve recently been working with the NYSE’s TAQ data in an effort to integrate it into StratBox’s back-testing and optimization capabilities. And the enormity of the data is really just staggering.
Each day, the NYSE publishes all of the day’s quotes and trades as well as some reference data. Compressed, the data will just about fit onto a DVD. For one day. A DVD. Compressed. It’s really mind-boggling. A year of the stuff, uncompressed, will require over a petabyte of storage. Over 1,125,899,906,842,624 bytes. And that’s just the US Equities markets. You want options data, too? I hope your uncle is named EMC, because just managing the data is going to be a challenge…
“Information about money has become almost as important as money itself.” — Walter Wriston, former Chairman of Citicorp
The enormity and profile of market data far exceeds the capacity of traditional RDBMSes. While RDBMSes continue to expand their usable capacity - we have used partitioned tables with nearly a billion rows of market data which have performed astonishingly well - they simply can’t deal with the kinds of quote volumes modern markets are generating daily. This has spawned a host of specialized timeseries database products, like the grandaddy: Sungard’s FAME which I’d used back in the 90’s to write programs to calculate bond indices at JPM, to more recent offerings like Vhayu and Kdb+. These timeseries oriented data products undoubtedly have many distinguishing characteristics and features, but they share one immutable characteristic: they are unbelievably expensive - in some cases a single developer seat costs in the high 6-figures for an annual license.
Thus, while no doubt missing out on some of their high-end features and niceties, we’ve decided to seek solutions from some of the original purveyors of petabyte-scaled data: NASA and the NCSA through their HDF5 system. Designed to support vast scientific data stores and boasting sophisticated capabilities in support of parallel computing environments, it should be possible to get comparable performance to some of the high-end specialized finance products without the sticker shock. Indeed, it’s potentially another example of free software providing a meaningful contribution to finance.
In researching cost-efficient and highly parallel hardware solutions to pair with our emergent data solution, I’ve come to realize that open-source is expanding its reach into the hardware sphere.

This guy shares his experience and “recipe” for building a powerful and unique rendering cluster inside an IKEA filing cabinet. It’s admittedly on the funky side for even a SOHO operation, but it’s no joke - it’s more powerful than a lot of production blade servers used on wall st and it cost him less than $4K. He also includes a (very loosely described) spec for a more powerful next-generation version with some 50-Teraflops of capacity! So, while the data we’re having to deal with is growing at an incredible rate, the tools we have to manage it are growing proportionately for those who know how to leverage the work so many smart people are producing and freely sharing.
As my dad told me years ago:
“Good programmers write good programs. Great programmers steal good programs.”
At this point, we’re still in the “discovery” stage of our development of TAQ+HDF5 for StratBox, but as we progress I’ll periodically post some of our experiences.
—
UPDATE
Speaking of my dad, he saw this posting and pointed me to some class notes he’s been working on which describe the Google approach to massive information processing and fault-tolerance. Interesting and full of great links to both academic and industrial papers/sites on the topic.
back-testing, market data, open-source software, post-trade analysis, technology
great post tito ! I like the site about that helmer
Thanks, Flaviu. Yeah, I’m very impressed by the helmer and would love to have one (built for me). Unfortunately, I view debugging hardware as a very expensive and generally fruitless form of torture, so I don’t see a helmer in my future. I have been pricing a bit more conventional systems and it seems that one can get an 8-core server with 32G of RAM and a couple terabytes of raid for about the same price point as helmer and maybe one of these days I’ll give building such a machine a whirl…
Certainly I wish the culture of open source hardware would open up a bit more with a purchasable listing of h/w requirements and software-guy-proof instructions for building and troubleshooting a system. You would think that vendors like newegg would try to foster such communities, but I haven’t seen it yet…
tito,
thanks for sharing this info!
when I see at what level are working guys like you I really understand the place of us, ordinary folks involved in trading: I am testing my strategies on a laptop with Amibroker
Thanks for your kind words, Luca. I’ve just updated the post with some related info my dad had shared with me. The amount of info that is available to us now is staggering and the problem has become not a paucity of information but our own ability to find it, make sense of it and, ultimately, act on it.