managing tick data with hdf5

One of the nicest things about the holiday season (Happy New Year, btw) is that it provides a lovely opportunity to spend some quality time with a project that’s a bit more exploratory than might be meaningfully undertaken while trading in lively markets.
A number of months ago, I mentioned using HDF5 to manage tick data as RDBMSes just aren’t up to the task and specialized Tick DBs are absurdly expensive. While I’d spent some time exploring this idea through the fall, I never had a discrete chunk of time to really explore the technology beyond determing that its Java interfaces weren’t production-worthy. This meant that we’d have to drop into C to access the functionality we’re interested in and that we’d have to come up with our own bridge out into Java for access by StratBox while StratCloud could access it directly.
Below, I describe what I’ve learned through my holiday geek-spelunking-trek including some timings on various configurable characteristics of HDF5 (e.g., compression and “chunking”).
After spending some time looking at the java interfaces to HDF5, I determined it wasn’t up to snuff. Why? Primarily because no-one seems to use it, it lags the main api from a versioning standpoint and it appears to be more-or-less impossible to build from source. Looking a bit more carefully, it seems to have been written by one (undoubtedly talented and well-meaning) individual who isn’t familiar with java. (The most egregious example was to use a javax.swing.tree.TreeNode as the base class for a key model object…)
I then spent some time looking at the native api and the underlying object model it exposes. The model is both powerful and pretty low-level. They’ve implemented many of the goodies of a file system including groups (“directories”), datasets (“files” or in RDBMS-land, “tables”) and a variety of nice linking mechanisms as well as attributes which might index or otherwise annotate data. There’s also powerful, extensible I/O options which I didn’t much study beyond compression and “chunking.”
The library is provided with two “first-class” APIs – in C and Fortran – a secondary API in C++ and then the Java interface I mentioned. Others have written interfaces for other languages, most notably the much-lauded PyTables implementation for Python which is used by many in conjunction with the popular NumPy package.
Given this spread of implementation languages I chose C and determined that I’d steal a page from the talented crew behind QuantLib and use SWIG to expose relevant functionality into Java. This has proven to be a splendid choice for my needs.
Having gotten this far, I started examining how I’d represent market data with hdf5 and came up with two broadly opposed approaches. In order to gain some insights from those more experienced, I sent the below problem statement / inquiry to the main HDF5 mailing list:
A description of the data and its use
The data is all timestamped financial streams of “tick” data. Each record is small (a few hundred bytes at the most), but there are many – in a day you may see many hundred million to a few billion. Each record is naturally partitioned by instrument (eg, “microsoft”, “ibm”, “dec crude”, etc). There are less than 30K instruments in the universe I might care about.
I (more or less) don’t care how long it takes to construct the h5 files/structures as it will be performed offline and the only critical query I care about is something like:
“Get ticks for instruments {i1…in} from time t1 to time t2 ordered by time, instr”.That is, I need to be able to “replay” a subset of the instruments within the data store over some period of time. But I really care that this be as fast as possible.
Questions
0. Am I barking up the wrong tree? Is HDF5 an appropriate technology for the use I’ve described?
1. Given the size/volume of the data, my thought is to partition h5 files by day. Uncompressed, the files will be on the order of ~25G. Does this sound reasonable? What are the key factors impacting this decision from an hdf5 perspective?
Two alternative models come immediately to mind: one big table (OBT) per day ordered by instrument and then time, or one table per instrument (OTPI) ordered by time. My current inclination is OTPI as it seems more manageable assuming the overhead of so many tables isn’t an issue.
2a. Are there other, better models you suggest I investigate?
2b. With the OBT, I’d need to be able “index into” the table to identify the beginning of each instrument’s section (at least). How would you recommend doing this? It seems possible to do this with references or perhaps a separate table with numerical indices into the main table. Any pros/cons/alternatives to these approaches?
2c. With the OTPI, I’d need to have many tables (at most ~30K) per file. Is this an issue?
2d. For both models, I’d need to be able to merge sorted sets of h5 data into one sorted set as quickly as possible. Is there any hdf5 support for doing such a thing or external libraries created for this purpose?
3. What impact on retrieval/querying should I expect to see with varying levels of compression?
4. Any suggestions on chunksizes for this application?
I was fortunate to receive some excellent responses to my query, including from Francesc Alted, the gracious author of the PyTables library, and from a gentleman who’d implemented similar functionality for his own trading environment. Interestingly, both approaches – OBT and OTPI – were championed. It seems that OTPI is probably to be preferred if the number of instruments/tables to be stored isn’t excessive (perhaps below 10K though I can’t quantify this) and the frequency of update is significant. OTPI is easier to implement as it means you can rely more upon the infrastructure provided by HDF5. OBT instead seems more scalable as you incur less overhead (and goodies) with the one table, though you pay for this by having to implement your own indexing logic.
Given the divergent advice and my own lack of hands-on familiarity with the C library, I decided to try both approaches on a prototype. Instead of looking at vast amounts of tick data, I’d try both approaches on a smaller store (~1G with ~7K instruments) of OHLCV data.
By far, the easier to implement is the OTPI approach. However, even with this relatively small amount of data, the difference in write performance and file size was substantial. Clearly, expanding this to the scale of tick data wasn’t going to yield sufficiently performant results. I focused on the OBT approach.
—
Given the length of this post and keeping in mind that the holiday season isn’t over just yet (about ten hours remaining as I write this!), I’m going to stop writing now and continue with the remainder of my implementation and findings in a follow-up post later this week..
EMS Internals, market data, open-source software, post-trade analysis, technology
sounds like you are travelling the well-trodden path of investigating tick databases, which usually end up with the conclusion that it is cheaper to buy one than build one. You mentioned in a previous post that dev-licenses can cost a 6 figure number – Kx Systems has a non-commercial license available for download free of charge. It has some limitations (e.g. 2 hour timeout) but allows you to get your feet wet with it.
A buy v build decision is always going to be specific to the organizational and technical contexts for which it’s being made and the use-cases required, not to mention cost considerations. This might be especially true for a start-up. It’s certainly impossible to economically reproduce the rich functionality of kdb+/Q and the attempt would be foolish. But if what’s needed is a sufficiently small subset of their functionality integrated with an existing platform, then it just might be reasonable… Or not. In any case, a non-commercial license is swell for educational purposes, but does nothing to solve the problems I’m looking to address.
A common application for a tick database is being able to create an array of time-synchronous prices across securities (taking into account daylight savings, etc). Someone needs to create a template db in HDF5 for time series. This would fulfill a huge need in the financial market. I have to agree 3rd party vendors are very expensive; the trick they employ is massive indexing.
David, I agree that it would be nice to see an open source version of a tick db, but I’m not too optimistic. It’s a reasonable effort, has a very limited potential audience, and just possibly people have had their fill with handouts for wall st types!
The time-synchronous use case you mention seems to me a special case of my one use-case…
Have you tested MySql ?
I have One table for ticks partitioned montly…
I store 100.000.000 of ticks each month…
I collect 30 instruments…
I’ve given-up irrevocably on trying to fit this kind of data into a relational db. Round hole, square peg as they say.
100M records is a small fraction of the *daily* US equity activity (top of book only) data delivered by TAQ which has thousands of instruments. If you’re interested in ~30 instruments, then an rdbms can be made to work, but not if you want a broader view.
That said, I knew a very smart guy who had a scheme for encoding one minute’s worth of tick data into a blob which was stored in mysql. Again, up to some size, these can be made to work, but not for larger applications…
Just for fun, how fast can you get a merged stream of ticks from your db? With hdf5 in c it was over 1M/sec (including the lookup and merge). I’m guessing that mysql is going to be at least an order of magnitude off that mark…