tick data & hdf5 (part 2)

One Big Table (and chair)
Last time I described the trajectory of my research into using hdf5 for large amounts of tick data. This time I describe the basic design of the prototype I implemented and some of its performance characteristics.
Prototype Design
With One Big Table (OBT) holding all of your data, you need some help finding what you need in that data. To this end, I wrote an index on the big table which basically stored, per instrument, beginning and ending indices into the OBT as well as the timestamp at each extreme.
So, the main table was: { conid, timestamp, open, high, low, close, adjustedClose, volume } and the index table was: { conid, minIndex, maxIndex, minTimestamp, maxTimestamp }. The index is read into memory and kept with a few other handy bits of data in a structure which represents a “connection” into the OBT.
To identify contracts, I use a long int contract identifier which is already employed within our environments. For timestamps I used the java convention of using a 64-bit long denoting milliseconds since the “epoch” and, initially, used mktime to support this approach in C. After my first iteration, I found that an incredible proportion of my time spent writing a HDF5 table from a CSV file was spent in mktime whereupon my Dad suggested the use of gmtime. Remarkably, this yielded a fully 40% improvement to the process! It’s nice having a guru in the family.
In order to retrieve data for one contract, I implemented an iterator which would find the appropriate section of the OBT for the particular contract. Another handy piece of data the connection struct maintained is the entire timestamp column for the main table. Clearly, this isn’t scalable for really big datasets and for the real implementation I’ll have to read this in on an instrument basis at the time of a query. But for this data, this seemed the sensible implementation. Two binary searches are performed on the appropriate, indexed subset of this big array to determine exactly which records will need to be read to satisfy the iterator’s query. Thus, the iterator is primed with the exact location of the data it will require on initialization. Then, buffered reads are performed by the iterator as data is requested from it.
My only critical use-case is to retrieve a time-ordered stream of OHLCVs (in this case) across potentially many contracts. This one query meets my needs both for back-testing purposes as well as statistical calculations. But it requires an efficient merge operation across a potentially large set of these iterators. To accomplish this, I’ve got a composite iterator which uses a red-black tree to keep all of the contained iterators sorted in the order of their Next() OHLCV. Thus, the composite iterator will always return the oldest OHLCV amongst all of its contained iterators.
That’s pretty much it. Apart from the cached timestamp column, this should all scale well to datasets the size of a day’s worth of tick data for the US equity markets with a few foreign markets and maybe some futures thrown in as well. Options data is a worse case than the one I’m envisioning, but I imagine a similar approach would still work, though perhaps you may have to distribute a day’s data across multiple files.
Benchmarking the Prototype
The data I tested this implementation on is daily data going back to 1990 on some 7100 us, lse and hk equities. In uncompressed, CSV format, the data weighed-in at 850M and was made-up of ~16.5M records. The tests I ran were:
- Read the CSV file and write an HDF5 file varying the compression { TRUE | FALSE } and hdf5 chunk sizes { 2^12, 2^13, 2^15, 2^17, 2^19 }. (For the chunksizes, I based it roughly on the excellent PyTables documentation and then expressed my preference for base-2 scales…)
- Respond to 100 queries across {1, 2^4, 2^8, 2^10, 2^12) randomized contracts over a randomized 2-year period (within the 19 year range).
The first part is really about hdf5 and how its different options will affect my results. The second is my actual use-case: “give me an ordered stream for some set of contracts over some (two-year) period”.
The results were interesting as these parameters matter. Especially compression. Below you can see the results of the tests.

HDF5 Write Performance
This “write test” is just a C program which reads in an 848M CSV file and writes and indexes an HDF5 file using the OBT approach as described above. Apart the curious bump in file size for the Compressed+8192 variant, the results aren’t too remarkable except to note that compression wants smaller chunksizes. Badly.

HDF5 Read Performance
Likewise, the read test is the same C program which then reads from each of the files written with the varying HDF5 write parameters. Here we really see the effect of compression on performance. It seems that if you want performance, then compression isn’t for you. If you need compression, then you need to use small chunksizes.
Apart this, it seems that performance from 1-4096 randomly selected contracts degrades reasonably linearly. The red-black tree is doing an effective job of merging even a reasonably large number of streams.
From an absolute perspective, the performance strikes me as pretty smoking in the good cases. In a two year period, you’ll have about 500 trading days. So, for the {No-compression/32768 chunk size/256 contracts} case we have (500 * 100 * 256) / 10 = ~1.28M records per second including all of the look-ups.
Java + SWIG + HDF5
One of my needs is to be able to access this functionality in Java so that the StratBox GUI can use this data. To that end, I made sure that as I was developing the prototype I maintained SWIG interfaces and a parallel test driver in Java. Apart the initial set-up, this proved pretty easy, though adding SWIG to a project does add some complexity. In any case, I wanted to see how bad a performance hit I’d get running the same tests in Java. Again, looking at the case highlighted above { No-compression/32768 chunk size/256 contracts}, the Java/SWIG timings are about twice those of HDF5′s native C. So, we take a pretty significant hit, but it seems unavoidable and ~600K records a second isn’t exactly slow.

HDF5 Read performance from Java / SWIG
Conclusion
The prototype I’ve implemented has the nice characteristic that its design is very similar to what I expect should work well with much larger quantities of tick data (as opposed to the ohlcv data I’ve used here). The only significant difference is that seeking within an indexed range would be slower as I’d first have to read in the range instead of keeping it handy in memory. Apart this, I’d need a layer on top of what I have here to manage HDF5 files as well. Of course, until I actually implement this for tick data, it’s not certain that it will be adequately performant on that much harder case, but I’m reasonably confident that something like what I’ve done here could be made to work well.
There’s a significant learning curve to using HDF5 and one clearly has to spend some time benchmarking for the specific requirements to make sure that the settings used are appropriate. If I were a little less obsessed with speed or a little more discerning about how I spend my holidays, I’d likely find using PyTables or something similar to be a much better solution than trying to roll my own in this fashion.
I just joined the HDF5 forum and newsgroup today and was looking forward to reading your thread. It’s likely user error on my part, but I fail to find an archive for either the newsgroup or the forum.
Apparently HDF5 has Groups and Datasets. Did you consider using a Group to represent a day, and Datasets to represent the tickdata for the instrumetns on that day? I wonder if the Group entity is useless or a timehog in storing tickdata.
To my knowledge, there’s no online archive for the hdf5 forum. In order to access past traffic, send an email to ‘hdf-forum-help@hdfgroup.org’ for instructions.
As for using groups & datasets in various ways, this was really the exploration I described in part 1 as OTPI (one table per instrument) vs OBT (one big table). In all cases, I planned to store one day per file. I went with OBT as I felt that it would allow me to avoid the overhead inherent in hdf5′s internal data structures.
I am trying to do something similar. I am using PyTables. However, I don’t think I am reading the data from the month DVDs correctly. It should be straight forward, but I am getting junk from the IDX files. Did you use python to read the data from DVD? Care to share your code? I would be much obliged. BTW, this is for work on my dissertation.
I didn’t use PyTables; everything I relate here was done in C and then exposed to java via swig. A similar approach might be taken with python, but I’m pretty certain that you’ll be better off using the excellent pytables package than trying my labor-intensive approach. This should be especially true if your interest is academic rather than for low-latency algo trading!
As for sharing the code, my hope is to solidify and build-upon what I’ve done for eventual integration with our StratBox/StratCloud environments, so it’s not something I’m willing to share at this point. Sorry. The good news is that it really is just a prototype, so likely wouldn’t be of much use to you in its current form!
Not sure what dvd issue you’re referring to, but there were no DVDs involved in my effort…
What’s the subject of your dissertation?
Thanks. I figured it out. I’ll let you know how it goes with PyTables. My dissertation focuses on taking the Huberman & Stanzl type optimal liquidity trading models to data. I use an econometric technique from the empirical IO literature to identify the parameters of the model from transaction data. Then with the estimated parameters I can do forward looking policy simulations. Best regards.
Nice. There’s some serious cocktail-party-fodder right there ;^>
Sounds like a difficult project – good luck!