<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hack the market &#187; market data</title>
	<atom:link href="http://www.puppetmastertrading.com/blog/index.php/category/market-data/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.puppetmastertrading.com/blog</link>
	<description>Algorithmic trading experiences</description>
	<lastBuildDate>Wed, 21 Apr 2010 23:11:41 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>peaky</title>
		<link>http://www.puppetmastertrading.com/blog/2009/12/08/peaky/</link>
		<comments>http://www.puppetmastertrading.com/blog/2009/12/08/peaky/#comments</comments>
		<pubDate>Tue, 08 Dec 2009 14:41:51 +0000</pubDate>
		<dc:creator>tito</dc:creator>
				<category><![CDATA[dereferenced]]></category>
		<category><![CDATA[market data]]></category>
		<category><![CDATA[our managed markets]]></category>

		<guid isPermaLink="false">http://www.puppetmastertrading.com/blog/?p=894</guid>
		<description><![CDATA[I came across this compelling site which uses a hardware-based ticker plant (Exegy) in a colo environment to measure peak bandwidth across scads of NA feeds and then, every minute, updates a chart like the above to capture the average messages/sec across all of them.  Pretty swank.
While the uninformed may rail against colocation rather than [...]]]></description>
			<content:encoded><![CDATA[<div class="wp-caption aligncenter" style="width: 453px"><a href="http://www.marketdatapeaks.com/"><img title="peaky" src="/images/peaky.png" alt="messages per second across all feeds" width="443" height="294" /></a><p class="wp-caption-text">messages per second across &quot;all&quot; feeds</p></div>
<p>I came across this compelling <a title="Market Data Peaks" href="http://www.marketdatapeaks.com/" target="_blank">site</a> which uses a hardware-based ticker plant (<a title="Exegy ticker plant" href="http://www.exegy.com/" target="_blank">Exegy</a>) in a colo environment to measure peak bandwidth across scads of NA feeds and then, every minute, updates a chart like the above to capture the average messages/sec across all of them.  Pretty swank.</p>
<p>While the uninformed may rail against colocation rather than focus on less intriguing issues like banana-variety corruption, they miss the basic point that colo can be done by anyone with the checkbook and the wish to do so.</p>
<div class="wp-caption alignright" style="width: 144px"><img class="   " src="/images/forrest-gump-shrimping.jpg" alt="unfair advantage?" width="134" height="134" /><p class="wp-caption-text">unfair advantage?</p></div>
<p>It&#8217;s sort of like that boat in Forrest Gump.  Forrest wanted to be a shrimper.  So he invested in a boat.  With his initial capital, hard work, perseverance and a bit of luck, Forrest made a go of it.  He might easily have not made it.  Colo is like that.  You can shrimp without a boat if you have a mask and fins, but it&#8217;s likely not a sustainable model&#8230; either way, it&#8217;s hard to see the harm in Gump&#8217;s boat.  Or colocation.</p>
<p>Hat-tip to <a title="Rodrick's Web Log !!" href="http://rodrickbrown.com/blog/" target="_blank"><em>Rodrick&#8217;s Web Log !!</em></a> for spotting the market data peaks site.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.puppetmastertrading.com/blog/2009/12/08/peaky/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>our solid-state future</title>
		<link>http://www.puppetmastertrading.com/blog/2009/09/04/our-solid-state-future/</link>
		<comments>http://www.puppetmastertrading.com/blog/2009/09/04/our-solid-state-future/#comments</comments>
		<pubDate>Fri, 04 Sep 2009 12:58:23 +0000</pubDate>
		<dc:creator>tito</dc:creator>
				<category><![CDATA[EMS Internals]]></category>
		<category><![CDATA[market data]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://www.puppetmastertrading.com/blog/?p=577</guid>
		<description><![CDATA[I&#8217;ve never been a hardware guy. Hardware has gotten so fast throughout my professional life that it has just never been a big issue. Also, on wall st we had a robust and annual budget for h/w so I&#8217;d routinely sign-off on hundreds of thousands of dollars on all sorts of machines I&#8217;d never lay [...]]]></description>
			<content:encoded><![CDATA[<div class="wp-caption alignleft" style="width: 260px"><img src="/images/solidStateFuture.jpg" alt="Mmmm... hardware.." width="250" height="249" /><p class="wp-caption-text">Mmmm... hardware..</p></div>
<p>I&#8217;ve never been a hardware guy. Hardware has gotten so fast throughout my professional life that it has just never been a big issue. Also, on wall st we had a robust and annual budget for h/w so I&#8217;d routinely sign-off on hundreds of thousands of dollars on all sorts of machines I&#8217;d never lay eyes on and somehow they always did the trick.</p>
<p>Before 9/11, they&#8217;d be in server racks in the building or down the street, but since then they might also be in increasingly far-flung places like weehawken or long island, tampa, even texas or beyond. The machines always seemed unbelievably overpriced &#8211; I remember over the years pretty consistently paying something like $40K for a low-end db server.  But that&#8217;s what it cost and you could only purchase approved products from approved channels, so nobody spent much thought on it.  Now that I don&#8217;t have the same kinds of constraints &#8211; or budgets! &#8211; I increasingly have to think of hardware.</p>
<p>As a software engineer, the hardware itself is also insisting that I pay some uncharacteristic attention to it.  The evolution of processors has reached a point where the programming paradigms many of us have fruitfully employed over many years are no longer suited for getting full performance out of today&#8217;s machines.  The recent introduction of remarkably powerful and inexpensive parallel-computing platforms based on GPUs like <a title="CUDA" href="http://www.puppetmastertrading.com/blog/2008/11/29/nvidias-tesla-and-the-compute-unified-device-architecture/" target="_blank">nvidia&#8217;s cuda</a> also outline a future that even current university training doesn&#8217;t address in a fashion practically adapted for institutional application.  Cores are multiplying like Tribbles.</p>
<p>The lines between persistent storage and main memory are also blurring as consumer SSDs push up from the &#8216;low&#8217;-end while exotic ioDrives and the like offer a glimpse of a world where the performance gap between the two approaches nil and after their long reign myriad metallic platters will spin no more.</p>
<p><span id="more-577"></span></p>
<p style="text-align: left;">
<div class="wp-caption aligncenter" style="width: 498px"><img class=" " src="/images/power7-die.jpg" alt="troubling like Tribbles" width="488" height="386" /><p class="wp-caption-text">troubling like Tribbles</p></div>
<p>There have been some steps taken towards taming the core dilemma.  Google&#8217;s introduction of the distributed map-reduce paradigm and all of the associated plumbing on top of computers in the &#8216;cloud&#8217; is probably the boldest and most effective reaction thus far, but it&#8217;s not always obvious that you want your stuff running in someone else&#8217;s cloud amongst many other natural limitations of this approach.  This is also a solution &#8216;in the large&#8217; and sometimes you need performant solutions on a different, smaller scale.  Here, the development of functional languages and idioms may be of some help, but there certainly don&#8217;t seem to be clear winners yet.</p>
<p>Erlang, Ocaml, Haskell, Scala and others all seem to have very limited impact thus far and all face big challenges before receiving widespread welcome.  Worse, any language can be mangled into expressing things poorly so the languages can only be a meaningful aid to programmers who are able to adjust their mindset for a new world&#8217;s computing paradigm.  This likely won&#8217;t be easy for many until there is an established set of usable programming idioms and toolsets for dealing with concurrency on a whole new scale.  To me, it seems that functional programming might well be an important part of that, but it&#8217;s difficult to imagine it as the complete game changer any time soon.  As it is, many of us have already been using functional languages (in my case, sql and R) on a regular or even daily basis for a long time, so it&#8217;s difficult to cast this as the revolution, bottled.</p>
<p>As cores proliferate and the bandwidth amongst them increases, new challenges and opportunities are unveiled.  Feeding all of those hungry cores can be a chore.  If you already had problems with i/o bound processes, then adding cores is sort of like adding liquidity to a debt crisis &#8211; not obviously helpful. We faced an issue like this recently while trying to improve the throughput of our backtesting subsystem in particularly poor-performing cases.</p>
<p>In these cases, we use daily data to initialize an intraday strategy.  For example, we might look at all equities over a trailing 3 or 5-year period, and perform various analytics – like calculating correlation matrices, beta against various benchmarks, volatility, etc – to determine which names might be candidates for the strategy’s portfolio.  We found that an inordinate amount of our time was spent just performing these morning analytics and that the cost of a day’s backtest was significantly spent on this oft-repeated morning exercise.  While working through this issue, we noted that we’ve reached a point where a decent, dual nehalem, server (with 16 logical cores) could be built with 48G of RAM for something along the lines of $6K.  Seriously.  So we stuck all of our daily data into memory and the improvement has been essentially infinite.  Maybe not the best example, but hopefully illustrative of the fact that the h/w underneath us is changing qualitatively and that we need to be more active in involving it in our design decisions.</p>
<p>Of course, even a yawning expanse like 48G or even a terabyte (in not too many years, after all, I should be able to buy 1T of RAM for my boxes at a similar price point to what I pay for 48G today), eventually gets consumed and so we can’t hope to employ this solution for all our problems.  We continue to develop our historical TAQ infrastructure (most recently discussed <a title="tick data and hdf5" href="../2009/01/06/tick-data-hdf5-part-2/" target="_blank">here</a>) and this is certainly an example where buying memory isn’t going to get you very far.  But SSDs now are getting reasonable and so our current approach uses memory as much as possible but when it needs to get volumes of detailed historical data we’ve placed our indices onto SSDs while the actual stores themselves reside on much more affordable RAID arrays.  Reading indices is now much faster and adds minimal i/o overhead to a very i/o-bound problem.</p>
<p>By constantly looking at new languages and concurrency idioms, vigilantly assessing the current and projected costs of ’solid-state’-solutions to our most vexing problems, and just staying active and creative, we hold out some hope that we can transition gracefully to what is looking increasingly like a solid-state future.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.puppetmastertrading.com/blog/2009/09/04/our-solid-state-future/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>tick data &amp; hdf5 (part 2)</title>
		<link>http://www.puppetmastertrading.com/blog/2009/01/06/tick-data-hdf5-part-2/</link>
		<comments>http://www.puppetmastertrading.com/blog/2009/01/06/tick-data-hdf5-part-2/#comments</comments>
		<pubDate>Tue, 06 Jan 2009 18:46:06 +0000</pubDate>
		<dc:creator>tito</dc:creator>
				<category><![CDATA[EMS Internals]]></category>
		<category><![CDATA[market data]]></category>
		<category><![CDATA[open-source software]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://www.puppetmastertrading.com/blog/?p=280</guid>
		<description><![CDATA[Last time I described the trajectory of my research into using hdf5 for large amounts of tick data.  This time I describe the basic design of the prototype I implemented and some of its performance characteristics.


Prototype Design
With One Big Table (OBT) holding all of your data, you need some help finding what you need in [...]]]></description>
			<content:encoded><![CDATA[<div class="wp-caption aligncenter" style="width: 510px"><img src="/images/obt.jpg" alt="" width="500" height="375" /><p class="wp-caption-text">One Big Table (and chair)</p></div>
<p style="text-align: left;"><a title="part 1" href="http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/" target="_blank">Last time</a> I described the trajectory of my research into using hdf5 for large amounts of tick data.  This time I describe the basic design of the prototype I implemented and some of its performance characteristics.</p>
<p style="text-align: left;">
<p style="text-align: left;"><span id="more-280"></span></p>
<p style="text-align: left;"><strong>Prototype Design</strong></p>
<p style="text-align: left;">With One Big Table (OBT) holding all of your data, you need some help finding what you need in that data.  To this end, I wrote an index on the big table which basically stored, per instrument, beginning and ending indices into the OBT as well as the timestamp at each extreme.</p>
<p style="text-align: left;">So, the main table was: { conid, timestamp, open, high, low, close, adjustedClose, volume } and the index table was: { conid, minIndex, maxIndex, minTimestamp, maxTimestamp }.  The index is read into memory and kept with a few other handy bits of data in a structure which represents a &#8220;connection&#8221; into the OBT.</p>
<p style="text-align: left;">To identify contracts, I use a <em>long int</em> contract identifier which is already employed within our environments.  For timestamps I used the java convention of using a <em>64-bit long</em> denoting milliseconds since the “epoch” and, initially, used <em>mktime</em> to support this approach in C.  After my first iteration, I found that an incredible proportion of my time spent writing a HDF5 table from a CSV file was spent in <em>mktime</em> whereupon my <a title="Professor Giorgio Ingargiola" href="http://www.cis.temple.edu/~ingargio/" target="_blank">Dad</a> suggested the use of <em>gmtime</em>.  Remarkably, this yielded a fully 40% improvement to the process!  It&#8217;s nice having a guru in the family.</p>
<p style="text-align: left;">In order to retrieve data for one contract, I implemented an iterator which would find the appropriate section of the OBT for the particular contract.  Another handy piece of data the connection struct maintained is the entire timestamp column for the main table.  Clearly, this isn&#8217;t scalable for really big datasets and for the real implementation I&#8217;ll have to read this in on an instrument basis at the time of a query.  But for this data, this seemed the sensible implementation.  Two binary searches are performed on the appropriate, indexed subset of this big array to determine exactly which records will need to be read to satisfy the iterator&#8217;s query.  Thus, the iterator is primed with the exact location of the data it will require on initialization.  Then, buffered reads are performed by the iterator as data is requested from it.</p>
<p style="text-align: left;">My only critical use-case is to retrieve a time-ordered stream of OHLCVs (in this case) across potentially many contracts.  This one query meets my needs both for back-testing purposes as well as statistical calculations.  But it requires an efficient merge operation across a potentially large set of these iterators.  To accomplish this, I&#8217;ve got a composite iterator which uses a red-black tree to keep all of the contained iterators sorted in the order of their <em>Next()</em> OHLCV.  Thus, the composite iterator will always return the oldest OHLCV amongst all of its contained iterators.</p>
<p style="text-align: left;">That&#8217;s pretty much it.  Apart from the cached timestamp column, this should all scale well to datasets the size of a day&#8217;s worth of tick data for the US equity markets with a few foreign markets and maybe some futures thrown in as well.  Options data is a worse case than the one I&#8217;m envisioning, but I imagine a similar approach would still work, though perhaps you may have to distribute a day&#8217;s data across multiple files.</p>
<p style="text-align: left;"><strong>Benchmarking the Prototype</strong></p>
<p style="text-align: left;">The data I tested this implementation on is daily data going back to 1990 on some 7100 us, lse and hk equities.  In uncompressed, CSV format, the data weighed-in at 850M and was made-up of ~16.5M records.  The tests I ran were:</p>
<ul>
<li>Read the CSV file and write an HDF5 file varying the compression { TRUE | FALSE } and hdf5 chunk sizes { 2^12, 2^13, 2^15, 2^17, 2^19 }.  (For the chunksizes, I based it roughly on the <a title="PyTables chunksize Guidelines" href="http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune" target="_blank">excellent PyTables documentation</a> and then expressed my preference for base-2 scales&#8230;)</li>
<li>Respond to 100 queries across {1, 2^4, 2^8, 2^10, 2^12) randomized contracts over a randomized 2-year period (within the 19 year range).</li>
</ul>
<p>The first part is really about hdf5 and how its different options will affect my results.  The second is my actual use-case: &#8220;give me an ordered stream for some set of contracts over some (two-year) period&#8221;.</p>
<p>The results were interesting as these parameters matter.  Especially compression.  Below you can see the results of the tests.</p>
<div class="wp-caption aligncenter" style="width: 406px"><img title="HDF5 Write Performance " src="/images/h5Write.jpg" alt="HDF5 Write Performance " width="396" height="244" /><p class="wp-caption-text">HDF5 Write Performance </p></div>
<p>This &#8220;write test&#8221; is just a C program which reads in an 848M CSV file and writes and indexes an HDF5 file using the OBT approach as described above.  Apart the curious bump in file size for the Compressed+8192 variant, the results aren&#8217;t too remarkable except to note that compression wants smaller chunksizes.  Badly.</p>
<div class="wp-caption aligncenter" style="width: 548px"><img title="HDF5 Read Performance " src="/images/h5Read.jpg" alt="" width="538" height="244" /><p class="wp-caption-text">HDF5 Read Performance </p></div>
<p>Likewise, the read test is the same C program which then reads from each of the files written with the varying HDF5 write parameters.  Here we really see the effect of compression on performance.  It seems that if you want performance, then compression isn&#8217;t for you.  If you need compression, then you need to use small chunksizes.</p>
<p>Apart this, it seems that performance from 1-4096 randomly selected contracts degrades <em>reasonably</em> linearly.  The red-black tree is doing an effective job of merging even a reasonably large number of streams.</p>
<p>From an absolute perspective, the performance strikes me as pretty smoking in the good cases.  In a two year period, you&#8217;ll have about 500 trading days.  So, for the {No-compression/32768 chunk size/256 contracts} case we have (500 * 100 * 256) / 10 =  ~1.28M records per second including all of the look-ups.</p>
<p><strong>Java + SWIG + HDF5</strong></p>
<p>One of my needs is to be able to access this functionality in Java so that the StratBox GUI can use this data.  To that end, I made sure that as I was developing the prototype I maintained SWIG interfaces and a parallel test driver in Java.  Apart the initial set-up, this proved pretty easy, though adding SWIG to a project does add some complexity.  In any case, I wanted to see how bad a performance hit I&#8217;d get running the same tests in Java.  Again, looking at the case highlighted above { No-compression/32768 chunk size/256 contracts}, the Java/SWIG timings are about twice those of HDF5&#8217;s native C.  So, we take a pretty significant hit, but it seems unavoidable and ~600K records a second isn&#8217;t exactly slow.</p>
<div class="wp-caption aligncenter" style="width: 548px"><img title="HDF5 Read performance from Java / SWIG" src="/images/h5Java.jpg" alt="HDF5 Read performance from Java / SWIG" width="538" height="64" /><p class="wp-caption-text">HDF5 Read performance from Java / SWIG</p></div>
<p><strong>Conclusion</strong></p>
<p>The prototype I&#8217;ve implemented has the nice characteristic that its design is very similar to what I expect should work well with much larger quantities of tick data (as opposed to the ohlcv data I&#8217;ve used here).  The only significant difference is that seeking within an indexed range would be slower as I&#8217;d first have to read in the range instead of keeping it handy in memory.  Apart this, I&#8217;d need a layer on top of what I have here to manage HDF5 files as well.  Of course, until I actually implement this for tick data, it&#8217;s not certain that it will be adequately performant on that much harder case, but I&#8217;m reasonably confident that something like what I&#8217;ve done here could be made to work well.</p>
<p>There&#8217;s a significant learning curve to using HDF5 and one clearly has to spend some time benchmarking for the specific requirements to make sure that the settings used are appropriate.  If I were a little less obsessed with speed or a little more discerning about how I spend my holidays, I&#8217;d likely find using PyTables or something similar to be a much better solution than trying to roll my own in this fashion.</p>
<p style="text-align: left;">
]]></content:encoded>
			<wfw:commentRss>http://www.puppetmastertrading.com/blog/2009/01/06/tick-data-hdf5-part-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>managing tick data with hdf5</title>
		<link>http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/</link>
		<comments>http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/#comments</comments>
		<pubDate>Sun, 04 Jan 2009 18:45:36 +0000</pubDate>
		<dc:creator>tito</dc:creator>
				<category><![CDATA[EMS Internals]]></category>
		<category><![CDATA[market data]]></category>
		<category><![CDATA[open-source software]]></category>
		<category><![CDATA[post-trade analysis]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://www.puppetmastertrading.com/blog/?p=262</guid>
		<description><![CDATA[
One of the nicest things about the holiday season (Happy New Year, btw) is that it provides a lovely opportunity to spend some quality time with a project that&#8217;s a bit more exploratory than might be meaningfully undertaken while trading in lively markets.
A number of months ago, I mentioned using HDF5 to manage tick data [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" style="margin-left: 5px; margin-right: 5px;" title="big data" src="/images/bigdata.jpg" alt="" width="328" height="248" /></p>
<p>One of the nicest things about the holiday season (Happy New Year, btw) is that it provides a lovely opportunity to spend some quality time with a project that&#8217;s a bit more exploratory than might be meaningfully undertaken while trading in lively markets.</p>
<p>A number of months ago, I <a title="billions and billions" href="http://www.puppetmastertrading.com/blog/2008/08/22/billions-and-billions/" target="_blank">mentioned</a> using <a title="HDF5" href="http://www.hdfgroup.org/HDF5/" target="_blank">HDF5</a> to manage tick data as RDBMSes just aren&#8217;t up to the task and specialized Tick DBs are absurdly expensive.  While I&#8217;d spent some time exploring this idea through the fall, I never had a discrete chunk of time to really explore the technology beyond determing that its Java interfaces weren&#8217;t production-worthy.  This meant that we&#8217;d have to drop into C to access the functionality we&#8217;re interested in and that we&#8217;d have to come up with our own bridge out into Java for access by StratBox while StratCloud could access it directly.</p>
<p>Below, I describe what I&#8217;ve learned through my holiday geek-spelunking-trek including some timings on various configurable characteristics of HDF5 (e.g., compression and &#8220;chunking&#8221;).</p>
<p><span id="more-262"></span></p>
<p>After spending some time looking at the java interfaces to HDF5, I determined it wasn&#8217;t up to snuff.  Why?  Primarily because no-one seems to use it, it lags the main api from a versioning standpoint and it appears to be more-or-less impossible to build from source.  Looking a bit more carefully, it seems to have been written by one (undoubtedly talented and well-meaning) individual who isn&#8217;t familiar with java.  (The most egregious example was to use a javax.swing.tree.TreeNode as the base class for a key model object&#8230;)</p>
<p>I then spent some time looking at the native api and the underlying object model it exposes.  The model is both powerful and pretty low-level.  They&#8217;ve implemented many of the goodies of a file system including groups (&#8220;directories&#8221;), datasets (&#8220;files&#8221; or in RDBMS-land, &#8220;tables&#8221;) and a variety of nice linking mechanisms as well as attributes which might index or otherwise annotate data.  There&#8217;s also powerful, extensible I/O options which I didn&#8217;t much study beyond compression and &#8220;chunking.&#8221;</p>
<p>The library is provided with two &#8220;first-class&#8221; APIs &#8211; in C and Fortran &#8211; a secondary API in C++ and then the Java interface I mentioned.  Others have written interfaces for other languages, most notably the much-lauded <a title="PyTables" href="http://www.pytables.org" target="_blank">PyTables</a> implementation for Python which is used by many in conjunction with the popular <a title="Numpy" href="http://numpy.scipy.org/" target="_blank">NumPy</a> package.</p>
<p>Given this spread of implementation languages I chose C and determined that I&#8217;d steal a page from the talented crew behind <a title="QuantLib" href="http://quantlib.org" target="_blank">QuantLib</a> and use <a title="Simplified Wrapper and Interface Generator" href="http://www.swig.org/" target="_blank">SWIG</a> to expose relevant functionality into Java.  This has proven to be a splendid choice for my needs.</p>
<p>Having gotten this far, I started examining how I&#8217;d represent market data with hdf5 and came up with two broadly opposed approaches.  In order to gain some insights from those more experienced, I sent the below problem statement / inquiry to the main HDF5 mailing list:</p>
<blockquote><p><span style="font-weight: bold;">A description of the data and its use</span></p>
<p>The data is all timestamped financial streams of &#8220;tick&#8221; data.  Each record is small (a few hundred bytes at the most), but there are many &#8211; in a day you may see many hundred million to a few billion.  Each record is naturally partitioned by instrument (eg, &#8220;microsoft&#8221;, &#8220;ibm&#8221;, &#8220;dec crude&#8221;, etc).  There are less than 30K instruments in the universe I might care about.</p>
<p>I (more or less) don&#8217;t care how long it takes to construct the h5 files/structures as it will be performed offline and the only critical query I care about is something like:</p>
<div style="margin-left: 40px;"><span style="font-style: italic;">&#8220;Get ticks for instruments {i1&#8230;in} from time t1 to time t2 ordered by time, instr&#8221;. </span></div>
<p>That is, I need to be able to &#8220;replay&#8221; a subset of the instruments within the data store over some period of time.  But I really care that this be as fast as possible.</p>
<p><span style="font-weight: bold;">Questions </span></p>
<p>0.  Am I barking up the wrong tree?  Is HDF5 an appropriate technology for the use I&#8217;ve described?</p>
<p>1. Given the size/volume of the data, my thought is to partition h5 files by day.  Uncompressed, the files will be on the order of ~25G.  Does this sound reasonable?  What are the key factors impacting this decision from an hdf5 perspective?</p>
<p>Two alternative models come immediately to mind: one big table (OBT) per day ordered by instrument and then time, or one table per instrument (OTPI) ordered by time.  My current inclination is OTPI as it seems more manageable assuming the overhead of so many tables isn&#8217;t an issue.</p>
<p>2a.  Are there other, better models you suggest I investigate?</p>
<p>2b.  With the OBT, I&#8217;d need to be able &#8220;index into&#8221; the table to identify the beginning of each instrument&#8217;s section (at least).  How would you recommend doing this?  It seems possible to do this with references or perhaps a separate table with numerical indices into the main table.  Any pros/cons/alternatives to these approaches?</p>
<p>2c.  With the OTPI, I&#8217;d need to have many tables (at most ~30K) per file.  Is this an issue?</p>
<p>2d. For both models, I&#8217;d need to be able to merge sorted sets of h5 data into one sorted set as quickly as possible.  Is there any hdf5 support for doing such a thing or external libraries created for this purpose?</p>
<p>3. What impact on retrieval/querying should I expect to see with varying levels of  compression?</p>
<p>4. Any suggestions on chunksizes for this application?</p></blockquote>
<p>I was fortunate to receive some excellent responses to my query, including from Francesc Alted, the gracious author of the PyTables library, and from a gentleman who&#8217;d implemented similar functionality for his own trading environment.  Interestingly, both approaches &#8211; OBT and OTPI &#8211; were championed.  It seems that OTPI is probably to be preferred if the number of instruments/tables to be stored isn&#8217;t excessive (perhaps below 10K though I can&#8217;t quantify this) and the frequency of update is significant.  OTPI is easier to implement as it means you can rely more upon the infrastructure provided by HDF5.  OBT instead seems more scalable as you incur less overhead (and goodies) with the one table, though you pay for this by having to implement your own indexing logic.</p>
<p>Given the divergent advice and my own lack of hands-on familiarity with the C library, I decided to try both approaches on a prototype.  Instead of looking at vast amounts of tick data, I&#8217;d try both approaches on a smaller store  (~1G with ~7K instruments) of OHLCV data.</p>
<p>By far, the easier to implement is the OTPI approach.  However, even with this relatively small amount of data, the difference in write performance and file size was substantial.  Clearly, expanding this to the scale of tick data wasn&#8217;t going to yield sufficiently performant results.  I focused on the OBT approach.</p>
<p>&#8212;</p>
<p>Given the length of this post and keeping in mind that the holiday season isn&#8217;t over just yet (about ten hours remaining as I write this!), I&#8217;m going to stop writing now and continue with the remainder of my implementation and findings in a follow-up post later this week..</p>
]]></content:encoded>
			<wfw:commentRss>http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>trading the news</title>
		<link>http://www.puppetmastertrading.com/blog/2008/11/18/trading-the-news/</link>
		<comments>http://www.puppetmastertrading.com/blog/2008/11/18/trading-the-news/#comments</comments>
		<pubDate>Tue, 18 Nov 2008 21:03:02 +0000</pubDate>
		<dc:creator>tito</dc:creator>
				<category><![CDATA[back-testing]]></category>
		<category><![CDATA[market data]]></category>
		<category><![CDATA[startup]]></category>
		<category><![CDATA[strategy development]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://www.puppetmastertrading.com/blog/?p=151</guid>
		<description><![CDATA[ Inevitably one of the first ideas people have when they start thinking about how to write a trading algorithm turns out to be among the hardest: trading the news.  The problems are many and in some cases not so obvious&#8230;but the natural appeal of the idea seems universally compelling.
Just after the dot.com craze, a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" title="news?" src="/images/news.jpg" alt="" width="282" height="320" /> Inevitably one of the first ideas people have when they start thinking about how to write a trading algorithm turns out to be among the hardest: trading the news.  The problems are many and in some cases not so obvious&#8230;but the natural appeal of the idea seems universally compelling.</p>
<p>Just after the dot.com craze, a brilliant friend of mine (who had just sold his web consulting startup) decided to write a book.  The premise was glorious.  A bunch of clever college-age kids formed a startup to predict the stock market.  The method they used was to constantly comb the web with ultra-sophisticated algorithms which would run across giant server farms overnight and ultimately <strong><em>generate tomorrow&#8217;s headlines</em></strong>.  Based on the headlines that their system generated, they would place trades that would take advantage of these predicted events.</p>
<p>Sadly, my friend never went on to complete his book, so I don&#8217;t know how it all turned out.  (Instead, he went on to start another successful company, this time in the field of robotics.)  While he was writing it, I loved getting new drafts as they were filled with clever ideas.  But the core idea of predicting headlines and then using those headlines to trade always struck me as especially cute.</p>
<p>For those of us without access to news-predicting algos, writing strategies based on the news is rather less straight forward, though there are a growing variety of products and services aiming to fill the gaps.  Today must have been trading-the-news-day as I found a few articles on the topic in my mailbox and even received a cold call from a vendor, <a title="Need to Know News" href="http://www.needtoknownews.com/" target="_blank">Need to Know News</a>, with just such an offering.  Below I&#8217;ll look at some of these offerings and consider some of the issues involved in writing trading strategies based on the news.<span id="more-151"></span></p>
<p>The idea of trading the news resonates with me as it was one of the first things I tried to automate.  In particular, I spent some time looking to trade crude and natural gas futures based on their respective weekly EIA reports.  This experience led me to a couple conclusions.</p>
<p><strong>The market knows before the news wires do. </strong>This is a big problem for two reasons.  The first is just plain-Jane latency, and I know that vendors are now effectively reducing latency to the sub-second level, but the market still knows first.  The second issue is deeper and is captured nicely by a quote I heard (sorry &#8211; I don&#8217;t remember where) which went something like:</p>
<blockquote><p>&#8220;capital markets are the original social networks&#8221;</p></blockquote>
<p>Which suggests that markets have their own internal languages and understandings.  Thus, <em>translating </em>a market-impacting event, like an EIA report, into a human-readable form to reason about it (even if there&#8217;s no human doing the reasoning) and then action it back into the market struck me as a necessarily lossy transformation.</p>
<p>Another way of seeing this is just to consider: what to do based on the news?  Sell a build of inventory?  Maybe with <em>your</em> money.  Compare the number against expectations?  Whose?  The semantic content that exists in the market seems to be intrinsically richer than that one might extract from a news wire.  But this problem of actioning a news item brings up the next issue.</p>
<p><strong>The problem of history</strong>.  One of the wonderful and horrible things about market data is that there is a lot of it going back a long ways.  This is an expensive pickle to <a title="Billions and billions" href="http://www.puppetmastertrading.com/blog/2008/08/22/billions-and-billions/" target="_blank">manage</a>, but it at least means that you have the ability to look back almost arbitrarily far into the past to see how markets responded to various conditions.  This isn&#8217;t so true with most news wires.</p>
<p>To some degree, these issues are being addressed by vendors.  And some of them will be addressed by how people utilize the news feeds.  If, instead of trying to write a strategy based wholly on the news, I try to improve an existing strategy by annotating its model with data gleaned from a feed, I might wind up with much better results.</p>
<p>Indeed, <a title="SIN Research Report: Algo-Trading on News " href="http://puppetmastertrading.com/images/News_n_Algorithmic_Trading_Research_Report_June08.pdf" target="_blank">one of the papers</a> I came across today is a research report from <a title="Securities Industry News" href="http://www.securitiesindustry.com/" target="_blank">Securities Industry News</a> (incidentally, my first tip-off that Citi was going down the toilet was when my management told me I could no longer keep my subscription to that fine periodical).  In it, market participants indicate that they expect to improve existing algos more than create brand-new ones.  But they also failed to complain about the lack of back-testable feed histories, so&#8230;</p>
<p>The <a title="AlphaSimplex assesses Reuters news feed" href="http://puppetmastertrading.com/images/Reuters_NewsScope_Event_Indices_Whitepaper.pdf" target="_blank">other paper</a> is a much more detailed quantitative exposition on how to build a news reading algo and some statistical analysis on how well it annotated the market.  This one is the work of <a title="Andrew Lo" href="http://web.mit.edu/alo/www/" target="_blank">Andrew Lo</a>&#8217;s <a title="AlphaSimplex Group" href="http://www.alphasimplex.com/" target="_blank">AlphaSimplex Group</a> and as such is required reading for anyone who really wants to implement such systems.</p>
<p>Ever since my first experiences with trying to trade energy commodities based on the EIA report via news feed, I&#8217;ve been pretty skeptical of the utility of trying to algorithmically trade the news.  That said, like everything else in this space, there&#8217;s an incredible amount of innovation going on and an incredible number of seriously smart and motivated folks working to ensure that this will be a productive path for those with the ability to tackle the formidable complexities presented.</p>
<p>&#8211;</p>
<p>Updated: added link to Need to Know News&#8217; site</p>
]]></content:encoded>
			<wfw:commentRss>http://www.puppetmastertrading.com/blog/2008/11/18/trading-the-news/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>billions and billions</title>
		<link>http://www.puppetmastertrading.com/blog/2008/08/22/billions-and-billions/</link>
		<comments>http://www.puppetmastertrading.com/blog/2008/08/22/billions-and-billions/#comments</comments>
		<pubDate>Fri, 22 Aug 2008 16:21:22 +0000</pubDate>
		<dc:creator>tito</dc:creator>
				<category><![CDATA[back-testing]]></category>
		<category><![CDATA[market data]]></category>
		<category><![CDATA[open-source software]]></category>
		<category><![CDATA[post-trade analysis]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://www.puppetmastertrading.com/blog-test/?p=81</guid>
		<description><![CDATA[
While Carl Sagan&#8217;s famous formulation introduced a generation to the vastness of the cosmos, more recent history suggests that his memorable term might now be more aptly applied to financial extents: our deficits and debts, perhaps, to the economically or politically minded.  But for those of us with the markets on our mind, the [...]]]></description>
			<content:encoded><![CDATA[<p><img hspace="7" align="left" alt="billions and billions" title="billions and billions" src="http://puppetmastertrading.com/images/stars.jpg" /></p>
<p>While Carl Sagan&#8217;s famous formulation introduced a generation to the vastness of the cosmos, more recent history suggests that his memorable term might now be more aptly applied to financial extents: our deficits and <a target="_blank" title="US Public Debt" href="http://en.wikipedia.org/wiki/United_States_public_debt">debts</a>, perhaps, to the economically or politically minded.  But for those of us with the markets on our mind, the term has to evoke the enormity of the data we create and must manage every day.  We&#8217;ve recently been working with the <a target="_blank" title="NYSE TAQ Data" href="http://www.nyxdata.com/nysedata/default.aspx?tabid=730">NYSE&#8217;s TAQ data</a> in an effort to integrate it into <a target="_blank" title="Puppetmaster Trading: StratBox" href="http://puppetmastertrading.com">StratBox</a>&#8217;s back-testing and optimization capabilities.  And the enormity of the data is really just staggering.</p>
<p>Each day, the NYSE publishes all of the day&#8217;s quotes and trades as well as some reference data.  Compressed, the data will just about fit onto a DVD.  For one day.  A DVD.  Compressed.  It&#8217;s really mind-boggling.  A year of the stuff, uncompressed, will require over a <em>petabyte </em>of storage.  Over 1,125,899,906,842,624 bytes.  And that&#8217;s just the US Equities markets.  You want options data, too?  I hope your uncle is named <a title="EMC - " target="_blank" href="http://www.emc.com/index.htm">EMC</a>, because just managing the data is going to be <em>a challenge</em>&#8230;</p>
<p><span id="more-81"></span></p>
<blockquote><p>&#8220;Information about money has become almost as important as money itself.&#8221; &#8212; Walter Wriston, former Chairman of Citicorp</p></blockquote>
<p>The enormity and profile of market data far exceeds the capacity of traditional RDBMSes. While RDBMSes continue to expand their usable capacity &#8211; we have used partitioned tables with nearly a billion rows of market data which have performedÂ  astonishingly well &#8211; they simply can&#8217;t deal with the kinds of quote volumes modern markets are generating daily.  This has spawned a host of specialized timeseries database products, like the grandaddy: <a target="_blank" title="Sungard's Fame" href="http://www.sungard.com/Fame/">Sungard&#8217;s FAME</a> which I&#8217;d used back in the 90&#8217;s to write programs to calculate bond indices at JPM, to more recent offerings like <a target="_blank" title="Vhayu" href="http://www.vhayu.com/">Vhayu</a> and <a target="_blank" title="kdb+" href="http://kx.com/">Kdb+</a>.  These timeseries oriented data products undoubtedly have many distinguishing characteristics and features, but they share one immutable characteristic: they are unbelievably expensive &#8211; in some cases a single developer seat costs in the high 6-figures for an annual license.</p>
<p>Thus, while no doubt missing out on some of their high-end features and niceties, we&#8217;ve decided to seek solutions from some of the original purveyors of petabyte-scaled data: NASA and the NCSA through their <a title="HDF5: what is it?" target="_blank" href="http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html">HDF5</a> system.  Designed to support vast scientific data stores and boasting sophisticated capabilities in support of parallel computing environments, it should be possible to get comparable performance to some of the high-end specialized finance products without the sticker shock.  Indeed, it&#8217;s potentially <a title="Quantlib" target="_blank" href="http://puppetmastertrading.com/blog/2008/06/14/using-quantlib-from-java/">another example of free software</a> providing a meaningful contribution to finance.</p>
<p>In researching cost-efficient and highly parallel hardware solutions to pair with our emergent data solution, I&#8217;ve come to realize that open-source is expanding its reach into the hardware sphere.</p>
<p><img title="Linux cluster in an IKEA Filing cabinet" alt="Linux cluster in an IKEA Filing cabinet" src="http://puppetmastertrading.com/images/helmer.png" /></p>
<p><a target="_blank" title="Helmer" href="http://helmer.sfe.se/">This guy</a> shares his experience and &#8220;recipe&#8221; for building a powerful and unique rendering cluster inside an IKEA filing cabinet.  It&#8217;s admittedly on the funky side for even a SOHO operation, but it&#8217;s no joke &#8211; it&#8217;s more powerful than a lot of production blade servers used on wall st and it cost him less than $4K.  He also includes a (very loosely described) spec for a more powerful next-generation version with some 50-Teraflops of capacity!  So, while the data we&#8217;re having to deal with is growing at an incredible rate, the tools we have to manage it are growing proportionately for those who know how to leverage the work so many smart people are producing and freely sharing.</p>
<p>As my dad told me years ago:</p>
<blockquote><p>&#8220;Good programmers write good programs.  Great programmers <em>steal </em>good programs.&#8221;</p></blockquote>
<p>At this point, we&#8217;re still in the &#8220;discovery&#8221; stage of our development of TAQ+HDF5 for StratBox, but as we progress I&#8217;ll periodically post some of our experiences.</p>
<p>&#8212;</p>
<p>UPDATE</p>
<p>Speaking of my dad, he saw this posting and pointed me to some <a target="_blank" title="Massive Information processing and Fault-Tolerance: The Google Approach" href="http://www.cis.temple.edu/~ingargio/cis307/readings/MapReduce.html">class notes</a> he&#8217;s been working on which describe the Google approach to massive information processing and fault-tolerance.Â  Interesting and full of great links to both academic and industrial papers/sites on the topic.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.puppetmastertrading.com/blog/2008/08/22/billions-and-billions/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
