Archive

Archive for February, 2010

lock free

February 16th, 2010

One of the recurring technology themes in these pages has been the ongoing and dramatic move from single to multi-core systems and the need to seriously increase the parallelism in our software designs. For me, one of the seminal, large-grained design patterns was the SEDA Architecture. For years, this informed my systems’ designs and formed a conceptual backbone for development. That said, I’ve been broadly aware for some time that SEDA’s golden age has (incredibly!) already passed us by, but haven’t identified what might replace it as a reference point for my design efforts.

Before considering tools, languages or patterns that might help, we need to reflect on the problem(s) we’re trying to solve. The problems inside an EMS look to me, after years of development, a lot like network routing problems. Indeed, my current view is that this (not just concurrency as I’d suggested at the time) is why the unfortunate Aleynikov & co. at GS were using erlang.

Why network routing? Think about the load on an EMS. The main issue is that you’re getting many thousands of teeny little messages per second and only a relatively small number of them matter to only a relatively small subset of ‘agents’ within the system. Reducing latency is all about making sure the time you spend on each message is minimized, and that the agents who are interested in a particular message needn’t wait for each other to do whatever they care to do based on the message. So, really you’re trying to route each message through your system with as few ‘hops’ possible and as much parallelism as you can muster under the (radically!) new assumption that you may have hundreds or thousands of cores available to you during the lifetime of the design.

I spent some time thinking (hoping) that languages might help furnish an answer. Perhaps a move to a functional language like erlang, ocaml or scala might help furnish at least a partial answer. But erlang is slow and peculiar, ocaml doesn’t support intra-process concurrency and scala looks like a bloated language on a bloated platform (jvm+java class library). And none of them seem to have achieved anything near the critical mass which is so crucial for the development of usable libraries and the availability of skilled developers with long experience in the technology.  Naturally, reasonable people will disagree about such things, but this is my view (today). Java is ok (and certainly sells servers), but it’s not obvious how it’s going to help me offload my work onto a GPU anytime soon (and jni is both painful and slow) and I’ve never been able to get comfortable with just how damn big VMs get.  Image size isn’t free and if we’re looking to go deep into the sub-millisecond response time, while running thousands of concurrent strategies, it seems we need to disintermediate the VMs and interpreters of the world. If they’re really necessary, they can be happily used for the analysis process (as I currently use R), or they can be lit-up and bridged from some lower-level language for batch-like services.

The good people at Intel have been thinking about this problem for a while as have many other seriously over-educated people. One of the (sensible sounding) conclusions reached as people look for ways to solve problems similar to my own, is that in such systems we should keep messages waiting as little as possible – ideally, not at all(!). This can be a problem in SEDA-like architectures which are basically made-up of (non-blocking, asynchronous) i/o processes linked to (blocking) queues linking pools of workers. Blocking queues can pile up and cause all sorts of problems like priority inversion and other such enigmatically named nasties. Lock-free queues and other data structures, algos and techniques promise some ways around this and I’ve been spending time looking into how they might be employed to address my issues.

Before I’m besieged by throngs of angry erlang/ocaml/scala/java developers, allow me one last observation on the topic.  (Peeved python and ruby users may rant away – vous m’amusez  ;^)

Why might a lock free algorithm be better than an equivalent, hardware-based locking implementation?  The answer isn’t obvious.  If locking is implemented in hardware as is typical (eg, with a compare-and-swap (CAS) instruction), then its explicit cost is measurable in (few) nanoseconds.  Hardware is fast.  The issue isn’t the speed of execution of the underlying primitives so much as it’s a consequence of the side effects of these operations at a very low level.  For real performance, cache coherence is King.  See here for an accessible discussion by IBM’s Paul McKenney and here for some remarkable examples from Igor Ostrovsky.  This indicates that if you want the highest possible performance, you need to be aware of what is happening ‘in the metal.’  So we need to use a system-level language and erlang, java & friends lose their candidacy in spite of any fantastic benefits they might offer.

Given that even the DoD has mostly given up on ADA means that we’re left with C/C++.

Ok, so language doesn’t seem to resolve much for us. (Indeed, it was mostly hopeful thinking on my part – design is mostly language agnostic and hardware is hardware…)

Apart from Intel’s own Threading Building Blocks (TBB) framework, there are a variety of toolkits available for exploiting lock free parallelism. Perhaps the newest and least known is called FastFlow, which is a C++ template library that provides a variety of facilities for writing efficient lock-free network models. It also claims to be faster than TBB, Cilk and OpenMP while holding out the promise of one day becoming CUDA- (or more generally, GPU-) aware which would be an incredible win. Finally, it is very small – the current version (not including tests and examples), weighs in at ~5K lines of (mostly) C++ templates.  Thus, it seems to me particularly well-suited for some experimentation to assess the fit of these techniques in this space and the level of difficulty of doing so.

In the remainder of this post, I’ll briefly describe the FF design and then illustrate a sample C++ program which uses FastFlow to ‘architecturally prototype’ a feed handler interacting with strategies inside an EMS / strategy container.

Read more…

EMS Internals, technology

transitions

February 8th, 2010

Today we return to our series on regime switching and the topic of managing portfolios of strategies.  In particular, we build on the examples illustrated in sensitivity testing and steppin’ out, in which we showed historical and then real-time ‘forward-walking’ of strategies.  The next step we’d described was to evolve the techniques illustrated to support the real-time management of a portfolio of strategies.

In the example below, we look at another ‘meta’ strategy named StrategyPortfolio which maintains a dynamic portfolio – P – of strategies which it will select from a set of strategies – S – running concurrently in simulation.  The constituents of P as well as their cash allocations and parameterizations will be rebalanced/adjusted regularly after an initial ‘out-of-sample’ period during which only the S strategies are run.

Apart education, the intention of this strategy, as I’d originally suggested here, is to ‘back-into’ a regime-switching strategy without attempting to directly quantify the regimes explicitly.

This has proved to be even more interesting than I’d expected, not so much because it performs particularly well (though it’s promising), but because of all of the things it has taught us.  In particular, the transitions are a killer and there are properties of strategies which (dis-)qualify them from being effective in such a scheme…

Read more…

EMS Internals, portfolio management, regime-switching, strategy development

Kooderive

February 3rd, 2010
photo by Simon Rogerson

photo by Simon Rogerson

Some time back, I’d written about NVidia’s CUDA noting that it looked ideal for many asset-pricing and monte-carlo type problems in finance.  At the time, I was hopeful that it would be quickly integrated into existing open source efforts like QuantLib, but adoption has proved slower than I’d hoped, most likely because implementing non-trivial problems on CUDA is, well, even less trivial than doing them without..

LMM on CUDA

Happily, I’ve just seen a promising first step in this direction as Über-quant and C++ artisan Mark Joshi recently announced an open-source project, Kooderive which looks to implement the LIBOR Market Model (LMM)  on top of CUDA.  His announcement on the QuantLib mailing lists reads:

Dear All,

various people have shown interest in the use of CUDA with QuantLib. I
have now made some progress on a CUDA implementation of the LIBOR
market model
.

In particular, I now have a path generator for the LMM working which
does 16384 paths for 40 rates, 40 steps, 5 factor model, displaced
diffusion predictor-corrector that takes 0.1 seconds on my Quadro 4600.

The state of the project is code fragments that can be called from
other code. Those who are interested can get the code via
the subversion repository on kooderive.sourceforge.net .  The only
project file is currently for VC9 x64. It also uses thrust and the
CUDA SDK.

The next stage will be writing routines, that use QuantLib for the CPU
stuff and kooderive for the GPU stuff,  to actually price things.

A gentle reminder that I will be giving a course on the LMM and
QuantLib in June in London, and I will include a session on kooderive
if there
is sufficient interest.

I am happy to take code contributions for kooderive. However, I am not
looking for a redesign of the library or contributions which introduce
dependence on other libraries. I am interested in contributions of
separate routines and of optimizations of existing routines that do
not change interfaces.

regards

Mark

Pricing exotic interest rate derivatives – The LIBOR Market Model in
QuantLib June 2010, London,
http://www.moneyscience.com/training/index.html

Assoc Prof Mark Joshi
Centre for Actuarial Studies
University of Melbourne
My website is www.markjoshi.com


EMS Internals, dereferenced, monte-carlo methods, open-source software, options pricing, technology