Home > EMS Internals, technology > lock free

lock free

February 16th, 2010 Leave a comment Go to comments

One of the recurring technology themes in these pages has been the ongoing and dramatic move from single to multi-core systems and the need to seriously increase the parallelism in our software designs. For me, one of the seminal, large-grained design patterns was the SEDA Architecture. For years, this informed my systems’ designs and formed a conceptual backbone for development. That said, I’ve been broadly aware for some time that SEDA’s golden age has (incredibly!) already passed us by, but haven’t identified what might replace it as a reference point for my design efforts.

Before considering tools, languages or patterns that might help, we need to reflect on the problem(s) we’re trying to solve. The problems inside an EMS look to me, after years of development, a lot like network routing problems. Indeed, my current view is that this (not just concurrency as I’d suggested at the time) is why the unfortunate Aleynikov & co. at GS were using erlang.

Why network routing? Think about the load on an EMS. The main issue is that you’re getting many thousands of teeny little messages per second and only a relatively small number of them matter to only a relatively small subset of ‘agents’ within the system. Reducing latency is all about making sure the time you spend on each message is minimized, and that the agents who are interested in a particular message needn’t wait for each other to do whatever they care to do based on the message. So, really you’re trying to route each message through your system with as few ‘hops’ possible and as much parallelism as you can muster under the (radically!) new assumption that you may have hundreds or thousands of cores available to you during the lifetime of the design.

I spent some time thinking (hoping) that languages might help furnish an answer. Perhaps a move to a functional language like erlang, ocaml or scala might help furnish at least a partial answer. But erlang is slow and peculiar, ocaml doesn’t support intra-process concurrency and scala looks like a bloated language on a bloated platform (jvm+java class library). And none of them seem to have achieved anything near the critical mass which is so crucial for the development of usable libraries and the availability of skilled developers with long experience in the technology.  Naturally, reasonable people will disagree about such things, but this is my view (today). Java is ok (and certainly sells servers), but it’s not obvious how it’s going to help me offload my work onto a GPU anytime soon (and jni is both painful and slow) and I’ve never been able to get comfortable with just how damn big VMs get.  Image size isn’t free and if we’re looking to go deep into the sub-millisecond response time, while running thousands of concurrent strategies, it seems we need to disintermediate the VMs and interpreters of the world. If they’re really necessary, they can be happily used for the analysis process (as I currently use R), or they can be lit-up and bridged from some lower-level language for batch-like services.

The good people at Intel have been thinking about this problem for a while as have many other seriously over-educated people. One of the (sensible sounding) conclusions reached as people look for ways to solve problems similar to my own, is that in such systems we should keep messages waiting as little as possible – ideally, not at all(!). This can be a problem in SEDA-like architectures which are basically made-up of (non-blocking, asynchronous) i/o processes linked to (blocking) queues linking pools of workers. Blocking queues can pile up and cause all sorts of problems like priority inversion and other such enigmatically named nasties. Lock-free queues and other data structures, algos and techniques promise some ways around this and I’ve been spending time looking into how they might be employed to address my issues.

Before I’m besieged by throngs of angry erlang/ocaml/scala/java developers, allow me one last observation on the topic.  (Peeved python and ruby users may rant away – vous m’amusez  ;^)

Why might a lock free algorithm be better than an equivalent, hardware-based locking implementation?  The answer isn’t obvious.  If locking is implemented in hardware as is typical (eg, with a compare-and-swap (CAS) instruction), then its explicit cost is measurable in (few) nanoseconds.  Hardware is fast.  The issue isn’t the speed of execution of the underlying primitives so much as it’s a consequence of the side effects of these operations at a very low level.  For real performance, cache coherence is King.  See here for an accessible discussion by IBM’s Paul McKenney and here for some remarkable examples from Igor Ostrovsky.  This indicates that if you want the highest possible performance, you need to be aware of what is happening ‘in the metal.’  So we need to use a system-level language and erlang, java & friends lose their candidacy in spite of any fantastic benefits they might offer.

Given that even the DoD has mostly given up on ADA means that we’re left with C/C++.

Ok, so language doesn’t seem to resolve much for us. (Indeed, it was mostly hopeful thinking on my part – design is mostly language agnostic and hardware is hardware…)

Apart from Intel’s own Threading Building Blocks (TBB) framework, there are a variety of toolkits available for exploiting lock free parallelism. Perhaps the newest and least known is called FastFlow, which is a C++ template library that provides a variety of facilities for writing efficient lock-free network models. It also claims to be faster than TBB, Cilk and OpenMP while holding out the promise of one day becoming CUDA- (or more generally, GPU-) aware which would be an incredible win. Finally, it is very small – the current version (not including tests and examples), weighs in at ~5K lines of (mostly) C++ templates.  Thus, it seems to me particularly well-suited for some experimentation to assess the fit of these techniques in this space and the level of difficulty of doing so.

In the remainder of this post, I’ll briefly describe the FF design and then illustrate a sample C++ program which uses FastFlow to ‘architecturally prototype’ a feed handler interacting with strategies inside an EMS / strategy container.

FastFlow architecture

FastFlow architecture

FastFlow’s (FF) goals seem quite ambitious, but the design is layered to allow users to work at a level that makes sense for the problems they are trying to solve, so they needn’t worry themselves (or even understand necessarily) layers below where they need to work.  At bottom is a thin layer of hardware-specific code supporting, one level up, single-producer-single-consumer (SPSC) lock-free queues and a threading model for their efficient interaction.  Above this are defined objects for generalizing from simple to arbitrary network types.  Here you see the definition of a composable Node which serves as the key abstraction for general-purpose streaming networks.  Above this are defined a variety of skeleton templates for building commonly useful graph types.  Because they are Nodes themselves, they can be nested and coupled to any depth.

Consider a simple pipeline:

a simple FastFlow pipeline

A farm is slightly more complex and acts something like a load balancing threadpool.  The Emitter pushes ‘tasks’ of arbitrary type to an arbitrary number of workers who then coalesce back into an (optional) Collector.  A simple Farm makes no promises about the ordering of tasks, but (I believe) FF provides mechanisms for ensuring their ordering at the collector.

a FastFlow Farm

Finally, because of the composable nature of Nodes, all of these graph types can be combined arbitrarily.  Here you see a Farm of pipelines with an ‘accelerator’ (which, if I understand correctly is just an integrated thread for pushing tasks into the network) and a feedback channel.

composition of FastFlow subnets

For a more complete (and probably accurate!) description of FastFlow, I encourage you to visit their site and tutorial.  The documentation remains limited, but there’s a good and growing selection of examples to examine and all of the source is available for inspection.

a simple example

Given my use-case, my first interest is in seeing how FastFlow might work in the context of asynchronous, non-blocking network handling.  To this end, I’ve built a simple Farm as depicted above in which the Emitter uses select() to read messages off the network which it parses and pushes into the farm for ‘processing’ which is currently NOP-ed.  The messages/tasks are then coalesced into a collector which does nothing more than free memory allocated in the Emitter.  Within the Farm, each of the nodes inhabits its own thread and outside of the farm I’ve implemented a very simple ‘client’ which connects to the Emitter’s socket and sends delimited Integral ‘messages.’  Ordinarily this would appear in its own process, but for the sake of only requiring 1 file, I’ve mashed them all together and the client lives in its own thread.  That’s it.

It turns out that the FastFlow part of this is, by far, the simplest.  Since the FF skeletons are implemented as templates, extending them couldn’t be much easier.  Here’s the definition of the worker and collector components.

// typical worker: does little ;^p
class Worker: public ff_node {
public:
  void * svc(void * task) {
    return task;
  }
};

// collector that just frees the malloc'ed memory
class freeing_collector: public ff_node {
public:
  void * svc(void * task) {
    int * t = (int *)task;
    if (*t == -1) return NULL;
    else free(task);
    return GO_ON;
  }
};

They are both nodes and their only requirement is to implement the svc method which is called by the FF framework when they are meant to service a task.  They can optionally implement svc_init and svc_end methods which provide an opportunity for initialization and cleanup.  The socket-reading emitter makes use of the svc_init method to initially setup the socket.  Later, as it repeatedly reads from the socket into a buffer, it cracks the incoming stream and then calls the ff_send_out() method to push those messages out to the downstream workers.  The whole example is about 350 lines of code and nearly 200 lines are consumed by the Emitter.  Almost all of this bulk is network-related, so I won’t illustrate here.  But highlighted here you can see the main FF-specific part of the Emitter:

 // handle new data coming in
 void new_data(int nbytes, const char* buf) {
   static char remainder[kMaxMsgSz];
   int *t;
   int j = 0;
   int i = 0;
   char msgbuf[kMaxMsgSz];
   for (; i < nbytes; i++) {
     if ( kDelimiter == buf[i] ) {
       int rlen = (j==0) ? strlen(remainder) : 0; // remainder?
       int len = i-j;
       if (rlen>0) strncpy(msgbuf,remainder,rlen);
       int start = (rlen==0) ? 0 : rlen;
       strncpy(&(msgbuf[start]),&(buf[j]),len);
       msgbuf[len+rlen] = '\0';
       t = (int *)malloc(sizeof(int));
       *t = atoi(msgbuf);
       j = i;
       ff_send_out(t);  // push message into farm
     }
   } // for

   int k = j+1;
   if (k < i) { // save 'remainder'
     strncpy(remainder,&(buf[k]),(i-k));
   } else remainder[0] = '\0';
 }

Finally, all of the pieces are assembled and launched in the main().

//  We construct a 'server' which is a fastflow emitter and which reads
//    integral 'msgs' from a client over a non-blocking socket.  Msgs
//    are parsed and fed into a fastflow farm for further handling.
//    The client is placed within its own thread - this is only so the
//    whole example can be placed in one file.
//
int main(int argc, char * argv[]) {
 // use: argv[0] <#msgs=1024> <#port=9999>

 int msgs = (argc > 1) ? atoi(argv[1]) : 1024;
 int port = (argc > 2) ? atoi(argv[2]) : 9999;

 int nworkers = 5;               // how many workers will recv msgs

 printf("main: sending #%d msgs to port <%d>\n",msgs,port);

 select_reader sr(port);         // we create 'server'
 freeing_collector fc;           // and a freeing collector
 ff_farm<> farm;                 // and a farm for it to live in
 farm.add_emitter(&sr);          // add both to the farm
 farm.add_collector(&fc);

 std::vector<ff_node *> workers; // build a collection of workers
 for(int i =0; i < nworkers; i++) { workers.push_back(new Worker); }

 farm.add_workers(workers);      // add all workers to the farm

 farm.run();                     // launch the farm
 printf("main: started farm.\n");

 sr.wait_til_ready();            // don't create client til srvr ready

 // create client which will send msgs via socket
 //
 printf("main: creating client...\n");
 client_thread* client = new client_thread(msgs,port);
 if (client != NULL) client->join();

 farm.wait();                    // wait for farm to be done its workload

 // emit some stats
#if defined(TRACE_FASTFLOW)
 std::cout << "DONE, time= " << farm.ffTime() << " (ms)\n";
 farm.ffStats(std::cout);
#endif

 return 0;
}

The whole example can be found here.

Writing this example was a bit of a ‘tale of two frameworks’ for me as I found the FastFlow part to be really easy and powerful.  On the other hand, I wanted to try out the boost::asio framework for non-blocking networking i/o (primarily in the hope it might cut down the code size a bit) and while implementing the trivial (blocking) client was, well, trivial, implementing a non-blocking server with boost::asio was a serious pain and I ultimately decided that Linus just might be on to something wrt to boost and scrapped that part of the effort (but kept the boost client).

Nothing about this example illuminates the potential performance advantages of a lock free approach, but this exercise has certainly convinced me that the FF authors have managed to combine very low-level performance considerations within a package which can be easily used from a high-level perspective.  Most problems will most likely be better served by higher-level programming languages, but some problems remain intractably bound to the metal and I suspect that writing ultra-low-latency trading systems will remain in the latter class for some time to come.

Categories: EMS Internals, technology
  1. Craig
    February 17th, 2010 at 13:34 | #1

    As usual programmers will always disagree on the right path to take, I used boost::asio for all my non-blocking i/o and have had no problems whatsoever, what did you find so painful?

  2. February 17th, 2010 at 15:13 | #2

    > what did you find so painful?

    Getting a clue. If I dust-off my vintage WR Stevens book and turn to chapter 12 on advanced i/o, I learn about select() in about six (dense) pages. I’m forced to learn about the arguments it takes and a variety of macros for manipulating them. Admittedly, it’s not easy to use select() correctly. (Indeed, I’ve already been informed that my little example Emitter has at least one error and readily admit that it suffers from use in an example that’s contrived-to-the-point-of-absurdity.) But to get the basic idea of it doesn’t take much more than reading a few man pages.

    Going to boost::asio’s reference almost couldn’t be a more removed experience. I didn’t bother to count, but there must be nearly 100 classes/templates/options/services/functions/&tc to wade through.

    I experienced terrifying flashbacks of Sun’s visionary decision to put corba and rmi in the java class library.

    I’m sure that once you get ‘grounded’ in the asio mindset it all makes sense and is very clean and nice, but I just wanted to whip together a simple example and get on with my day. The fact that google doesn’t include asio in the teeny sliver of boost that they allow, didn’t much inspire me to make the investment of time.

    But my real point was the difference I saw in using the two libraries. boost::asio has been around for a long time and has good documentation (I think – certainly there’s plenty of it), but I found it hard to understand and get to work as desired. Fastflow is doing non-trivial stuff, is brand new and has practically no documentation, but using it was very easy – there’s not a lot of concepts you need to learn because the designers kept it simple and used layers of abstraction to hide things you might not initially care about.

    Best,

  3. Craig
    February 17th, 2010 at 17:48 | #3

    Interesting, I just used the boost::asio examples to cobble together my connection classes, the timers where a little more obscure, but none of this took very long. Once I had the TCP client/server stuff going, it was only slightly more code to expand this to SSL. It is true select() can be surmised very quickly, but one may spend years finding subtle bugs and implementation problems. I do all my work with C++/Boost so I guess it I had a head start, and perhaps was a little bias :)

  4. February 18th, 2010 at 16:34 | #4

    Going back to my C roots I’d consider some type of shared memory pool with a “circular” queue. Make it the responsibility of all readers to keep up, remember the “position” they were last at and pulling out any new updates for processing.

    For layering purposes perhaps have queue managers that would take the last “position” as a request parameter and return 1-N available updates if any present. Since it’s read only memory to all but one locking/blocking should be minimal.

    Queue managers, server processes, could be placed in front of pools of consumers and provide answers based on reading snapshots from the master queue. It would be possible for queue managers to have different filtering capabilities if not all messages were useful for all potential clients.

    Heh, obviously, I like to roll my own solutions, but that doesn’t always work well with trying to fit into some massive organized framework. It is however fun when there are no real world issues to deal with during the design process… ;)

  5. February 19th, 2010 at 08:55 | #5

    Rookie: your roots are showing ;^>

  6. April 12th, 2010 at 13:22 | #6

    LINKED IN VISITORS:

    I see I’ve been getting a bunch of traffic on this post from some group in LinkedIn, but since I’m not a member of that (or any) group in LinkedIn, I can’t even see what the group is let alone why you folks are visiting here…

    Anyone care to enlighten me? Thanks!

  7. Marco
    April 13th, 2010 at 06:48 | #7

    Tito, I’ve mentioned this page in a post within the “Trading technologies” linkedin group. I hope this is not a problem for you.

    Marco

  8. April 13th, 2010 at 06:54 | #8

    Hi Marco – no problem at all of course! Thank you for letting me know as I was simply curious… Best regards

  9. Corporate Serf
    October 18th, 2011 at 11:34 | #9

    Bit late in the post, but have you looked at the “purely functional data structures”/ “functional arrays” work? Quite simply instead of “array[index] = 42;” you do “newArray = array[ with index = 42]” with the intent that you get to keep pointers to both the old and the new array. The data structures are tree based and keeps some sort of “modifications” in a tree like fashion. The clojure language does this on top of the java vm and allegedly it is almost lock free (some locks in garbage collection I would imagine, so you will need to fake that if you are doing it in C).

  1. February 9th, 2011 at 10:58 | #1
You must be logged in to post a comment.