Sunday 3 May 2015

Measuring Latency in Ultra Low Latency Systems


For years people have said optimise last, don’t worry .. You can fix later.

For ultra low latency its simply not true. Every line of code should be written with optimisation in mind. When writing SubMicroTrading I wrote micro benchmarks for comparing performance on a micro level. I tested pooling patterns, different queue implementations, impact of inheritance and generics. Some of the results were surprising. The one which stands out as misleading was the benchtest of Generic pools versus concrete pools. Theoretically generics should have no runtime impact, but on a PC I had a micro benchmark which showed there were some latency spikes. Because of this I made my code generator generate a discrete pool and recycle class. Ofcause when I reran the benchmarks on linux with tuned JVM parameters there was zero difference !

Absolutely essential is understand Key use cases and have a controlled benchtest setup which you can guarantee is not in use when testing.

Colocation OMS

The first use case for SubMicroTrading was as a colocation normalising order management with client risk checks. The OMS received client orders, validated them, ran risk checks then sent a market order to the exchange. The OMS would normalise all exchange eccentricities and allow for easy client customisation. This was before IB clients were allowed sponsored access. With sponsored access the need for ultra low latency order management systems disappeared … bad timing on my part. 

With this use case I ran a client simulator and an exchange simulator on the sim server, and the OMS on the trade server. The client simulator stamped a nanosecond timestamp on the output client order. The OMS when it generated a market order would stamp the order with total time within the OMS as well as the original client order timestamp from the client simulator. The exchange simulator and client simulator both use thread affinity to the same core, and thus the nano second timestamp can be used to determine realistic end to end time. You can guestimate the time in the NIC and network by :-

    TimeInNICsAndNetwork = ( timeMarketOrderRecieved - clientOrderTimeStamp ) - timeInOMS

There were 4 NIC TCP hops so roughly a

    (Rough) NIC TCP hop  = TimeInNICsAndNetwork / 4

Because the timestamps are generated on one host (even 1 core or 1 cpu) then you don’t need expensive PTP time synchronisation. Using System.nanoTime to measure latency is not recommended, it uses the HPET timer which shows some aberations of many milliseconds on long test runs. RDTSC seems to be more accurate but can have problems measuring across cores (I will include the code in a later blog on JNI).

To be honest however at this level all we are really concerned about is having a repeatable testbed. By changing some parameter, whats the impact ?  Run the benchtest several times and measure the delta. Now we can see if we make things better or worse. Note now Solarflare have added packet capture facility I would hope its possible to use capture of input/output would allow accurate perf measurements.

Algo Trading System : Tick To Trade

The second use case was for Tick to Trade. Here I wrote an algo container, optimal book manager and market data session handling using the SMT framework.



NIC1 was used for market data, and NIC2 for the trade connection.

In this use case, the simulation server replays captured market data using tcpreplay at various rates (from x1 to flat out which was around 800,000pps).

The trade server gets the market data thru NIC1 into the market data session(s). CME has so many sessions that I wrote a dynamic session generator to lazily create sessions based on subscribed contracts. The market data is consumed by the algos which can then send order to exchange session over NIC2.  The order is send with a marker field to identify the tick that generated it. This allows accurate correlation of the tick which really generated the order.

Hybrid FPGA, market data providers, exchange adapters all need to be considered holistically within the trading system as a whole for the KEY use cases of the system. Benchtest them not at the rate of a passive day but at the maximum rate you may need … consider the highest market spike rates and then try normal market rate x1, x5, x10, x100 to understand different performance levels.

For really accurate figures we need to use NIC splitters on the trade server and capture input/output from both NICS. I have done this most recently with a TipOFF device (with Grandmaster clock and solarflare PTP 10GE NICS which have CPU independent host clock adjustment). Here you measure reasonably accurate wire to wire latency. This is best way to compare tick to trade performance for a system as a whole and thus bypass the smoke and mirrors from component service providers.

I have seen people working on ultra low latency FX systems state the P99.9999 is most important because the arb opportuntities are 1 in 1,000,000 and they believe that opportunity at the exchange will occur at the time the trading system has any jitter … ofcause it was completely uncorroborated and personally I think it was tosh. Arb ops are likely to happen during peak loads so its critical that system has throughput that can cater with max load without say causing hidden packet backup in network  buffers (beware dodgy TCP partial stack implementations that are likely to fail during network congestion / peaks).

Key measurements in my mind are the P50, P90, P95, P99 … with P95 being the most important. I am sure there will be plenty of people who disagree. But at the end of the day the only really important stat is how much money you make !! 

Note ensure that any monitoring processes on trade servers are lightweight and do NO GC and not impact trading systems performance. Run the testbed with and without for few days to see impact.