For years people
have said optimise last, don’t worry .. You can fix later.
For ultra low
latency its simply not true. Every line of code should be written with
optimisation in mind. When writing SubMicroTrading I wrote micro benchmarks for
comparing performance on a micro level. I tested pooling patterns, different
queue implementations, impact of inheritance and generics. Some of the results
were surprising. The one which stands out as misleading was the benchtest of
Generic pools versus concrete pools. Theoretically generics should have no
runtime impact, but on a PC I had a micro benchmark which showed there were
some latency spikes. Because of this I made my code generator generate a
discrete pool and recycle class. Ofcause when I reran the benchmarks on linux
with tuned JVM parameters there was zero difference !
Absolutely essential
is understand Key use cases and have a controlled benchtest setup which you can
guarantee is not in use when testing.
Colocation OMS
The first use case for SubMicroTrading was as a colocation normalising order management with client risk checks. The OMS received client orders, validated them, ran risk checks then sent a market order to the exchange. The OMS would normalise all exchange eccentricities and allow for easy client customisation. This was before IB clients were allowed sponsored access. With sponsored access the need for ultra low latency order management systems disappeared … bad timing on my part.
With this use case I
ran a client simulator and an exchange simulator on the sim server, and the OMS
on the trade server. The client simulator stamped a nanosecond timestamp on the
output client order. The OMS when it generated a market order would stamp the
order with total time within the OMS as well as the original client order
timestamp from the client simulator. The exchange simulator and client
simulator both use thread affinity to the same core, and thus the nano second
timestamp can be used to determine realistic end to end time. You can
guestimate the time in the NIC and network by :-
TimeInNICsAndNetwork
= ( timeMarketOrderRecieved - clientOrderTimeStamp ) - timeInOMS
There were 4 NIC TCP
hops so roughly a
(Rough)
NIC TCP hop = TimeInNICsAndNetwork / 4
Because the
timestamps are generated on one host (even 1 core or 1 cpu) then you don’t need
expensive PTP time synchronisation. Using System.nanoTime to measure latency is
not recommended, it uses the HPET timer which shows some aberations of many
milliseconds on long test runs. RDTSC seems to be more accurate but can have
problems measuring across cores (I will include the code in a later blog on
JNI).
To be honest however
at this level all we are really concerned about is having a repeatable testbed.
By changing some parameter, whats the impact ?
Run the benchtest several times and measure the delta. Now we can see if
we make things better or worse. Note now Solarflare have added packet capture
facility I would hope its possible to use capture of input/output would allow
accurate perf measurements.
Algo Trading System : Tick To Trade
The second use case
was for Tick to Trade. Here I wrote an algo container, optimal book manager and
market data session handling using the SMT framework.
NIC1 was used for
market data, and NIC2 for the trade connection.
In this use case,
the simulation server replays captured market data using tcpreplay at various
rates (from x1 to flat out which was around 800,000pps).
The trade server
gets the market data thru NIC1 into the market data session(s). CME has so many
sessions that I wrote a dynamic session generator to lazily create sessions
based on subscribed contracts. The market data is consumed by the algos which
can then send order to exchange session over NIC2. The order is send with a marker field to
identify the tick that generated it. This allows accurate correlation of the
tick which really generated the order.
Hybrid FPGA, market
data providers, exchange adapters all need to be considered holistically within
the trading system as a whole for the KEY use cases of the system. Benchtest
them not at the rate of a passive day but at the maximum rate you may need … consider
the highest market spike rates and then try normal market rate x1, x5, x10,
x100 to understand different performance levels.
For really accurate figures we need to use NIC
splitters on the trade server and capture input/output from both NICS. I have
done this most recently with a TipOFF device (with Grandmaster clock and
solarflare PTP 10GE NICS which have CPU independent host clock adjustment).
Here you measure reasonably accurate wire to wire latency. This is best way to
compare tick to trade performance for a system as a whole and thus bypass the
smoke and mirrors from component service providers.
I have seen people
working on ultra low latency FX systems state the P99.9999 is most important
because the arb opportuntities are 1 in 1,000,000 and they believe that
opportunity at the exchange will occur at the time the trading system has any
jitter … ofcause it was completely uncorroborated and personally I think it was
tosh. Arb ops are likely to happen during peak loads so its critical that
system has throughput that can cater with max load without say causing hidden
packet backup in network buffers (beware
dodgy TCP partial stack implementations that are likely to fail during network
congestion / peaks).
Key measurements in my mind are the P50, P90, P95, P99
… with P95 being the most important. I am sure there will be plenty of people who disagree. But at the end of the day the only really important
stat is how much money you make !!
Note ensure that any
monitoring processes on trade servers are lightweight and do NO GC and not
impact trading systems performance. Run the testbed with and without for few
days to see impact.