Sunday, 26 April 2015

Holistic Latency for Ultra Low Latency Systems



Definition of holistic

"characterized by the belief that the parts of something are intimately interconnected and explicable only by reference to the whole."

Its crucial to understand all the components that make up a trading system as each will have its own latency characteristics.

Layer
Sample
Factor
Mitigation
Effort
Software Program
Design
Key design patterns / extensible framework / Efficient code
Hard
Language
Inherent
GC
JIT
Compile time args / run time args
Object Pooling
Warmup Code
Easy
Medium
High
Operating System

Scheduler
Hard Page Fault
Network Stack
Thread Affinity / Core Spinning
Appropriate memory in server, Process sizing
Kernal Bypass Drivers and Tuning, Socket spinning
Easy
Easy
Easy
Hardware
CPU
Memory
Network
Disable H/T, get fastest CPU, Overclock
Buy memory with lowest latency and ensure enough
Buy Solarflare NIC
Easy
Easy
Easy

I have built four rack servers with a box full Chelsio, Mellanox and Solarflare NIC's. By far the easiest to install and easiest to tune and best performing was Solarflare. Really disappointed in the Mellanox cards. Solarflare open onload provides one sided acceleration suitable for colocation purposes and at no extra cost. This was several years ago so maybe Mellanox have their own one sided acceleration now but for me its come too late. I have preached Solarflare NIC's to everyone I know.

Consider following scenario

Read next packet from socket
decodes market data tick into exchange normalised event
log event
Place event into queue for async consumption

How much benefit will there be in the end system by saving 20nano seconds in switching from a queue from ConcurrentLinkedQueue  to  a  RingBuffer (eg Disruptor) to your system ? Will it be twice as quick ? …. No, what about how you read the packet of the socket ? What about the log event ? What about the queue size characterics ?
I have seen FX systems with man years of effort put into latency optimisation when they didn’t even use OpenOnload for their Solarflare cards ! What has higher risk .. Using OpenOnload or the code a 10 man team has written over 3 years ?

You must understand the key use cases for your system where latency is important, then create end to end repeatable bench tests in fully controlled environment which will be reflective of the production environment. I suggest wire to wire timings with PTP Solarflare NIC's. Alternatively use two servers (1 simulation, 1 trading) with dual Solarflare NIC's … in this case you don’t need PTP (I will cover this in a later blog with the JNI code I put together).

Micro-benchmarks must be used with care, they can give good comparative performance against other implementations. But wont necessarily produce production system gains given all the variables in play. For example in above scenario consider what happens when a fixed sized queue fills up. More on benchtesting another day.