Saturday, 23 May 2015

Coding for Ultra Low Latency

For a system to operate as fast as possible every line of code needs to be optimal. If you take the approach of writing lazy code then optimising you will end up rewriting everything. A profiler wont help you at the nanosecond level, the overhead of running with profiler metrics will have you "chasing your tail" !

Writing optimal code from the start of the project is easy, set up coding standards and enforce them. Have a simple set of guidelines that everyone follows.

Minimise synchronisation
The synchronized keyword used to be really slow and was avoided with more complex lock classes used in preference. But with the advent of under the cover lock spinning this is no longer the case. That said even if the lock was uncontended you still have the overhead of a read and write memory barrier. So use synchronized where its absolutely needed ie where you have real concurrency.
Key here is application design where you want components to be single threaded and achieve throughput via concurrent instances which are independent and require no synchronisation.
Minimise use of volatile variables
Understand how your building blocks work eg AtomicInteger, ConcurrentHashMap.
Only use concurrent techniques for the code that needs to be concurrent.
Minimise use of CAS operations
An efficient atomic operation bypassing O/S and implemented by CPU instruction. However to make it atomic and consistent will incur a memory barrier hitting cache effectiveness. So use it where needed and not where not !
Avoid copying objects unnecessarily
I see this A LOT and the overhead can soon mount up
Same holds true for mempcy'ing buffer to buffer between API layers (especially in socket code)
Avoid statics
Can be a pain for unit tests, but real issue comes from required concurrency of shared state across instances running in separate threads
Avoid maps
I have worked on several C++ and java systems where instead of a real object model, they used abstract concepts with object values stored in maps. Not only do these systems run slowly, but they lack compile time safety and are simple a pain. Use maps where they are needed … eg a map of books or a map of orders. SMT has a goal of at most one map lookup for each event.
Presize collections
Understand the cost of growing collections, eg a HashMap has to create new array double the size then rehash its elements, an expensive operation when the map is growing into hundreds of thousands. Make initial size configurable.
Reuse heuristics
At end of the day write out the size of all collections. Next time process is bounced resize to previous stored max.
Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency.
Use Object Orientation
Avoiding object orientation due to fear of the cost of vtable lookups seems wrong to me. I can understand it on a micro scale, but on a macro end to end scale whats the impact ?  In java all methods are virtual, but the JIT compiler knows what classes are currently loaded and can not only avoid a vtable lookup but can also inline the code. The benefit of object orientation is huge. Component reuse and extensibility make it easy to extend and create new strategies without swathes of cut and paste code.
Use final keyword everywhere
Help the JIT compiler optimise .. If in future a method or class needs extending then you can always remove the final keyword
Small Methods
Keep methods small and easy to understand. Big big methods will never be compiled, big complex methods may be compiled, but the compiler may end of recompiling and recompiling the method to try and optimise.  David Straker wrote "KISS" on the board and I never forgot it !  If the code is easy to understand that’s GOOD.
Avoid Auto Boxing
Stick to primitives and use long over Long and thus avoid any auto boxing overhead (stick the auto boxing warning on)
Avoid Immutables
Immutable objects are fine for long lived objects, but can cause GC for anything else … eg a trading system with market data would have GC every second if each tick creates an immutable POJO
Avoid String
String is immutable and is a big no-no for ultra low latency systems. In SMT I have a ZString immutable "string-like" interface. With ViewString and ReusableString concrete implementations.
Avoid Char
Use byte and byte[] and avoid translation between byte and char on every IO operation
Avoid temp objects
Objects take time to construct and initialise. Consider using instance variables for reuse instead (if instance is not used concurrently).
Facilitate object reuse by API
Where possible, pass into a method the object that needs to be populated. This allows invoking code to avoid object creation and reuse instances where appropriate


String str = order.toString();      // the api forces construction of temporary string

Versus

_str.reset();                                 // a reusable "working" instance var
Order.toString( _str );               // because buffer passed into method no temp objects required

Don’t make everything reusable
Just where otherwise the objects would cause GC
Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs.
Unfortunately for ultra low latency its not an option, you have to reuse objects (remember there are places in Java classes that already use pools and reuse)
Avoid finalize
Objects which hold resources such as files and sockets should all attempt to shutdown cleanly and not rely on finalisers. Add explicit open and close methods and add shutdown handlers to cleanly close if possible.
Avoid threadlocal
Every threadlocal call involves a map lookup for current thread so only use where really needed.
24 * 7
Design your systems to run 24 * 7 …. common in 80's and 90's less so now in finance.