Ultra Low Latency Trading Systems: May 2015

Saturday, 23 May 2015

Coding for Ultra Low Latency

For a system to operate as fast as possible every line of code needs to be optimal. If you take the approach of writing lazy code then optimising you will end up rewriting everything. A profiler wont help you at the nanosecond level, the overhead of running with profiler metrics will have you "chasing your tail" !

Writing optimal code from the start of the project is easy, set up coding standards and enforce them. Have a simple set of guidelines that everyone follows.

Minimise synchronisation	The synchronized keyword used to be really slow and was avoided with more complex lock classes used in preference. But with the advent of under the cover lock spinning this is no longer the case. That said even if the lock was uncontended you still have the overhead of a read and write memory barrier. So use synchronized where its absolutely needed ie where you have real concurrency. Key here is application design where you want components to be single threaded and achieve throughput via concurrent instances which are independent and require no synchronisation.
Minimise use of volatile variables	Understand how your building blocks work eg AtomicInteger, ConcurrentHashMap. Only use concurrent techniques for the code that needs to be concurrent.
Minimise use of CAS operations	An efficient atomic operation bypassing O/S and implemented by CPU instruction. However to make it atomic and consistent will incur a memory barrier hitting cache effectiveness. So use it where needed and not where not !
Avoid copying objects unnecessarily	I see this A LOT and the overhead can soon mount up Same holds true for mempcy'ing buffer to buffer between API layers (especially in socket code)
Avoid statics	Can be a pain for unit tests, but real issue comes from required concurrency of shared state across instances running in separate threads
Avoid maps	I have worked on several C++ and java systems where instead of a real object model, they used abstract concepts with object values stored in maps. Not only do these systems run slowly, but they lack compile time safety and are simple a pain. Use maps where they are needed … eg a map of books or a map of orders. SMT has a goal of at most one map lookup for each event.
Presize collections	Understand the cost of growing collections, eg a HashMap has to create new array double the size then rehash its elements, an expensive operation when the map is growing into hundreds of thousands. Make initial size configurable.
Reuse heuristics	At end of the day write out the size of all collections. Next time process is bounced resize to previous stored max. Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency.
Use Object Orientation	Avoiding object orientation due to fear of the cost of vtable lookups seems wrong to me. I can understand it on a micro scale, but on a macro end to end scale whats the impact ? In java all methods are virtual, but the JIT compiler knows what classes are currently loaded and can not only avoid a vtable lookup but can also inline the code. The benefit of object orientation is huge. Component reuse and extensibility make it easy to extend and create new strategies without swathes of cut and paste code.
Use final keyword everywhere	Help the JIT compiler optimise .. If in future a method or class needs extending then you can always remove the final keyword
Small Methods	Keep methods small and easy to understand. Big big methods will never be compiled, big complex methods may be compiled, but the compiler may end of recompiling and recompiling the method to try and optimise. David Straker wrote "KISS" on the board and I never forgot it ! If the code is easy to understand that’s GOOD.
Avoid Auto Boxing	Stick to primitives and use long over Long and thus avoid any auto boxing overhead (stick the auto boxing warning on)
Avoid Immutables	Immutable objects are fine for long lived objects, but can cause GC for anything else … eg a trading system with market data would have GC every second if each tick creates an immutable POJO
Avoid String	String is immutable and is a big no-no for ultra low latency systems. In SMT I have a ZString immutable "string-like" interface. With ViewString and ReusableString concrete implementations.
Avoid Char	Use byte and byte[] and avoid translation between byte and char on every IO operation
Avoid temp objects	Objects take time to construct and initialise. Consider using instance variables for reuse instead (if instance is not used concurrently).
Facilitate object reuse by API	Where possible, pass into a method the object that needs to be populated. This allows invoking code to avoid object creation and reuse instances where appropriate String str = order.toString(); // the api forces construction of temporary string Versus _str.reset(); // a reusable "working" instance var Order.toString( _str ); // because buffer passed into method no temp objects required
Don’t make everything reusable	Just where otherwise the objects would cause GC Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs. Unfortunately for ultra low latency its not an option, you have to reuse objects (remember there are places in Java classes that already use pools and reuse)
Avoid finalize	Objects which hold resources such as files and sockets should all attempt to shutdown cleanly and not rely on finalisers. Add explicit open and close methods and add shutdown handlers to cleanly close if possible.
Avoid threadlocal	Every threadlocal call involves a map lookup for current thread so only use where really needed.
24 * 7	Design your systems to run 24 * 7 …. common in 80's and 90's less so now in finance.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Saturday, 16 May 2015

Java Bytecode Latency Impact

In the 80's I remember building NAND circuits to represent code. It was pretty cool seeing how code could be implemented at a circuit level. What I was unsure of when I started SubMicroTrading was the impact on performance by java byte code and whether there were any possible optimisations available.

To cut a long story short, I found only one worthwhile optimisation and that’s how a switch statement is represented in byte code.

Consider the following switch statement :-

Switch( a ) {

case 10 : doAStuff(); break;

case 20 : doBStuff(); break;

case 30 : doCStuff(); break;

case 40 : doDStuff(); break;

...

}

is conceptually the same as

if ( a == 10 ) doAStuff

else if ( a == 20 ) doBStuff

else if ( a == 30 ) doCStuff

else if ( a == 40 ) doDStuff

Think about that, if you are parsing fix and you have 1000 possible fix tags, and an average of 20 fields in a message. Then to process a fix message you could possibly be making an average of 10,000 comparisons . If you want to process 1,000,000 events per second, that would be 10,000,000,000 comparisons per second. (A huge number that would be based on linear top down search, binary search is clearly much better … but point is its still a cost that can be avoided).

Java has two bytecodes for switch statements, the LookUpSwitch statement is in effect a table of key value to jump label … ie you have to search the table to find correct key entry. TableSwitch is in effect a table of jump labels which are indexed directly by the key value (minus the table offset .. Ie lowest key value in switch statement).

For Ultra Low Latency you should consider adding an ant task and check the bytecode for any "lookupswitch" statements. For message processing on most exchanges you can safely force a switch statement to become a tableswitch by adding packer entries so there are no gaps between the key values. In my CODEC generators I stipulate a max pack range eg 100, and any sparse values are handled within the default statement eg via a second switch or if statement if only couple of keys of interest. Like everything test out with real data to see impact. For me, the tableswitch made HUGE difference.

Sample tableswitch with packering case statements :-

Here is the start of the switch statement within the Standard44Decoder generated class :-

final byte msgType = _fixMsg[ _idx ];

switch( msgType ) {

case '8':

if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type

throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );

}

_idx += 2;

return decodeExecReport();

case 'D':

if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type

throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );

}

_idx += 2;

return decodeNewOrderSingle();

………….

// packers

case '6': case '7': case ':': case ';': case '<': case '=':

case '>': case '?': case '@': case 'B': case 'C': case 'E':

break;

javap -c ./com/rr/model/generated/fix/codec/Standard44Decoder.class >s44.bc

protected final com.rr.core.model.Message doMessageDecode();

Code:

…….

74: tableswitch { // 48 to 71

48: 549 // '0' ie heartbeat is the first entry

49: 987

50: 841

51: 768

52: 914

53: 695

54: 1060

55: 1060

56: 184 // '8' this was the first switch entry in java code

57: 476

58: 1060 // 1060 is same as default and are the packer entries

59: 1060

60: 1060

61: 1060

62: 1060

63: 1060

64: 1060

65: 622

66: 1060

67: 1060

68: 257

69: 1060

70: 403

71: 330

default: 1060

}

184: aload_0

185: getfield #462 // Field _fixMsg:[B

188: aload_0

189: getfield #466 // Field _idx:I

192: iconst_1

193: iadd

194: baload

195: iconst_1

196: if_icmpeq 242

199: aload_0

200: new #475 // class java/lang/StringBuilder

203: dup

204: ldc_w #477 // String Unsupported fix message type

You could easily grep for lookupswitch … remember over time extra case statements could be added that cause the switch to become lookupSwitch again.

PS> clearly tableswitch isnt suitable for very sparse switch statements, but in my experience its very useful in trading systems.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Saturday, 9 May 2015

Java JVM Tuning for Ultra Low Latency

There is no JVM arg that fits all applications, key is have a repeatable full test bed and run full scale benchmarks over hours not seconds. Rinse and repeat several times for EACH arg change. The args I focus on are the ones on SubMicroTrading which performs no GC and has almost no JIT post warmup.

Please note some of these flags are now on by option … sorry I havent checked, still worth bringing them to attention I think.

For standard java applications which do lots of GC with mainly short lived objects, I would recommend try the G1 collector … for market data I found it much better than concurrent mark sweep. I will blog about that another time … spent weeks tuning poorly designed apps (advice don’t bother buy Zing).

Note each Java upgrade brings new options and tweaks existing performance, sometimes up, sometimes down, so re-benchmark each Java upgrade.

Treat micro benchmarks with care, discuss the Generics benchmark and explain how on PC was different to Linux

Avoid BiasedLocks … they incur regular millisecond latency in systems I have tested

JVM Args for Recording Jitter (JIT/GC)

-XX:+	PrintCompilation
-XX:+	CITime
-XX:+	UnlockDiagnosticVMOptions
-XX:+	PrintInlining
-XX:+	LogCompilation
-verbose:	gc
-XX:+	PrintGCTimeStamps
-XX:+	PrintGCDetails

Rather than regurgitate what I previously googled on understanding output from PrintCompilation :- http://blog.joda.org/2011/08/printcompilation-jvm-flag.html

For ultra low latency you want no GC and no JIT, so in SMT I preallocate pools and run warmup code then invoke System.gc(). I take note of the last compilation entry then while re-running controlled bench test look for new JIT output (generally recompilation). When this occurs I go back to the warmup code and find out why the warmup code had to be recompiled. This generally comes down to either the code not being warmed up, or the routine was too complicated for the compiler. Either add further warmup code or simplify the routine. Adding final everywhere really helps.

Writing warmup code is a pain, and I am gutted the java.lang.Compiler.disable() method is not implemented (or at least it wasn’t in Open JDK1.6 … empty method doh!). Ideally I would invoke this when application is warm and have no recompilation due to the compiler thinking it can make further optimisations.

Java can recompile and recompile this only happens in my experience when method is too complex. Ofcause if a recompilation is due because java inlined a non final method and the optimisation was premature then the code needs to be corrected. What I want to avoid recompilation optimisations from edge cases that infrequently go into code branches.

Note you cannot guarantee no GC and no JIT under any situation in a complex system. What you can do is guarantee no JIT/GC for KEY specified scenarios that the business demand. If a trading system does 10 million trades a day, I would set a goal of no GC/JIT under NORMAL conditions with 30 million trades then check performance upto 100 million to see at which point jitter occurs. If for example the exchange disconnect you during the day, and that kicks in a few milliseconds of JIT its not important. You don’t need pool every object … just the key ones that cause GC. More on that in future blog on SuperPools.

I remember speaking to Gil Tene from Azul, while working at Morgan Stanley and really tried to get across how much more JIT is of a pain than GC. Some exciting developments seem to have been made with Zing and I would have been very interested in benchtesting that … alas I just don’t have time at present. Very impressed with Azul and Gil and how they respond to queries and enhance their product ….. so much better than Sun/Oracle were with Java.

http://www.azulsystems.com/products/zing/whatisit

SubMicroTrading JVM Arguments

The following are the arguments that SubMicroTrading run with, this includes the algo container, OMS, exchange sim and client sim.

-XX:+	BackgroundCompilation	Even with this on there is still latency in switching in newly compiled routines. I really wish that switch time was much much quicker !
-XX:	CompileThreshold=1000	If you don’t want to benefit from fastest possible code given the runtime heuristics you can force initial compilation with -Xcomp … an option if you don’t want to write warmup code. This may run 10% slower but sometimes much slower depending on the code.
-XX:+	TieredCompilation	So code is initially compiled with the C1 (GUI/client) compiler, then when its reached invocation limit is recompiled with the fully optimised C2 (server) compiler. The C1 compiler is much quicker to compile a routine than the C2 compiler and reduced some outlying latency in SMT for routines that were not compiled during warmup (eg for code paths not covered in warmup).
-XX:-	ClassUnloading	Disable class unloading, don’t want any possible jitter from this. SMT doesn’t use custom class loaders and tries to load all required classes during warmup.
-XX:+	UseCompilerSafepoints	I had hoped that disabling compiler safepoints would reduce JIT jitter but in SMT multithreaded system it brings instability so I ensure the safepoints are on ….. More jittter I don’t want ho hum.
-XX:	CompileCommandFile= .hotspot_compiler"	The filename used to be picked up by default but now you have to use this command. This is really handy, if you have a small routine you cant simplify further which causes many recompilations then prevent it by adding a line to this file, example :- exclude sun/nio/ch/FileChannelImpl force This means the routine wont be compiled, you need to benchmark to determine if running routine as bytecode has noticeable impact.
-XX:+	UseCompressedOops	I kind of expected this to have a small performance overhead but in fact it has slightly improved performace … perhaps thru reduced object size and fitting more instances into cpu cache.
-X	noclassgc	Again all classes loaded during warmup and don’t want any possible jitter from trying to free up / unload them
-XX:-	RelaxAccessControlCheck	To be honest no idea why I still have this in or even if its still required !
-D	java.net.preferIPv4Stack= true	If you upgraded java and spent hours working out why your socket code isnt working anymore, this could well be it … DOH !!!
-	server	Don’t forget this if running benchmarks on PC
-XX:+	UseFastAccessorMethods
-XX:+	UseFastJNIAccessors
-XX:+	UseThreadPriorities	Not sure this is needed for SMT, I use JNI function to hwloc routines for thread core affinity
-XX:-	UseCodeCacheFlushing
-XX:-	UseBiasedLocking	Disable biased locking, this causes horrendous jitter in ultra low latency systems with discrete threading models Probably the single biggest cause of jitter from a jvm arg that I found.
-XX:+	UseNUMA	Assuming you have a multi CPU system, this can have significant impact … google NUMA architecture

JVM Arguments to experiment with … didn’t help SMT, but may help you

-XX:-	DoEscapeAnalysis	Try disable escape analysis and see what the impact is
-X	comp	Mentioned earlier, compile class when loaded as opposed to optimising based on runtime heuristics Avoids JIT jitter but code in general is slower than dynamically compiled code. Cant remember if it compiles all classes on startup or when each class is loaded, google failed to help me here !
-XX:+	UseCompressedStrings	Use byte arrays in Strings instead of char. SMT has its own ReusableString which uses byte arrays. Obviously a no go for systems that require multi byte char sets like Japanese Shift-JIS All IO is in bytes so avoid the constant translation between char and byte
-XX:-	UseCounterDecay	Experiment disabling / reenabling with recompilation decay timers. I believe the decay timers delay recompilation from happening within 30seconds. A real pain in warmup code. I run warmup code, pause 30seconds then rerun! Must be a better way. Wish decent documentation existed that wasnt hidden away !
-XX:	PerMethodRecompilationCutoff=1	Try setting maximum recompilation boundary … didn’t help me much

I have tried many many other JVM args but none of those had any favourable impact on SMT performance.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Monday, 4 May 2015

Linux Tuning for Ultra Low Latency

These are some of my notes from BIOS and Linux tuning for Ultra Low Latency with SubMicroTrading.

Don’t blindly copy ANY settings here, test each one for impact and pick the values that suit your system. I include them as possible points of interest.

I have spent many weeks simply doing this, its tedious but necessary to determine best settings for your system.

BIOS settings

Research every option, remember try to change one item at a time and run benchtest to ascertain impact

Disable hyper-threading

Disable turbo mode if overclocking

Disable all options related to power saving (eg CPU C State support)

Set SATA Configuration to ENHANCED

ACPI Power Management Features : APIC ACPI SCI IRQ and High Precision Timer ENABLED

If using PCI3 NetworkCard the slot with the card HAS PCI3 enabled .. default may be PCI2

Operating System Tuning for Ultra Low Latency

Its over 20 years since I worked at a low level with Unix/Linux and the truth is I have forgotten more about the kernal than I now know. I am NOT an O/S specialist. For SubMicroTrading I didn’t have the luxury to pay someone to configure linux for me so I had to do it myself. I still have my trusty Stevens UNIX Network Programming book which helped and ofcause today we have Google ! David Riddoch from Solarflare was also helpful at answering questions regarding Solarflare and OpenOnload tuning.

I started with Redhat and dismissed the Realtime variant as it was slower for my benchtest, I currently recommend CentOS 5.10 (I am somewhat behind the later versions but honestly what do they have that helps with low latency ?).

Read the Solarflare optimisation document, only wish it had existed when I started ! INSTALL OpenOnload !! I don’t understand why people working in microsecond level latency still don’t use kernal bypass ! OpenOnload is great as its non intrusive and requires ZERO application code changes.

INSTALL SOLARFLARE (requires linux install to have dev env) … note these notes are OLD so versions will be well out of date

copy the SolarFlare drivers to /INSTALL/SOLARFLARE

s1) rpmbuild --rebuild /INSTALL/SOLARFLARE/sfc-3.0.6.2199-1.src.rpm

==> CREATES /usr/src/redhat/RPMS/x86_64/kernel-module-sfc-RHEL5-2.6.18-194.el5-3.0.6.2199-1.x86_64.rpm

s2) Install the RPM

rpm -ivh /usr/src/redhat/RPMS/x86_64/kernel-module-sfc-RHEL5-2.6.18-194.el5-3.0.6.2199-1.x86_64.rpm

==> eth2 and eth3 are now available

==> use rpm -e if old version around

s3) install OpenOnLoad

tar -xvf openonload-20100923.tar

./scripts/onload_install

modprobe -r sfc

modprobe sfc

openonload now ready to use

s4) install BIOS update tools

gunzip SF-104451-LS-4_Solarstorm_Linux_and_VMware_ESX_Utilities_RPM.tgz

tar -xvf SF-104451-LS-4_Solarstorm_Linux_and_VMware_ESX_Utilities_RPM.tar

==> creates ==> sfutils-3.0.8.2216-1.rpm

rpm -ivh /INSTALL/SOLARFLARE/sfutils-3.0.8.2216-1.rpm

(if get clash with previous version use rpm -e {oldRpm}

s5) check is BIOS update is required and update

sfupdate

sfupdate --write

onload_tool disable_cstates persist

s6) setup and env var with your required profile eg latency :-

export PRERUN="onload --profile=latency "

s7) simply add $PRERUN to the start of your application invocation command

$PRERUN java …….

Beware O/S upgrades, Linux 6 has some extra horrid latency which after several days tweaking I still hadnt eradicated … I went back to Centos 5.10

Here are my notes from CentOS / RedHat installation

Deselect virtualisation

Disable firewall

Disable SELinux

Delete SWAP partition (you don’t want swapping so ensure you have enough memory!)

Obviously disabling SELinux and firewall is for benchtesting in controlled environment. For colocation running you need determine an appropriate security level for your org. If you must have a firewall between you and the exchange then use a hardware one. Benchtest without security then with security on so you know the impact.

Protect cores against unwanted intrusion (will be discussed when I blog on using thread affinity)

Avoid millisecond latency impact by using discrete threading model with core affinity via kernal param isolcpus. The O/S wont share these cores via the scheduler so you will need to use thread affinity to bind threads to the protected cores (code to follow in later blog).

Edit the /boot/grub/grub.conf

kernel /vmlinuz-2.6.18-194.el5 ro root=LABEL=RH_ROOT nohz=off isolcpus=6,7,8,9,10,11 rhgb

Kernal Params

There are many ! Here are some to look at :-

transparent_hugepages=never

intel_idle.max_cstate=0

nohz=off

nosoftlockup

idle=poll

Disable unwanted services

/sbin/chkconfig --list | grep "5:on"

chkconfig irqbalance off

chkconfig anacron off

chkconfig atd off

chkconfig avahi-daemon off

chkconfig bluetooth off

chkconfig cups off

chkconfig hidd off

chkconfig isdn off

chkconfig pand off

chkconfig rhnsd off

chkconfig sendmail off

chkconfig cpuspeed off

chkconfig NetworkManager off

chkconfig iptables off

chkconfig ip6tables off

chkconfig libvirt-guests off

These are the services I disabled obviously you need to ensure you don’t need a service before you disabled it.

IRQBalance and CpuSpeed were the main services that I wished to disable … at the risk of sounding like a broken record disable single service, bench test rinse, repeat. Don’t disable ANY service without checking if YOU need it first !

System Scripts : rc.local

Edit /etc/rc.local

ethtool -C eth2 adaptive-rx off

ethtool -C eth2 rx-usecs 0 rx-frames 0 rx-usecs-high 0 rx-usecs-low 0 pkt-rate-low 0 pkt-rate-high 0

ethtool -C eth2 rx-usecs-irq 60

ethtool -A eth2 rx off tx off

ethtool -C eth3 adaptive-rx off

ethtool -C eth3 rx-usecs 0 rx-frames 0 rx-usecs-high 0 rx-usecs-low 0 pkt-rate-low 0 pkt-rate-high 0

ethtool -C eth2 rx-usecs-irq 60

ethtool -A eth3 rx off tx off

echo 0 > /sys/class/net/eth2/device/lro

echo 0 > /sys/class/net/eth3/device/lro

Only use rx irq 60 IF using openOnload …. You can experiment with this setting, I use spin reading so should never require an IRQ … but I found if I set lower or higher I could get nasty jitter.

System Scripts : sysctl.conf

Edit /etc/sysctl.conf

kernel.sysrq = 0

kernel.core_uses_pid = 1

net.ipv4.tcp_low_latency=1

# Controls the maximum size of a message, in bytes

kernel.msgmnb = 65536

# Controls the default maxmimum size of a mesage queue

kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes

kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages

kernel.shmall = 4294967296

kernal.isolcpus=6,7,8,9,10,11

kernel.vsyscall64 = 2

Some other useful CentOS "stuff"

Set Run Level

change /etc/inittab default run lvl from 5 to 3

id:3:initdefault

Network config

/etc/sysconfig/network-scripts/ifcfg-eth*

After changing run

/etc/init.d/network restart

Label Root Partition

Label the root partition eg to RH_ROOT … so its not confused with any other later O/S installs

SET DEVICE LABEL READY FOR GRUB (** DONT FORGET UPDATE fstab OR IT WONT BOOT **)

e2label /dev/sda7 RH_ROOT

edit /etc/fstab

LABEL=RH_ROOT / ext3 defaults 1 1

Edit the /boot/grub/grub.conf

kernel /vmlinuz-2.6.18-194.el5 ro root=LABEL=RH_ROOT nohz=off isolcpus=6,7,8,9,10,11 rhgb

Check Kernal CPU Params

cat /sys/devices/system/cpu/cpuidle/current_driver

intel_idle.max_cstate=0 idle=poll transparent_hugepage=never processor.max_cstate=0

Click here for my list of Ultra Low Latency Blogs and Future Topics