Saturday, 23 May 2015

Coding for Ultra Low Latency

For a system to operate as fast as possible every line of code needs to be optimal. If you take the approach of writing lazy code then optimising you will end up rewriting everything. A profiler wont help you at the nanosecond level, the overhead of running with profiler metrics will have you "chasing your tail" !

Writing optimal code from the start of the project is easy, set up coding standards and enforce them. Have a simple set of guidelines that everyone follows.

Minimise synchronisation
The synchronized keyword used to be really slow and was avoided with more complex lock classes used in preference. But with the advent of under the cover lock spinning this is no longer the case. That said even if the lock was uncontended you still have the overhead of a read and write memory barrier. So use synchronized where its absolutely needed ie where you have real concurrency.
Key here is application design where you want components to be single threaded and achieve throughput via concurrent instances which are independent and require no synchronisation.
Minimise use of volatile variables
Understand how your building blocks work eg AtomicInteger, ConcurrentHashMap.
Only use concurrent techniques for the code that needs to be concurrent.
Minimise use of CAS operations
An efficient atomic operation bypassing O/S and implemented by CPU instruction. However to make it atomic and consistent will incur a memory barrier hitting cache effectiveness. So use it where needed and not where not !
Avoid copying objects unnecessarily
I see this A LOT and the overhead can soon mount up
Same holds true for mempcy'ing buffer to buffer between API layers (especially in socket code)
Avoid statics
Can be a pain for unit tests, but real issue comes from required concurrency of shared state across instances running in separate threads
Avoid maps
I have worked on several C++ and java systems where instead of a real object model, they used abstract concepts with object values stored in maps. Not only do these systems run slowly, but they lack compile time safety and are simple a pain. Use maps where they are needed … eg a map of books or a map of orders. SMT has a goal of at most one map lookup for each event.
Presize collections
Understand the cost of growing collections, eg a HashMap has to create new array double the size then rehash its elements, an expensive operation when the map is growing into hundreds of thousands. Make initial size configurable.
Reuse heuristics
At end of the day write out the size of all collections. Next time process is bounced resize to previous stored max.
Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency.
Use Object Orientation
Avoiding object orientation due to fear of the cost of vtable lookups seems wrong to me. I can understand it on a micro scale, but on a macro end to end scale whats the impact ?  In java all methods are virtual, but the JIT compiler knows what classes are currently loaded and can not only avoid a vtable lookup but can also inline the code. The benefit of object orientation is huge. Component reuse and extensibility make it easy to extend and create new strategies without swathes of cut and paste code.
Use final keyword everywhere
Help the JIT compiler optimise .. If in future a method or class needs extending then you can always remove the final keyword
Small Methods
Keep methods small and easy to understand. Big big methods will never be compiled, big complex methods may be compiled, but the compiler may end of recompiling and recompiling the method to try and optimise.  David Straker wrote "KISS" on the board and I never forgot it !  If the code is easy to understand that’s GOOD.
Avoid Auto Boxing
Stick to primitives and use long over Long and thus avoid any auto boxing overhead (stick the auto boxing warning on)
Avoid Immutables
Immutable objects are fine for long lived objects, but can cause GC for anything else … eg a trading system with market data would have GC every second if each tick creates an immutable POJO
Avoid String
String is immutable and is a big no-no for ultra low latency systems. In SMT I have a ZString immutable "string-like" interface. With ViewString and ReusableString concrete implementations.
Avoid Char
Use byte and byte[] and avoid translation between byte and char on every IO operation
Avoid temp objects
Objects take time to construct and initialise. Consider using instance variables for reuse instead (if instance is not used concurrently).
Facilitate object reuse by API
Where possible, pass into a method the object that needs to be populated. This allows invoking code to avoid object creation and reuse instances where appropriate


String str = order.toString();      // the api forces construction of temporary string

Versus

_str.reset();                                 // a reusable "working" instance var
Order.toString( _str );               // because buffer passed into method no temp objects required

Don’t make everything reusable
Just where otherwise the objects would cause GC
Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs.
Unfortunately for ultra low latency its not an option, you have to reuse objects (remember there are places in Java classes that already use pools and reuse)
Avoid finalize
Objects which hold resources such as files and sockets should all attempt to shutdown cleanly and not rely on finalisers. Add explicit open and close methods and add shutdown handlers to cleanly close if possible.
Avoid threadlocal
Every threadlocal call involves a map lookup for current thread so only use where really needed.
24 * 7
Design your systems to run 24 * 7 …. common in 80's and 90's less so now in finance.


Saturday, 16 May 2015

Java Bytecode Latency Impact

In the 80's I remember building NAND circuits to represent code. It was pretty cool seeing how code could be implemented at a circuit level. What I was unsure of when I started SubMicroTrading was the impact on performance by java byte code and whether there were any possible optimisations available.

To cut a long story short, I found only one worthwhile optimisation and that’s how a switch statement is represented in byte code.

Consider the following switch statement :-

    Switch( a ) {
    case 10 : doAStuff(); break;
    case 20 : doBStuff(); break;
    case 30 : doCStuff(); break;
    case 40 : doDStuff(); break;
    ...
    }  
   
    is conceptually the same as
   
    if ( a == 10 ) doAStuff
    else if ( a == 20 ) doBStuff
    else if ( a == 30 ) doCStuff
    else if ( a == 40 ) doDStuff

Think about that, if you are parsing fix and you have 1000 possible fix tags, and an average of 20 fields in a message. Then to process a fix message you could possibly be making an average of 10,000 comparisons . If you want to process 1,000,000 events per second, that would be 10,000,000,000 comparisons per second. (A huge number that would be based on linear top down search, binary search is clearly much better … but point is its still a cost that can be avoided).

Java has two bytecodes for switch statements, the LookUpSwitch statement is in effect a table of key value to jump label … ie you have to search the table to find correct key entry. TableSwitch is in effect a table of jump labels which are indexed directly by the key value (minus the table offset .. Ie lowest key value in switch statement).

For Ultra Low Latency you should consider adding an ant task and check the bytecode for any "lookupswitch" statements. For message processing on most exchanges you can safely force a switch statement to become a tableswitch by adding packer entries so there are no gaps between the key values. In my CODEC generators I stipulate a max pack range eg 100, and any sparse values are handled within the default statement eg via a second switch or if statement if only couple of keys of interest. Like everything test out with real data to see impact. For me, the tableswitch made HUGE difference.

Sample tableswitch with packering case statements :-

Here is the start of the switch statement within the Standard44Decoder generated class :-

        final byte msgType = _fixMsg[ _idx ];
        switch( msgType ) {
        case '8':
            if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type
                throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );
            }
            _idx += 2;
            return decodeExecReport();
        case 'D':
            if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type
                throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );
            }
            _idx += 2;
            return decodeNewOrderSingle();
        ………….
        // packers
        case '6': case '7': case ':': case ';': case '<': case '=':
        case '>': case '?': case '@': case 'B': case 'C': case 'E':
            break;


javap -c ./com/rr/model/generated/fix/codec/Standard44Decoder.class >s44.bc

  protected final com.rr.core.model.Message doMessageDecode();
    Code:
…….
      74: tableswitch   { // 48 to 71
                    48: 549                                      // '0' ie heartbeat is the first entry
                    49: 987
                    50: 841
                    51: 768
                    52: 914
                    53: 695
                    54: 1060
                    55: 1060
                    56: 184                                        // '8' this was the first switch entry in java code
                    57: 476
                    58: 1060                                     // 1060 is same as default and are the packer entries
                    59: 1060
                    60: 1060
                    61: 1060
                    62: 1060
                    63: 1060
                    64: 1060
                    65: 622
                    66: 1060
                    67: 1060
                    68: 257
                    69: 1060
                    70: 403
                    71: 330
               default: 1060
          }
     184: aload_0
     185: getfield      #462                // Field _fixMsg:[B
     188: aload_0
     189: getfield      #466                // Field _idx:I
     192: iconst_1
     193: iadd
     194: baload
     195: iconst_1
     196: if_icmpeq     242
     199: aload_0
     200: new           #475                // class java/lang/StringBuilder
     203: dup
     204: ldc_w         #477                // String Unsupported fix message type

You could easily grep for lookupswitch … remember over time extra case statements could be added that cause the switch to become lookupSwitch again.


PS> clearly tableswitch isnt suitable for very sparse  switch statements, but in my experience its very useful in trading systems.

Saturday, 9 May 2015

Java JVM Tuning for Ultra Low Latency

There is no JVM arg that fits all applications, key is have a repeatable full test bed and run full scale benchmarks over hours not seconds. Rinse and repeat several times for EACH arg change. The args I focus on are the ones on SubMicroTrading which performs no GC and has almost no JIT post warmup. 

Please note some of these flags are now on by option … sorry I havent checked, still worth bringing them to attention I think.

For standard java applications which do lots of GC with mainly short lived objects, I would recommend try the G1 collector … for market data I found it much better than concurrent mark sweep. I will blog about that another time … spent weeks tuning poorly designed apps (advice don’t bother buy Zing).

Note each Java upgrade brings new options and tweaks existing performance, sometimes up, sometimes down, so re-benchmark each Java upgrade.

Treat micro benchmarks with care, discuss the Generics benchmark and explain how on PC was different to Linux

Avoid BiasedLocks … they incur regular millisecond latency in systems I have tested

JVM Args for Recording Jitter (JIT/GC)

-XX:+ 
PrintCompilation
-XX:+
CITime  
-XX:+
UnlockDiagnosticVMOptions
-XX:+
PrintInlining
-XX:+
LogCompilation
-verbose:
gc
-XX:+
PrintGCTimeStamps
-XX:+
PrintGCDetails

Rather than regurgitate what I previously googled on understanding output from PrintCompilation :- http://blog.joda.org/2011/08/printcompilation-jvm-flag.html

For ultra low latency you want no GC and no JIT, so in SMT I preallocate pools and run warmup code then invoke System.gc(). I take note of the last compilation entry then while re-running controlled bench test look for new JIT output (generally recompilation). When this occurs I go back to the warmup code and find out why the warmup code had to be recompiled. This generally comes down to either the code not being warmed up, or the routine was too complicated for the compiler. Either add further warmup code or simplify the routine. Adding final everywhere really helps.

Writing warmup code is a pain, and I am gutted the java.lang.Compiler.disable() method is not implemented (or at least it wasn’t in Open JDK1.6 … empty method doh!). Ideally I would invoke this when application is warm and have no recompilation due to the compiler thinking it can make further optimisations. 

Java can recompile and recompile this only happens in my experience when method is too complex. Ofcause if a recompilation is due because java inlined a non final method and the optimisation was premature then the code needs to be corrected. What I want to avoid recompilation optimisations from edge cases that infrequently go into code branches.

Note you cannot guarantee no GC and no JIT under any situation in a complex system. What you can do is guarantee no JIT/GC for KEY specified scenarios that the business demand. If a trading system does 10 million trades a day, I would set a goal of no GC/JIT under NORMAL conditions with 30 million trades then check performance upto 100 million to see at which point jitter occurs. If for example the exchange disconnect you during the day, and that kicks in a few milliseconds of JIT its not important. You don’t need pool every object … just the key ones that cause GC. More on that in future blog on SuperPools.

I remember speaking to Gil Tene from Azul, while working at Morgan Stanley and really tried to get across how much more JIT is of a pain than GC. Some exciting developments seem to have been made with Zing and I would have been very interested in benchtesting that … alas I just don’t have time at present. Very impressed with Azul and Gil and how they respond to queries and enhance their product ….. so much better than Sun/Oracle were with Java.



SubMicroTrading JVM Arguments

The following are the arguments that SubMicroTrading run with, this includes the algo container, OMS, exchange sim and client sim.

-XX:+
BackgroundCompilation
Even with this on there is still latency in switching in newly compiled routines. I really wish that switch time was much much quicker !
-XX:
CompileThreshold=1000
If you don’t want to benefit from fastest possible code given the runtime heuristics you can force initial compilation with -Xcomp … an option if you don’t want to write warmup code. This may run 10% slower but sometimes much slower depending on the code.
-XX:+
TieredCompilation
So code is initially compiled with the C1 (GUI/client) compiler, then when its reached invocation limit is recompiled with the fully optimised C2 (server) compiler. The C1 compiler is much quicker to compile a routine than the C2 compiler and reduced some outlying latency in SMT for routines that were not compiled during warmup (eg for code paths not covered in warmup).
-XX:-
ClassUnloading
Disable class unloading, don’t want any possible jitter from this. SMT doesn’t use custom class loaders and tries to load all required classes during warmup.
-XX:+
UseCompilerSafepoints
I had hoped that disabling compiler safepoints would reduce JIT jitter but in SMT multithreaded system it brings instability so I ensure the safepoints are on ….. More jittter I don’t want ho hum.
-XX:
CompileCommandFile=
.hotspot_compiler"
The filename used to be picked up by default but now you have to use this command.
This is really handy, if you have a small routine you cant simplify further which causes many recompilations then prevent it by adding a line to this file, example :-

exclude sun/nio/ch/FileChannelImpl force

This means the routine wont be compiled, you need to benchmark to determine if running routine as bytecode has noticeable impact.
-XX:+
UseCompressedOops
I kind of expected this to have a small performance overhead but in fact it has slightly improved performace … perhaps thru reduced object size and fitting more instances into cpu cache.
-X
noclassgc
Again all classes loaded during warmup and don’t want any possible jitter from trying to free up / unload them
-XX:-
RelaxAccessControlCheck
To be honest no idea why I still have this in or even if its still required !
-D
java.net.preferIPv4Stack=
true 
If you upgraded java and spent hours working out why your socket code isnt working anymore, this could well be it … DOH !!!
-
server
Don’t forget this if running benchmarks on PC
-XX:+
UseFastAccessorMethods

-XX:+
UseFastJNIAccessors

-XX:+
UseThreadPriorities
Not sure this is needed for SMT, I use JNI function to hwloc routines for thread core affinity
-XX:-
UseCodeCacheFlushing

-XX:-
UseBiasedLocking
Disable biased locking, this causes horrendous jitter in ultra low latency systems with discrete threading models
Probably the single biggest cause of jitter from a jvm arg that I found.
-XX:+
UseNUMA
Assuming you have a multi CPU system, this can have significant impact … google NUMA architecture



JVM Arguments to experiment with … didn’t help SMT, but may help you

 -XX:-
DoEscapeAnalysis
Try disable escape analysis and see what the impact is
-X
comp
Mentioned earlier, compile class when loaded as opposed to optimising based on runtime heuristics
Avoids JIT jitter but code in general is slower than dynamically compiled code.
Cant remember if it compiles all classes on startup or when each class is loaded, google failed to help me here !
-XX:+
UseCompressedStrings
Use byte arrays in Strings instead of char. SMT has its own ReusableString which uses byte arrays.
Obviously a no go for systems that require multi byte char sets like Japanese Shift-JIS
All IO is in bytes so avoid the constant translation between char and byte
-XX:-
UseCounterDecay
Experiment disabling / reenabling with recompilation decay timers. I believe the decay timers delay recompilation from happening within 30seconds. A real pain in warmup code. I run warmup code, pause 30seconds then rerun! Must be a better way. Wish decent documentation existed that wasnt hidden away !
-XX:
PerMethodRecompilationCutoff=1
Try setting maximum recompilation boundary … didn’t help me much


I have tried many many other JVM args but none of those had any favourable impact on SMT performance.





Monday, 4 May 2015

Linux Tuning for Ultra Low Latency

These are some of my notes from BIOS and Linux tuning for Ultra Low Latency with SubMicroTrading.

Don’t blindly copy ANY settings here, test each one for impact and pick the values that suit your system. I include them as possible points of interest.
I have spent many weeks simply doing  this, its tedious but necessary to determine best settings for your system.

BIOS settings

Research every option, remember try to change one item at a time and run benchtest to ascertain impact

Disable hyper-threading
Disable turbo mode if overclocking
Disable all options related to power saving (eg CPU C State support)
Set SATA Configuration to ENHANCED
ACPI Power Management Features :  APIC ACPI SCI IRQ and  High Precision Timer ENABLED                                     
If using PCI3 NetworkCard the slot with the card HAS PCI3 enabled .. default may be PCI2

Operating System Tuning for Ultra Low Latency

Its over 20 years since I worked at a low level with Unix/Linux and the truth is I have forgotten more about the kernal than I  now know. I am NOT an O/S specialist. For SubMicroTrading I didn’t have the luxury to pay someone to configure linux for me so I had to do it myself. I still have my trusty Stevens UNIX Network Programming book which helped and ofcause today we have Google !  David Riddoch from Solarflare was also helpful at answering questions regarding Solarflare and OpenOnload tuning.

I started with Redhat and dismissed the Realtime variant as it was slower for my benchtest, I currently recommend CentOS 5.10 (I am somewhat behind the later versions but honestly what do they have that helps with low latency ?).

Read the Solarflare optimisation document, only wish it had existed when I started !   INSTALL OpenOnload !! I don’t understand why people working in microsecond level latency still don’t use kernal bypass !  OpenOnload is great as its non intrusive and requires ZERO application code changes.

INSTALL SOLARFLARE (requires linux install to have dev env) … note these notes are OLD so versions will be well out of date

copy the SolarFlare drivers to /INSTALL/SOLARFLARE

s1) rpmbuild --rebuild /INSTALL/SOLARFLARE/sfc-3.0.6.2199-1.src.rpm

==> CREATES  /usr/src/redhat/RPMS/x86_64/kernel-module-sfc-RHEL5-2.6.18-194.el5-3.0.6.2199-1.x86_64.rpm

s2) Install the RPM

rpm -ivh /usr/src/redhat/RPMS/x86_64/kernel-module-sfc-RHEL5-2.6.18-194.el5-3.0.6.2199-1.x86_64.rpm

==> eth2 and eth3 are now available
==> use rpm -e if old version around

s3) install OpenOnLoad

    tar -xvf openonload-20100923.tar
    ./scripts/onload_install
    modprobe -r sfc
    modprobe sfc

    openonload now ready to use

s4) install BIOS update tools
   
    gunzip SF-104451-LS-4_Solarstorm_Linux_and_VMware_ESX_Utilities_RPM.tgz
    tar -xvf SF-104451-LS-4_Solarstorm_Linux_and_VMware_ESX_Utilities_RPM.tar
    ==> creates ==> sfutils-3.0.8.2216-1.rpm
    rpm -ivh /INSTALL/SOLARFLARE/sfutils-3.0.8.2216-1.rpm
   
    (if get clash with  previous version use rpm -e {oldRpm}
   
s5) check is BIOS update is required and update
   
    sfupdate
    sfupdate --write
   
    onload_tool disable_cstates persist


s6) setup and env var with your required profile eg latency :-

export PRERUN="onload --profile=latency "


s7) simply add $PRERUN to the start of your application invocation command

$PRERUN java …….


Beware O/S upgrades, Linux 6 has some extra horrid latency which after several days tweaking I still hadnt eradicated … I went back to Centos 5.10

Here are my notes from CentOS / RedHat installation

Deselect virtualisation
Disable firewall
Disable SELinux
Delete SWAP partition (you don’t want swapping so ensure you have enough memory!)

Obviously disabling SELinux and firewall is for benchtesting in controlled environment. For colocation running you need determine an appropriate security level for your org. If you must have a firewall between you and the exchange then use a hardware one.  Benchtest without security then with security on so you know the impact.

Protect cores against unwanted intrusion (will be discussed when I blog on using thread affinity)

Avoid millisecond latency impact by using discrete threading model with core affinity via kernal param isolcpus. The O/S wont share these cores via the scheduler so you will need to use thread affinity to bind threads to the protected cores (code to follow in later blog).

Edit the /boot/grub/grub.conf

kernel /vmlinuz-2.6.18-194.el5 ro root=LABEL=RH_ROOT    nohz=off  isolcpus=6,7,8,9,10,11    rhgb

Kernal Params

There are many ! Here are some to look at :-

transparent_hugepages=never
intel_idle.max_cstate=0
nohz=off
nosoftlockup
idle=poll

Disable unwanted services

/sbin/chkconfig --list | grep "5:on"
 
chkconfig irqbalance off
chkconfig anacron off
chkconfig atd off
chkconfig avahi-daemon off
chkconfig bluetooth off
chkconfig cups off      
chkconfig hidd off      
chkconfig isdn off      
chkconfig pand off      
chkconfig rhnsd off     
chkconfig sendmail off  
chkconfig cpuspeed off  
chkconfig NetworkManager off
chkconfig iptables off
chkconfig ip6tables off
chkconfig libvirt-guests off

These are the services I disabled obviously you need to ensure you don’t need a service before you disabled it.
IRQBalance and CpuSpeed were the main services that I wished to disable … at the risk of sounding like a broken record disable single service, bench test rinse, repeat. Don’t disable ANY service without checking if YOU need it first !

System Scripts : rc.local

Edit /etc/rc.local

ethtool -C eth2 adaptive-rx off
ethtool -C eth2 rx-usecs 0 rx-frames 0 rx-usecs-high 0 rx-usecs-low 0 pkt-rate-low 0 pkt-rate-high 0
ethtool -C eth2 rx-usecs-irq 60
ethtool -A eth2 rx off tx off

ethtool -C eth3 adaptive-rx off
ethtool -C eth3 rx-usecs 0 rx-frames 0 rx-usecs-high 0 rx-usecs-low 0 pkt-rate-low 0 pkt-rate-high 0
ethtool -C eth2 rx-usecs-irq 60
ethtool -A eth3 rx off tx off

echo 0 > /sys/class/net/eth2/device/lro
echo 0 > /sys/class/net/eth3/device/lro

Only use rx irq 60  IF using openOnload …. You can experiment with this setting, I use spin reading so should never require an IRQ … but I found if I set lower or higher I could get nasty jitter.


System Scripts : sysctl.conf

Edit /etc/sysctl.conf

kernel.sysrq = 0
kernel.core_uses_pid = 1

net.ipv4.tcp_low_latency=1

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296

kernal.isolcpus=6,7,8,9,10,11
kernel.vsyscall64 = 2

Some other useful CentOS "stuff"

Set Run Level

change /etc/inittab default run lvl from 5 to 3   

id:3:initdefault

Network config

/etc/sysconfig/network-scripts/ifcfg-eth*

After changing run

/etc/init.d/network restart

Label Root Partition

Label the root partition eg to RH_ROOT  … so its not confused with any other later O/S installs

SET DEVICE LABEL READY FOR GRUB  (** DONT FORGET UPDATE fstab OR IT WONT BOOT **)
e2label /dev/sda7 RH_ROOT

   edit /etc/fstab
LABEL=RH_ROOT          /                       ext3    defaults        1 1

Edit the /boot/grub/grub.conf
         kernel /vmlinuz-2.6.18-194.el5 ro root=LABEL=RH_ROOT    nohz=off  isolcpus=6,7,8,9,10,11    rhgb
         

Check Kernal CPU Params

     cat /sys/devices/system/cpu/cpuidle/current_driver
    
         intel_idle.max_cstate=0   idle=poll   transparent_hugepage=never processor.max_cstate=0