Please check out the article on SubMicroTrading at theTradingMesh.com :-
http://www.thetradingmesh. com/pg/blog/Richard.Rose/read/ 742893
http://www.thetradingmesh.
A blog about techniques used to build www.SubMicroTrading.com
Minimise
synchronisation
|
The synchronized
keyword used to be really slow and was avoided with more complex lock classes
used in preference. But with the advent of under the cover lock spinning this
is no longer the case. That said even if the lock was uncontended you still
have the overhead of a read and write memory barrier. So use synchronized
where its absolutely needed ie where you have real concurrency.
Key here is
application design where you want components to be single threaded and
achieve throughput via concurrent instances which are independent and require
no synchronisation.
|
Minimise use of
volatile variables
|
Understand how
your building blocks work eg AtomicInteger, ConcurrentHashMap.
Only use
concurrent techniques for the code that needs to be concurrent.
|
Minimise use of
CAS operations
|
An efficient
atomic operation bypassing O/S and implemented by CPU instruction. However to
make it atomic and consistent will incur a memory barrier hitting cache
effectiveness. So use it where needed and not where not !
|
Avoid copying
objects unnecessarily
|
I see this A LOT
and the overhead can soon mount up
Same holds true for mempcy'ing buffer to buffer between API layers (especially in socket code) |
Avoid statics
|
Can be a pain for
unit tests, but real issue comes from required concurrency of shared state
across instances running in separate threads
|
Avoid maps
|
I have worked on
several C++ and java systems where instead of a real object model, they used
abstract concepts with object values stored in maps. Not only do these
systems run slowly, but they lack compile time safety and are simple a pain.
Use maps where they are needed … eg a map of books or a map of orders. SMT
has a goal of at most one map lookup for each event.
|
Presize
collections
|
Understand the
cost of growing collections, eg a HashMap has to create new array double the
size then rehash its elements, an expensive operation when the map is growing
into hundreds of thousands. Make initial size configurable.
|
Reuse heuristics
|
At end of the day
write out the size of all collections. Next time process is bounced resize to
previous stored max.
Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency. |
Use Object
Orientation
|
Avoiding object
orientation due to fear of the cost of vtable lookups seems wrong to me. I
can understand it on a micro scale, but on a macro end to end scale whats the
impact ? In java all methods are
virtual, but the JIT compiler knows what classes are currently loaded and can
not only avoid a vtable lookup but can also inline the code. The benefit of
object orientation is huge. Component reuse and extensibility make it easy to
extend and create new strategies without swathes of cut and paste code.
|
Use final keyword
everywhere
|
Help the JIT
compiler optimise .. If in future a method or class needs extending then you
can always remove the final keyword
|
Small Methods
|
Keep methods small
and easy to understand. Big big methods will never be compiled, big complex
methods may be compiled, but the compiler may end of recompiling and
recompiling the method to try and optimise.
David Straker wrote "KISS" on the board and I never forgot
it ! If the code is easy to understand
that’s GOOD.
|
Avoid Auto Boxing
|
Stick to
primitives and use long over Long and thus avoid any auto boxing overhead
(stick the auto boxing warning on)
|
Avoid Immutables
|
Immutable objects
are fine for long lived objects, but can cause GC for anything else … eg a
trading system with market data would have GC every second if each tick
creates an immutable POJO
|
Avoid String
|
String is
immutable and is a big no-no for ultra low latency systems. In SMT I have a
ZString immutable "string-like" interface. With ViewString and
ReusableString concrete implementations.
|
Avoid Char
|
Use byte and
byte[] and avoid translation between byte and char on every IO operation
|
Avoid temp objects
|
Objects take time
to construct and initialise. Consider using instance variables for reuse
instead (if instance is not used concurrently).
|
Facilitate object
reuse by API
|
Where possible,
pass into a method the object that needs to be populated. This allows
invoking code to avoid object creation and reuse instances where appropriate
String
str = order.toString(); // the api
forces construction of temporary string
Versus
_str.reset(); // a
reusable "working" instance var
Order.toString(
_str ); // because buffer
passed into method no temp objects required
|
Don’t make
everything reusable
|
Just where
otherwise the objects would cause GC
Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs.
Unfortunately for
ultra low latency its not an option, you have to reuse objects (remember
there are places in Java classes that already use pools and reuse)
|
Avoid finalize
|
Objects which hold
resources such as files and sockets should all attempt to shutdown cleanly
and not rely on finalisers. Add explicit open and close methods and add
shutdown handlers to cleanly close if possible.
|
Avoid threadlocal
|
Every threadlocal
call involves a map lookup for current thread so only use where really
needed.
|
24 * 7
|
Design your
systems to run 24 * 7 …. common in 80's and 90's less so now in finance.
|
-XX:+
|
PrintCompilation
|
-XX:+
|
CITime
|
-XX:+
|
UnlockDiagnosticVMOptions
|
-XX:+
|
PrintInlining
|
-XX:+
|
LogCompilation
|
-verbose:
|
gc
|
-XX:+
|
PrintGCTimeStamps
|
-XX:+
|
PrintGCDetails
|
-XX:+
|
BackgroundCompilation
|
Even with this on
there is still latency in switching in newly compiled routines. I really wish
that switch time was much much quicker !
|
-XX:
|
CompileThreshold=1000
|
If you don’t want
to benefit from fastest possible code given the runtime heuristics you can
force initial compilation with -Xcomp … an option if you don’t want to write
warmup code. This may run 10% slower but sometimes much slower depending on
the code.
|
-XX:+
|
TieredCompilation
|
So code is
initially compiled with the C1 (GUI/client) compiler, then when its reached
invocation limit is recompiled with the fully optimised C2 (server) compiler.
The C1 compiler is much quicker to compile a routine than the C2 compiler and
reduced some outlying latency in SMT for routines that were not compiled
during warmup (eg for code paths not covered in warmup).
|
-XX:-
|
ClassUnloading
|
Disable class
unloading, don’t want any possible jitter from this. SMT doesn’t use custom
class loaders and tries to load all required classes during warmup.
|
-XX:+
|
UseCompilerSafepoints
|
I had hoped that
disabling compiler safepoints would reduce JIT jitter but in SMT
multithreaded system it brings instability so I ensure the safepoints are on
….. More jittter I don’t want ho hum.
|
-XX:
|
CompileCommandFile=
.hotspot_compiler"
|
The filename used
to be picked up by default but now you have to use this command.
This is really handy, if you have a small routine you cant simplify further which causes many recompilations then prevent it by adding a line to this file, example :- exclude sun/nio/ch/FileChannelImpl force
This means the
routine wont be compiled, you need to benchmark to determine if running
routine as bytecode has noticeable impact.
|
-XX:+
|
UseCompressedOops
|
I kind of expected
this to have a small performance overhead but in fact it has slightly
improved performace … perhaps thru reduced object size and fitting more
instances into cpu cache.
|
-X
|
noclassgc
|
Again all classes
loaded during warmup and don’t want any possible jitter from trying to free
up / unload them
|
-XX:-
|
RelaxAccessControlCheck
|
To be honest no
idea why I still have this in or even if its still required !
|
-D
|
java.net.preferIPv4Stack=
true
|
If you upgraded
java and spent hours working out why your socket code isnt working anymore,
this could well be it … DOH !!!
|
-
|
server
|
Don’t forget this
if running benchmarks on PC
|
-XX:+
|
UseFastAccessorMethods
|
|
-XX:+
|
UseFastJNIAccessors
|
|
-XX:+
|
UseThreadPriorities
|
Not sure this is
needed for SMT, I use JNI function to hwloc routines for thread core affinity
|
-XX:-
|
UseCodeCacheFlushing
|
|
-XX:-
|
UseBiasedLocking
|
Disable biased
locking, this causes horrendous jitter in ultra low latency systems with
discrete threading models
Probably the single biggest cause of jitter from a jvm arg that I found. |
-XX:+
|
UseNUMA
|
Assuming you have
a multi CPU system, this can have significant impact … google NUMA
architecture
|
-XX:-
|
DoEscapeAnalysis
|
Try disable escape
analysis and see what the impact is
|
-X
|
comp
|
Mentioned earlier,
compile class when loaded as opposed to optimising based on runtime
heuristics
Avoids JIT jitter
but code in general is slower than dynamically compiled code.
Cant remember if
it compiles all classes on startup or when each class is loaded, google
failed to help me here !
|
-XX:+
|
UseCompressedStrings
|
Use byte arrays in
Strings instead of char. SMT has its own ReusableString which uses byte
arrays.
Obviously a no go for systems that require multi byte char sets like Japanese Shift-JIS
All IO is in bytes
so avoid the constant translation between char and byte
|
-XX:-
|
UseCounterDecay
|
Experiment
disabling / reenabling with recompilation decay timers. I believe the decay timers delay recompilation from happening within 30seconds. A real pain in warmup code. I run warmup code, pause 30seconds then rerun! Must be a better way. Wish decent documentation existed that wasnt hidden away !
|
-XX:
|
PerMethodRecompilationCutoff=1
|
Try setting
maximum recompilation boundary … didn’t help me much
|