There is no JVM arg
that fits all applications, key is have a repeatable full test bed and run full
scale benchmarks over hours not seconds. Rinse and repeat several times for
EACH arg change. The args I focus on are the ones on SubMicroTrading which performs
no GC and has almost no JIT post warmup.
Please note some of
these flags are now on by option … sorry I havent checked, still worth bringing them
to attention I think.
For standard java
applications which do lots of GC with mainly short lived objects, I would
recommend try the G1 collector … for market data I found it much better than
concurrent mark sweep. I will blog about that another time … spent weeks tuning
poorly designed apps (advice don’t bother buy Zing).
Note each Java
upgrade brings new options and tweaks existing performance, sometimes up,
sometimes down, so re-benchmark each Java upgrade.
Treat micro
benchmarks with care, discuss the Generics benchmark and explain how on PC was
different to Linux
Avoid BiasedLocks …
they incur regular millisecond latency in systems I have tested
JVM Args for Recording Jitter (JIT/GC)
-XX:+
|
PrintCompilation
|
-XX:+
|
CITime
|
-XX:+
|
UnlockDiagnosticVMOptions
|
-XX:+
|
PrintInlining
|
-XX:+
|
LogCompilation
|
-verbose:
|
gc
|
-XX:+
|
PrintGCTimeStamps
|
-XX:+
|
PrintGCDetails
|
Rather than
regurgitate what I previously googled on understanding output from
PrintCompilation :- http://blog.joda.org/2011/08/printcompilation-jvm-flag.html
For ultra low
latency you want no GC and no JIT, so in SMT I preallocate pools and run warmup
code then invoke System.gc(). I take note of the last compilation entry then
while re-running controlled bench test look for new JIT output (generally
recompilation). When this occurs I go back to the warmup code and find out why
the warmup code had to be recompiled. This generally comes down to either the
code not being warmed up, or the routine was too complicated for the compiler.
Either add further warmup code or simplify the routine. Adding final everywhere
really helps.
Writing warmup code
is a pain, and I am gutted the java.lang.Compiler.disable() method is not
implemented (or at least it wasn’t in Open JDK1.6 … empty method doh!). Ideally
I would invoke this when application is warm and have no recompilation due to
the compiler thinking it can make further optimisations.
Java can recompile
and recompile this only happens in my experience when method is too complex.
Ofcause if a recompilation is due because java inlined a non final method and
the optimisation was premature then the code needs to be corrected. What I want
to avoid recompilation optimisations from edge cases that infrequently go into
code branches.
Note you cannot
guarantee no GC and no JIT under any situation in a complex system. What you
can do is guarantee no JIT/GC for KEY specified scenarios that the business
demand. If a trading system does 10 million trades a day, I would set a goal of
no GC/JIT under NORMAL conditions with 30 million trades then check performance
upto 100 million to see at which point jitter occurs. If for example the
exchange disconnect you during the day, and that kicks in a few milliseconds of
JIT its not important. You don’t need pool every object … just the key ones
that cause GC. More on that in future blog on SuperPools.
I remember speaking
to Gil Tene from Azul, while working at Morgan Stanley and really tried to get
across how much more JIT is of a pain than GC. Some exciting developments seem
to have been made with Zing and I would have been very interested in benchtesting
that … alas I just don’t have time at present. Very impressed with Azul and Gil
and how they respond to queries and enhance their product ….. so much better
than Sun/Oracle were with Java.
SubMicroTrading JVM Arguments
The following are
the arguments that SubMicroTrading run with, this includes the algo container,
OMS, exchange sim and client sim.
-XX:+
|
BackgroundCompilation
|
Even with this on
there is still latency in switching in newly compiled routines. I really wish
that switch time was much much quicker !
|
-XX:
|
CompileThreshold=1000
|
If you don’t want
to benefit from fastest possible code given the runtime heuristics you can
force initial compilation with -Xcomp … an option if you don’t want to write
warmup code. This may run 10% slower but sometimes much slower depending on
the code.
|
-XX:+
|
TieredCompilation
|
So code is
initially compiled with the C1 (GUI/client) compiler, then when its reached
invocation limit is recompiled with the fully optimised C2 (server) compiler.
The C1 compiler is much quicker to compile a routine than the C2 compiler and
reduced some outlying latency in SMT for routines that were not compiled
during warmup (eg for code paths not covered in warmup).
|
-XX:-
|
ClassUnloading
|
Disable class
unloading, don’t want any possible jitter from this. SMT doesn’t use custom
class loaders and tries to load all required classes during warmup.
|
-XX:+
|
UseCompilerSafepoints
|
I had hoped that
disabling compiler safepoints would reduce JIT jitter but in SMT
multithreaded system it brings instability so I ensure the safepoints are on
….. More jittter I don’t want ho hum.
|
-XX:
|
CompileCommandFile=
.hotspot_compiler"
|
The filename used
to be picked up by default but now you have to use this command.
This is really handy, if you have a small routine you cant simplify further which causes many recompilations then prevent it by adding a line to this file, example :- exclude sun/nio/ch/FileChannelImpl force
This means the
routine wont be compiled, you need to benchmark to determine if running
routine as bytecode has noticeable impact.
|
-XX:+
|
UseCompressedOops
|
I kind of expected
this to have a small performance overhead but in fact it has slightly
improved performace … perhaps thru reduced object size and fitting more
instances into cpu cache.
|
-X
|
noclassgc
|
Again all classes
loaded during warmup and don’t want any possible jitter from trying to free
up / unload them
|
-XX:-
|
RelaxAccessControlCheck
|
To be honest no
idea why I still have this in or even if its still required !
|
-D
|
java.net.preferIPv4Stack=
true
|
If you upgraded
java and spent hours working out why your socket code isnt working anymore,
this could well be it … DOH !!!
|
-
|
server
|
Don’t forget this
if running benchmarks on PC
|
-XX:+
|
UseFastAccessorMethods
|
|
-XX:+
|
UseFastJNIAccessors
|
|
-XX:+
|
UseThreadPriorities
|
Not sure this is
needed for SMT, I use JNI function to hwloc routines for thread core affinity
|
-XX:-
|
UseCodeCacheFlushing
|
|
-XX:-
|
UseBiasedLocking
|
Disable biased
locking, this causes horrendous jitter in ultra low latency systems with
discrete threading models
Probably the single biggest cause of jitter from a jvm arg that I found. |
-XX:+
|
UseNUMA
|
Assuming you have
a multi CPU system, this can have significant impact … google NUMA
architecture
|
JVM Arguments to
experiment with … didn’t help SMT, but may help you
-XX:-
|
DoEscapeAnalysis
|
Try disable escape
analysis and see what the impact is
|
-X
|
comp
|
Mentioned earlier,
compile class when loaded as opposed to optimising based on runtime
heuristics
Avoids JIT jitter
but code in general is slower than dynamically compiled code.
Cant remember if
it compiles all classes on startup or when each class is loaded, google
failed to help me here !
|
-XX:+
|
UseCompressedStrings
|
Use byte arrays in
Strings instead of char. SMT has its own ReusableString which uses byte
arrays.
Obviously a no go for systems that require multi byte char sets like Japanese Shift-JIS
All IO is in bytes
so avoid the constant translation between char and byte
|
-XX:-
|
UseCounterDecay
|
Experiment
disabling / reenabling with recompilation decay timers. I believe the decay timers delay recompilation from happening within 30seconds. A real pain in warmup code. I run warmup code, pause 30seconds then rerun! Must be a better way. Wish decent documentation existed that wasnt hidden away !
|
-XX:
|
PerMethodRecompilationCutoff=1
|
Try setting
maximum recompilation boundary … didn’t help me much
|
I have tried many many other JVM args but none of those had any favourable impact on SMT performance.