Ultra Low Latency Trading Systems: 2015

Monday, 17 August 2015

Trading Mesh Article Link

Please check out the article on SubMicroTrading at theTradingMesh.com :-

http://www.thetradingmesh.com/pg/blog/Richard.Rose/read/742893

Sunday, 16 August 2015

Avoid Unnecessary Allocations and Memcpy's

This blog may seem obvious but its worth a mention anyway.

Low latency java needs to follow a different path from standard java and avoid allocations and unnecessary memcpy's.

Consider the following example, how many memory allocations occur for the StringBuilder?

A) Typical java code

private
Message read() {

StringBuilder info = new StringBuilder();

Message m = decode();

info.append( "seqNum=" ).append( m.getSeqNum() )

    .append( ", isDup="
).append( m.isPosDup() )

m.dump( info );

log( info );

return m;

}

GC .. who
cares ? Multiple allocs per call to read. Threadsafe

B)Presized string buffer

private
Message read() {

StringBuilderinfo = new StringBuilder(1024);

Message m = decode();

info.append( "seqNum=" ).append( m.getSeqNum() )

    .append( ", isDup="
).append( m.isPosDup() )

m.dump( info );

log( info );

return m;

}

GC .. who
cares ? One buffer alloc per call to read. Threadsafe

C)Member variable

private final
StringBuilder _info = new StringBuilder();

private
Message read() {

_info.setLength(0);

Message m = decode();

_info.append( "seqNum=" ).append( m.getSeqNum() )

     .append( ",
isDup=" ).append( m.isPosDup() )

m.dump( _info );

log( _info );

return m;

}

Lazy
initialisation for the StringBuilder. Some allocs until buffer hits max
required size.

Best option
where memory is limited and there can be many instances

Not thread
safe ... but thats FINE as all code assumed single threaded unless otherwise
specified

In ultra low
latency threading model is explicit and all contention minimised and
understood.

D)Presized member variable 

private final
StringBuilder _info = new StringBuilder(1024);

private
Message read() {

_info.setLength(0);

Message m = decode();

if ( isLogEnabled() ) {

_info.append( "seqNum=" ).append( m.getSeqNum() )

     .append( ",
isDup=" ).append( m.isPosDup() )

m.dump( _info );

log( _info );

}

return m;

}

Ok so the
guards should of been in all examples, but in typical java code its ignored and
the allocations and mempcy paid ... a lazy tax.

Ultra Low
Latency approach for the StringBuilder. Single buffer allocation presized to
max requirement.

Not thread
safe ... but thats FINE as all code assumed single threaded unless otherwise
specified

In ultra low
latency threading model is explicit and all contention minimised and
understood.

FYI
SubMicroTrading doesnt use StringBuilder but ReusableString to avoid the
overhead of toString() and uses byte instead of char.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Tuesday, 11 August 2015

SubMicroTrading Open Source on GitHub

As explained in my last post I have now Open Sourced SubMicroTrading.com on GitHub

SubMicroTrading on GitHub

There is almost a quarter of a million lines of code.

Code highlights for me are

1) The socket session class hierarchy and cleanly extending it across all the different exchange sessions

2) Exchange Agnostic Algo Container to make writing strategies easier

3) Highly concurrent book maintenance in MarketDataController and Book snapping with optimisation to snap at most once per thread without a map lookup

Please check it out !!

Click here for my list of Ultra Low Latency Blogs and Future Topics

Sunday, 7 June 2015

SubMicroTrading Ultra Low Latency Java Trading Framework Preparing for Open Source

If you are using FPGA or the current third party providers for trading where microseconds really matter then please consider evaluating SubMicroTrading (TM). This is not vapourware or empty promises. It’s a trading framework almost 5 years in the making that I am preparing for open source with a target date August 2015.

Java can be used for ultra low latency.

Just because someone doesn’t know how to do something, doesn’t mean its not possible

Key stats measured wire to wire in an independent lab using TipOff

800,000+ CME fast fix messages decoded per core per second

4 micros average tick to trade, wire to wire at 800,000 ticks/second (MAX tcpreplay) includes :-

Read packet from wire using Solarflare openonload and deliver to main memory (1 micro)

decode market data tick into a normalised POJO event (<1micro)

update book and invoke algo container (<1micro)

simple algo which crosses spread every X ticks and creates order POJO

Encode order to CME fix order request and ready to write to socket buffer (<1micro)

write packet to wire using Solarflare openonload (1 micro)

In process latency is measured at 2 micros with 2 micros in/out of Solarflare/OpenOnload

Note latency is highly dependant on configuration and data topology (which is why concurrency is so important)

Whats in open release

Current model and generated codecs including ETI, UTP, Millenium, Fix, FastFix, CME MDP

All market data and exchange codecs convert from external wire format to normalised common internal POJO events

Possibly fastest standard Fix engine on planet

Possibly fastest FastFix implementation on planet

Possibly fastest log engine on planet

Possibly fastest memory mapped index paged persistence

Possible fastest OMS on planet

Custom exchange session engines for ETI, UTP, Millenium, Fix, FastFix

Exchange trading simulator (works with any of the generated codecs like ETI)

Complete core of SubMicroTrading including thread core affinity

Component architecture for easy configuration of flow pipelines

Ability to extend and customise the source code of any component

Whats not in open first release

Encoder/Decoder and model generator

Exchange and market data agnostic Algo container

CME dynamic on the fly session generation

Book Manager and Book Conflation for optimal concurrent update processing

Sample spread algo implementation really shows the power of Java algos.

Note when comparing SubMicroTrading with other products, remember you have the source, you have full control. Don’t compare apples and oranges … SubMicroTrading converts wire messages to appropriate normalised domain objects allowing a clean and simple to use algo container.

To really compare performance you must test wire to wire within a controlled network. For really high throughput and to avoid exchange test environment throttling, run the exchange trading simulator and market data replay on a separate server. You can then run SubMicroTrading on the trading server then switch to an alternative implementation. Try the T1 benchmark at different TCP replay rates.

Try it, its pretty amazing to run the market data back using tcpreplay, the trading application and exchange simulator all on a low power laptop. To see the true power run on tuned CentOS linux with custom NIO and thread affinity configured.

Follow the blog or register on the website for confirmation on the open launch.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Saturday, 6 June 2015

Setting Thread Affinity and Priority using JNI in SubMicroTrading

There is no real mystique in JNI calls, the concept that JNI is slow is a misnomer. If you keep your JNI interfaces simple then when the code is compiled its just another function call (allbeit with an extra two parameters).

I recommend wrapping JNI calls within an envelope which can allow switching between linux, windows and perhaps no custom JNI. I developed SubMicroTrading on a little Dell Adamo laptop and could run the exchange sim, market data sim, trading application all on dual core with 4GB RAM … try doing that in C++ !

In SubMicroTrading all custom JNI calls (excluding custom NIO) are wrapped within a class called NativeHooksImpl (simplified and cut down version below).

public class
NativeHooksImpl implements NativeHooks {

    private static boolean _linuxNative   = false;

    static {

        if ( Env.isUseLinuxNative() ) {

            System.loadLibrary(
"submicrocore" );

            _linuxNative = true;

}

}

    private static NativeHooks _instance = new
NativeHooksImpl();

    public 
static NativeHooks instance() { return _instance; }

    private static native void jniSetPriority( int mask, int priority );

    @Override public void setPriority( Thread
thread, int mask, int priority ) {

        if ( _linuxNative ) {

            jniSetPriority( mask, priority );

        } else {

            thread.setPriority( priority );

}

}

    ………...

To generate the header file :-

javah -force -classpath ..\..\bin -o src\SubMicroCore_jni.h com.rr.core.os.NativeHooksImpl

Sample entry from the generated header …. Clearly the actual function must match the definition

* Class: com_rr_core_os_NativeHooksImpl

* Method: jniSetPriority

* Signature: (II)V

JNIEXPORT void JNICALL Java_com_rr_core_os_NativeHooksImpl_jniSetPriority(JNIEnv *, jclass, jint, jint);

Implementation for the set priority method .. Note this sets the cpumask and priority for the CURRENT thread. Invoke this method at the start of the thread run() method. SubMicroTrading keeps all the thread and priority mappings in a config file which is essential. I use different configs for each different PC/server.

JNIEXPORT void JNICALL
Java_com_rr_core_os_NativeHooksImpl_jniSetPriority( JNIEnv *env, jclass clazz,
jint cpumask, jint priority ) {

    int topodepth;

    hwloc_topology_t topology;

    hwloc_cpuset_t cpuset;

   
hwloc_topology_init(&topology);

    hwloc_topology_load(topology);

    topodepth =
hwloc_topology_get_depth(topology);

    cpuset = hwloc_bitmap_alloc();

    hwloc_bitmap_from_ulong(
cpuset, (unsigned int)cpumask );

    char *str;

   
hwloc_bitmap_asprintf(&str, cpuset);

    printf("cpumask [%d]
=> hwloc [%s]\n", cpumask, str);

    if (hwloc_set_cpubind(topology, cpuset,
HWLOC_CPUBIND_THREAD)) {

        printf("Couldn't bind
cpuset %s\n", str);

    } else {

        printf("BOUND cpuset
%s\n", str);

}

    free(str);

    /* Free our cpuset copy */

    hwloc_bitmap_free(cpuset);

    /* Destroy topology object. */

   
hwloc_topology_destroy(topology);

}

Heres the linux makefile I wrote :-

# g++: 3.2.3

.SUFFIXES:        .c

TMP_PATH=./target

BIN_PATH=./bin/linux

#DEBUG=            -g

DEBUG=

DLL_NAME=        libsubmicrocore.so

CC=                      gcc

CFLAGS=          -O3 -march=nocona
-m64 -I"${JAVA_HOME}/include"
-I"${JAVA_HOME}/include/linux" -DLINUX -fPIC -I${HWLOC_HOME}/include
-I./sun

LD=                      gcc

LDFLAGS=-L${HWLOC_HOME}/lib -L${JAVA_HOME}/jre/lib/amd64
-L${JAVA_HOME}/jre/lib/amd64/server

LIBS=-m64 -lhwloc -ljava -ljvm -lverify -lnio -lnet -lrt

all:    setup lib 

setup:

mkdir -p ${TMP_PATH}

mkdir -p ${BIN_PATH}

lib: ${BIN_PATH}/${DLL_NAME}

${BIN_PATH}/${DLL_NAME}: ${TMP_PATH}/SubMicroCore_jni.o 

${LD} ${LDFLAGS} -LD ${LIBS} -shared -o ${TMP_PATH}/${DLL_NAME}
${TMP_PATH}/SubMicroCore_jni.o 

cp -f ${TMP_PATH}/${DLL_NAME} 
${BIN_PATH}/${DLL_NAME}

${TMP_PATH}/SubMicroCore_jni.o: src/SubMicroCore_jni.c
src/SubMicroCore_jni.h

${CC} ${CFLAGS} -o ${TMP_PATH}/SubMicroCore_jni.o -c
src/SubMicroCore_jni.c

clean:

rm -rf ${TMP_PATH}/*

rm -rf ${BIN_PATH}/*

FYI I had written a windows version of the library but ditched it as for ultra low latency you really need the control level that linux gives you … especially as its free !

This is all EASY due to the good work on the hwloc project :-

https://www.open-mpi.org/projects/hwloc/

I am using a pretty old version (I think 1.0.2 … cant check as my linux servers are offline atm) so API may have changed again, but I would expect impact is minimal.

I will give recommendations for how to use thread affinity and priority in post on threading models. Please use with care, you can grind a system to a halt by poor usage.

My plan is still to open source components from SubMicroTrading which will include the complete JNI layer .. Ie above + various timer and microsecond sleep functions.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Saturday, 23 May 2015

Coding for Ultra Low Latency

For a system to operate as fast as possible every line of code needs to be optimal. If you take the approach of writing lazy code then optimising you will end up rewriting everything. A profiler wont help you at the nanosecond level, the overhead of running with profiler metrics will have you "chasing your tail" !

Writing optimal code from the start of the project is easy, set up coding standards and enforce them. Have a simple set of guidelines that everyone follows.

Minimise synchronisation	The synchronized keyword used to be really slow and was avoided with more complex lock classes used in preference. But with the advent of under the cover lock spinning this is no longer the case. That said even if the lock was uncontended you still have the overhead of a read and write memory barrier. So use synchronized where its absolutely needed ie where you have real concurrency. Key here is application design where you want components to be single threaded and achieve throughput via concurrent instances which are independent and require no synchronisation.
Minimise use of volatile variables	Understand how your building blocks work eg AtomicInteger, ConcurrentHashMap. Only use concurrent techniques for the code that needs to be concurrent.
Minimise use of CAS operations	An efficient atomic operation bypassing O/S and implemented by CPU instruction. However to make it atomic and consistent will incur a memory barrier hitting cache effectiveness. So use it where needed and not where not !
Avoid copying objects unnecessarily	I see this A LOT and the overhead can soon mount up Same holds true for mempcy'ing buffer to buffer between API layers (especially in socket code)
Avoid statics	Can be a pain for unit tests, but real issue comes from required concurrency of shared state across instances running in separate threads
Avoid maps	I have worked on several C++ and java systems where instead of a real object model, they used abstract concepts with object values stored in maps. Not only do these systems run slowly, but they lack compile time safety and are simple a pain. Use maps where they are needed … eg a map of books or a map of orders. SMT has a goal of at most one map lookup for each event.
Presize collections	Understand the cost of growing collections, eg a HashMap has to create new array double the size then rehash its elements, an expensive operation when the map is growing into hundreds of thousands. Make initial size configurable.
Reuse heuristics	At end of the day write out the size of all collections. Next time process is bounced resize to previous stored max. Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency.
Use Object Orientation	Avoiding object orientation due to fear of the cost of vtable lookups seems wrong to me. I can understand it on a micro scale, but on a macro end to end scale whats the impact ? In java all methods are virtual, but the JIT compiler knows what classes are currently loaded and can not only avoid a vtable lookup but can also inline the code. The benefit of object orientation is huge. Component reuse and extensibility make it easy to extend and create new strategies without swathes of cut and paste code.
Use final keyword everywhere	Help the JIT compiler optimise .. If in future a method or class needs extending then you can always remove the final keyword
Small Methods	Keep methods small and easy to understand. Big big methods will never be compiled, big complex methods may be compiled, but the compiler may end of recompiling and recompiling the method to try and optimise. David Straker wrote "KISS" on the board and I never forgot it ! If the code is easy to understand that’s GOOD.
Avoid Auto Boxing	Stick to primitives and use long over Long and thus avoid any auto boxing overhead (stick the auto boxing warning on)
Avoid Immutables	Immutable objects are fine for long lived objects, but can cause GC for anything else … eg a trading system with market data would have GC every second if each tick creates an immutable POJO
Avoid String	String is immutable and is a big no-no for ultra low latency systems. In SMT I have a ZString immutable "string-like" interface. With ViewString and ReusableString concrete implementations.
Avoid Char	Use byte and byte[] and avoid translation between byte and char on every IO operation
Avoid temp objects	Objects take time to construct and initialise. Consider using instance variables for reuse instead (if instance is not used concurrently).
Facilitate object reuse by API	Where possible, pass into a method the object that needs to be populated. This allows invoking code to avoid object creation and reuse instances where appropriate String str = order.toString(); // the api forces construction of temporary string Versus _str.reset(); // a reusable "working" instance var Order.toString( _str ); // because buffer passed into method no temp objects required
Don’t make everything reusable	Just where otherwise the objects would cause GC Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs. Unfortunately for ultra low latency its not an option, you have to reuse objects (remember there are places in Java classes that already use pools and reuse)
Avoid finalize	Objects which hold resources such as files and sockets should all attempt to shutdown cleanly and not rely on finalisers. Add explicit open and close methods and add shutdown handlers to cleanly close if possible.
Avoid threadlocal	Every threadlocal call involves a map lookup for current thread so only use where really needed.
24 * 7	Design your systems to run 24 * 7 …. common in 80's and 90's less so now in finance.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Saturday, 16 May 2015

Java Bytecode Latency Impact

In the 80's I remember building NAND circuits to represent code. It was pretty cool seeing how code could be implemented at a circuit level. What I was unsure of when I started SubMicroTrading was the impact on performance by java byte code and whether there were any possible optimisations available.

To cut a long story short, I found only one worthwhile optimisation and that’s how a switch statement is represented in byte code.

Consider the following switch statement :-

Switch( a ) {

case 10 : doAStuff(); break;

case 20 : doBStuff(); break;

case 30 : doCStuff(); break;

case 40 : doDStuff(); break;

...

}

is conceptually the same as

if ( a == 10 ) doAStuff

else if ( a == 20 ) doBStuff

else if ( a == 30 ) doCStuff

else if ( a == 40 ) doDStuff

Think about that, if you are parsing fix and you have 1000 possible fix tags, and an average of 20 fields in a message. Then to process a fix message you could possibly be making an average of 10,000 comparisons . If you want to process 1,000,000 events per second, that would be 10,000,000,000 comparisons per second. (A huge number that would be based on linear top down search, binary search is clearly much better … but point is its still a cost that can be avoided).

Java has two bytecodes for switch statements, the LookUpSwitch statement is in effect a table of key value to jump label … ie you have to search the table to find correct key entry. TableSwitch is in effect a table of jump labels which are indexed directly by the key value (minus the table offset .. Ie lowest key value in switch statement).

For Ultra Low Latency you should consider adding an ant task and check the bytecode for any "lookupswitch" statements. For message processing on most exchanges you can safely force a switch statement to become a tableswitch by adding packer entries so there are no gaps between the key values. In my CODEC generators I stipulate a max pack range eg 100, and any sparse values are handled within the default statement eg via a second switch or if statement if only couple of keys of interest. Like everything test out with real data to see impact. For me, the tableswitch made HUGE difference.

Sample tableswitch with packering case statements :-

Here is the start of the switch statement within the Standard44Decoder generated class :-

final byte msgType = _fixMsg[ _idx ];

switch( msgType ) {

case '8':

if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type

throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );

}

_idx += 2;

return decodeExecReport();

case 'D':

if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type

throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );

}

_idx += 2;

return decodeNewOrderSingle();

………….

// packers

case '6': case '7': case ':': case ';': case '<': case '=':

case '>': case '?': case '@': case 'B': case 'C': case 'E':

break;

javap -c ./com/rr/model/generated/fix/codec/Standard44Decoder.class >s44.bc

protected final com.rr.core.model.Message doMessageDecode();

Code:

…….

74: tableswitch { // 48 to 71

48: 549 // '0' ie heartbeat is the first entry

49: 987

50: 841

51: 768

52: 914

53: 695

54: 1060

55: 1060

56: 184 // '8' this was the first switch entry in java code

57: 476

58: 1060 // 1060 is same as default and are the packer entries

59: 1060

60: 1060

61: 1060

62: 1060

63: 1060

64: 1060

65: 622

66: 1060

67: 1060

68: 257

69: 1060

70: 403

71: 330

default: 1060

}

184: aload_0

185: getfield #462 // Field _fixMsg:[B

188: aload_0

189: getfield #466 // Field _idx:I

192: iconst_1

193: iadd

194: baload

195: iconst_1

196: if_icmpeq 242

199: aload_0

200: new #475 // class java/lang/StringBuilder

203: dup

204: ldc_w #477 // String Unsupported fix message type

You could easily grep for lookupswitch … remember over time extra case statements could be added that cause the switch to become lookupSwitch again.

PS> clearly tableswitch isnt suitable for very sparse switch statements, but in my experience its very useful in trading systems.

Click here for my list of Ultra Low Latency Blogs and Future Topics

Saturday, 9 May 2015

Java JVM Tuning for Ultra Low Latency

There is no JVM arg that fits all applications, key is have a repeatable full test bed and run full scale benchmarks over hours not seconds. Rinse and repeat several times for EACH arg change. The args I focus on are the ones on SubMicroTrading which performs no GC and has almost no JIT post warmup.

Please note some of these flags are now on by option … sorry I havent checked, still worth bringing them to attention I think.

For standard java applications which do lots of GC with mainly short lived objects, I would recommend try the G1 collector … for market data I found it much better than concurrent mark sweep. I will blog about that another time … spent weeks tuning poorly designed apps (advice don’t bother buy Zing).

Note each Java upgrade brings new options and tweaks existing performance, sometimes up, sometimes down, so re-benchmark each Java upgrade.

Treat micro benchmarks with care, discuss the Generics benchmark and explain how on PC was different to Linux

Avoid BiasedLocks … they incur regular millisecond latency in systems I have tested

JVM Args for Recording Jitter (JIT/GC)

-XX:+	PrintCompilation
-XX:+	CITime
-XX:+	UnlockDiagnosticVMOptions
-XX:+	PrintInlining
-XX:+	LogCompilation
-verbose:	gc
-XX:+	PrintGCTimeStamps
-XX:+	PrintGCDetails

Rather than regurgitate what I previously googled on understanding output from PrintCompilation :- http://blog.joda.org/2011/08/printcompilation-jvm-flag.html

For ultra low latency you want no GC and no JIT, so in SMT I preallocate pools and run warmup code then invoke System.gc(). I take note of the last compilation entry then while re-running controlled bench test look for new JIT output (generally recompilation). When this occurs I go back to the warmup code and find out why the warmup code had to be recompiled. This generally comes down to either the code not being warmed up, or the routine was too complicated for the compiler. Either add further warmup code or simplify the routine. Adding final everywhere really helps.

Writing warmup code is a pain, and I am gutted the java.lang.Compiler.disable() method is not implemented (or at least it wasn’t in Open JDK1.6 … empty method doh!). Ideally I would invoke this when application is warm and have no recompilation due to the compiler thinking it can make further optimisations.

Java can recompile and recompile this only happens in my experience when method is too complex. Ofcause if a recompilation is due because java inlined a non final method and the optimisation was premature then the code needs to be corrected. What I want to avoid recompilation optimisations from edge cases that infrequently go into code branches.

Note you cannot guarantee no GC and no JIT under any situation in a complex system. What you can do is guarantee no JIT/GC for KEY specified scenarios that the business demand. If a trading system does 10 million trades a day, I would set a goal of no GC/JIT under NORMAL conditions with 30 million trades then check performance upto 100 million to see at which point jitter occurs. If for example the exchange disconnect you during the day, and that kicks in a few milliseconds of JIT its not important. You don’t need pool every object … just the key ones that cause GC. More on that in future blog on SuperPools.

I remember speaking to Gil Tene from Azul, while working at Morgan Stanley and really tried to get across how much more JIT is of a pain than GC. Some exciting developments seem to have been made with Zing and I would have been very interested in benchtesting that … alas I just don’t have time at present. Very impressed with Azul and Gil and how they respond to queries and enhance their product ….. so much better than Sun/Oracle were with Java.

http://www.azulsystems.com/products/zing/whatisit

SubMicroTrading JVM Arguments

The following are the arguments that SubMicroTrading run with, this includes the algo container, OMS, exchange sim and client sim.

-XX:+	BackgroundCompilation	Even with this on there is still latency in switching in newly compiled routines. I really wish that switch time was much much quicker !
-XX:	CompileThreshold=1000	If you don’t want to benefit from fastest possible code given the runtime heuristics you can force initial compilation with -Xcomp … an option if you don’t want to write warmup code. This may run 10% slower but sometimes much slower depending on the code.
-XX:+	TieredCompilation	So code is initially compiled with the C1 (GUI/client) compiler, then when its reached invocation limit is recompiled with the fully optimised C2 (server) compiler. The C1 compiler is much quicker to compile a routine than the C2 compiler and reduced some outlying latency in SMT for routines that were not compiled during warmup (eg for code paths not covered in warmup).
-XX:-	ClassUnloading	Disable class unloading, don’t want any possible jitter from this. SMT doesn’t use custom class loaders and tries to load all required classes during warmup.
-XX:+	UseCompilerSafepoints	I had hoped that disabling compiler safepoints would reduce JIT jitter but in SMT multithreaded system it brings instability so I ensure the safepoints are on ….. More jittter I don’t want ho hum.
-XX:	CompileCommandFile= .hotspot_compiler"	The filename used to be picked up by default but now you have to use this command. This is really handy, if you have a small routine you cant simplify further which causes many recompilations then prevent it by adding a line to this file, example :- exclude sun/nio/ch/FileChannelImpl force This means the routine wont be compiled, you need to benchmark to determine if running routine as bytecode has noticeable impact.
-XX:+	UseCompressedOops	I kind of expected this to have a small performance overhead but in fact it has slightly improved performace … perhaps thru reduced object size and fitting more instances into cpu cache.
-X	noclassgc	Again all classes loaded during warmup and don’t want any possible jitter from trying to free up / unload them
-XX:-	RelaxAccessControlCheck	To be honest no idea why I still have this in or even if its still required !
-D	java.net.preferIPv4Stack= true	If you upgraded java and spent hours working out why your socket code isnt working anymore, this could well be it … DOH !!!
-	server	Don’t forget this if running benchmarks on PC
-XX:+	UseFastAccessorMethods
-XX:+	UseFastJNIAccessors
-XX:+	UseThreadPriorities	Not sure this is needed for SMT, I use JNI function to hwloc routines for thread core affinity
-XX:-	UseCodeCacheFlushing
-XX:-	UseBiasedLocking	Disable biased locking, this causes horrendous jitter in ultra low latency systems with discrete threading models Probably the single biggest cause of jitter from a jvm arg that I found.
-XX:+	UseNUMA	Assuming you have a multi CPU system, this can have significant impact … google NUMA architecture

JVM Arguments to experiment with … didn’t help SMT, but may help you

-XX:-	DoEscapeAnalysis	Try disable escape analysis and see what the impact is
-X	comp	Mentioned earlier, compile class when loaded as opposed to optimising based on runtime heuristics Avoids JIT jitter but code in general is slower than dynamically compiled code. Cant remember if it compiles all classes on startup or when each class is loaded, google failed to help me here !
-XX:+	UseCompressedStrings	Use byte arrays in Strings instead of char. SMT has its own ReusableString which uses byte arrays. Obviously a no go for systems that require multi byte char sets like Japanese Shift-JIS All IO is in bytes so avoid the constant translation between char and byte
-XX:-	UseCounterDecay	Experiment disabling / reenabling with recompilation decay timers. I believe the decay timers delay recompilation from happening within 30seconds. A real pain in warmup code. I run warmup code, pause 30seconds then rerun! Must be a better way. Wish decent documentation existed that wasnt hidden away !
-XX:	PerMethodRecompilationCutoff=1	Try setting maximum recompilation boundary … didn’t help me much

I have tried many many other JVM args but none of those had any favourable impact on SMT performance.

Click here for my list of Ultra Low Latency Blogs and Future Topics