Monday, 17 August 2015

Sunday, 16 August 2015

Avoid Unnecessary Allocations and Memcpy's

This blog may seem obvious but its worth a mention anyway.

Low latency java needs to follow a different path from standard java and avoid allocations and unnecessary memcpy's.


Consider the following example, how many memory allocations occur for the StringBuilder?

A) Typical java code

private Message read() {
StringBuilder info = new StringBuilder();

Message m = decode();

info.append( "seqNum=" ).append( m.getSeqNum() )
    .append( ", isDup=" ).append( m.isPosDup() )
 
m.dump( info );

log( info );

return m;
}

GC .. who cares ? Multiple allocs per call to read. Threadsafe

B)Presized string buffer

private Message read() {
StringBuilderinfo = new StringBuilder(1024);

Message m = decode();

info.append( "seqNum=" ).append( m.getSeqNum() )
    .append( ", isDup=" ).append( m.isPosDup() )
 
m.dump( info );

log( info );

return m;
}

GC .. who cares ? One buffer alloc per call to read. Threadsafe

C)Member variable

private final StringBuilder _info = new StringBuilder();

private Message read() {

_info.setLength(0);

Message m = decode();

_info.append( "seqNum=" ).append( m.getSeqNum() )
     .append( ", isDup=" ).append( m.isPosDup() )
 
m.dump( _info );

log( _info );

return m;
}

Lazy initialisation for the StringBuilder. Some allocs until buffer hits max required size.
Best option where memory is limited and there can be many instances
Not thread safe ... but thats FINE as all code assumed single threaded unless otherwise specified
In ultra low latency threading model is explicit and all contention minimised and understood.

D)Presized member variable

private final StringBuilder _info = new StringBuilder(1024);

private Message read() {

_info.setLength(0);

Message m = decode();

if ( isLogEnabled() ) {
_info.append( "seqNum=" ).append( m.getSeqNum() )
     .append( ", isDup=" ).append( m.isPosDup() )
 
m.dump( _info );

log( _info );
}

return m;
}

Ok so the guards should of been in all examples, but in typical java code its ignored and the allocations and mempcy paid ... a lazy tax.
Ultra Low Latency approach for the StringBuilder. Single buffer allocation presized to max requirement.
Not thread safe ... but thats FINE as all code assumed single threaded unless otherwise specified
In ultra low latency threading model is explicit and all contention minimised and understood.

FYI SubMicroTrading doesnt use StringBuilder but ReusableString to avoid the overhead of toString() and uses byte instead of char.



Tuesday, 11 August 2015

SubMicroTrading Open Source on GitHub

As explained in my last post I have now Open Sourced SubMicroTrading.com on GitHub

                                          SubMicroTrading on GitHub

There is almost a quarter of a million lines of code.

Code highlights for me are

1) The socket session class hierarchy and cleanly extending it across all the different exchange sessions

2) Exchange Agnostic Algo Container to make writing strategies easier

3) Highly concurrent book maintenance in MarketDataController and Book snapping with optimisation to snap at most once per thread without a map lookup

Please check it out !!


Sunday, 7 June 2015

SubMicroTrading Ultra Low Latency Java Trading Framework Preparing for Open Source


If you are using FPGA or the current third party providers for trading where microseconds really matter then please consider evaluating SubMicroTrading (TM). This is not vapourware or empty promises. It’s a trading framework almost 5 years in the making that I am preparing for open source with a target date August 2015.

Java can be used for ultra low latency.

        Just because someone doesn’t know how to do something, doesn’t mean its not possible

Key stats measured wire to wire in an independent lab using TipOff

800,000+ CME fast fix messages decoded per core per second

4 micros average tick to trade, wire to wire at 800,000 ticks/second (MAX tcpreplay) includes :-
Read packet from wire using Solarflare openonload and deliver to main memory (1 micro)
decode market data tick into a normalised POJO event (<1micro)
update book and invoke algo container (<1micro)
simple algo which crosses spread every X ticks and creates order POJO
Encode order to CME fix order request and ready to write to socket buffer (<1micro)
write packet to wire using Solarflare openonload (1 micro)

In process latency is measured at 2 micros with 2 micros in/out of Solarflare/OpenOnload
Note latency is highly dependant on configuration and data topology (which is why concurrency is so important)

Whats in open release
Current model and generated codecs including ETI, UTP, Millenium, Fix, FastFix, CME MDP
All market data and exchange codecs convert from external wire format to normalised common internal POJO events
Possibly fastest standard Fix engine on planet
Possibly fastest FastFix implementation on planet
Possibly fastest log engine on planet
Possibly fastest memory mapped index paged persistence
Possible fastest OMS on planet
Custom exchange session engines for ETI, UTP, Millenium, Fix, FastFix
Exchange trading simulator (works with any of the generated codecs like ETI)
Complete core of SubMicroTrading including thread core affinity
Component architecture for easy configuration of flow pipelines
Ability to extend and customise the source code of any component

Whats not in open first release
Encoder/Decoder and model generator
Exchange and market data agnostic Algo container
CME dynamic on the fly session generation
Book Manager and Book Conflation for optimal concurrent update processing
Sample spread algo implementation really shows the power of Java algos.

Note when comparing SubMicroTrading with other products, remember you have the source, you have full control. Don’t compare apples and oranges … SubMicroTrading converts wire messages to appropriate normalised domain objects allowing a clean and simple to use algo container.

To really compare performance you must test wire to wire within a controlled network. For really high throughput and to avoid exchange test environment throttling, run the exchange trading simulator and market data replay on a separate server. You can then run SubMicroTrading on the trading server then switch to an alternative implementation. Try the T1 benchmark at different TCP replay rates.

Try it, its pretty amazing to run the market data back using tcpreplay, the trading application and exchange simulator all on a low power laptop. To see the true power run on tuned CentOS linux with custom NIO and thread affinity configured.

Follow the blog or register on the website for confirmation on the open launch.

Saturday, 6 June 2015

Setting Thread Affinity and Priority using JNI in SubMicroTrading


 
There is no real mystique in JNI calls, the concept that JNI is slow is a misnomer. If you keep your JNI interfaces simple then when the code is compiled its just another function call (allbeit with an extra two parameters).

I recommend wrapping JNI calls within an envelope which can allow switching between linux, windows and perhaps no custom JNI. I developed SubMicroTrading on a little Dell Adamo laptop and could run the exchange sim, market data sim, trading application all on dual core with 4GB RAM … try doing that in C++ !

In SubMicroTrading all custom JNI calls (excluding custom NIO) are wrapped within a class called NativeHooksImpl (simplified and cut down version below).

public class NativeHooksImpl implements NativeHooks {

    private static boolean _linuxNative   = false;
   
    static {
        if ( Env.isUseLinuxNative() ) {
            System.loadLibrary( "submicrocore" );
            _linuxNative = true;
        }
    }

    private static NativeHooks _instance = new NativeHooksImpl();
    public  static NativeHooks instance() { return _instance; }

    private static native void jniSetPriority( int mask, int priority );
   
    @Override public void setPriority( Thread thread, int mask, int priority ) {
        if ( _linuxNative ) {
            jniSetPriority( mask, priority );
        } else {
            thread.setPriority( priority );
        }
    }
    ………...

To generate the header file :-

javah -force -classpath ..\..\bin -o src\SubMicroCore_jni.h com.rr.core.os.NativeHooksImpl

Sample entry from the generated header …. Clearly the actual function must match the definition

/*
 * Class:     com_rr_core_os_NativeHooksImpl
 * Method:    jniSetPriority
 * Signature: (II)V
 */
JNIEXPORT void JNICALL Java_com_rr_core_os_NativeHooksImpl_jniSetPriority(JNIEnv *, jclass, jint, jint);

Implementation for the set priority method .. Note this sets the cpumask and priority for the CURRENT thread. Invoke this method at the start of the thread run() method. SubMicroTrading keeps all the thread and priority mappings in a config file which is essential. I use different configs for each different PC/server.

JNIEXPORT void JNICALL Java_com_rr_core_os_NativeHooksImpl_jniSetPriority( JNIEnv *env, jclass clazz, jint cpumask, jint priority ) {

    int topodepth;
    hwloc_topology_t topology;
    hwloc_cpuset_t cpuset;

    hwloc_topology_init(&topology);
    hwloc_topology_load(topology);
    topodepth = hwloc_topology_get_depth(topology);

    cpuset = hwloc_bitmap_alloc();
    hwloc_bitmap_from_ulong( cpuset, (unsigned int)cpumask );

    char *str;
    hwloc_bitmap_asprintf(&str, cpuset);

    printf("cpumask [%d] => hwloc [%s]\n", cpumask, str);

    if (hwloc_set_cpubind(topology, cpuset, HWLOC_CPUBIND_THREAD)) {
        printf("Couldn't bind cpuset %s\n", str);
    } else {
        printf("BOUND cpuset %s\n", str);
    }

    free(str);

    /* Free our cpuset copy */
    hwloc_bitmap_free(cpuset);

    /* Destroy topology object. */
    hwloc_topology_destroy(topology);
}

Heres the linux makefile I wrote :-

# g++: 3.2.3

.SUFFIXES:        .c

TMP_PATH=./target
BIN_PATH=./bin/linux

#DEBUG=            -g
DEBUG=

DLL_NAME=        libsubmicrocore.so

CC=                      gcc
CFLAGS=          -O3 -march=nocona -m64 -I"${JAVA_HOME}/include" -I"${JAVA_HOME}/include/linux" -DLINUX -fPIC -I${HWLOC_HOME}/include -I./sun
LD=                      gcc
LDFLAGS=-L${HWLOC_HOME}/lib -L${JAVA_HOME}/jre/lib/amd64 -L${JAVA_HOME}/jre/lib/amd64/server

LIBS=-m64 -lhwloc -ljava -ljvm -lverify -lnio -lnet -lrt

all:    setup lib

setup:
mkdir -p ${TMP_PATH}
mkdir -p ${BIN_PATH}

lib: ${BIN_PATH}/${DLL_NAME}

${BIN_PATH}/${DLL_NAME}: ${TMP_PATH}/SubMicroCore_jni.o
${LD} ${LDFLAGS} -LD ${LIBS} -shared -o ${TMP_PATH}/${DLL_NAME} ${TMP_PATH}/SubMicroCore_jni.o
cp -f ${TMP_PATH}/${DLL_NAME}  ${BIN_PATH}/${DLL_NAME}

${TMP_PATH}/SubMicroCore_jni.o: src/SubMicroCore_jni.c src/SubMicroCore_jni.h
${CC} ${CFLAGS} -o ${TMP_PATH}/SubMicroCore_jni.o -c src/SubMicroCore_jni.c

clean:
rm -rf ${TMP_PATH}/*
rm -rf ${BIN_PATH}/*

FYI I had written a windows version of the library but ditched it as for ultra low latency you really need the control level that linux gives you … especially as its free !

This is all EASY due to the good work on the hwloc project :-


I am using a pretty old version (I think 1.0.2 … cant check as my linux servers are offline atm) so API may have changed again, but I would expect impact is minimal.

I will give recommendations for how to use thread affinity and priority in post on threading models. Please use with care, you can grind a system to a halt by poor usage.

My plan is still to open source components from SubMicroTrading which will include the complete JNI layer .. Ie above + various timer and microsecond sleep functions.


Saturday, 23 May 2015

Coding for Ultra Low Latency

For a system to operate as fast as possible every line of code needs to be optimal. If you take the approach of writing lazy code then optimising you will end up rewriting everything. A profiler wont help you at the nanosecond level, the overhead of running with profiler metrics will have you "chasing your tail" !

Writing optimal code from the start of the project is easy, set up coding standards and enforce them. Have a simple set of guidelines that everyone follows.

Minimise synchronisation
The synchronized keyword used to be really slow and was avoided with more complex lock classes used in preference. But with the advent of under the cover lock spinning this is no longer the case. That said even if the lock was uncontended you still have the overhead of a read and write memory barrier. So use synchronized where its absolutely needed ie where you have real concurrency.
Key here is application design where you want components to be single threaded and achieve throughput via concurrent instances which are independent and require no synchronisation.
Minimise use of volatile variables
Understand how your building blocks work eg AtomicInteger, ConcurrentHashMap.
Only use concurrent techniques for the code that needs to be concurrent.
Minimise use of CAS operations
An efficient atomic operation bypassing O/S and implemented by CPU instruction. However to make it atomic and consistent will incur a memory barrier hitting cache effectiveness. So use it where needed and not where not !
Avoid copying objects unnecessarily
I see this A LOT and the overhead can soon mount up
Same holds true for mempcy'ing buffer to buffer between API layers (especially in socket code)
Avoid statics
Can be a pain for unit tests, but real issue comes from required concurrency of shared state across instances running in separate threads
Avoid maps
I have worked on several C++ and java systems where instead of a real object model, they used abstract concepts with object values stored in maps. Not only do these systems run slowly, but they lack compile time safety and are simple a pain. Use maps where they are needed … eg a map of books or a map of orders. SMT has a goal of at most one map lookup for each event.
Presize collections
Understand the cost of growing collections, eg a HashMap has to create new array double the size then rehash its elements, an expensive operation when the map is growing into hundreds of thousands. Make initial size configurable.
Reuse heuristics
At end of the day write out the size of all collections. Next time process is bounced resize to previous stored max.
Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency.
Use Object Orientation
Avoiding object orientation due to fear of the cost of vtable lookups seems wrong to me. I can understand it on a micro scale, but on a macro end to end scale whats the impact ?  In java all methods are virtual, but the JIT compiler knows what classes are currently loaded and can not only avoid a vtable lookup but can also inline the code. The benefit of object orientation is huge. Component reuse and extensibility make it easy to extend and create new strategies without swathes of cut and paste code.
Use final keyword everywhere
Help the JIT compiler optimise .. If in future a method or class needs extending then you can always remove the final keyword
Small Methods
Keep methods small and easy to understand. Big big methods will never be compiled, big complex methods may be compiled, but the compiler may end of recompiling and recompiling the method to try and optimise.  David Straker wrote "KISS" on the board and I never forgot it !  If the code is easy to understand that’s GOOD.
Avoid Auto Boxing
Stick to primitives and use long over Long and thus avoid any auto boxing overhead (stick the auto boxing warning on)
Avoid Immutables
Immutable objects are fine for long lived objects, but can cause GC for anything else … eg a trading system with market data would have GC every second if each tick creates an immutable POJO
Avoid String
String is immutable and is a big no-no for ultra low latency systems. In SMT I have a ZString immutable "string-like" interface. With ViewString and ReusableString concrete implementations.
Avoid Char
Use byte and byte[] and avoid translation between byte and char on every IO operation
Avoid temp objects
Objects take time to construct and initialise. Consider using instance variables for reuse instead (if instance is not used concurrently).
Facilitate object reuse by API
Where possible, pass into a method the object that needs to be populated. This allows invoking code to avoid object creation and reuse instances where appropriate


String str = order.toString();      // the api forces construction of temporary string

Versus

_str.reset();                                 // a reusable "working" instance var
Order.toString( _str );               // because buffer passed into method no temp objects required

Don’t make everything reusable
Just where otherwise the objects would cause GC
Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs.
Unfortunately for ultra low latency its not an option, you have to reuse objects (remember there are places in Java classes that already use pools and reuse)
Avoid finalize
Objects which hold resources such as files and sockets should all attempt to shutdown cleanly and not rely on finalisers. Add explicit open and close methods and add shutdown handlers to cleanly close if possible.
Avoid threadlocal
Every threadlocal call involves a map lookup for current thread so only use where really needed.
24 * 7
Design your systems to run 24 * 7 …. common in 80's and 90's less so now in finance.


Saturday, 16 May 2015

Java Bytecode Latency Impact

In the 80's I remember building NAND circuits to represent code. It was pretty cool seeing how code could be implemented at a circuit level. What I was unsure of when I started SubMicroTrading was the impact on performance by java byte code and whether there were any possible optimisations available.

To cut a long story short, I found only one worthwhile optimisation and that’s how a switch statement is represented in byte code.

Consider the following switch statement :-

    Switch( a ) {
    case 10 : doAStuff(); break;
    case 20 : doBStuff(); break;
    case 30 : doCStuff(); break;
    case 40 : doDStuff(); break;
    ...
    }  
   
    is conceptually the same as
   
    if ( a == 10 ) doAStuff
    else if ( a == 20 ) doBStuff
    else if ( a == 30 ) doCStuff
    else if ( a == 40 ) doDStuff

Think about that, if you are parsing fix and you have 1000 possible fix tags, and an average of 20 fields in a message. Then to process a fix message you could possibly be making an average of 10,000 comparisons . If you want to process 1,000,000 events per second, that would be 10,000,000,000 comparisons per second. (A huge number that would be based on linear top down search, binary search is clearly much better … but point is its still a cost that can be avoided).

Java has two bytecodes for switch statements, the LookUpSwitch statement is in effect a table of key value to jump label … ie you have to search the table to find correct key entry. TableSwitch is in effect a table of jump labels which are indexed directly by the key value (minus the table offset .. Ie lowest key value in switch statement).

For Ultra Low Latency you should consider adding an ant task and check the bytecode for any "lookupswitch" statements. For message processing on most exchanges you can safely force a switch statement to become a tableswitch by adding packer entries so there are no gaps between the key values. In my CODEC generators I stipulate a max pack range eg 100, and any sparse values are handled within the default statement eg via a second switch or if statement if only couple of keys of interest. Like everything test out with real data to see impact. For me, the tableswitch made HUGE difference.

Sample tableswitch with packering case statements :-

Here is the start of the switch statement within the Standard44Decoder generated class :-

        final byte msgType = _fixMsg[ _idx ];
        switch( msgType ) {
        case '8':
            if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type
                throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );
            }
            _idx += 2;
            return decodeExecReport();
        case 'D':
            if ( _fixMsg[_idx+1 ] != FixField.FIELD_DELIMITER ) { // 2 byte message type
                throwDecodeException( "Unsupported fix message type " + _fixMsg[_idx] + _fixMsg[_idx+1] );
            }
            _idx += 2;
            return decodeNewOrderSingle();
        ………….
        // packers
        case '6': case '7': case ':': case ';': case '<': case '=':
        case '>': case '?': case '@': case 'B': case 'C': case 'E':
            break;


javap -c ./com/rr/model/generated/fix/codec/Standard44Decoder.class >s44.bc

  protected final com.rr.core.model.Message doMessageDecode();
    Code:
…….
      74: tableswitch   { // 48 to 71
                    48: 549                                      // '0' ie heartbeat is the first entry
                    49: 987
                    50: 841
                    51: 768
                    52: 914
                    53: 695
                    54: 1060
                    55: 1060
                    56: 184                                        // '8' this was the first switch entry in java code
                    57: 476
                    58: 1060                                     // 1060 is same as default and are the packer entries
                    59: 1060
                    60: 1060
                    61: 1060
                    62: 1060
                    63: 1060
                    64: 1060
                    65: 622
                    66: 1060
                    67: 1060
                    68: 257
                    69: 1060
                    70: 403
                    71: 330
               default: 1060
          }
     184: aload_0
     185: getfield      #462                // Field _fixMsg:[B
     188: aload_0
     189: getfield      #466                // Field _idx:I
     192: iconst_1
     193: iadd
     194: baload
     195: iconst_1
     196: if_icmpeq     242
     199: aload_0
     200: new           #475                // class java/lang/StringBuilder
     203: dup
     204: ldc_w         #477                // String Unsupported fix message type

You could easily grep for lookupswitch … remember over time extra case statements could be added that cause the switch to become lookupSwitch again.


PS> clearly tableswitch isnt suitable for very sparse  switch statements, but in my experience its very useful in trading systems.

Saturday, 9 May 2015

Java JVM Tuning for Ultra Low Latency

There is no JVM arg that fits all applications, key is have a repeatable full test bed and run full scale benchmarks over hours not seconds. Rinse and repeat several times for EACH arg change. The args I focus on are the ones on SubMicroTrading which performs no GC and has almost no JIT post warmup. 

Please note some of these flags are now on by option … sorry I havent checked, still worth bringing them to attention I think.

For standard java applications which do lots of GC with mainly short lived objects, I would recommend try the G1 collector … for market data I found it much better than concurrent mark sweep. I will blog about that another time … spent weeks tuning poorly designed apps (advice don’t bother buy Zing).

Note each Java upgrade brings new options and tweaks existing performance, sometimes up, sometimes down, so re-benchmark each Java upgrade.

Treat micro benchmarks with care, discuss the Generics benchmark and explain how on PC was different to Linux

Avoid BiasedLocks … they incur regular millisecond latency in systems I have tested

JVM Args for Recording Jitter (JIT/GC)

-XX:+ 
PrintCompilation
-XX:+
CITime  
-XX:+
UnlockDiagnosticVMOptions
-XX:+
PrintInlining
-XX:+
LogCompilation
-verbose:
gc
-XX:+
PrintGCTimeStamps
-XX:+
PrintGCDetails

Rather than regurgitate what I previously googled on understanding output from PrintCompilation :- http://blog.joda.org/2011/08/printcompilation-jvm-flag.html

For ultra low latency you want no GC and no JIT, so in SMT I preallocate pools and run warmup code then invoke System.gc(). I take note of the last compilation entry then while re-running controlled bench test look for new JIT output (generally recompilation). When this occurs I go back to the warmup code and find out why the warmup code had to be recompiled. This generally comes down to either the code not being warmed up, or the routine was too complicated for the compiler. Either add further warmup code or simplify the routine. Adding final everywhere really helps.

Writing warmup code is a pain, and I am gutted the java.lang.Compiler.disable() method is not implemented (or at least it wasn’t in Open JDK1.6 … empty method doh!). Ideally I would invoke this when application is warm and have no recompilation due to the compiler thinking it can make further optimisations. 

Java can recompile and recompile this only happens in my experience when method is too complex. Ofcause if a recompilation is due because java inlined a non final method and the optimisation was premature then the code needs to be corrected. What I want to avoid recompilation optimisations from edge cases that infrequently go into code branches.

Note you cannot guarantee no GC and no JIT under any situation in a complex system. What you can do is guarantee no JIT/GC for KEY specified scenarios that the business demand. If a trading system does 10 million trades a day, I would set a goal of no GC/JIT under NORMAL conditions with 30 million trades then check performance upto 100 million to see at which point jitter occurs. If for example the exchange disconnect you during the day, and that kicks in a few milliseconds of JIT its not important. You don’t need pool every object … just the key ones that cause GC. More on that in future blog on SuperPools.

I remember speaking to Gil Tene from Azul, while working at Morgan Stanley and really tried to get across how much more JIT is of a pain than GC. Some exciting developments seem to have been made with Zing and I would have been very interested in benchtesting that … alas I just don’t have time at present. Very impressed with Azul and Gil and how they respond to queries and enhance their product ….. so much better than Sun/Oracle were with Java.



SubMicroTrading JVM Arguments

The following are the arguments that SubMicroTrading run with, this includes the algo container, OMS, exchange sim and client sim.

-XX:+
BackgroundCompilation
Even with this on there is still latency in switching in newly compiled routines. I really wish that switch time was much much quicker !
-XX:
CompileThreshold=1000
If you don’t want to benefit from fastest possible code given the runtime heuristics you can force initial compilation with -Xcomp … an option if you don’t want to write warmup code. This may run 10% slower but sometimes much slower depending on the code.
-XX:+
TieredCompilation
So code is initially compiled with the C1 (GUI/client) compiler, then when its reached invocation limit is recompiled with the fully optimised C2 (server) compiler. The C1 compiler is much quicker to compile a routine than the C2 compiler and reduced some outlying latency in SMT for routines that were not compiled during warmup (eg for code paths not covered in warmup).
-XX:-
ClassUnloading
Disable class unloading, don’t want any possible jitter from this. SMT doesn’t use custom class loaders and tries to load all required classes during warmup.
-XX:+
UseCompilerSafepoints
I had hoped that disabling compiler safepoints would reduce JIT jitter but in SMT multithreaded system it brings instability so I ensure the safepoints are on ….. More jittter I don’t want ho hum.
-XX:
CompileCommandFile=
.hotspot_compiler"
The filename used to be picked up by default but now you have to use this command.
This is really handy, if you have a small routine you cant simplify further which causes many recompilations then prevent it by adding a line to this file, example :-

exclude sun/nio/ch/FileChannelImpl force

This means the routine wont be compiled, you need to benchmark to determine if running routine as bytecode has noticeable impact.
-XX:+
UseCompressedOops
I kind of expected this to have a small performance overhead but in fact it has slightly improved performace … perhaps thru reduced object size and fitting more instances into cpu cache.
-X
noclassgc
Again all classes loaded during warmup and don’t want any possible jitter from trying to free up / unload them
-XX:-
RelaxAccessControlCheck
To be honest no idea why I still have this in or even if its still required !
-D
java.net.preferIPv4Stack=
true 
If you upgraded java and spent hours working out why your socket code isnt working anymore, this could well be it … DOH !!!
-
server
Don’t forget this if running benchmarks on PC
-XX:+
UseFastAccessorMethods

-XX:+
UseFastJNIAccessors

-XX:+
UseThreadPriorities
Not sure this is needed for SMT, I use JNI function to hwloc routines for thread core affinity
-XX:-
UseCodeCacheFlushing

-XX:-
UseBiasedLocking
Disable biased locking, this causes horrendous jitter in ultra low latency systems with discrete threading models
Probably the single biggest cause of jitter from a jvm arg that I found.
-XX:+
UseNUMA
Assuming you have a multi CPU system, this can have significant impact … google NUMA architecture



JVM Arguments to experiment with … didn’t help SMT, but may help you

 -XX:-
DoEscapeAnalysis
Try disable escape analysis and see what the impact is
-X
comp
Mentioned earlier, compile class when loaded as opposed to optimising based on runtime heuristics
Avoids JIT jitter but code in general is slower than dynamically compiled code.
Cant remember if it compiles all classes on startup or when each class is loaded, google failed to help me here !
-XX:+
UseCompressedStrings
Use byte arrays in Strings instead of char. SMT has its own ReusableString which uses byte arrays.
Obviously a no go for systems that require multi byte char sets like Japanese Shift-JIS
All IO is in bytes so avoid the constant translation between char and byte
-XX:-
UseCounterDecay
Experiment disabling / reenabling with recompilation decay timers. I believe the decay timers delay recompilation from happening within 30seconds. A real pain in warmup code. I run warmup code, pause 30seconds then rerun! Must be a better way. Wish decent documentation existed that wasnt hidden away !
-XX:
PerMethodRecompilationCutoff=1
Try setting maximum recompilation boundary … didn’t help me much


I have tried many many other JVM args but none of those had any favourable impact on SMT performance.