Sunday 7 June 2015

SubMicroTrading Ultra Low Latency Java Trading Framework Preparing for Open Source


If you are using FPGA or the current third party providers for trading where microseconds really matter then please consider evaluating SubMicroTrading (TM). This is not vapourware or empty promises. It’s a trading framework almost 5 years in the making that I am preparing for open source with a target date August 2015.

Java can be used for ultra low latency.

        Just because someone doesn’t know how to do something, doesn’t mean its not possible

Key stats measured wire to wire in an independent lab using TipOff

800,000+ CME fast fix messages decoded per core per second

4 micros average tick to trade, wire to wire at 800,000 ticks/second (MAX tcpreplay) includes :-
Read packet from wire using Solarflare openonload and deliver to main memory (1 micro)
decode market data tick into a normalised POJO event (<1micro)
update book and invoke algo container (<1micro)
simple algo which crosses spread every X ticks and creates order POJO
Encode order to CME fix order request and ready to write to socket buffer (<1micro)
write packet to wire using Solarflare openonload (1 micro)

In process latency is measured at 2 micros with 2 micros in/out of Solarflare/OpenOnload
Note latency is highly dependant on configuration and data topology (which is why concurrency is so important)

Whats in open release
Current model and generated codecs including ETI, UTP, Millenium, Fix, FastFix, CME MDP
All market data and exchange codecs convert from external wire format to normalised common internal POJO events
Possibly fastest standard Fix engine on planet
Possibly fastest FastFix implementation on planet
Possibly fastest log engine on planet
Possibly fastest memory mapped index paged persistence
Possible fastest OMS on planet
Custom exchange session engines for ETI, UTP, Millenium, Fix, FastFix
Exchange trading simulator (works with any of the generated codecs like ETI)
Complete core of SubMicroTrading including thread core affinity
Component architecture for easy configuration of flow pipelines
Ability to extend and customise the source code of any component

Whats not in open first release
Encoder/Decoder and model generator
Exchange and market data agnostic Algo container
CME dynamic on the fly session generation
Book Manager and Book Conflation for optimal concurrent update processing
Sample spread algo implementation really shows the power of Java algos.

Note when comparing SubMicroTrading with other products, remember you have the source, you have full control. Don’t compare apples and oranges … SubMicroTrading converts wire messages to appropriate normalised domain objects allowing a clean and simple to use algo container.

To really compare performance you must test wire to wire within a controlled network. For really high throughput and to avoid exchange test environment throttling, run the exchange trading simulator and market data replay on a separate server. You can then run SubMicroTrading on the trading server then switch to an alternative implementation. Try the T1 benchmark at different TCP replay rates.

Try it, its pretty amazing to run the market data back using tcpreplay, the trading application and exchange simulator all on a low power laptop. To see the true power run on tuned CentOS linux with custom NIO and thread affinity configured.

Follow the blog or register on the website for confirmation on the open launch.

Saturday 6 June 2015

Setting Thread Affinity and Priority using JNI in SubMicroTrading


 
There is no real mystique in JNI calls, the concept that JNI is slow is a misnomer. If you keep your JNI interfaces simple then when the code is compiled its just another function call (allbeit with an extra two parameters).

I recommend wrapping JNI calls within an envelope which can allow switching between linux, windows and perhaps no custom JNI. I developed SubMicroTrading on a little Dell Adamo laptop and could run the exchange sim, market data sim, trading application all on dual core with 4GB RAM … try doing that in C++ !

In SubMicroTrading all custom JNI calls (excluding custom NIO) are wrapped within a class called NativeHooksImpl (simplified and cut down version below).

public class NativeHooksImpl implements NativeHooks {

    private static boolean _linuxNative   = false;
   
    static {
        if ( Env.isUseLinuxNative() ) {
            System.loadLibrary( "submicrocore" );
            _linuxNative = true;
        }
    }

    private static NativeHooks _instance = new NativeHooksImpl();
    public  static NativeHooks instance() { return _instance; }

    private static native void jniSetPriority( int mask, int priority );
   
    @Override public void setPriority( Thread thread, int mask, int priority ) {
        if ( _linuxNative ) {
            jniSetPriority( mask, priority );
        } else {
            thread.setPriority( priority );
        }
    }
    ………...

To generate the header file :-

javah -force -classpath ..\..\bin -o src\SubMicroCore_jni.h com.rr.core.os.NativeHooksImpl

Sample entry from the generated header …. Clearly the actual function must match the definition

/*
 * Class:     com_rr_core_os_NativeHooksImpl
 * Method:    jniSetPriority
 * Signature: (II)V
 */
JNIEXPORT void JNICALL Java_com_rr_core_os_NativeHooksImpl_jniSetPriority(JNIEnv *, jclass, jint, jint);

Implementation for the set priority method .. Note this sets the cpumask and priority for the CURRENT thread. Invoke this method at the start of the thread run() method. SubMicroTrading keeps all the thread and priority mappings in a config file which is essential. I use different configs for each different PC/server.

JNIEXPORT void JNICALL Java_com_rr_core_os_NativeHooksImpl_jniSetPriority( JNIEnv *env, jclass clazz, jint cpumask, jint priority ) {

    int topodepth;
    hwloc_topology_t topology;
    hwloc_cpuset_t cpuset;

    hwloc_topology_init(&topology);
    hwloc_topology_load(topology);
    topodepth = hwloc_topology_get_depth(topology);

    cpuset = hwloc_bitmap_alloc();
    hwloc_bitmap_from_ulong( cpuset, (unsigned int)cpumask );

    char *str;
    hwloc_bitmap_asprintf(&str, cpuset);

    printf("cpumask [%d] => hwloc [%s]\n", cpumask, str);

    if (hwloc_set_cpubind(topology, cpuset, HWLOC_CPUBIND_THREAD)) {
        printf("Couldn't bind cpuset %s\n", str);
    } else {
        printf("BOUND cpuset %s\n", str);
    }

    free(str);

    /* Free our cpuset copy */
    hwloc_bitmap_free(cpuset);

    /* Destroy topology object. */
    hwloc_topology_destroy(topology);
}

Heres the linux makefile I wrote :-

# g++: 3.2.3

.SUFFIXES:        .c

TMP_PATH=./target
BIN_PATH=./bin/linux

#DEBUG=            -g
DEBUG=

DLL_NAME=        libsubmicrocore.so

CC=                      gcc
CFLAGS=          -O3 -march=nocona -m64 -I"${JAVA_HOME}/include" -I"${JAVA_HOME}/include/linux" -DLINUX -fPIC -I${HWLOC_HOME}/include -I./sun
LD=                      gcc
LDFLAGS=-L${HWLOC_HOME}/lib -L${JAVA_HOME}/jre/lib/amd64 -L${JAVA_HOME}/jre/lib/amd64/server

LIBS=-m64 -lhwloc -ljava -ljvm -lverify -lnio -lnet -lrt

all:    setup lib

setup:
mkdir -p ${TMP_PATH}
mkdir -p ${BIN_PATH}

lib: ${BIN_PATH}/${DLL_NAME}

${BIN_PATH}/${DLL_NAME}: ${TMP_PATH}/SubMicroCore_jni.o
${LD} ${LDFLAGS} -LD ${LIBS} -shared -o ${TMP_PATH}/${DLL_NAME} ${TMP_PATH}/SubMicroCore_jni.o
cp -f ${TMP_PATH}/${DLL_NAME}  ${BIN_PATH}/${DLL_NAME}

${TMP_PATH}/SubMicroCore_jni.o: src/SubMicroCore_jni.c src/SubMicroCore_jni.h
${CC} ${CFLAGS} -o ${TMP_PATH}/SubMicroCore_jni.o -c src/SubMicroCore_jni.c

clean:
rm -rf ${TMP_PATH}/*
rm -rf ${BIN_PATH}/*

FYI I had written a windows version of the library but ditched it as for ultra low latency you really need the control level that linux gives you … especially as its free !

This is all EASY due to the good work on the hwloc project :-


I am using a pretty old version (I think 1.0.2 … cant check as my linux servers are offline atm) so API may have changed again, but I would expect impact is minimal.

I will give recommendations for how to use thread affinity and priority in post on threading models. Please use with care, you can grind a system to a halt by poor usage.

My plan is still to open source components from SubMicroTrading which will include the complete JNI layer .. Ie above + various timer and microsecond sleep functions.