Locating JIT-generated machine code using gdb

This article describes how I was able to locate machine instructions generated by OpenJDK JIT compiler on Linux using gdb debugger. It may be useful to those who want to understand what JIT does with your code.I used this procedure to find that JIT emits efficient branchless code, as described here. I decided to share this information because I spent considerable amount of time figuring it out. It’s possible that easier approaches exist – please feel free to suggest.

First, I used these 2 options with java command: -XX:+AggressiveOpts -XX:+PrintCompilation

AggressiveOpts forces JVM to compile most everything it sees into native instructions. PrintCompilation makes it report what it does compile. With these 2 options together, you have a great chance that your target function will be compiled and you get a log statement to confirm that it does.

Next, I found that it is best to place a breakpoint just before your target section of code. But how does one set a breakpoint? Java breakpoints are useless for native code debugging, because Java debuggers don’t give you an opportunity to jump from Java code into native instructions. A breakpoint in Java interpreter code in JVM would be triggered countless number of times while JVM interprets your Java program. I found this to be impractical even for tiny handcrafted sample programs. Answer: you create a simple NOOP native function and set breakpoint in it!

I took inspiration from Fahd’s core dumper code He describes how to invoke native code from Java to cause a core dump (by deliberately dereferencing null pointer). We can take this example, but eliminate the core dump. Our function will do absolutely nothing – it will simply serve as a hook to hang the debugger breakpoint on.

The steps will be:

1. Create a Java class:


public class Hookable {

// load the library

static {

System.loadLibrary("nativelib");

}

// native method declaration

public native void here();

public static void main(String[] args) {

// any prep work

new Hookable().here();

// Your target code goes here

}

}

2. Compile your class with javac


javac Hookable.java

3. Generate JNI header file


javah -jni Hookable

4. Write C code for the function that you will use a breakpoint in gdb:


#include "Hookable.h"

void foo() {

int a=2;

return;

}

JNIEXPORT void JNICALL Java_Hookable

(JNIEnv *env, jobject obj) {

foo();

}

5. Compile the C code


gcc -fPIC -o libnativelib.so -shared -I$JAVA_HOME/include/linux/ -I$JAVA_HOME/include/ Hookable.c

6. Launch gdb to runJava under its control:


gdb --args java -XX:+AggressiveOpts -XX:+PrintCompilation Hookable

When gdb starts, add a breakpoint for foo. gdb will complain, because Java is not yet running, so foo can’t be found yet. Confirm that you really want to add this breakpoint by responding “y” to the prompt:


(gdb) break foo

Function "foo" not defined.

Make breakpoint pending on future shared library load? (y or [n]) <b>y</b>

Breakpoint 1 (foo) pending.

Run Java:


(gdb) run

Starting program: /usr/bin/java -XX:+AggressiveOpts -XX:+PrintCompilation Hookable

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

[New Thread 0x7ffff7fce700 (LWP 7336)]

…

gdb will stop at the breakpoint in foo:

Breakpoint 1, 0x00007fffe48c96ff in foo ()

from /home/ivan2/dev/benchmarks/libnativelib.so

(gdb)

Now would be a good time to set up your debugging session. For example, tell gdb to display next machine instruction to be executed:


(gdb) display/i $pc

1: x/i $pc

=> 0x7fffe48c96ff <foo+4>: movl $0x2,-0x4(%rbp)

(gdb)

Exit foo and Java_Hookable, the 2 frames that exist solely as a marker and a place to hang breakpoint on


(gdb) fin

Run till exit from #0 0x00007fffe48c96ff in foo ()

(gdb) fin

Run till exit from #0 0x00007fffe48c9723 in Java_Hookable_here ()

Your debugging session is now stopped at the point marked “// Your target code goes here”. Next debugger step wioll take you out of native C code and back into Java realm, exactly where you want to be. Happy hunting for JIT output!

Advertisements

Confirmed: OpenJDK JIT compiler emits efficient conditional move machine codes

On popular programmers’ Q&A site stackoverflow.com the hit question of the last few years dealt with the effect of branching on program performance. I recommend that you read the brilliant explanation there, but if you don’t have much time, here’s the summary. Modern processors execute commands in a pipeline. While One command is finishing executing, the next one is already started, the next one is being prepared and so on. Pipelining can only work if processor knows which command will execute next. If the program is linear, determining next command is trivial. But if the program branches (think “if” statement), processor can only guess what comes next. If the processor guesses wrong, it would have to back out several steps of the wrong branch it took. Such backout would severely degrade performance.

Fortunately, modern processors have branch prediction mechanisms, which allow them to guess branching correctly when a particular branch exhibits consistent behavior. An example would be an if statement inside a loop. If code takes “then” branch consistently, processor will be able to make correct guess and performance will remain optimal.

So efficient code would avoid branching or at least make it branching predictable. One could assume that to write efficient code, programmer needs to be aware of branch prediction and code around it. An example in the linked Stack Overflow article shows how sorting data before processing it in a loop creates consistent branching and dramatically improves performance. Another trick would be to replace mini-branches with equivalent branchless code. For example, function min(a,b) is normally written as

if (a<b) return a; else return b; 

It could rewritten (for Java’s 32 bit signed ints, with an assumption on input range not causing overflow/underflow) as

return a-((b-a)>>31&1)*(a-b)

Should you write code in this obscure style to gain performance? I’m going to show that in Java this trick is unnecessary.

The key to the problem lies in the fact that in some cases, an if statement doesn’t produce branching! X86 processors (Intel and AMD) that power most PCs and great many servers have recognized Conditional Move instructions in Pentium Pro, released about 20 years ago in 1995. Conditional move instruction is a way to achieve branching effect without branching code. The meaning of this instruction is “move value from one place to another IF specified condition is met, do nothing otherwise”. This instruction can be used to determine the minimum of 2 values or to express a ternary operator (?:) in C or Java.

The Stack Overflow post cited benchmark results that suggest that C compilers are not consistent in emitting these efficient conditional move instructions. Java seemed to use conditional move, based on performance results. I ran a slightly different benchmark (computing minimum of 2 numbers using if-return and ternary operator), comparing run times between branch-friendly ordered input data and noisy random input data. There was no difference in execution times, suggesting that branching was not a factor.

I decided to check what instructions OpenJDK JIT compiler generates. For my tests I used Java 7 on my home Ubuntu Linux running on 64 bit AMD processor.


ivan2@bigbox:~$ java -version

java version "1.7.0_79"

OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)

OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

I wrote a primitive test case with simple min function:


public int ifmin(int a, int b) {

if (a<b) return a; else return b;

}

My goal was to confirm that this simple code will be compiled to native instructions as a conditional move (CMOVxx), not a conditiona jump (Jxx). My primitive function compiled to this Java bytecode:


0: iload_0

1: iload_1

2: if_icmpge 7

5: iload_0

6: ireturn

7: iload_1

8: ireturn

I caused JVM to compile the bytecode to native instructions and traced execution with gdb debugger. By the way, locating compiled code took quite a bit of effort. I probably missed an obvious approach, but in any case, I explain how I pinpointed JIT-generated machine code in a separate post. I verified that compilation happened in a log file:


Compilation events (4 events):

Event: 0.077 Thread 0x00007f03bc0a9000 1 ivan.Code::ifmin (9 bytes)

I was then able to trace the execution of the function and see the actual machine instructions. The relevant section of gdb session looked like this:


=> 0x7fffed060190: sub $0x18,%rsp

=> 0x7fffed060197: mov %rbp,0x10(%rsp)

The instructions above populated registers in preparation for comparison.


=> 0x7fffed06019c: cmp %ecx,%edx

The above instruction compared a with b


=> 0x7fffed06019e: mov %ecx,%eax

And the above moved b to result


=> 0x7fffed0601a0: <b>cmovl</b> %edx,%eax

Next: conditional move if a<b. The logic of this steps is like this: prior to this step, result=b (we just moved b to result). Now, if a<b,result will be overwritten with a. So if a<b, then result=a. Other wise, result remains unchanged, meaning result=b. Just what we wanted!

Next, finalize and return – nothing terribly interesting


=> 0x7fffed0601a3: add $0x10,%rsp

=> 0x7fffed0601a7: pop %rbp

=> 0x7fffed0601a8: test %eax,0xaf96e52(%rip) # 0x7ffff7ff7000

=> 0x7fffed0601ae: retq

Similarly, I confirmed that ternary operator is compiled into conditional move instruction. This Java code:


int a = aarr[1];

int b = barr[2];

res[3] = a<b?a:b;

produced similar machine code sequence, including conditional move:


=> 0x7fffed05fc9c: cmp %ebp,%r11d

The above compares a with b


=> 0x7fffed05fc9f: <b>cmovl</b> %r11d,%ebp

And this lines overwrites one with the other – conditionally!!

Conclusion: for simple conditionals, modern JIT compilers emit efficient branchless code. No additional programming tricks are required. Clever interview trick of computing minimum of 2 numbers through obscure bitwise code should remain confined to interview rooms – it does nothing to improve performance in real applications.