Measuring Performance Overhead of Getters

Abstractions are tricky to get right. In order to be good, they have to be easy enough for programmers to understand, performant enough for the problem at hand, and their utility must ultimately outweigh the cost of implementing (and maintaining) them.

My definition of a “good abstraction” above is the reason why I don’t like getters. In my opinion, they are a useless abstraction. They introduce additional code that tries to solve a non-existent problem. Sure, it’s trivial code, and it can be auto-generated by your IDE. But it’s additional code. And additional code == additional work. Right?

After a recent debate with my teammate, I decided to measure the overhead introduced by getters in a modern HotSpot JVM running on a x86 CPU. Getters may look innocent, but any compiler person knows that they can translate to fairly complicated CPU instructions. I wanted to see if these complex inner workings would translate to degraded performance.

Getters from a compiler’s perspective

Accessing a field of some object is a very basic operation. It boils down to taking some reference to an object and accessing memory at address reference + field offset. This operation is natively supported on most CPU architectures.

Getters wrap field access in a method. Most obviously this introduces a method call. A method is basically a fancy function that receives a pointer to this object as its first implicit argument. This means that in order to invoke a method we need to perform all the same steps involved in calling a function – push the return address onto the stack, pass the arguments, branch to the function, execute the instructions, pass the return value, pop the return address from the stack, and branch back to the caller. Quite a ritual.

Functions are a cheap abstraction, but they’re not free. In fact, one of the easiest, yet most impactful optimizations performed by almost all compilers is function inlining.

Finally, we’re working with Java, which means that we’re forced to do virtual method dispatch almost every time we call an instance method. Why? Well, think about it. Everything is an object, and every object inherits from at least one other object. When we call foo.bar() we have to check what foo’s real type is. Maybe it’s not Foo which defines the method bar(), but a subclass of Foo that overrides that method.

Designing the benchmark

I wanted to design the benchmark in such a way that field accesses happen in isolation. Also, I wanted to let the optimizer have its way with the field access, but I couldn’t allow it to optimize the access away entirely.

With these requirements in mind, I’ve devised the following procedure:

Create a large array of wrapper objects that store random integers. These wrappers will expose their integer in different ways, allowing us to compare field access strategies.
Loop over the array and compute the sum of all integers.
Display the sum.

These are the wrapper objects that I came up with:

RawWrapper

RawWrapper is just a simple object that exposes the number through a public final field.

public final class RawWrapper {
    public final int number;

    public RawWrapper(int number) {
        this.number = number;
    }
}

GetterWrapper

GetterWrapper exposes the number through a getter int getNumber().

public final class GetterWrapper {
    private final int number;

    public GetterWrapper(int number) {
        this.number = number;
    }

    public int getNumber() {
        return number;
    }
}

ParentGetterWrapper and ChildGetterWrapper

The last two wrappers were designed with a tricky goal in mind. I’ve read before that the HotSpot compiler can un-virtualize method calls, which means that it could make getter accesses just as fast as raw field accesses. I wanted to create a scenario that would prevent the compiler from doing that.

Class ParentGetterWrapper is a getter wrapper object just like GetterWrapper. However, it’s not declared as final, meaning that it can be extended by other classes.

public class ParentGetterWrapper {
    private final int number;

    public ParentGetterWrapper(int number) {
        this.number = number;
    }

    public int getNumber() {
        return number;
    }
}

ChildGetterWrapper extends ParentGetterWrapper. It inherits the getter from its parent, but now the optimizer should be confused as to where getNumber() is supposed to be coming from.

public class ChildGetterWrapper extends ParentGetterWrapper {
    public ChildGetterWrapper(int number) {
        super(number);
    }
}

The wrappers array was populated such that it contained both objects.

public static ParentGetterWrapper[] generateVirtualGetterWrappers(int benchmarkSize) {
    ParentGetterWrapper[] wrappers = new ParentGetterWrapper[benchmarkSize];
    Random randomGenerator = new Random();
    for (int i = 0; i < benchmarkSize; i++) {
        int number = randomGenerator.nextInt();
        int coinToss = randomGenerator.nextInt(2);
        ParentGetterWrapper wrapper;
        if (coinToss == 0) {
            wrapper = new ParentGetterWrapper(number);
        } else {
            wrapper = new ChildGetterWrapper(number);
        }
        wrappers[i] = wrapper;
    }
    return wrappers;
}

Analyzing benchmark output

You can find all the code and all the output in this article’s GitHub repo. All output discussed below is located under output.

Output will depend on your JDK, OS, and host CPU architecture. I compiled and ran the benchmark with Temurin OpenJDK 20.0.2 running on a MacBook with an Intel Core i7-9750H.

Bytecode

First, let’s take a look at bytecode that the compiler generated for different benchmark methods.

RawWrapper

This access boiled down to a single getfield instruction at byte offset 25 in the benchmarkRawWrappers method.

public static int benchmarkRawWrappers(foo.rida.RawWrapper[]);
  descriptor: ([Lfoo/rida/RawWrapper;)I
  flags: (0x0009) ACC_PUBLIC, ACC_STATIC
  Code:
    stack=2, locals=6, args_size=1
       0: iconst_0
       1: istore_1
       2: aload_0
       3: astore_2
       4: aload_2
       5: arraylength
       6: istore_3
       7: iconst_0
       8: istore        4
      10: iload         4
      12: iload_3
      13: if_icmpge     36
      16: aload_2
      17: iload         4
      19: aaload
      20: astore        5
      22: iload_1
      23: aload         5
      25: getfield      #80                 // Field foo/rida/RawWrapper.number:I
      28: iadd
      29: istore_1
      30: iinc          4, 1
      33: goto          10
      36: iload_1
      37: ireturn

GetterWrapper and ParentGetterWrapper

Bytecode generated for both benchmarkGetterWrappers and benchmarkVirtualGetterWrappers was absolutely identical. Unsurprisingly, we got an invokevirtual at byte offset 25, which performs virtual method dispatch for getNumber(). The actual getter inside GetterWrapper and ParentGetterWrapper contained a getfield instruction.

public static int benchmarkGetterWrappers(foo.rida.GetterWrapper[]);
  descriptor: ([Lfoo/rida/GetterWrapper;)I
  flags: (0x0009) ACC_PUBLIC, ACC_STATIC
  Code:
    stack=2, locals=6, args_size=1
       0: iconst_0
       1: istore_1
       2: aload_0
       3: astore_2
       4: aload_2
       5: arraylength
       6: istore_3
       7: iconst_0
       8: istore        4
      10: iload         4
      12: iload_3
      13: if_icmpge     36
      16: aload_2
      17: iload         4
      19: aaload
      20: astore        5
      22: iload_1
      23: aload         5
      25: invokevirtual #84                 // Method foo/rida/GetterWrapper.getNumber:()I
      28: iadd
      29: istore_1
      30: iinc          4, 1
      33: goto          10
      36: iload_1
      37: ireturn

public int getNumber();
  descriptor: ()I
  flags: (0x0001) ACC_PUBLIC
  Code:
    stack=1, locals=1, args_size=1
       0: aload_0
       1: getfield      #7                  // Field number:I
       4: ireturn

Machine code

Java bytecode may get compiled by two different compilers at runtime. At first, HotSpot will just interpret the bytecode, which is slow, but bearable for infrequently running code. When HotSpot detects that some method (or some hot loop) is running often, it will compile it to machine code using C1. C1 is a compiler that’s designed to compile fast at the cost of making worse optimizations. Then, if the compiled method continues to run often, HotSpot will compile it using C2. C2 is a slower compiler that optimizes code aggressively.

For brevity I’ll only discuss C2-generated code. I’ll also omit large portions of assembly and focus only on the part where number is retrieved from the wrapper and added to the sum.

RawWrapper

Field access was predictably compiled down to a mov instruction with an offset. Additionally, HotSpot made an optimization where it fetched three wrappers in one loop iteration. Optimized assembly is a bit jarring to read, but you can see the 3 movs that load number into eax, r11d, and r10d. These registers are then added to r13d that holds the sum.

mov r10d,DWORD PTR [rbx+r14*4+0x10]
add r13d,DWORD PTR [r12+r10*8+0xc]  ; implicit exception: dispatches to 0x000000011721afa8
movsxd rsi,r14d
mov r10d,DWORD PTR [rbx+rsi*4+0x14] ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@19 (line 76)
mov eax,DWORD PTR [r12+r10*8+0xc]   ; implicit exception: dispatches to 0x000000011721afa8
                                    ;*getfield number {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@25 (line 77)
mov r10d,DWORD PTR [rbx+rsi*4+0x18] ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@19 (line 76)
mov r11d,DWORD PTR [r12+r10*8+0xc]  ; implicit exception: dispatches to 0x000000011721afa8
                                    ;*getfield number {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@25 (line 77)
mov esi,DWORD PTR [rbx+rsi*4+0x1c]  ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@19 (line 76)
mov r10d,DWORD PTR [r12+rsi*8+0xc]  ; implicit exception: dispatches to 0x000000011721afa8
                                    ;*getfield number {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@25 (line 77)
add r13d,eax
add r13d,r11d
add r13d,r10d                       ;*iadd {reexecute=0 rethrow=0 return_oop=0}
                                    ; - foo.rida.SimpleBenchmark::benchmarkRawWrappers@28 (line 77)

GetterWrapper

The code is… completely identical to RawWrapper. HotSpot not only inlined and un-virtualized the access, but applied the same optimization of fetching three wrappers in one iteration.

ParentGetterWrapper and ChildGetterWrapper

C2 saw right through my charade with phony inheritance and produced the same code as for the other wrappers. It even used the same registers as before which made it feel surprisingly personal.

Confirming results with microbenchmarks

I’m not a big fan of microbenchmarks, but I wanted to double-check the results just to make sure that I wasn’t misunderstanding the assembly. JMH confirmed that average time taken by all three field access approaches were within the margin of error.

Benchmark	Score	Error
JmhBenchmark.benchmarkRawWrappers	311.917ns	±8.322ns
JmhBenchmark.benchmarkGetterWrappers	312.967ns	±8.474ns
JmhBenchmark.benchmarkVirtualGetterWrappers	323.835ns	±8.074ns

So there is no overhead at all?

Nope. We always pay a cost for abstractions, whether it’s execution time, compile time, engineer time, or something else. Here are compilation statistics for the benchmark:

Method	Bytecode size	Native size	Compile time
JmhBenchmark.benchmarkRawWrappers	38	496	2ms
JmhBenchmark.benchmarkGetterWrappers	38	496	2ms
JmhBenchmark.benchmarkVirtualGetterWrappers	38	512	3ms

Notice that C2 generated more code for benchmarkVirtualGetterWrappers and spent more time doing it? What happened?

benchmarkVirtualGetterWrappers receives an array of ParentGetterWrapper objects, which can actually contain ChildGetterWrapper instances. Both these classes use the same implementation of getNumber(), so the compiler figured it can cheat by inserting a hard reference to that method instead of performing virtual dispatch each time. It also later figured that it can inline that method, resulting in code identical to raw field access.

However, JVM cannot guarantee that it won’t see a class passed to benchmarkVirtualGetterWrappers that doesn’t override getNumber() in the future. What will it do then? It cannot execute incorrect code for the sake of speed.

HotSpot solves this by inserting “uncommon traps” into generated assembly. If code that violates optimizer’s view of the world (uncommon code) enters an optimized method, it will trigger this “trap” and execution will continue from unoptimized, but correct bytecode. In our case, the uncommon trap generated by C2 transfers execution back to an address that contains virtual dispatch of getNumber().

call   0x000000011c906180 ; ImmutableOopMap {}
                          ;*invokevirtual getNumber {reexecute=0 rethrow=0 return_oop=0}
                          ; - foo.rida.SimpleBenchmark::benchmarkVirtualGetterWrappers@25 (line 93)
                          ;   {runtime_call UncommonTrapBlob}

Why didn’t HotSpot generate uncommon traps for benchmarkGetterWrappers? Since GetterWrapper was declared as final, HotSpot reasonably concluded that no other implementation of getNumber() will ever exist.

Am I going to start using getters now?

I still think that the only good reason to use getters is to follow an existing project’s convention. But convincing myself to use getters wasn’t my goal.

My goal was to explore and share how getters are handled by a modern HotSpot JVM. Hopefully, you’ve now seen that even things that seem trivial to the programmer can be fairly complex under the hood. Oracle hired the best engineers to make HotSpot an excellent tool. But that’s all it is. A tool. And a tool can never reason about the program better than you, the programmer. It’s your responsibility to master your tool and make informed decisions on how to use it well.