Monday, May 30, 2016

Benchmarking Java Locks with Counters

These days I am analyzing some Java Flight Recordings from taken from WSO2 API Manager performance tests and I found out that main processing threads were in "BLOCKED" state in some situations.

The threads were mainly blocked due to "synchronized" methods in Java. Synchronizing the methods in a critical section of request processing causes bottlenecks and it has an impact on the throughput and overall latency.

Then I was thinking whether we could avoid synchronizing the whole method. The main problem with synchronized is that only one thread can run that critical section. When it comes to consumer/producer scenarios, we may need to give read access to data in some threads and write access to a thread to edit data exclusively. Java provides ReadWriteLock for these kinds of scenarios.

Java 8 provides another kind of lock named StampedLock. The StampedLock provides an alternative way to the standard ReadWriteLock and it also supports optimistic reads. I'm not going to compare the features and functionalities of the each lock type in this blog post. You may read the StampedLock Idioms by Dr. Heinz M. Kabutz.

I'm more interested in finding out which lock is faster when it is accessed by multiple threads. Let's write a benchmark!


The code for benchmarks


There is an article on "Java 8 StampedLocks vs. ReadWriteLocks and Synchronized" by Tal Weiss, who is the CEO of Takipi. In that article, there is a benchmark for Java locks with different counter implementations. I'm using that counters benchmark as the basis for my benchmark. 

I also found another fork of the same benchmark and it has added the Optimistic Stamped version and Fair mode of ReentrantReadWriteLock. I found out about that from the slides on "Java 8 - Stamped Lock" by Haim Yadid after I got my benchmark results.

I also looked at the article "Java Synchronization (Mutual Exclusion) Benchmark" by Baptiste Wicht.

I'm using the popular JMH library for my benchmark. The JMH has now become the standard way to write Java microbenchmarks. The benchmarks done by Tal Weiss do not use JMH.

See JMH Resources by Nitsan Wakart for an introduction to JMH and related links to get more information about JMH.

I used Thread Grouping feature in JMH and the Group states for benchmarking different counter implementations.

This is my first attempt in writing a proper microbenchmark. If there are any problems with the code, please let me know. When we talk about benchmarks, it's important to know that you should not expect the same results in a real life application and the code may behave differently in runtime.

There are 11 counter implementations. I also benchmarked the fair and non-fair modes of ReentrantLockReentrantReadWriteLock and Semaphore.

Class Diagram for Counter implementations

There are altogether 14 benchmark methods!

  1. Adder - Using LongAdder. This is introduced in Java 8.
  2. Atomic - Using AtomicLong
  3. Dirty - Not using any mechanism to control concurrent access
  4. Lock Fair Mode - Using ReentrantLock
  5. Lock Non-Fair Mode - Using ReentrantLock
  6. Read Write Lock Fair Mode - Using ReentrantReadWriteLock
  7. Read Write Lock Non-Fair Mode - Using ReentrantReadWriteLock
  8. Semaphore Fair Mode - Using Semaphore
  9. Semaphore Non-Fair Mode - Using Semaphore
  10. Stamped - Using  StampedLock
  11. Optimistic Stamped - Using  StampedLock with tryOptimisticRead(); If it fails, I used the read lock. There are no more attempts to tryOptimisticRead().
  12. Synchronized - Using synchronized block with an object
  13. Synchronized Method - Using synchronized keyword in methods
  14. Volatile  - Using volatile keyword for counter variable

The code is available at https://github.com/chrishantha/microbenchmarks/tree/v0.0.1-initial-counter-impl

Benchmark Results


As I mentioned, I used Thread Grouping feature in JMH and I ran the benchmarks for different thread group distributions. There were 10 iterations after 5 warm-up iterations. I measured only the "throughput". Measuring latency would be very difficult (as the minimum throughtput values were having around 6 digits)

The thread group distribution was passed by the "-tg" argument to the JMH and the first number was used for "get" (read) operations and the second number was used for "increment" (write) operations.

There are many combinations we can use to run the benchmark tests. I used 12 combinations for thread group distribution and those are specified in the benchmark script.

These 12 combinations include the scenarios tested by Tal Weiss and Baptiste Wicht.

The benchmark was run on my Thinkpad T530 laptop.

$ hwinfo --cpu --short
cpu:                                                            
                       Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, 3394 MHz
                       Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, 3333 MHz
                       Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, 3305 MHz
                       Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, 3333 MHz
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15866        4424        7761         129        3680       11204
Swap:         18185           0       18185

Note: I added the "Dirty" counter only to compare the results, but I omitted it from the benchmark as no one wants to keep a dirty counter in their code. :)

I have committed all results to the Github repository and I used gnuplot for the graphs.

It's very important to note that the graphs show the throughput for both reader and writer threads. If you need to look at individual reader and writer throughput, you can refer the results at https://github.com/chrishantha/microbenchmarks/tree/v0.0.1-initial-counter-impl/counters/results

Let's see the results!

1 Reader, 1 Writer

2 Readers, 2 Writers

4 Readers, 4 Writers

5 Readers, 5 Writers

10 Readers, 10 Writers

16 Readers, 16 Writers

64 Readers, 64 Writers

128 Readers, 128 Writers

1 Reader, 19 Writers

19 Readers, 1 Writer

4 Readers, 16 Writers

16 Readers, 4 Writers



Conclusion


Following are some conclusions we can make when looking at above results

  1. Optimistic Stamped counter has much better throughput when there is high contention.
  2. Fair modes of the locks are very slow.
  3. Adder counter has better throughput than Atomic counter when there are more writers.
  4. When there are less threads, the Synchronized and Synchronized Method counters has better throughput than using a Read Write Lock (in non-fair mode, which is the default)
  5. The Lock counter also has better throughput than Read Write Lock when there are less threads.

Adder, Atomic and Volatile counter examples do not show a way to provide mutual exclusion, but those are thread-safe ways to keep a count. You may refer benchmark results for other counters with Java locks if you want to have a mutual exclusion to some logic in your code.

In this benchmark, the read write lock has performed poorly. The reason could be that there are writers continuously trying to access the write lock. There may be situations that a write lock may be required less frequently and therefore this benchmark is probably not a good way to evaluate performance for read write locks.

Please make sure that you run the benchmarks for your scenarios before making a decision based on these results. Even my benchmarks give slightly different results for each run. So, it's not a good idea to rely entirely on benchmarks and you must test the performance of the overall application.


If there are any questions or comments on the results or regarding benchmark code, please let me know.