This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not.

This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact. Parallel-lazy performance Java 8 vs Scala vs GS Collections Craig Motlin June 2014

Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid

Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid Lots of claims and opinions

Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid Lots of evidence

Intro Solve the same problem in all three libraries – Java ( 1.8.0_05 ) – GS Collections ( 5.1.0 ) – Scala ( 2.11.0 ) Count how many even numbers are in a list of numbers Then accomplish the same thing in parallel – Data-level parallelism – Batch the data – Use all the cores

Performance Factors Tests that isolate individual performance factors Count Filter, Transform, Transform, Filter, convert to List Aggregation – Market value stats aggregated by product or category

Count: Serial long evens = arrayList.stream().filter(each -> each % 2 == 0).count(); int evens = fastList.count(each -> each % 2 == 0); val evens = arrayBuffer.count(_ % 2 == 0)

Count: Serial Lazy long evens = arrayList.stream().filter(each -> each % 2 == 0).count(); int evens = fastList.asLazy().count(each -> each % 2 == 0); val evens = arrayBuffer.view.count(_ % 2 == 0)

Count: Parallel Lazy long evens = arrayList.parallelStream().filter(each -> each % 2 == 0).count(); int evens = fastList.asParallel(executorService, BATCH_SIZE).count(each -> each % 2 == 0); val evens = arrayBuffer.par.count(_ % 2 == 0)

Parallel Lazy 12345678…1M Filter and Count 1-10k10k-20k 20k-30k 30k-40k…990k-1M 500k Reduce 5k … Batch

Parallel Eager 12345678…1M 1-10k10k-20k 20k-30k 30k-40k…990k-1M 2, 4, 6, 8 … 10k 10k-20k (evens) 20k-30k (evens) 30k-40k (evens) … 990k-1M (evens) Batch Filter Count 5k … 500k Reduce

Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid Time for some numbers!

Serial Count ops/s (higher is better)

Parallel Count ops/s (higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v2 8x

Java Microbenchmark Harness “JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” 5 forked JVMs per test 100 warmup iterations per JVM 50 measurement iterations per JVM 1 second of looping per iteration http://openjdk.java.net/projects/code-tools/jmh/

Java Microbenchmark Harness @GenerateMicroBenchmark public void parallel_lazy_jdk() { long evens = this.integersJDK.parallelStream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); }

Java Microbenchmark Harness @Setup includes megamorphic warmup More info on megamorphic in the appendix This is something that JMH does not handle for you!

Java Microbenchmark Harness Throughput: higher is better Enough warmup iterations so that standard deviation is low Benchmark Mode Samples Mean Mean error Units CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...

Java Microbenchmark Harness Performance tests are open sourced Read them and run them on your hardware https://github.com/goldmansachs/gs-collections/

Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns

Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer Isolated because combination of intermediate results is simple addition. Let’s look at reasons for the differences in count()

Count: Java 8

Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); }

Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); } filter(Predicate).count() Instead of count(Predicate)

Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); } Is count() just incrementing a counter? filter(Predicate).count() Instead of count(Predicate)

Count: Java 8 implementation public final long count() { return mapToLong(e -> 1L).sum(); } public final long sum() { return reduce(0, Long::sum); } /** @since 1.8 */ public static long sum(long a, long b) { return a + b; }

Count: Java 8 implementation this.integersJDK.stream().filter(each -> each % 2 == 0).mapToLong(e -> 1L).reduce(0, Long::sum); this.integersGSC.asLazy().count(each -> each % 2 == 0);

Count: Java 8 implementation this.integersJDK.stream().filter(each -> each % 2 == 0).mapToLong(e -> 1L).reduce(0, Long::sum); this.integersGSC.asLazy().count(each -> each % 2 == 0); Seems like extra work

Count: GS Collections

@GenerateMicroBenchmark public void serial_lazy_gsc() { int evens = this.integersGSC.asLazy().count(each -> each % 2 == 0); Assert.assertEquals(SIZE / 2, evens); }

Count: GS Collections AbstractLazyIterable.java public int count(Predicate predicate) { CountProcedure procedure = new CountProcedure (predicate); this.forEach(procedure); return procedure.getCount(); }

Count: GS Collections FastList.java public void forEach(Procedure procedure) { for (int i = 0; i < this.size; i++) { procedure.value(this.items[i]); }

Count: GS Collections public class CountProcedure implements Procedure { private final Predicate predicate; private int count;... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } public int getCount() { return this.count; } }

Count: GS Collections public class CountProcedure implements Procedure { private final Predicate predicate; private int count;... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } public int getCount() { return this.count; } } Predicate from the test: each -> each % 2 == 0

Count: Scala

Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt }

Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt } for-comprehension becomes call to foreach() lambda closes over cnt. Executes predicate and increments cnt, just like CountProcedure

Count: Scala implementation public final java.lang.Object apply(java.lang.Object); 0: aload_0 1: aload_1 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I 2: invokestatic #32 // Method apply:(I)Z 5: invokevirtual #34 // Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean; 8: invokestatic #38 11: areturn public boolean apply$mcZI$sp(int); 0: iload_1 1: iconst_2 2: irem 3: iconst_0 4: if_icmpne 11 7: iconst_1 8: goto 12 11: iconst_0 12: ireturn public final boolean apply(int); 0: aload_0 1: iload_1 // Method apply$mcZI$sp:(I)Z 2: invokevirtual #21 5: ireturn

Count: Scala implementation Integer int boolean Boolean Integer.intValue() Lambda: _ % 2 == 0 Bytecode: irem Boolean.valueOf(boolean)

Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Scala’s auto-boxing Java’s pull lazy evaluation

Parallel / Lazy / JDK List list = this.integersJDK.parallelStream().filter(each -> each % 10_000 != 0).map(String::valueOf).map(Integer::valueOf).filter(each -> (each + 1) % 10_000 != 0).collect(Collectors.toList()); Verify.assertSize(999_800, list);

Parallel / Lazy / GSC MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(each -> each % 10_000 != 0).collect(String::valueOf).collect(Integer::valueOf).select(each -> (each + 1) % 10_000 != 0).toList(); Verify.assertSize(999_800, list);

Parallel / Lazy / Scala val list = this.integers.par.filter(each => each % 10000 != 0).map(String.valueOf).map(Integer.valueOf).filter(each => (each + 1) % 10000 != 0).toBuffer Assert.assertEquals(999800, list.size)

Stacked computation ops/s (higher is better) 8x

Parallel / Lazy / JDK List list = this.integersJDK.parallelStream().filter(each -> each % 10_000 != 0).map(String::valueOf).map(Integer::valueOf).filter(each -> (each + 1) % 10_000 != 0).collect(Collectors.toList()); Verify.assertSize(999_800, list);

Parallel / Lazy / JDK List list = this.integersJDK.parallelStream().filter(each -> each % 10_000 != 0).map(String::valueOf).map(Integer::valueOf).filter(each -> (each + 1) % 10_000 != 0).collect(Collectors.toList()); Verify.assertSize(999_800, list); ArrayList::new List::add (left, right) -> { left.addAll(right); return left; }

Fork-Join Merge Intermediate results are merged in a tree Merging is O(n log n) work and garbage

Fork-Join Merge Amount of work done by last thread is O(n)

Parallel / Lazy / GSC MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(each -> each % 10_000 != 0).collect(String::valueOf).collect(Integer::valueOf).select(each -> (each + 1) % 10_000 != 0).toList(); Verify.assertSize(999_800, list); ParallelIterable.toList() returns a CompositeFastList, a List with O(1) implementation of addAll()

Parallel / Lazy / GSC public final class CompositeFastList { private final FastList > lists = FastList.newList(); public boolean addAll(Collection collection) { FastList collectionToAdd = collection instanceof FastList ? (FastList ) collection : new FastList (collection); this.lists.add(collectionToAdd); return true; }... }

CompositeFastList Merge Merging is O(1) work per batch CFL

Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Fork-join is general purpose but requires merge work Specialized data structures meant for combining

Thread Pools

Parallel: GSC.asParallel(this.executorService, BATCH_SIZE) You must specify your own batch size – 10,000 is fine – size / (8 * #cores) is fine You must specify your own thread pool – Can share, or not – Can tailor for CPU-bound Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors()) – Or IO-Bound Executors.newFixedThreadPool(maxDbConnections)

Parallel: Scala One shared fork-join pool, configurable Batch sizes are dynamic and respond to work stealing Minimum batch size: 1 + size / (8 * #cores)

Parallel: Java 8 One shared fork-join pool, not configurable Batch sizes are dynamic and respond to work stealing Minimum batch size: – max(1, size / (4 * (#cores - 1))) – Default pool also has #cores – 1 threads, plus main thread helps – Can be changed with system property java.util.concurrent.ForkJoinPool.common.parallelism

Aggregation

Aggregation Domain

Aggregate by Categories 8x

Aggregate by Accounts 8x

Aggregate by Category Streams Map categoryDoubleMap = this.jdkPositions.parallelStream().collect( Collectors.groupingBy( Position::getCategory, Collectors.summarizingDouble(Position::getMarketValue)));

Aggregate by Category GSC MapIterable categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE).aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis);

Aggregate by Category GSC MapIterable categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE).aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?

Aggregate by Account GSC MapIterable accountDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE).aggregateInPlaceBy( Position::getAccount, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?

Collapse factor MapIterable categoryDoubleMap There are 26 categories, so the map has 26 keys MapIterable accountDoubleMap There are 100k accounts, so the map has 100k keys

Collapse factor Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys

Collapse factor Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys See Mohammad Rezaei’s presentation from QCon 2012 called “Fine Grained Coordinated Parallelism in a Real World Application.”

Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Test groupBy, aggregateBy

http://github.com/goldmansachs/gs-collections http://github.com/goldmansachs/gs-collections-kata @GoldmanSachs http://stackoverflow.com/questions/tagged/gs-collections craig.motlin@gs.com Info in appendix Sets Handcoded parallelism Megamorphic warmup

Appendix

Hashtable Sets

Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer What if we use Java’s HashSet, Scala’s HashSet, and GS Collections’ UnifiedSet?

Parallel Count ops/s (higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v2 8x Lists: FastList | ArrayList | ArrayBuffer

Parallel Count ops/s (higher is better) Sets: UnifiedSet | HashSet (Java’s) | HashSet (Scala’s) 8x

Parallel / Lazy / GSC MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(each -> each % 10_000 != 0).collect(String::valueOf).collect(Integer::valueOf).select(each -> (each + 1) % 10_000 != 0).toSet(); Verify.assertSize(999_800, list); ParallelIterable.toSet() uses a concurrent set. No combination step. No preserving order.

Hand coded parallelism

Hand coded Parallel / Lazy MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(integer -> integer % 10_000 != 0 && (Integer.valueOf(String.valueOf(integer)) + 1) % 10_000 != 0).toList(); Verify.assertSize(999_800, list);

Stacked computation ops/s (higher is better) 8x

Method inlining

Count: SAM method calls Let’s take a closer look at both implementations of count() Let’s assume that @FunctionalInterface method calls are costly and count them as we go We’ll revisit this assumption

Count: GS Collections java.lang.Thread.State: RUNNABLE at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47) at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623) at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114) at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49) at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461) at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302) Execution of the lazy evaluation Executed once per element We’ll look for @FunctionalInterface method calls here

Count: GS Collections Grand total of 2 @FunctionalInterface method calls

Count: Java 8 java.lang.Thread.State: RUNNABLE at java.lang.Long.sum(Long.java:1587) at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1) at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394) at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.LongPipeline.reduce(LongPipeline.java:438) at java.util.stream.LongPipeline.sum(LongPipeline.java:396) at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526) at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278) Execution of the pipeline Executed once per element We’ll look for @FunctionalInterface method calls here

Count: Java 8 Grand total of 6 @FunctionalInterface method calls

Count: Scala Scala implementation is similar to GS Collections Grand total of 2 @FunctionalInterface method calls

@FunctionalInterface method calls Why do we care about @FunctionalInterface method calls? The JDK inlines short method bodies like our Predicates The exact nature of the inlining has a dramatic impact on performance

@FunctionalInterface method calls JMH forks a new JVM for each test During both stages of JIT compilation, this.predicate is our test Predicate The JVM will perform monomorphic inlining public void value(T object) { if (this.predicate.accept(object)) { this.count++; } Predicate from the test: each -> each % 2 == 0

@FunctionalInterface method calls The dispatch algorithm in pseudo code if (this.predicate instanceof lambda$serial_lazy_gsc$1) { if (object % 2 == 0) { this.count++; } } else { [recompile] if (this.predicate.accept(object)) { this.count++; }

@FunctionalInterface method calls The next recompilation will result in bimorphic inlining The next recompilation will result in megamorphic method dispatch Classic table lookup and jump In other words, no inlining Dramatic performance penalty for fast methods like count()

Megamorphic method dispatch How do we trigger megamorphic deoptimization? @Setup(Level.Trial) public void setUp_megamorphic() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count(); Assert.assertEquals(SIZE / 2, odds); long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count(); Assert.assertEquals(SIZE / 2, evens2); } This is something that JMH does not handle for you!

Megamorphic Count ops/s (higher is better) 8x

Megamorphic method dispatch Why force megamorphic deoptimization? Some implementations will have extra virtual method calls ( @FunctionalInterface method calls) Microbenchmarks aren’t realistic, but which is more realistic (less unrealistic?) You will trigger this deoptimization in normal production code, as soon as there is more than one call to this api anywhere in the executed code

This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not.

Similar presentations

Presentation on theme: "This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not.

Similar presentations

Presentation on theme: "This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not."— Presentation transcript:

Similar presentations

About project

Feedback