Presentation is loading. Please wait.

Presentation is loading. Please wait.

This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not.

Similar presentations


Presentation on theme: "This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not."— Presentation transcript:

1 This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact. Parallel-lazy performance Java 8 vs Scala vs GS Collections Craig Motlin June 2014

2 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid

3 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid Lots of claims and opinions

4 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid Lots of evidence

5 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid

6 Intro Solve the same problem in all three libraries – Java ( 1.8.0_05 ) – GS Collections ( 5.1.0 ) – Scala ( 2.11.0 ) Count how many even numbers are in a list of numbers Then accomplish the same thing in parallel – Data-level parallelism – Batch the data – Use all the cores

7 Performance Factors Tests that isolate individual performance factors Count Filter, Transform, Transform, Filter, convert to List Aggregation – Market value stats aggregated by product or category

8 Count: Serial long evens = arrayList.stream().filter(each -> each % 2 == 0).count(); int evens = fastList.count(each -> each % 2 == 0); val evens = arrayBuffer.count(_ % 2 == 0)

9 Count: Serial Lazy long evens = arrayList.stream().filter(each -> each % 2 == 0).count(); int evens = fastList.asLazy().count(each -> each % 2 == 0); val evens = arrayBuffer.view.count(_ % 2 == 0)

10 Count: Parallel Lazy long evens = arrayList.parallelStream().filter(each -> each % 2 == 0).count(); int evens = fastList.asParallel(executorService, BATCH_SIZE).count(each -> each % 2 == 0); val evens = arrayBuffer.par.count(_ % 2 == 0)

11 Parallel Lazy 12345678…1M Filter and Count 1-10k10k-20k 20k-30k 30k-40k…990k-1M 500k Reduce 5k … Batch

12 Parallel Eager 12345678…1M 1-10k10k-20k 20k-30k 30k-40k…990k-1M 2, 4, 6, 8 … 10k 10k-20k (evens) 20k-30k (evens) 30k-40k (evens) … 990k-1M (evens) Batch Filter Count 5k … 500k Reduce

13 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid Time for some numbers!

14 Serial Count ops/s (higher is better)

15 Parallel Count ops/s (higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v2 8x

16 Java Microbenchmark Harness “JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” 5 forked JVMs per test 100 warmup iterations per JVM 50 measurement iterations per JVM 1 second of looping per iteration http://openjdk.java.net/projects/code-tools/jmh/

17 Java Microbenchmark Harness @GenerateMicroBenchmark public void parallel_lazy_jdk() { long evens = this.integersJDK.parallelStream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); }

18 Java Microbenchmark Harness @Setup includes megamorphic warmup More info on megamorphic in the appendix This is something that JMH does not handle for you!

19 Java Microbenchmark Harness Throughput: higher is better Enough warmup iterations so that standard deviation is low Benchmark Mode Samples Mean Mean error Units CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...

20 Performance Factors Tests that isolate individual performance factors Count Filter, Transform, Transform, Filter, convert to List Aggregation – Market value stats aggregated by product or category

21 Java Microbenchmark Harness Performance tests are open sourced Read them and run them on your hardware https://github.com/goldmansachs/gs-collections/

22 Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns

23 Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer Isolated because combination of intermediate results is simple addition. Let’s look at reasons for the differences in count()

24 Count: Java 8

25 Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); }

26 Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); } filter(Predicate).count() Instead of count(Predicate)

27 Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); } Is count() just incrementing a counter? filter(Predicate).count() Instead of count(Predicate)

28 Count: Java 8 implementation public final long count() { return mapToLong(e -> 1L).sum(); } public final long sum() { return reduce(0, Long::sum); } /** @since 1.8 */ public static long sum(long a, long b) { return a + b; }

29 Count: Java 8 implementation this.integersJDK.stream().filter(each -> each % 2 == 0).mapToLong(e -> 1L).reduce(0, Long::sum); this.integersGSC.asLazy().count(each -> each % 2 == 0);

30 Count: Java 8 implementation this.integersJDK.stream().filter(each -> each % 2 == 0).mapToLong(e -> 1L).reduce(0, Long::sum); this.integersGSC.asLazy().count(each -> each % 2 == 0); Seems like extra work

31 Count: GS Collections

32 @GenerateMicroBenchmark public void serial_lazy_gsc() { int evens = this.integersGSC.asLazy().count(each -> each % 2 == 0); Assert.assertEquals(SIZE / 2, evens); }

33 Count: GS Collections AbstractLazyIterable.java public int count(Predicate predicate) { CountProcedure procedure = new CountProcedure (predicate); this.forEach(procedure); return procedure.getCount(); }

34 Count: GS Collections FastList.java public void forEach(Procedure procedure) { for (int i = 0; i < this.size; i++) { procedure.value(this.items[i]); }

35 Count: GS Collections public class CountProcedure implements Procedure { private final Predicate predicate; private int count;... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } public int getCount() { return this.count; } }

36 Count: GS Collections public class CountProcedure implements Procedure { private final Predicate predicate; private int count;... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } public int getCount() { return this.count; } } Predicate from the test: each -> each % 2 == 0

37 Count: Scala

38 Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt }

39 Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt } for-comprehension becomes call to foreach() lambda closes over cnt. Executes predicate and increments cnt, just like CountProcedure

40 Count: Scala implementation public final java.lang.Object apply(java.lang.Object); 0: aload_0 1: aload_1 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I 2: invokestatic #32 // Method apply:(I)Z 5: invokevirtual #34 // Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean; 8: invokestatic #38 11: areturn public boolean apply$mcZI$sp(int); 0: iload_1 1: iconst_2 2: irem 3: iconst_0 4: if_icmpne 11 7: iconst_1 8: goto 12 11: iconst_0 12: ireturn public final boolean apply(int); 0: aload_0 1: iload_1 // Method apply$mcZI$sp:(I)Z 2: invokevirtual #21 5: ireturn

41 Count: Scala implementation Integer int boolean Boolean Integer.intValue() Lambda: _ % 2 == 0 Bytecode: irem Boolean.valueOf(boolean)

42 Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Scala’s auto-boxing Java’s pull lazy evaluation

43 Performance Factors Tests that isolate individual performance factors Count Filter, Transform, Transform, Filter, convert to List Aggregation – Market value stats aggregated by product or category

44 Parallel / Lazy / JDK List list = this.integersJDK.parallelStream().filter(each -> each % 10_000 != 0).map(String::valueOf).map(Integer::valueOf).filter(each -> (each + 1) % 10_000 != 0).collect(Collectors.toList()); Verify.assertSize(999_800, list);

45 Parallel / Lazy / GSC MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(each -> each % 10_000 != 0).collect(String::valueOf).collect(Integer::valueOf).select(each -> (each + 1) % 10_000 != 0).toList(); Verify.assertSize(999_800, list);

46 Parallel / Lazy / Scala val list = this.integers.par.filter(each => each % 10000 != 0).map(String.valueOf).map(Integer.valueOf).filter(each => (each + 1) % 10000 != 0).toBuffer Assert.assertEquals(999800, list.size)

47 Stacked computation ops/s (higher is better) 8x

48 Parallel / Lazy / JDK List list = this.integersJDK.parallelStream().filter(each -> each % 10_000 != 0).map(String::valueOf).map(Integer::valueOf).filter(each -> (each + 1) % 10_000 != 0).collect(Collectors.toList()); Verify.assertSize(999_800, list);

49 Parallel / Lazy / JDK List list = this.integersJDK.parallelStream().filter(each -> each % 10_000 != 0).map(String::valueOf).map(Integer::valueOf).filter(each -> (each + 1) % 10_000 != 0).collect(Collectors.toList()); Verify.assertSize(999_800, list); ArrayList::new List::add (left, right) -> { left.addAll(right); return left; }

50 Fork-Join Merge Intermediate results are merged in a tree Merging is O(n log n) work and garbage

51 Fork-Join Merge Amount of work done by last thread is O(n)

52 Parallel / Lazy / GSC MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(each -> each % 10_000 != 0).collect(String::valueOf).collect(Integer::valueOf).select(each -> (each + 1) % 10_000 != 0).toList(); Verify.assertSize(999_800, list); ParallelIterable.toList() returns a CompositeFastList, a List with O(1) implementation of addAll()

53 Parallel / Lazy / GSC public final class CompositeFastList { private final FastList > lists = FastList.newList(); public boolean addAll(Collection collection) { FastList collectionToAdd = collection instanceof FastList ? (FastList ) collection : new FastList (collection); this.lists.add(collectionToAdd); return true; }... }

54 CompositeFastList Merge Merging is O(1) work per batch CFL

55 Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Fork-join is general purpose but requires merge work Specialized data structures meant for combining

56 Thread Pools

57 Parallel: GSC.asParallel(this.executorService, BATCH_SIZE) You must specify your own batch size – 10,000 is fine – size / (8 * #cores) is fine You must specify your own thread pool – Can share, or not – Can tailor for CPU-bound Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors()) – Or IO-Bound Executors.newFixedThreadPool(maxDbConnections)

58 Parallel: Scala One shared fork-join pool, configurable Batch sizes are dynamic and respond to work stealing Minimum batch size: 1 + size / (8 * #cores)

59 Parallel: Java 8 One shared fork-join pool, not configurable Batch sizes are dynamic and respond to work stealing Minimum batch size: – max(1, size / (4 * (#cores - 1))) – Default pool also has #cores – 1 threads, plus main thread helps – Can be changed with system property java.util.concurrent.ForkJoinPool.common.parallelism

60 Aggregation

61 Aggregation Domain

62 Aggregate by Categories 8x

63 Aggregate by Accounts 8x

64 Aggregate by Category Streams Map categoryDoubleMap = this.jdkPositions.parallelStream().collect( Collectors.groupingBy( Position::getCategory, Collectors.summarizingDouble(Position::getMarketValue)));

65 Aggregate by Category GSC MapIterable categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE).aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis);

66 Aggregate by Category GSC MapIterable categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE).aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?

67 Aggregate by Account GSC MapIterable accountDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE).aggregateInPlaceBy( Position::getAccount, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?

68 Collapse factor MapIterable categoryDoubleMap There are 26 categories, so the map has 26 keys MapIterable accountDoubleMap There are 100k accounts, so the map has 100k keys

69 Collapse factor Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys

70 Collapse factor Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys See Mohammad Rezaei’s presentation from QCon 2012 called “Fine Grained Coordinated Parallelism in a Real World Application.”

71 Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Test groupBy, aggregateBy

72 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid

73 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid

74 Goals Compare Java Streams, Scala parallel Collections, and GS Collections Convince you to use GS Collections Convince you to do your own performance testing Identify when to avoid parallel APIs Identify performance pitfalls to avoid

75 Q&A

76 http://github.com/goldmansachs/gs-collections http://github.com/goldmansachs/gs-collections-kata @GoldmanSachs http://stackoverflow.com/questions/tagged/gs-collections craig.motlin@gs.com Info in appendix Sets Handcoded parallelism Megamorphic warmup

77 Appendix

78 Hashtable Sets

79 Performance Factors Factors that may affect performance Underlying container implementation Combine strategy Fork-join vs batching (and batch size) Push vs pull lazy evaluation Collapse factor Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer What if we use Java’s HashSet, Scala’s HashSet, and GS Collections’ UnifiedSet?

80 Parallel Count ops/s (higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v2 8x Lists: FastList | ArrayList | ArrayBuffer

81 Parallel Count ops/s (higher is better) Sets: UnifiedSet | HashSet (Java’s) | HashSet (Scala’s) 8x

82 Parallel / Lazy / GSC MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(each -> each % 10_000 != 0).collect(String::valueOf).collect(Integer::valueOf).select(each -> (each + 1) % 10_000 != 0).toSet(); Verify.assertSize(999_800, list); ParallelIterable.toSet() uses a concurrent set. No combination step. No preserving order.

83 Hand coded parallelism

84 Hand coded Parallel / Lazy MutableList list = this.integersGSC.asParallel(this.executorService, BATCH_SIZE).select(integer -> integer % 10_000 != 0 && (Integer.valueOf(String.valueOf(integer)) + 1) % 10_000 != 0).toList(); Verify.assertSize(999_800, list);

85 Stacked computation ops/s (higher is better) 8x

86 Method inlining

87 Count: SAM method calls Let’s take a closer look at both implementations of count() Let’s assume that @FunctionalInterface method calls are costly and count them as we go We’ll revisit this assumption

88 Count: GS Collections java.lang.Thread.State: RUNNABLE at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47) at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623) at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114) at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49) at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461) at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302) Execution of the lazy evaluation Executed once per element We’ll look for @FunctionalInterface method calls here

89 Count: GS Collections Grand total of 2 @FunctionalInterface method calls

90 Count: Java 8 java.lang.Thread.State: RUNNABLE at java.lang.Long.sum(Long.java:1587) at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1) at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394) at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.LongPipeline.reduce(LongPipeline.java:438) at java.util.stream.LongPipeline.sum(LongPipeline.java:396) at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526) at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278) Execution of the pipeline Executed once per element We’ll look for @FunctionalInterface method calls here

91 Count: Java 8 Grand total of 6 @FunctionalInterface method calls

92 Count: Scala Scala implementation is similar to GS Collections Grand total of 2 @FunctionalInterface method calls

93 @FunctionalInterface method calls Why do we care about @FunctionalInterface method calls? The JDK inlines short method bodies like our Predicates The exact nature of the inlining has a dramatic impact on performance

94 @FunctionalInterface method calls JMH forks a new JVM for each test During both stages of JIT compilation, this.predicate is our test Predicate The JVM will perform monomorphic inlining public void value(T object) { if (this.predicate.accept(object)) { this.count++; } Predicate from the test: each -> each % 2 == 0

95 @FunctionalInterface method calls The dispatch algorithm in pseudo code if (this.predicate instanceof lambda$serial_lazy_gsc$1) { if (object % 2 == 0) { this.count++; } } else { [recompile] if (this.predicate.accept(object)) { this.count++; }

96 @FunctionalInterface method calls The next recompilation will result in bimorphic inlining The next recompilation will result in megamorphic method dispatch Classic table lookup and jump In other words, no inlining Dramatic performance penalty for fast methods like count()

97 Megamorphic method dispatch How do we trigger megamorphic deoptimization? @Setup(Level.Trial) public void setUp_megamorphic() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count(); Assert.assertEquals(SIZE / 2, odds); long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count(); Assert.assertEquals(SIZE / 2, evens2); } This is something that JMH does not handle for you!

98 Megamorphic Count ops/s (higher is better) 8x

99 Megamorphic method dispatch Why force megamorphic deoptimization? Some implementations will have extra virtual method calls ( @FunctionalInterface method calls) Microbenchmarks aren’t realistic, but which is more realistic (less unrealistic?) You will trigger this deoptimization in normal production code, as soon as there is more than one call to this api anywhere in the executed code


Download ppt "This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not."

Similar presentations


Ads by Google