Monday, 14 November 2016

[Java 8 / Parallel stream / Stream] Should I always use parallel stream instead of stream ?

Streams are probably one of the most commonly used feature of Java 8. At first people discover forEach() method, then map() and filter() and so on. Some of them starts reading about functional programming but from my experience I'd say that in general people still think that stream is just an improved looping structure.

Then comes this exciting moment when they realize that it all can be much, much faster because there's also parallelStream. And then problems come...

Very often when I'm waiting for something I look into the code and try to fix some crappy parts. This one I've found yesterday:
resource.setRegions(product.getRegions().parallelStream().map(Region::getName).collect(toList()));
Our database has something like ten regions. Let's see how long it takes to collect such items using stream and parallelStream.
public static void main(String[] args) {
    final List<Region> regions = IntStream.range(0, 10)
                        .mapToObj(i -> new Regioan("region:" + i))
                        .collect(toList());

    useLabel("stream()").andLogPerformanceOf(() -> regions.stream()
                                                         .map(Region::getName)
                                                         .collect(toList()));
    useLabel("parallelStream()").andLogPerformanceOf(() -> regions.parallelStream()
                                                              .map(Region::getName)
                                                              .collect(toList()));
}
useLabel(...).andLogPerformanceOf(...) is just a simple wrapper that runs a piece of code and logs time taken (I'll paste it at the end of the article). First run shows:
stream() started
stream() completed. Time elapsed = 1 millis
parallelStream() started
parallelStream() completed. Time elapsed = 9 millis
And some more results:
10 elements
Stream
Parallel stream
4
10
2
3
2
14
2
18
1
6
3
17
2
8
5
7
2
14
1
8
As you can see in all cases stream() is faster than parallelStream(). Parallel stream has much higher overhead compared to stream which uses single thread. When you want to split collection's computation you need to divide the input so that the threads compute similar amount of data, run the threads, collect results and so on.

Let's make the input list bigger.
100 elements
Stream
Parallel stream
2
12
2
11
1
5
2
7
2
8
1
6
3
6
2
6
4
9
6
18
Parallel stream is still slower.
1000 elements
Stream
Parallel stream
9
14
2
20
2
7
9
23
3
9
2
20
3
6
2
5
3
9
3
5
Still slower.
10 000 elements
Stream
Parallel stream
8
7
6
23
12
9
5
10
6
9
16
19
7
9
14
14
11
22
20
18
For 10k elements the results are similar.
1 000 000 elements
Stream
Parallel stream
1423
 65
 1715  91
 1244  63
 1345  68
 1458 91
 1479  65
 1415  48
 1584 87
 1425  61
 1506 73
Having list that contains 1M elements parallel stream is way faster but how often do you work with such big collections ?

Let's get back to the main question: Should I always use parallel stream instead of stream ?

Definitely not.

You should consider parallel version:
  • when you work with huge collections
  • when computation of single element takes much time
I suppose that each case should be considered separately. Performance stronlgy depends on operations you perform so in my opinion trying to define some kind of conditions when parallel stream should be used simply doesn't make sense.

You've seen example that transforms huge collection. You can find another one which shows processing collection for which computing single element takes much time in my post here: How to control pool size while using parallel stream.

You should also remember that if you want to make your code parallel IT HAS TO BE immutable. I stronly recommend reading about functional programming principles.

That's all. I've promised to paste the tool that logs performance so here you are:
/**
 * @author Grzegorz Taramina
 *         Created on: 13/06/16
 */
public class PerformanceLoggingBlock implements Logging {
    private final String label;

    public static PerformanceLoggingBlock useLabel(final String label) {
        return new PerformanceLoggingBlock(label);
    }

    private PerformanceLoggingBlock(final String label) {
        this.label = label;
    }

    public void andLogPerformanceOf(final Runnable runnable) {
        perfLog().info(label + " started");
        Stopwatch stopwatch = Stopwatch.createStarted();
        runnable.run();
        perfLog().info(label + " completed. Time elapsed = " + stopwatch.elapsed(MILLISECONDS) + " millis");
    }

    public <T> T andLogPerformanceOf(final Supplier<T> supplier) {
        System.out.println(label + " started");
        Stopwatch stopwatch = Stopwatch.createStarted();
        T result = supplier.get();
        System.out.println(label + " completed. Time elapsed = " + stopwatch.elapsed(MILLISECONDS) + " millis");
        return result;
    }
}

No comments:

Post a Comment