Monday 14 November 2016

[Java 8 / Parallel stream / Stream] Should I always use parallel stream instead of stream ?

Streams are probably one of the most commonly used feature of Java 8. At first people discover forEach() method, then map() and filter() and so on. Some of them starts reading about functional programming but from my experience I'd say that in general people still think that stream is just an improved looping structure.

Then comes this exciting moment when they realize that it all can be much, much faster because there's also parallelStream. And then problems come...

Very often when I'm waiting for something I look into the code and try to fix some crappy parts. This one I've found yesterday:
resource.setRegions(product.getRegions().parallelStream().map(Region::getName).collect(toList()));
Our database has something like ten regions. Let's see how long it takes to collect such items using stream and parallelStream.
public static void main(String[] args) {
    final List<Region> regions = IntStream.range(0, 10)
                        .mapToObj(i -> new Regioan("region:" + i))
                        .collect(toList());

    useLabel("stream()").andLogPerformanceOf(() -> regions.stream()
                                                         .map(Region::getName)
                                                         .collect(toList()));
    useLabel("parallelStream()").andLogPerformanceOf(() -> regions.parallelStream()
                                                              .map(Region::getName)
                                                              .collect(toList()));
}
useLabel(...).andLogPerformanceOf(...) is just a simple wrapper that runs a piece of code and logs time taken (I'll paste it at the end of the article). First run shows:
stream() started
stream() completed. Time elapsed = 1 millis
parallelStream() started
parallelStream() completed. Time elapsed = 9 millis
And some more results:
10 elements
Stream
Parallel stream
4
10
2
3
2
14
2
18
1
6
3
17
2
8
5
7
2
14
1
8
As you can see in all cases stream() is faster than parallelStream(). Parallel stream has much higher overhead compared to stream which uses single thread. When you want to split collection's computation you need to divide the input so that the threads compute similar amount of data, run the threads, collect results and so on.

Let's make the input list bigger.
100 elements
Stream
Parallel stream
2
12
2
11
1
5
2
7
2
8
1
6
3
6
2
6
4
9
6
18
Parallel stream is still slower.
1000 elements
Stream
Parallel stream
9
14
2
20
2
7
9
23
3
9
2
20
3
6
2
5
3
9
3
5
Still slower.
10 000 elements
Stream
Parallel stream
8
7
6
23
12
9
5
10
6
9
16
19
7
9
14
14
11
22
20
18
For 10k elements the results are similar.
1 000 000 elements
Stream
Parallel stream
1423
 65
 1715  91
 1244  63
 1345  68
 1458 91
 1479  65
 1415  48
 1584 87
 1425  61
 1506 73
Having list that contains 1M elements parallel stream is way faster but how often do you work with such big collections ?

Let's get back to the main question: Should I always use parallel stream instead of stream ?

Definitely not.

You should consider parallel version:
  • when you work with huge collections
  • when computation of single element takes much time
I suppose that each case should be considered separately. Performance stronlgy depends on operations you perform so in my opinion trying to define some kind of conditions when parallel stream should be used simply doesn't make sense.

You've seen example that transforms huge collection. You can find another one which shows processing collection for which computing single element takes much time in my post here: How to control pool size while using parallel stream.

You should also remember that if you want to make your code parallel IT HAS TO BE immutable. I stronly recommend reading about functional programming principles.

That's all. I've promised to paste the tool that logs performance so here you are:
/**
 * @author Grzegorz Taramina
 *         Created on: 13/06/16
 */
public class PerformanceLoggingBlock implements Logging {
    private final String label;

    public static PerformanceLoggingBlock useLabel(final String label) {
        return new PerformanceLoggingBlock(label);
    }

    private PerformanceLoggingBlock(final String label) {
        this.label = label;
    }

    public void andLogPerformanceOf(final Runnable runnable) {
        perfLog().info(label + " started");
        Stopwatch stopwatch = Stopwatch.createStarted();
        runnable.run();
        perfLog().info(label + " completed. Time elapsed = " + stopwatch.elapsed(MILLISECONDS) + " millis");
    }

    public <T> T andLogPerformanceOf(final Supplier<T> supplier) {
        System.out.println(label + " started");
        Stopwatch stopwatch = Stopwatch.createStarted();
        T result = supplier.get();
        System.out.println(label + " completed. Time elapsed = " + stopwatch.elapsed(MILLISECONDS) + " millis");
        return result;
    }
}

Wednesday 9 November 2016

[Scala / Java / Gradle] How to add Scala to Java project and use both ?

I've been developing Java projects for couple of years and I remember the day when I finally could use Java 8. I was pretty excited that I can abandon Guava's FluentIterable, command pattern, anonymous classes and so on but I shortly realized that it's not enough when you want to write concise functional code.

Although Java8 makes significant step forward it's nothing compared to Scala. I haven't heard about a company that decided to rewrite some huge Java project to Scala so far but fortunately Scala runs on JVM so you can use both.

Ok I have Java8 + Gradle project. Let's add Scala :)

In build gradle you need to add scala plugin:
apply plugin: 'scala'
And scala lang/compiler dependencies:
compile group: 'org.scala-lang', name: 'scala-library', version: scalaVersion
compile group: 'org.scala-lang', name: 'scala-compiler', version: scalaVersion
You should also make sure that .java and .scala files are being built together so:
sourceSets.main.scala.srcDir "src/main/java"
sourceSets.test.scala.srcDir "src/test/java"
sourceSets.main.java.srcDirs = []
sourceSets.test.java.srcDirs = []
The project I've been working on has multiple modules so I've added sourceSets and dependencies in subprojects section. Remember about installing Scala plugin in your IDE (in intellij it's called Scala).

That's all :)

Now when I'm trying to build the project I get the following output:
➜  cw git:(develop) git status
On branch develop
Your branch is up-to-date with 'origin/develop'.
nothing to commit, working directory clean
➜  cw git:(develop) gradle build
:compileJava UP-TO-DATE
:compileScala UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:jar UP-TO-DATE
:assemble UP-TO-DATE
:compileTestJava UP-TO-DATE
:compileTestScala UP-TO-DATE
:processTestResources UP-TO-DATE
:testClasses UP-TO-DATE
:test UP-TO-DATE
:check UP-TO-DATE
:build UP-TO-DATE
...
As you can see there's compileScala among others which in fact builds both .java and .scala files. It's because of our sourceSets. It allows you to use Scala classes in .java files and Java classes in .scala files.

You should also make sure that you chose proper version of Scala. I use 2.11.5. It works without any issues with Java 8.