Java and .NET - Comparing Streams to LINQ
In .NET, an easy way for you to simplify querying datasets is LINQ. Java doesn’t have this, but since the introduction of Java 8 in 2014, you do now have the possibility to use "streams". Both LINQ and streams are ways to simplify querying datasets and to provide developers with an easy API to use. Streams are similar to LINQ in many ways, but there are still some differences. Because Java only started using streams in 2014, the way in which they apply it to simplify querying sets of data can seem a little bit half-heartedly to a .NET developer (since LINQ was already introduced in 2008).
Nonetheless, it is interesting to take a look at the differences and similarities between both ways of querying data sources. Though LINQ in itself is defined both as a language construct and as a library, we will limit ourselves to the library part since it bears the most similarities to Java streams. We will also focus on LINQ to Objects only.
1. LINQ and Java stream operations
All LINQ and Java stream operations are part of one of three groups. These are:
1. Fetch the data
2. Create a query
3. Execute the query
The following will give a more detailed explanation of each of these three groups.
Fetch the data
First, we need to fetch the data that we are going to manipulate. The source of the data doesn’t matter, what is important is that the resultset that we are going to work on explicitly implements the IEnumerable<T> interface (for LINQ) or the Collection<T> interface (for Java) -- either directly or via its parents. This way we know that we are working with a collection that can be manipulated.
Create a query
After you have fetched the data that you want to manipulate, you can write a query using that dataset. A query is just that: A list of criteria that specifies what subset of the data you want to retrieve. There is one major difference between Java and .NET related to the querying of data. In .NET, there is a specific difficulty called DeferredExecution. This means that, if you call the same query ten times, it will be executed ten times. By contrast, if you do the same in Java, the JVM will throw an IllegalStateException.
This decision was made when the engineers for Java 8 were designing the implementation of the streaming-model, while working on JSR-335: The designers have chosen to throw an exception whenever you reuse an already closed stream. The .NET-team, in contrast, did not choose to implement this and put the entire responsibility with the developer who uses it.
Execute the query
You probably also want to do something with the subset of data you just retrieved: You can retrieve the objects in full or transform them using map or reduce functionality. We should always make sure that side effects are either intentional or avoided. Both Java and .NET will happily introduce you to side effects on the objects themselves, but will throw exceptions when you try to modify the collection that you are currently handling.
For example, persons.ForEach(x => x.Name = "C"); is OK, since it only affects the object inside the collection.persons.ForEach(x => persons.Add(new Developer("C"))); will throw an InvalidOperationException, because the collection itself was modified.
2. Implementation
The following examples are some basic examples to give you a general idea of what you can do with the LINQ/Stream-functionality. Because, let’s be honest, code does speak louder than words.
// the .NET example
dataset.Where(x => x > 5).Sum();
// the Java example
Arrays.stream(data).filter(x -> x > 5).mapToInt(Integer::intValue).sum
// Result: 40
In .NET (always the first example), dataset is the “fetch the data”-part: In our case a static array of integers. The “create a query”-part of the statement is the where-clause: Here we say that we only want to consider the elements that are greater than 5. Finally, we execute the query by calling sum(); this tells the application that we want to take the sum of every element that comes out of the query-part (in our case 6+7+ 8+9+10 = 40).
In Java (always the second one), we can see similarities: data is the fetched data, the query-part is both filter and mapToInt. We call sum() to execute the query and get the same result. Notice how, in this case, we create two intermediary streams: One contains the results of the filter operation and the second one contains the same elements, but is returned as an IntStream.
Differences
Some of the methods in .NET are not available in Java. We will briefly explore two of these methods.
First, .NET provides the developer with two distinct functions to filter the initial dataset: The TakeWhile and Where functions.
The difference between TakeWhile() and Where() is that TakeWhile() stops as soon as the condition returns false. The Where(), by contrast, doesn’t, as you can see below:
var strings = new[] {
"abc", "bab", "cab", "ddd", "aaa", "xyz", "abc"
};
strings.Where(x => x.Contains('a')).ToList(); abc,bab,cab,aaa,abc
strings.TakeWhile(x => x.Contains('a')).ToList(); abc,bab,cab
This is relevant, since a malformed query can cause the application to have performance issues, like when we have a huge dataset that is streamed on-demand or a dataset that is potentially infinite, e.g. a stream of prime numbers. At the moment, this method is not possible in Java, but it will be supported in JDK 9.
A second example is OfType(). Java lacks functions like this, but you can work around them by using a combination of the methods provided for the filter and map functions.
var dataset = new Person[] {
new Developer("Maarten"), new Developer("Robin"),
new ProjectManager("Frank"), new ProjectManager("Hans")};
// the .NET example
var subset = dataset.OfType<Developer>().ToList();
// the Java example
Collection<Person> subset = dataset.stream()
.filter(x -> x instanceof Developer)
.map(x -> (Developer)x)
.collect(Collectors.toList());
This returns a collection of developers. In this case, the subset contains “Maarten” and “Robin” while leaving out “Frank” and “Hans”.
// the .NET example
var subset2 = dataset.ToList();
// the Java example
Collection<Person> subset2 = dataset.stream()
.collect(Collectors.toList());
This returns a list with the same contents as the array. We can test the difference by using these two checks:
// returns true
// the .NET example
subset2.SequenceEqual(subset2)
// the Java example
subset2.equals(dataset)
// returns false
// the .NET example
Equals(subset2, dataset)
// the Java example
subset2 == dataset
Parallelism
Both streams and LINQ support parallel processing, the former using .parallelStream() and the latter using .asParallel(). .NET supports this from .NET 4.0 onwards with the “PLINQ” execution engine. Mostly they do what you think they do, namely process data in parallel, but .NET has one pitfall compared to Java: There is no guaranteed order in which the statements are executed, unless you use the keyword AsOrdered().
When you use behavioural parameters that are stateless and non-interfering, Java guarantees that the computation is the same, whether you run it in parallel or sequentially. As always, it’s important to make sure that the processing you do doesn’t introduce any unknown side effects.
Both technologies can be used to speed up and simplify the development process. One should, however, pay attention to the details mentioned in the text above because unoptimized code can lead to both faulty and poorly performant code. Especially consider the “hidden” functionalities in .NET, like "DeferredExecution", and possible side effects during the use of parallel processing.
More information on LINQ can be found, as with all things .NET, on the documentation part of the MSDN site. For more information on Java streams, see the API documentation at the Oracle site. If there's another topic you'd like to see supported in comparing Java to .NET, do leave a comment.