-
Notifications
You must be signed in to change notification settings - Fork 339
Are You Trying to Replace SQL with Java 8 Stream?
If you are among the programmers who are trying to replace SQL with Java 8 stream, I believe you need to read this essay really carefully. Stream lacks special data objects for computing structured data. The class library is, in fact, much less professional than SQL in structured data computations. By contrast, SPL has a complete set of data objects for structured data computations and provides structured data computation capability independent of databases. The programming language is the real structured data computation professional.
Stream was first introduced with Java 8. The class library offers richer and smoother Lambda syntax, facilitating the process of achieving set-oriented operations. This has attracted many programmers to try to substitute SQL with it. Actually, Stream is not better than SQL in terms of professional equipment for structured data computation.
When members of a set are of simple data types (integer, floating-point, string or date), Stream is convenient to use for implementing set-oriented operations. Below is an example of integer array filtering:
IntStream iStream=IntStream.of(1,3,5,2,3,6);
IntStream r1=iStream.filter(m->m>2);
Sorting:
Stream r2=iStream.boxed().sorted();
Aggregate:
int r3=iStream.sum();
The three pieces of code are short, easy to understand and master. Stream also implements distinct, concat, and contain and other operations with simple code. The class library is convenient for handling set-oriented operations.
The data objects in structured data computations, however, are of record type (Map\ entity\ record) rather than simple data types. Stream becomes not so convenient to use in computing records. To sort Orders table by Client field in the reversed direction and Amount field in ascending order, for instance, Stream generates the following code:
// Ordersis of Stream<Order>type, and Order is of record type, which are defined as follows:
//record Order(int OrderID, String Client, int SellerId, double Amount, Date OrderDate) {}
Stream<Order> result=Orders
.sorted((sAmount1,sAmount2)->Double.compare(sAmount1.Amount,sAmount2.Amount))
.sorted((sClient1,sClient2)->CharSequence.compare(sClient2.Client,sClient1.Client));
Obviously, the Stream code is more complicated than SQL code. An uncommon thing about the code is that the order of the sort fields has to be reversed. The record type data object in the code implements equals function automatically. But if you use the entity type, you need to manually implement the function to do the sorting, which adds more code to the existing piece.
Though inconvenient in performing sorting, Stream uses a logical structure similar to SQL’s. It handles some other computations with different logical structures that SQL programmers are not familiar with. To group Orders table by SellerID field and sum Amount field, for instance, Stream has the code below:
…
Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(r->r.SellerId,Collectors.summarizingDouble(r->r.Amount)));
for(Object sellerid:c.keySet()){
DoubleSummaryStatistics r =c.get(sellerid);
System.out.println("group is:"+sellerid+" sum is:"+r.getSum());
}
In the above code, not only groupingBy but multiple functions, including collect, Collectors, summarizingDouble and DoubleSummaryStatistics, etc., are used to achieve the grouping & aggregation computation. The result is not the ordinary set of records of, say, Stream type, but one of Map type, from which data needs to be retrieved through loops because the data types are not as consistent as SQL.
It is relatively simple if the grouping and aggregation is performed on a single field, but it is complicated when the computation involves multiple fields. To group a table by the year and Client, for instance:
Calendar cal=Calendar.getInstance();
Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(
r->{
cal.setTime(r.OrderDate);
return cal.get(Calendar.YEAR)+"_"+r.SellerId;
},
Collectors.summarizingDouble(r->{
return r.Amount;
})
)
);
for(Object sellerid:c.keySet()){
DoubleSummaryStatistics r =c.get(sellerid);
String year_sellerid[]=((String)sellerid).split("_");
System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum());
}
To achieve the multi-field grouping & aggregation, Stream first concatenates the multiple fields into one using the separator (such as the underline) to convert the grouping & aggregation into one on a single filed, and then split the field after the computation is done. The logical structure is different from both that in SQL and that it uses for performing the single-field aggregate. Both SQL programmers and Java programmers are not accustomed to it.
The code is already complicated when only one data object is involved. We can imagine what the code will become if the computation occurs between multiple data objects. To perform intersection between two orders table of same structures, for instance:
Stream<Order> result=Stream.concat(Orders1, Orders2)
.collect(Collectors.collectingAndThen(
Collectors.toMap(Order::OrderID,
Function.identity(),
(l, r) -> l,
LinkedHashMap::new),
f -> f.values().stream()));
Stream handles both sorting and grouping & aggregation using functions with intuitive names (sorted\groupingBy), but it uses concat to perform intersection (and difference and union). concat originally refers to concatenation, which is not intuitive at all. This is because Stream does not support a computation between data object. It has no choice but to resort to hardcoding, which requires a certain level of skills and optimization experience.
Stream does not provide a direct support for join operations. To perform the inner join on Orders table and Employee table, and group the result by Employee.Dept and sum Orders.Amount, for instance:
Map<Integer, Employee> EIds = Employees.collect(Collectors.toMap(Employee::EId, Function.identity()));
//Create new OrderRelationclass, where SellerIdcontains single valuesthat point to corresponding Employeeobjects
record OrderRelation(int OrderID, String Client, Employee SellerId, double Amount, Date OrderDate){}
Stream<OrderRelation> ORS=Orders.map(r -> {
Employee e=EIds.get(r.SellerId);
OrderRelation or=new OrderRelation(r.OrderID,r.Client,e,r.Amount,r.OrderDate);
return or;
}).filter(e->e.SellerId!=null);
Map<String, DoubleSummaryStatistics> c=ORS.collect(Collectors.groupingBy(r->r.SellerId.Dept,Collectors.summarizingDouble(r->r.Amount)));
for(String dept:c.keySet()){
DoubleSummaryStatistics r =c.get(dept);
System.out.println("group(dept):"+dept+" sum(Amount):"+r.getSum());
}
The hardcoded inner join is long and complex. Left joins and outer joins need hardcoding too but with different logic and more complexity, which is a challenge even to Java programmers.
Joins are an important part of structured computations, but Stream does not give good support for it. That defines its structured data computation competency as less than SQL.
Before Stream, it is knotty for Java to achieve set-oriented operations. Its introduction to Java adds special structured data computation support, including basic set-oriented operations and Lambda-syntax-friendly design, to the high-level language. Yet, Stream still needs to use Java-based data types to do operations due to the lack of professional structured data objects. The improvements are only superficial.
In fact, no class libraries that implement computations within Java are truly professional due to the lack of solid low-level support. The fundamental reason is that Java lacks professional structured data objects. A structured computation returns a result set whose structure varies according to computing processes and generates intermediate results of dynamic structures. It is almost impossible to pre-define the structures. Yet, Java is a strongly typed language, which requires that structures of data objects be pre-defined (otherwise only difficult to manipulate types of data objects, say map, can be used), resulting in rigid and inflexible coding process and greatly restricting the ability of Lambada syntax. If it is an interpreted language, it can simplify the definition of a parameter by specifying within a function that a parameter expression will be parsed as a value or a function. Java is a compiled language that cannot distinguish different types of parameters. It can only implement an anonymous function (using Lambda syntax) by designing a complicated and difficult-to-understand interface. This is difficult even for SQL programmers. Though the structured data computations can be considerably simplified by skipping the data object to directly reference fields (like the form of “UnitPriceQuantity”), Java cannot support the cleverly simple syntax due to the lack of professional structured data objects. It thus produces lengthy and non-intuitive code (like “x.UnitPricex.Quantity”).
In a word, Stream is far from professional in computing structured data because of the lack of special structured data objects, while SQL is professional enough but relies heavily on databases. Both have their advantages and disadvantages. It would be best if we could combine SQL’s professional structured data computation syntax with Stream’s independent computing ability.
For this we have esproc SPL.
esProc SPL offers independent structured data computation capability and a complete set of structured data objects that make it as capable as SQL in handling structured data. To achieve the above computing task, SPL generates much simpler code than Stream.
Below is SPL’s way of performing double-field sorting:
A | |
---|---|
1 | =Orders=file("Orders.txt").import@t() |
2 | =Orders.sort(-Client, Amount) |
The @t option enables reading the first row as field names. We can then do the computation directly with field names without referencing data objects. -Client represents performing the specific operation in the reversed order.
SPL writes code in cells. The cell name can be used to replace a variable name, and the above script can be written as follows:
A | |
---|---|
1 | =file("Orders.txt").import@t() |
2 | =A1.sort(-Client, Amount) |
SPL also allows writing all code in one line when reading experience won’t be affected, generating shorter script:
A | |
---|---|
1 | =file("Orders.txt").import@t().sort(-Client, Amount) |
SPL performs single-field grouping & aggregation as follows:
=Orders.groups(SellerId; sum(Amount))
The SPL language has a simple and consistent syntax structure to enable easy usage when the single-field operation is extended to double fields:
=Orders.groups(year(OrderDate),Client; sum(Amount))
SPL is the professional structured data computation language. It simplifies computations on a single data object as well as makes it easy to do computations on multiple data objects. To perform intersection on subsets of two orders tables, for instance:
=Orders1 ^ Orders2
The circumflex symbol ^ in the above code is the intersection operator, a counterpart of the SPL isect function. Similar set operators in SPL include & (union), \ (difference), and the special operator | for concatenation.
SPL has simple code for performing an inner join operation:
=join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))
As SQL does, SPL just makes few changes on the code to switch to another type of join without an overhaul. We use join@1 to represent a left join, and to perform a full join, we just need to change the option to @f, which is join@f.
SPL offers the general JDBC interface. The SPL code can be easily embedded into a Java program for execution (similar to the SQL way of coding) or called from a Java program in the form of a script file (similar to the stored procedure).
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("=file(\"Orders.txt\").import@t().sort(-Client, Amount)");
//result = statement.executeQuery("call splFileName(?)");
…
Find more information about SPL’s invocation by Java HERE.
SPL boasts far more powerful computational ability than SQL. For the following complicated computing scenarios, it is much easier to handle them in SPL than in SQL.
Example 1: Duty.xlsx stores daily duty records. Usually, one person is on duty for several continuous workdays before another one takes place. The task is to get the information of continuous duties for each person according to the file.
Part of the original Duty.xlsx:
Date | Name |
---|---|
2018-03-01 | Emily |
2018-03-02 | Emily |
2018-03-04 | Emily |
2018-03-04 | Johnson |
2018-04-05 | Ashley |
2018-03-06 | Emily |
2018-03-07 | Emily |
… | … |
Part of the result table:
Name | Begin | End |
---|---|---|
Emily | 2018-03-01 | 2018-03-03 |
Johnson | 2018-03-04 | 2018-03-04 |
Ashley | 2018-03-05 | 2018-03-05 |
Emily | 2018-03-06 | 2018-03-07 |
… | … | … |
SQL is not good at handling order-based computations. The process is complicated and difficult as it uses the window functions to generate nested subqueries. SPL supplies special functions to do the order-based computations. There is only one line of the key code.
A | |
---|---|
1 | =T("D:/data/Duty.xlsx") |
2 | =A1.group@o(name) |
3 | =A2.new(name,~.m(1).date:begin,~.m(-1).date:end) |
Example 2: Database table sales stores sales amount information for customers. The main fields are client and amount. The task is to find the top N big customers whose amount takes up at least half of the total sales and sort them in descending order by amount. Below is the SPL code:
A | B | |
---|---|---|
1 | =demo.query(“select client,amount from sales”).sort(amount:-1) | Get data and sort it in descending order |
2 | =A1.cumulate(amount) | Get a sequence of cumulative amounts |
3 | =A2.m(-1)/2 | The last member is the total |
4 | =A2.pselect(~>=A3) | Get the position where the cumulative amount exceeds half of the total |
5 | =A1(to(A4)) | Get values by positions |
SPL is more convenient than SQL in handling complicated computing scenarios because the former has a complete set of structured data objects and the special grid-style, stepwise, and easy to debug coding mode.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code