-
Notifications
You must be signed in to change notification settings - Fork 339
Java or SQL, which is faster for data operation?
SQL has always been the main language to calculate and process structured data. However, as technology evolves, Java is now often used to process data in order to meet the needs of scalability. Data processing in Java can significantly reduce the coupling between database and application (database is only used for storage), increase the flexibility of overall framework and make application easier to expand and maintain.
Yet, in the era of big data, computing performance is an unavoidable topic, a technology that runs too slow will offset its advantages in other aspects. So, which is faster for data operation, Java or SQL?
In terms of data processing, SQL is more widely used, and more advantageous in performance in general. The high performance of SQL mainly comes from the following two aspects:
-
Optimization engine. Since SQL is a declarative language, users usually only need to describe the calculation task without caring about specific implementation details. Once users finish coding, database will automatically optimize the query plan and execution method based on the description of SQL statement, thereby improving query efficiency. This is what the role of optimization engine is. In addition, database offers very mature algorithms for conventional operation, many calculations can be executed quickly with the help of optimization engine.
-
Data storage. Data computing and storage are inseparable, and the database combines both of them. Once the storage is available, we can carry out many speed-up methods in practice, such as index, cache, partition, redundancy. Especially for the AP-based data warehouse, we can design specialized storage (such as columnar storage) for computing. The combined work of storage and algorithms can achieve high efficiency.
Nevertheless, SQL has its weaknesses. Due to the limitations of SQL description ability, we have to adopt an indirect way to implement complex query tasks, making it very cumbersome to code. More importantly, once the complexity of SQL statement increases, it will be difficult for optimization engine to play its role (it can't guess the intention of SQL statement, and thus it has to execute according to literally expressed logic, resulting in very poor performance). Therefore, the optimization engine works only for simply tasks. In fact, the complexity of SQL statement that make the optimizer ineffective is very low.
On the other hand, the data types and algorithms of SQL are not comprehensive enough. If we want to use an algorithm that is beyond the scope of SQL algorithms, we need to write ourselves, such as writing a user-defined function (UDF). However, if some basic data types are not supported, or it needs to design a storage based on calculation characteristic, it doesn’t help even with UDF.
Let's take an example: calculate TopN:
SELECT TOP 10 x FROM T ORDER BY x DESC
Most databases have the ability to optimize this task, and won’t do a sorting in fact. However, if the task is changed to calculate in-group TopN, the code will be:
select * from
(select y,*,row_number() over (partition by y order by x desc) rn from T)
where rn<=10
Although the code complexity doesn’t increase much, the optimization engine will get confused and can't guess its real intention, and will probably do a big sorting according to the literally expressed logic. As the sorting of big data is a very slow action, the final result is poor performance.
Since SQL lacks the explicit set data type, and the database doesn’t provide an algorithm that regards TopN as an aggregation operation, we have to hope that the optimizer is powerful enough. Unfortunately, the optimization result is often unsatisfactory.
Another example: when carrying out a funnel analysis on user churn rate for an e-commerce company, SQL needs to resort to sub-queries because multiple steps and interdependent order are involved:
with e1 as (
select uid,1 as step1,min(etime) as t1
from event
where etime>= to_date('2021-01-10') and etime<to_date('2021-01-25')
and eventtype='eventtype1' and …
group by 1),
e2 as (
select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
from event as e2
inner join e1 on e2.uid = e1.uid
where e2.etime>= to_date('2021-01-10') and e2.etime<to_date('2021-01-25')
and e2.etime > t1 and e2.etime < t1 + 7
and eventtype='eventtype2' and …
group by 1),
e3 as (
select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
from event as e3
inner join e2 on e3.uid = e2.uid
where e3.etime>= to_date('2021-01-10') and e3.etime<to_date('2021-01-25')
and e3.etime > t2 and e3.etime < t1 + 7
and eventtype='eventtype3' and …
group by 1)
select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from
e1
left join e2 on e1.uid = e2.uid
left join e3 on e2.uid = e3.uid
This is just a three-step funnel calculation, but as you can see, this code is quite complex, and the performance is low. In the real-world computing scenario, this calculation was performed on 400 million rows of data, the user did not get result in 3 minutes running on Snowflake's Medium server (equivalent to 32 cores). In addition, SQL does not provide enough support for ordered operation, and naturally the database itself does not provide efficient algorithms for ordered operation. As a result, we have to write it as such a self-JOIN form, resulting in low efficiency.
Obviously, nothing is perfect in this world. Despite the fact that the “inherent” optimization algorithms and “automatic” optimization ability of SQL (database) do make us feel easy to use sometimes, and we only need to write SQL code, and don’t have to consider the execution steps, we find it very difficult to intervene in its execution path, let alone get the database to execute the algorithms we designed. Such inflexibility together with framework shortcomings (tight coupling, poor scalability) will often offset some of SQL advantages in performance, thereby causing the users to be reluctant to use SQL.
In contrast, Java, as a fully functional programming language, is much more flexible. Not only can we use Java to implement complex calculations, but we can also write various high-performance algorithms to improve computing performance - after all, there's nothing that code can't solve. For example, if the in-group TopN mentioned above is calculated in Java, then big sorting can be avoided (we only need to use a set with same length with the groups, and keep the set members as the largest N members of group), and the calculation can be done in only one traversal. As for the funnel analysis, using Java also works well, we just need to read out the behavior data of one user each time and then count the funnels in order.
However, a key point is necessary to execute complex calculations and improve computing performance in Java because algorithms not only need to be thought of but also implemented. Yet, it is usually not easy to write these algorithms, and many programmers lack such ability. For example, for the two-table join operation, the easiest method to think of and implement is to use two-layer loop, which has a computing complexity of n*m. If we want to make the running speed faster, we can use HASH JOIN, which can reduce the complexity by K times (provided that HASH can be evenly divided into K parts), but this algorithm is beyond the ability of many programmers. Furthermore, if we know the data is in order, then we can use the ordered merge algorithm, it can reduce the complexity to n+m, yet it is still very difficult to implement. What we’re talking about is just an operation on two tables, if the operation involves multiple tables, it will be more difficult to implement. In that case, we will probably adopt a low-performance algorithm that is easy to write, resulting in a situation where the conventional operations implemented in Java runs slower than that implemented in SQL.
There are two reasons why it is difficult to implement the algorithms in Java. One is that these algorithms are inherently difficult; the other is that Java is too native and does not provide sufficiently rich structured data computing library, we have to do from scratch in order to implement these data calculations. Therefore, it is cumbersome and requires highly skilled programmers. Unfortunately, many programmers don’t have such ability, or they may be unwilling to code in Java even if they can. For example, the following Java code is for a conventional grouping and aggregating:
Map<Integer, Double> sellerAmountMap = new HashMap<>();
Iterator<Order> iterator = orders.iterator();
while (iterator.hasNext()) {
Order order = iterator.next();
int sellerId = order.getSellerId();
double amount = order.getAmount();
if (sellerAmountMap.containsKey(sellerId)) {
double currentAmount = sellerAmountMap.get(sellerId);
amount += currentAmount;
}
sellerAmountMap.put(sellerId, amount);
}
for (Map.Entry<Integer, Double> entry : sellerAmountMap.entrySet()) {
System.out.println("Seller ID: " + entry.getKey() + ", Total Amount: " + entry.getValue());
}
This is just an in-memory grouping and aggregating operation performed on one field. If multiple fields are involved, and both the grouping keys and the aggregation formulas are expression, the code will be many times more complex, and the coding difficulty will increase significantly.
Stream in Java8 or higher version provides some computing libraries:
Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(r->r.SellerId,Collectors.summarizingDouble(r->r.Amount)));
for(Object sellerid:c.keySet()){
DoubleSummaryStatistics r =c.get(sellerid);
System.out.println("group is:"+sellerid+" sum is:"+r.getSum());
}
Although this code uses multiple classes and functions such as groupingBy, collect, Collectors, summarizingDouble, and DoubleSummaryStatistics, it is still somewhat complex.
Multi-Field grouping and aggregating (group by Year and Client):
Calendar cal=Calendar.getInstance();
Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(
r->{
cal.setTime(r.OrderDate);
return cal.get(Calendar.YEAR)+"_"+r.SellerId;
},
Collectors.summarizingDouble(r->{
return r.Amount;
})));
for(Object sellerid:c.keySet()){
DoubleSummaryStatistics r =c.get(sellerid);
String year_sellerid[]=((String)sellerid).split("_");
System.out.println("group is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t sum is:"+r.getSum());
}
Obviously, this code is much more complex. Coding in Java is totally different from coding in SQL. When this task is implemented in SQL, the coding difficulty is almost the same whether it is to group a single field or multiple fields, and whether or not a formula is used. Moreover, this is just a simple grouping and aggregating operation, the complexity of coding other high-performance algorithms will surely be more complex. In addition, the computing library of Java only offers in-memory algorithms, making it difficult to process big data, and hence Java also falls short compared with SQL in this regard. It is already very difficult to code these calculations in Java, let alone guarantee the performance after coding.
Another problem of Java is that Java does not provide a common efficient storage. As a result, we often need to use databases as storage medium (leading to low computing performance), and calculate only after the data is read out through JDBC. Since the efficiency of JDBC of most databases is not high, the IO cost will be very high when the amount of data is large. Even if it runs fast in subsequent computation, the overall time is still long (IO cost is greater than computing cost).
If there is an efficient storage mechanism that does not rely on database (such as a file system), then the IO efficiency problem of databases can be avoided.
An efficient own storage mechanism still makes sense for improving performance, even though we must use database as the data source or destination and put up with the inefficient IO of database in some scenarios. Specifically, big data computing will involve writing buffers sometimes (such as big sorting), and for some complex businesses, the intermediate results need to be stored during computing. In the event that the buffer and intermediate data are written to database, the cost will be too high. Writing data to database is much slower than reading data from database. Data writing and reading will result in very high time cost. In this case, using an independent storage to write buffers or store intermediate results can solve the problem caused by frequent read/write of database effectively. In addition, for some databases, especially for the data warehouse serving AP, we can usually copy historical cold data from the database to external storage to obtain higher computing performance. If we always read a large amount of historical data from database for calculation, it will often result in an embarrassing situation where the IO time far exceeds computing time.
Is it feasible to use files to buffer the data? After all, it is easy for Java to read and write files. Since the efficiency of reading and writing file is much higher than that of reading and writing database, it is theoretically possible. However, implementing an efficient file storage mechanism is still difficult and cumbersome just as algorithm above.
The simplest method is to use text, which is very common. Yet, the shortcoming of using text is low in read and write efficiency. The problem lies in data type parsing because it is time-consuming to parse various data into corresponding type, with parsing strings and dates being the slowest.
In this case, does it work if we serialize the objects and write them as binary format? It's not that simple. The reason is that although this method solves the problem of data type parsing, it may not be more efficient than text, and may occupy more space than text (for example, 1 of long type will take up 8 bytes, which is even longer than 1 of text). In addition, while it reduces the data type parsing time, it increases the hard disk read and write time, so it is hard to judge whether the performance is improved. In short, for an efficient storage, any means that can improve usage efficiency should be taken into account, such as encode and compression, columnar storage and even index. After all means are considered, we will foresee that the development work very is very challenging.
From the above analysis we can see that the advantages and disadvantages of SQL are basically opposite to those of Java. The advantages of SQL are exactly the disadvantages of Java, and vice versa. However, both have performance issues. When the performance of SQL is low, there's nothing we can do. When the performance of Java is low, we can do something to improve performance, but it is too difficult and has little effect.
Is there a language that combines the advantages of both SQL and Java, and allows us to enjoy the simplicity of SQL and the flexibility of Java while achieving high performance?
Yes, SPL (Structured Process Language) is exactly such a language. As a programming language specifically for processing structured data, SPL provides rich structured data computing library, encapsulates many high-performance algorithms that can be used directly, and fully retains the advantages of SQL. Since SPL syntax is simple and flexible, it is more convenient to implement various complex calculations than Java. In addition, SPL provides its own file storage, it can solve the IO efficiency problem of database efficiently. Moreover, SPL has good integration and can be embedded in the application to process data and obtain the same flexibility as Java.
SPL provides a professional structured data object: table sequence, and offers rich computing library based on the table sequence, including conventional operations like filtering, grouping, sorting, distinct and join. Here below are some common operations:
Orders.sort(Amount) // sort
Orders.select(Amount*Quantity>3000 && like(Client,"*S*")) // filter
Orders.groups(Client; sum(Amount)) // group
Orders.id(Client) // distinct
join(Orders:o,SellerId ; Employees:e,EId) // join
…
Similar to SQL, these libraries also adopt mature algorithms of the industry and can operate efficiently.
With these rich data types and computing library, SOL code becomes concise, and the performance is also improved. For example, the TopN mentioned above can be coded as:
T.groups(y;top(-5;x))
SPL regards TopN as an aggregation operation, just like SUM and COUNT, except that its return value is a set. In this way, the sorting action is avoided in SPL statement that takes top N, the execution efficiency is thus increased.
Another example, when using SPL to do the funnel analysis mentioned above, the code is:
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime<date("2021-01-25") && A1.contain(etype) && …) |
4 | =A3.group(uid).(~.sort(etime)) |
5 | =A4.new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3) |
Compared with SQL code given above, this SPL code is shorter. The reason is that SPL not only supports procedural and ordered calculations, but offers rich computing library and agile syntax. In addition, this code can handle funnel involving any number of steps and more universal (we just need to modify the parameter).
Moreover, this SPL code exhibits excellent running performance. For the same data size (400 million rows), running this code on a 12-core, 1.7G low-end server only takes less than 10 seconds, which is at least 20 times faster than database. If hardware difference is considered, the performance gap will be bigger.
SPL provides many similar high-performance algorithms, including:
- In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning...
- External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval...
- Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping...
- Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning...
- Merge and join: ordered merging, merge by segment, association positioning, attached table...
- Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension...
- Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing...
From these ready algorithms in SPL, we can select the appropriate one directly based on specific application scenario instead of developing algorithm from scratch like Java. Moreover, SPL provides more rich algorithms than SQL. For example, for the join operation mentioned earlier, SPL has made a more detailed division. Specifically, for many-to-one foreign key association, we can use the algorithms like foreign key addressization, instant addressization and foreign key sequence-numberization; If the dimension table is relatively large, we can use the big dimension table search technology to achieve join; If both the dimension table and the fact table are large, we can adopt the one-side partitioning algorithm; For the one-to-one homo-dimension tables association and the many-to-one primary-sub table association, the ordered merge algorithm can be utilized. When the amount of data is too large, we can further improve performance through segmental parallel computing.
From the perspective of computing library, since SPL offers richer algorithms than Java and SQL, we only need to select and combine these algorithms when processing data, which makes coding simple and convenient while improving performance.
Relying solely on a rich computing library doesn’t work well. If we still utilize the database as storage like Java, the same problem that the IO cost is greater than computing cost still exists. For this reason, SPL designs its own binary file storage, which is not just a simple binary format, but an efficient file format that integrates multiple mechanisms such as encoding, compression, columnar storage, index, and segmentation.
Currently, SPL provides two high-performance file types: bin file and composite table. The bin file adopts the compression technology (faster reading due to less space occupation), and stores the data types (faster reading as a result of avoiding parsing data type). Since the bin file supports the double increment segmentation mechanism that can append data, it is easy to implement parallel computing by utilizing the segmentation strategy, and the computing performance is guaranteed. The composite table supports the columnar storage, which has great advantage when the number of columns (fields) involved in calculation is small. In addition, not only is the minmax index adopted on composite table, but the double increment segmentation mechanism is supported, allowing us to utilize the advantage of columnar storage, and improve the performance more easily through parallel computing.
By means of high-performance storage, we can migrate data from database to files to obtain higher computing performance. Even if the data cannot be migrated from database, using SPL’s high-performance storage to write buffers or intermediate results is also helpful for improving performance. And we can copy the historical cold data to file storage for efficient computation.
In addition to performance advantage, file storage can bring us many other benefits: i) lower cost; ii) file system's tree directory structure makes file management more convenient; iii) it is easier to implement the separation between storage and computation, allowing us to expand separately for storage or computation, which is more in line with the requirements of modern application framework.
We don’t have to use SPL’s own file format as SPL has good openness and can access multiple data sources. When the computing performance and data real-timeness are given priority, we can solidify cold data to SPL storage, and read hot data still from database, thus implementing real-time query and HTAP.
We have already discussed the high performance and code simplicity of SPL. Can SPL inherit the benefits of Java in terms of framework?
The answer is yes.
SPL has good integration and can be embedded in application to serve as an in-application computing engine, which makes SPL have the flexibility same as Java, and play its role in application scenarios such as microservice.
SPL provides standard JDBC driver, allowing us to request SPL calculation results like invoking stored procedure.
Below is an example of invoking SPL code through JDBC:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
Overall, the performance of SQL depends on the performance of optimization engine, yet the optimization engine often fails when facing slightly complex calculations and, if we use SQL, a bloated (inflexible) application framework is unavoidable; As for Java, it has good flexibility and can handle complex calculations, however, if we want to achieve high performance, various situations have to be taken into account, and we have to do from scratch, which is very complex. Moreover, considering the dependence on database storage, Java is not advantageous in performance. In contrast, SPL combines the advantages of both SQL and Java. Not only is SPL concise in syntax and flexible in use, but it also offers a large number of built-in high-performance algorithms and a high-performance file storage, which can fully guarantee the computing performance and better meet the requirements of contemporary applications.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code