Releases: StarRocks/starrocks
3.4.0
Release date: January 24, 2025
Data Lake Analytics
- Optimized Iceberg V2 query performance and lowered memory usage by reducing repeated reads of delete-files.
- Supports column mapping for Delta Lake tables, allowing queries against data after Delta Schema Evolution. For more information, see Delta Lake catalog - Feature support.
- Data Cache related improvements:
- Introduces a Segmented LRU (SLRU) Cache eviction strategy, which significantly defends against cache pollution from occasional large queries, improves cache hit rate, and reduces fluctuations in query performance. In simulated test cases with large queries, SLRU-based query performance can be improved by 70% or even higher. For more information, see Data Cache - Cache replacement policies.
- Unified the Data Cache instance used in both shared-data architecture and data lake query scenarios to simplify the configuration and improve resource utilization. For more information, see Data Cache.
- Provides an adaptive I/O strategy optimization for Data Cache, which flexibly routes some query requests to remote storage based on the cache disk's load and performance, thereby enhancing overall access throughput.
- Supports automatic collection of external table statistics through automatic ANALYZE tasks triggered by queries. It can provide more accurate NDV information compared to metadata files, thereby optimizing the query plan and improving query performance. For more information, see Query-triggered collection.
- Provides Time Travel query capability for Iceberg, allowing data to be read from a specified BRANCH or TAG by specifying TIMESTAMP or VERSION.
- Supports asynchronous delivery of query fragments for data lake queries. It avoids the restriction that FE must obtain all files to be queried before BE can execute a query, thus allowing FE to fetch query files and BE to execute queries in parallel, and reducing the overall latency of data lake queries involving a large number of files that are not in the cache. Meanwhile, it reduces the memory load on FE due to caching the file list and improves query stability. (Currently, the optimization for Hudi and Delta Lake is implemented, while the optimization for Iceberg is still under development.)
Performance Improvement and Query Optimization
- [Experimental] Offers a preliminary Query Feedback feature for automatic optimization of slow queries. The system will collect the execution details of slow queries, automatically analyze its query plan for potential opportunities for optimization, and generate a tailored optimization guide for the query. If CBO generates the same bad plan for subsequent identical queries, the system will locally optimize this query plan based on the guide. For more information, see Query Feedback.
- [Experimental] Supports Python UDFs, offering more convenient function customization compared to Java UDFs. For more information, see Python UDF.
- [Experimental] Supports Arrow Flight interface for more efficient reading of large data volumes in query results. It also allows BE, instead of FE, to process the returned results, greatly reducing the pressure on FE. It is especially suitable for business scenarios involving big data analysis and processing, and machine learning.
- Enables the pushdown of multi-column OR predicates, allowing queries with multi-column OR conditions (for example,
a = xxx OR b = yyy
) to utilize certain column indexes, thus reducing data read volume and improving query performance. - Optimized TPC-DS query performance by roughly 20% under the TPC-DS 1TB Iceberg dataset. Optimization methods include table pruning and aggregated column pruning using primary and foreign keys, and aggregation pushdown.
Shared-data Enhancements
- Supports Query Cache, aligning the shared-nothing architecture.
- Supports synchronous materialized views, aligning the shared-nothing architecture.
Storage Engine
- Unified all partitioning methods into the expression partitioning and supported multi-level partitioning, where each level can be any expression. For more information, see Expression Partitioning.
- [Preview] Supports all native aggregate functions in Aggregate tables. By introducing a generic aggregate function state storage framework, all native aggregate functions supported by StarRocks can be used to define an Aggregate table.
- Supports vector indexes, enabling fast approximate nearest neighbor searches (ANNS) of large-scale, high-dimensional vectors, which are commonly required in deep learning and machine learning scenarios. Currently, StarRocks supports two types of vector indexes: IVFPQ and HNSW.
Loading
- INSERT OVERWRITE now supports a new semantic - Dynamic Overwrite. When this semantic is enabled, the ingested data will either create new partitions or overwrite existing partitions that correspond to the new data records. Partitions not involved will not be truncated or deleted. This semantic is especially useful when users want to recover data in specific partitions without specifying the partition names. For more information, see Dynamic Overwrite.
- Optimized the data ingestion with INSERT from FILES to replace Broker Load as the preferred loading method:
- FILES now supports listing files in remote storage, and providing basic statistics of the files. For more information, see FILES - list_files_only.
- INSERT now supports matching columns by name, which is especially useful when users load data from numerous columns with identical names. (The default behavior matches columns by their position.) For more information, see Match column by name.
- INSERT supports specifying PROPERTIES, aligning with other loading methods. Users can specify
strict_mode
,max_filter_ratio
, andtimeout
for INSERT operations to control and behavior and quality of the data ingestion. For more information, see INSERT - PROPERTIES. - INSERT from FILES supports pushing down the target table schema check to the Scan stage of FILES to infer a more accurate source data schema. For more information, see see Push down target table schema check.
- FILES supports unionizing files with different schema. The schema of Parquet and ORC files are unionized based on the column names, and that of CSV files are unionized based on the position (order) of the columns. When there are mismatched columns, users can choose to fill the columns with NULL or return an error by specifying the property
fill_mismatch_column_with
. For more information, see Union files with different schema. - FILES supports inferring the STRUCT type data from Parquet files. (In earlier versions, STRUCT data is inferred as STRING type.) For more information, see Infer STRUCT type from Parquet.
- Supports merging multiple concurrent Stream Load requests into a single transaction and committing data in a batch, thus improving the throughput of real-time data ingestion. It is designed for high-concurrency, small-batch (from KB to tens of MB) real-time loading scenarios. It can reduce the excessive data versions caused by frequent loading operations, resource consumption during Compaction, and IOPS and I/O latency brought by excessive small files.
Others
- Optimized the graceful exit process of BE and CN by accurately displaying the status of BE or CN nodes during a graceful exit as
SHUTDOWN
. - Optimized log printing to avoid excessive disk space being occupied.
- Shared-nothing clusters now support backing up and restoring more objects: logical view, external catalog metadata, and partitions created with expression partitioning and list partitioning strategies.
- [Preview] Supports CheckPoint on Follower FE to avoid excessive memory on Leader FE during CheckPoint, thereby improving the stability of Leader FE.
Downgrade Notes
- Clusters can be downgraded from v3.4.0 only to v3.3.9 and later.
3.4.0-RC01
Release date: January 13, 2025
Data Lake Analytics
- Optimized Iceberg V2 query performance and lowered memory usage by reducing repeated reads of delete-files.
- Supports column mapping for Delta Lake tables, allowing queries against data after Delta Schema Evolution. For more information, see Delta Lake catalog - Feature support.
- Data Cache related improvements:
- Introduces a Segmented LRU (SLRU) Cache eviction strategy, which significantly defends against cache pollution from occasional large queries, improves cache hit rate, and reduces fluctuations in query performance. In simulated test cases with large queries, SLRU-based query performance can be improved by 70% or even higher. For more information, see Data Cache - Cache replacement policies.
- Unified the Data Cache instance used in both shared-data architecture and data lake query scenarios to simplify the configuration and improve resource utilization. For more information, see Data Cache.
- Provides an adaptive I/O strategy optimization for Data Cache, which flexibly routes some query requests to remote storage based on the cache disk's load and performance, thereby enhancing overall access throughput.
- Supports automatic collection of external table statistics through automatic ANALYZE tasks triggered by queries. It can provide more accurate NDV information compared to metadata files, thereby optimizing the query plan and improving query performance. For more information, see Query-triggered collection.
Performance Improvement and Query Optimization
- [Experimental] Offers a preliminary Query Feedback feature for automatic optimization of slow queries. The system will collect the execution details of slow queries, automatically analyze its query plan for potential opportunities for optimization, and generate a tailored optimization guide for the query. If CBO generates the same bad plan for subsequent identical queries, the system will locally optimize this query plan based on the guide. For more information, see Query Feedback.
- [Experimental] Supports Python UDFs, offering more convenient function customization compared to Java UDFs. For more information, see Python UDF.
Shared-data Enhancements
- Supports Query Cache, aligning the shared-nothing architecture.
- Supports synchronous materialized views, aligning the shared-nothing architecture.
Storage Engine
- Unified all partitioning methods into the expression partitioning and supported multi-level partitioning, where each level can be any expression. For more information, see Expression Partitioning.
Loading
- INSERT OVERWRITE now supports a new semantic - Dynamic Overwrite. When this semantic is enabled, the ingested data will either create new partitions or overwrite existing partitions that correspond to the new data records. Partitions not involved will not be truncated or deleted. This semantic is especially useful when users want to recover data in specific partitions without specifying the partition names. For more information, see Dynamic Overwrite.
- Optimized the data ingestion with INSERT from FILES to replace Broker Load as the preferred loading method:
- FILES now supports listing files in remote storage, and providing basic statistics of the files. For more information, see FILES - list_files_only.
- INSERT now supports matching columns by name, which is especially useful when users load data from numerous columns with identical names. (The default behavior matches columns by their position.) For more information, see Match column by name.
- INSERT supports specifying PROPERTIES, aligning with other loading methods. Users can specify
strict_mode
,max_filter_ratio
, andtimeout
for INSERT operations to control and behavior and quality of the data ingestion. For more information, see INSERT - PROPERTIES. - INSERT from FILES supports pushing down the target table schema check to the Scan stage of FILES to infer a more accurate source data schema. For more information, see see Push down target table schema check.
- FILES supports unionizing files with different schema. The schema of Parquet and ORC files are unionized based on the column names, and that of CSV files are unionized based on the position (order) of the columns. When there are mismatched columns, users can choose to fill the columns with NULL or return an error by specifying the property
fill_mismatch_column_with
. For more information, see Union files with different schema. - FILES supports inferring the STRUCT type data from Parquet files. (In earlier versions, STRUCT data is inferred as STRING type.) For more information, see Infer STRUCT type from Parquet.
Others
- Optimized the graceful exit process of BE and CN by accurately displaying the status of BE or CN nodes during a graceful exit as
SHUTDOWN
. - Optimized log printing to avoid excessive disk space being occupied.
Downgrade Notes
- Clusters can be downgraded from v3.4.0 only to v3.3.9 and later.
3.3.9
3.3.9
Release date: January 12, 2025
New Features
- Supports the translation of Trino SQL into StarRocks SQL. #54185
Improvements
- Corrected FE node names starting with
bdbje_reset_election_group
to enhance clarity. #54399 - Implemented vectorization for the
IF
function on ARM architectures. #53093 ALTER SYSTEM CREATE IMAGE
supports creating an image for StarManager. #54370- Supports deleting cloud-native indexes of Primary Key tables in shared-data clusters. #53971
- Enforced the refresh of materialized views when the
FORCE
keyword is specified. #52081 - Supports specifying hints in
CACHE SELECT
. #54697 - Supports loading compressed CSV files using the
FILES()
function. Supported compression formats include gzip, bz2, lz4, deflate, and zstd. #54626 - Supports assigning multiple values to the same column in an
UPDATE
statement. #54534
Bug Fixes
Fixed the following issues:
- Unexpected errors when refreshing materialized views built on JDBC catalogs. #54487
- Instability in results when a Delta Lake table joins itself. #54473
- Upload retries fail when backing up data to HDFS. #53679
- BFD initialization errors on the aarch64 architecture. #54372
- Sensitive information recorded in BE logs. #54677
- Errors in Compaction-related metrics in profiles. #54678
- BE crashes caused by creating tables with nested
TIME
types. #54601 - Query plan errors for
LIMIT
queries with subquery TOP-N. #54507
Downgrade notes
- Clusters can be downgraded from v3.3.9 only to v3.2.11 and later.
3.2.14
3.2.14
Release date: January 8, 2025
Improvements
- Supports collecting statistics of Paimon tables. #52858
- Included node information and histogram metrics in JSON metrics. #53735
Bug Fixes
Fixed the following issues:
- The score of the Primary Key table index was not updated in the Commit phase. #41737
- Incorrect execution plans for
max(count(distinct))
when low-cardinality optimization is enabled. #53403 - When the List partition column has NULL values, queries against the Min/Max value of the partition column will lead to incorrect partition pruning. #53235
- Upload retries fail when backing up data to HDFS. #53679
3.3.8
3.3.8
Release date: January 3, 2025
Improvements
- Added a cluster idle API to assist in determining cluster status. #53850
- Included node information and histogram metrics in JSON metrics. #53735
- Optimized the MemTable for Primary Key tables in shared-data clusters. #54178
- Optimized memory usage and statistics for Primary Key tables in shared-data clusters. #54358
- Introduced a limit on the number of partitions scanned per node for queries requiring full-table or large-scale partition scans, enhancing system stability by reducing scanning pressure on individual BE or CN nodes. #53747
- Supports collecting statistics of Paimon tables. #52858
- Supports configuration of S3 client request timeout for shared-data clusters. #54211
Bug Fixes
Fixed the following issues:
- BE crashes caused by inconsistencies in the DelVec of Primary Key tables. #53460
- Issues with lock release of Primary Key tables in shared-data clusters. #53878
- Errors of UDFs nested in functions are not returned in query failures. #44297
- Transactions are blocked at the Decommission phase because they depend on the original replicas. #49349
- Queries against Delta Lake tables use relative paths instead of filenames for file retrieval. #53949
- An error is returned when querying Delta Lake Shallow Clone tables. #54044
- Case sensitivity issues when reading Paimon using JNI. #54041
- An error is returned during
INSERT OVERWRITE
operations on Hive tables created in Hive. #53792 SHOW TABLE STATUS
command does not validate view privileges. #53811- Missing FE metrics. #53058
- Memory leaks in
INSERT
tasks. #53809 - Concurrency issues caused by missing write locks in replication tasks. #54061
partition_ttl
of tables in thestatistics
database does not take effect. #54398- Query Cache-related issues:
- Issues with materialized view Union Rewrite. #54293
- Missing padding in string updates for partial updates in Primary Key tables. #54182
- Incorrect execution plans for
max(count(distinct))
when low-cardinality optimization is enabled. #53403 - Issues with changing the
excluded_refresh_tables
parameter of materialized views. #53394
Behavior Changes
- Changed the default value of
persistent_index_type
for Primary Key tables in shared-data clusters toCLOUD_NATIVE
, that is, enabled Persistent Index by default. #52209
3.1.17
Release Date: January 3, 2025
Bug Fixes
Fixed the following issues:
- Cross-cluster Data Migration Tool caused the Follower FE to crash during data synchronization and commit, due to not accounting for the deletion of partitions in the target cluster. #54061
- BE in the target cluster might crash when synchronizing tables with DELETE operations using Cross-cluster Data Migration Tool. #54081
- A bug in the BDBJE handshake mechanism where Leader FE would reject reconnection attempts from Follower FE when connection is being re-established, causing Follower FE nodes to exit. #50412
- Duplicate memory statistics in FE leads to excessive memory usage. #53055
- The statuses of the asynchronous materialized view refresh tasks are inconsistent across multiple FE nodes, which lead to inaccurate states of the materialized view during queries. #54236
3.2.13
Release date: December 13, 2024
Improvements
- Supports setting a time range within which Base Compaction is forbidden for a specific table. #50120
Bug Fixes
Fixed the following issues:
- The
loadRowsRate
field returned0
after executing SHOW ROUTINE LOAD. #52151 - The
Files()
function read columns that were not queried. #52210 - Prometheus failed to parse materialized view metrics with special characters in their names. (Now materialized view metrics support tags.) #52782
- The
array_map
function caused BE to crash. #52909 - Metadata Cache issues caused BE to crash. #52968
- Routine Load tasks were canceled due to expired transactions. (Now tasks are canceled only if the database or table no longer exists). #50334
- Stream Load failures when submitted using HTTP 1.0. #53010 #53008
- Issues related to Glue and S3 integration: #48433
- Some error messages did not display the root cause.
- Error messages for writing to a Hive partitioned table with the partition column of type STRING when Glue was used as the metadata service.
- Dropping Hive tables failed without proper error messages when the user lacked sufficient permissions.
- The
storage_cooldown_time
property for materialized views did not take effect when set tomaximum
. #52079
3.1.16
Release date: December 16, 2024
Improvements
- Optimized table-related statistics. #50316
Bug Fixes
Fixed the following issues:
- Insufficient granularity in error code handling for disk full scenarios caused the BE to mistakenly identify disk errors and delete data. #51411
- Stream Load failures when submitted using HTTP 1.0. #53010 #53008
- Routine Load tasks were canceled due to expired transactions (now tasks are canceled only if the database or table no longer exists and paused when transactions expired). #50334
- Unloading data using
EXPORT
with Broker tofile://
resulted in a file rename error, causing the export to fail. #52544 - If the join condition in an equal-join is an expression based on a low-cardinality column, the system may incorrectly push down a Runtime Filter predicate, leading to a BE crash. #50690
3.3.7
3.3.7
Release date: November 29, 2024
New Features
- Added a new Materialized View parameter,
excluded_refresh_tables
, exclude tables that need to be refreshed. #50926
Improvements
- Rewrote
unnest(bitmap_to_array)
asunnest_bitmap
to improve performance. #52870 - Reduced the write and delete operations of Txn logs. #42542
Bug Fixes
Fixed the following issues:
- Failure to connect Power BI to external tables. #52977
- Misleading FE Thrift RPC failure messages in logs. #52706
- Routine Load tasks were canceled due to expired transactions (now tasks are canceled only if the database or table no longer exists). #50334
- Stream Load failures when submitted using HTTP 1.0. #53010 #53008
- Integer overflow of partition IDs. #52965
- Hive Text Reader failed to recognize the last empty element. #52990
- Issues caused by
array_map
in Join conditions. #52911 - Metadata cache issues under high concurrency scenarios. #52968
- The whole materialized view was refreshed when a partition was dropped from the base table. #52740
3.3.6
3.3.6
Release date: November 18, 2024
Improvements
- Optimized internal repair logic for Primary Key tables. #52707
- Optimized the internal implementation of histograms of statistics. #52400
- Supports adjusting log level via the FE configuration item
sys_log_warn_modules
to reduce Hudi Catalog logging. #52709 - Supports constant folding in the
yearweek
function. #52714 - Avoided push-down for Lambda functions. #52655
- Divided the Query Error metric into three: Internal Error Rate, Analysis Error Rate, and Timeout Rate. #52646
- Avoided constant expressions being extracted as common expressions within
array_map
. #52541 - Optimized the Text-based Rewrite of materialized views. #52498
Bug Fixes
Fixed the following issues:
- The
unique_constraints
andforeign_constraints
parameters were incomplete in SHOW CREATE TABLE for cloud-native tables in shared-data clusters. #52804 - Some materialized views were activated even when
enable_mv_automatic_active_check
was set tofalse
. #52799 - Memory usage is not reducing after stale memory flush. #52613
- Resource leak caused by Hudi file-system views. #52738
- Concurrent Publish and Update operations on Primary Key tables may cause issues. #52687
- Failures to terminate queries on clients. #52185
- Multi-column List partitions cannot be pushed down. #51036
- Incorrect result due to the lack of
hasnull
property in ORC files. #52555 - An issue caused by using uppercase column names in ORDER BY during table creation. #52513
- An error was returned after running
ALTER TABLE PARTITION (*) SET ("storage_cooldown_ttl" = "xxx")
. #52482
Behavior Changes
-
In earlier versions, scale-in operations would fail if there were insufficient replicas for views in the
_statistics_
database. Starting from v3.3.6, if nodes are scaled in to 3 or more, view replicas are set to 3; if there is only 1 node after the scale-in, view replicas are set to 1, allowing for successful scale-in. #51799Affected views include:
column_statistics
histogram_statistics
table_statistic_v1
external_column_statistics
external_histogram_statistics
pipe_file_list
loads_history
task_run_history
-
New Primary Key tables no longer allow
__op
as a column name, even ifallow_system_reserved_names
is set totrue
. Existing tables are unaffected. #52621 -
Expression-partitioned tables cannot have partition names modified. #52557
-
Deprecated FE parameters
heartbeat_mgr_blocking_queue_size
andprofile_process_threads_num
. #52236 -
Enabled persistent index on object storage by default for Primary Key tables in shared-data clusters. #52209
-
Disallowed manual changes to bucketing methods for tables with the random bucketing method. #52120
-
Backup and Restore-related parameter changes: #52111
make_snapshot_worker_count
supports dynamic configuration.release_snapshot_worker_count
supports dynamic configuration.upload_worker_count
supports dynamic configuration. Its default value is changed from1
to the number of CPU cores on the machine where the BE resides.download_worker_count
supports dynamic configuration. Its default value is changed from1
to the number of CPU cores on the machine where the BE resides.
-
The return type of
SELECT @@autocommit
has changed from BOOLEAN to BIGINT. #51946 -
Added a new FE configuration item,
max_bucket_number_per_partition
, to control the maximum number of buckets per partition. #47852 -
Enabled memory usage checks by default for Primary Key tables. #52393