Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Reading LZ4-Compressed Parquet File Using Spark 3.5 + Blaze #771

Open
merrily01 opened this issue Jan 17, 2025 · 0 comments
Open

Comments

@merrily01
Copy link
Contributor

Describe the bug

Issue with Reading LZO-Compressed Parquet File Using Spark 3.5 + Blaze

To Reproduce
Steps to reproduce the behavior:

  1. The LZO-compressed Parquet file that reproduces the issue is attached, eg:
    part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet.txt

    Note: Please remove the “.txt” suffix to convert it back to a Parquet file before proceeding.

  2. Upload the aforementioned LZO-compressed Parquet file to HDFS for backup.

  3. Launch spark-shell with Spark 3.5 + Blaze.

  4. Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error as follows::

scala> spark.conf.set("spark.blaze.enable", true)
scala> val df = spark.read.parquet("hdfs://path/o/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
25/01/17 17:01:31 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (tjtx16-35-27.58os.org executor 2): java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[ParquetScan] error: Execution error: output_with_sender[ParquetScan]: output() returns error: Arrow error: External error: Arrow: Parquet argument error: External: the offset to copy is not contained in the decompressed buffer
	at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
	at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:95)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:143)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:662)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:682)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
  1. Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results, as follows:
scala> spark.conf.set("spark.blaze.enable", false)
scala> val df = spark.read.parquet("hdfs://path/to/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|cp_catalog_page_sk|cp_catalog_page_id|cp_start_date_sk|cp_end_date_sk|cp_department|cp_catalog_number|cp_catalog_page_number|      cp_description|  cp_type|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|                 1|  AAAAAAAABAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     1|In general basic ...|bi-annual|
|                 2|  AAAAAAAACAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     2|English areas wil...|bi-annual|
|                 3|  AAAAAAAADAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     3|Times could not a...|bi-annual|
|                 4|  AAAAAAAAEAAAAAAA|         2450815|          NULL|         NULL|                1|                  NULL|                NULL|bi-annual|
|                 5|  AAAAAAAAFAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     5|Classic buildings...|bi-annual|
|                 6|  AAAAAAAAGAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     6|Exciting principl...|bi-annual|
|                 7|  AAAAAAAAHAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     7|National services...|bi-annual|
|                 8|  AAAAAAAAIAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     8|Areas see early f...|bi-annual|
|                 9|  AAAAAAAAJAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     9|Intensive, econom...|bi-annual|
|                10|  AAAAAAAAKAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    10|Careful, intense ...|bi-annual|
|                11|  AAAAAAAALAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    11|At least national...|bi-annual|
|                12|  AAAAAAAAMAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    12|Girls indicate so...|bi-annual|
|                13|  AAAAAAAANAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    13|Miles see mainly ...|bi-annual|
|                14|  AAAAAAAAOAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    14|Rooms would say a...|bi-annual|
|                15|  AAAAAAAAPAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    15|Legal, required e...|bi-annual|
|                16|  AAAAAAAAABAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    16|Schools must know...|bi-annual|
|                17|  AAAAAAAABBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    17|More than true ca...|bi-annual|
|                18|  AAAAAAAACBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    18|Shops end problem...|bi-annual|
|                19|  AAAAAAAADBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    19|Poor, hostile gui...|bi-annual|
|                20|  AAAAAAAAEBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    20|Appropriate years...|bi-annual|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
only showing top 20 rows

Expected behavior

  1. Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error;
  2. Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results;

Screenshots
Enable the Blaze switch:
Image

Disable the Blaze switch:

Image

Additional context

Spark version: 3.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant