[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution #49678

BOOTMGR · 2025-01-26T09:45:31Z

What changes were proposed in this pull request?

Wrap Dataset.rdd inside withNewRDDExecutionId, which takes care of important setup tasks, like updating Spark properties in SparkContext's thread-locals, before executing the SparkPlan to fetch data. This also makes it possible to track any prerequisite tasks (Shuffle, Scan etc.) for generating the RDD in the Spark UI.

Why are the changes needed?

When Dataset is converted into RDD, It executes SpakPlan without any execution context. This leads to:

No tracking is available on Spark UI for stages which are necessary to build the RDD.
Spark properties which are local to thread may not be set in the RDD execution context. This leads to these properties not being sent with TaskContext but some operations like reading parquet files depend on these properties (eg, case-sesitivity).

Test scenario:

test("SPARK-50994: RDD conversion is performed with execution context") {
    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
        withTempDir(dir => {
          val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A")
          dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath)

          val df = spark.read.parquet(dir.getCanonicalPath)
          val encoder = ExpressionEncoder(df.schema)
          val deduplicated = df.dropDuplicates(Array("a"))
          val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd

          val output = spark.createDataFrame(df2, df.schema)
          checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)))
        })
      }
    }
  }

In the above scenario,

Call to .rdd triggers execution which performs shuffle after reading parquet
However, while reading parquet file spark.sql.caseSensitive is not set (even though it is passed during session creation) which is referred into SQLConf by parquet-mr reader
This leads to unexpected and wrong result of dropDuplicates as it would drop duplicates by either a or 'A'. Expectation is to drop duplicates by column a
This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from hadoopContext hence is disabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing testcases & new test case added for specific scenario

Was this patch authored or co-authored using generative AI tooling?

No

Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same

BOOTMGR · 2025-01-26T14:45:41Z

Marking WIP, this would require some more work around event listeners and observable due to exposure of RDD stages.

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.

Perform RDD conversion under tracked execution

88b8b1b

github-actions bot added the SQL label Jan 26, 2025

BOOTMGR changed the title ~~Perform RDD conversion under tracked execution~~ SPARK-50994: Perform RDD conversion under tracked execution Jan 26, 2025

BOOTMGR changed the title ~~SPARK-50994: Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL] Perform RDD conversion under tracked execution Jan 26, 2025

Fix failing test case

0c93694

Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same

BOOTMGR changed the title ~~[SPARK-50994][SQL] Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution Jan 26, 2025

BOOTMGR added 2 commits January 27, 2025 09:17

Behave Dataset#rdd like any other operations on RDD

6388c61

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.

style fix

97184ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution #49678

[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution #49678

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading

[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution #49678

Are you sure you want to change the base?

[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution #49678

Conversation

BOOTMGR commented Jan 26, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

BOOTMGR commented Jan 26, 2025 • edited Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading