Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution #49678

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

BOOTMGR
Copy link
Contributor

@BOOTMGR BOOTMGR commented Jan 26, 2025

What changes were proposed in this pull request?

Wrap Dataset.rdd inside withNewRDDExecutionId, which takes care of important setup tasks, like updating Spark properties in SparkContext's thread-locals, before executing the SparkPlan to fetch data. This also makes it possible to track any prerequisite tasks (Shuffle, Scan etc.) for generating the RDD in the Spark UI.

Why are the changes needed?

When Dataset is converted into RDD, It executes SpakPlan without any execution context. This leads to:

  1. No tracking is available on Spark UI for stages which are necessary to build the RDD.
  2. Spark properties which are local to thread may not be set in the RDD execution context. This leads to these properties not being sent with TaskContext but some operations like reading parquet files depend on these properties (eg, case-sesitivity).

Test scenario:

test("SPARK-50994: RDD conversion is performed with execution context") {
    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
        withTempDir(dir => {
          val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A")
          dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath)

          val df = spark.read.parquet(dir.getCanonicalPath)
          val encoder = ExpressionEncoder(df.schema)
          val deduplicated = df.dropDuplicates(Array("a"))
          val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd

          val output = spark.createDataFrame(df2, df.schema)
          checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)))
        })
      }
    }
  }

In the above scenario,

  • Call to .rdd triggers execution which performs shuffle after reading parquet
  • However, while reading parquet file spark.sql.caseSensitive is not set (even though it is passed during session creation) which is referred into SQLConf by parquet-mr reader
  • This leads to unexpected and wrong result of dropDuplicates as it would drop duplicates by either a or 'A'. Expectation is to drop duplicates by column a
  • This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from hadoopContext hence is disabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing testcases & new test case added for specific scenario

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jan 26, 2025
@BOOTMGR BOOTMGR changed the title Perform RDD conversion under tracked execution SPARK-50994: Perform RDD conversion under tracked execution Jan 26, 2025
@BOOTMGR BOOTMGR changed the title SPARK-50994: Perform RDD conversion under tracked execution [SPARK-50994][SQL] Perform RDD conversion under tracked execution Jan 26, 2025
Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same
@BOOTMGR BOOTMGR changed the title [SPARK-50994][SQL] Perform RDD conversion under tracked execution [SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution Jan 26, 2025
@BOOTMGR
Copy link
Contributor Author

BOOTMGR commented Jan 26, 2025

Marking WIP, this would require some more work around event listeners and observable due to exposure of RDD stages.

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant