[SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default #45740

dongjoon-hyun · 2024-03-27T21:12:38Z

What changes were proposed in this pull request?

This PR aims to use Apache Hadoop Magic Committer for all S3 buckets by default in Apache Spark 4.0.0.

Why are the changes needed?

Apache Hadoop Magic Committer has been used for S3 buckets to get the best performance since S3 became fully consistent on December 1st, 2020.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel

Amazon S3 provides strong read-after-write consistency for PUT and DELETE requests of objects in your Amazon S3 bucket in all AWS Regions. This behavior applies to both writes to new objects as well as PUT requests that overwrite existing objects and DELETE requests. In addition, read operations on Amazon S3 Select, Amazon S3 access controls lists (ACLs), Amazon S3 Object Tags, and object metadata (for example, the HEAD object) are strongly consistent.

Does this PR introduce any user-facing change?

Yes, the migration guide is updated.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun · 2024-03-27T21:16:43Z

Hi, do you have any concern on using S3 Magic Committer by default, @steveloughran ?

dongjoon-hyun · 2024-03-27T21:22:15Z

Also, cc @viirya

viirya

It seems good to me. Of course, except for there are additional concerns from @steveloughran .

dongjoon-hyun · 2024-03-27T21:30:36Z

Thank you, @viirya .

dongjoon-hyun · 2024-03-28T00:16:01Z

All tests passed.

yaooqinn · 2024-03-28T02:13:47Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -420,6 +420,14 @@ class SparkContext(config: SparkConf) extends Logging {
    // HADOOP-19097 Set fs.s3a.connection.establish.timeout to 30s
    // We can remove this after Apache Hadoop 3.4.1 releases
    conf.setIfMissing("spark.hadoop.fs.s3a.connection.establish.timeout", "30s")
+    try {
+      // Try to enable Magic Committer by default for all buckets
+      Utils.classForName("org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")


Use Utils.classIsLoadable?

Thank you, @yaooqinn .

yaooqinn · 2024-03-28T02:53:25Z

LGTM from my side. Thank you @dongjoon-hyun

viirya · 2024-03-28T03:01:53Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+    if (Utils.classIsLoadable("org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")) {
+      conf.setIfMissing("spark.hadoop.fs.s3a.bucket.*.committer.magic.enabled", "true")
+    }


If the class is not there, will setting the config cause any problem?

Yes, at the first commit test on this PR, CI shows the following.

https://github.com/dongjoon-hyun/spark/actions/runs/8458713094/job/23173570901

[info] - SPARK-23731 plans should be canonicalizable after being (de)serialized *** FAILED *** (53 milliseconds) [info] java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol [info] at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) [info] at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) [info] at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) [info] at java.base/java.lang.Class.forName0(Native Method) [info] at java.base/java.lang.Class.forName(Class.java:467) [info] at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41) [info] at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36) [info] at org.apache.spark.util.Utils$.classForName(Utils.scala:97) [info] at org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:213) [info] at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)

dongjoon-hyun · 2024-03-28T03:48:09Z

Thank you for review, @yaooqinn and @viirya .

I might overlook the other object storage side-effect cases. I'll convert this PR to Draft and test with Google Cloud Storage at least.

steveloughran · 2024-04-16T20:48:44Z

I have no problems with the PR; we have made it the default in our releases.

This could be a good time to revisit "why there's some separate PathOutputCommitter" stuff; originally it was because spark built against releases without the new PathOutputCommitter interface. This no longer holds: could anything needed from it be pulled up into the main committer?

One recurrent troublespot we have with committing work is parquet; it requires all committers to be a subclass of ParquetOutputCommitter, hence the (ugly, brittle) wrapping stuff. Life will be a lot easier if parquet didn't mind if it was any PathOutputCommitter -it would just skip the schema writing.

Of course, we then come up against the fact that parquet still wants to build against hadoop 2.8. Everyone needs to move on, especially as hadoop java11+ support is 3.2.x+ only.

dongjoon-hyun · 2024-04-16T23:01:07Z

Thank you for your feedback, @steveloughran . Ya, as you mentioned, this is blocked by exactly those two configurations.

spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

steveloughran · 2024-04-17T17:02:52Z

What code is doing the instanceof check? it's in parquet. correct? as unless its been told to save a summary, it shouldn't care...

dongjoon-hyun · 2024-04-17T17:11:18Z

Oh, I must be clear.

Unlike spark.hadoop.fs.s3a.* configuration, the following are applied to all FS globally. We cannot do that for non-S3A filesystem. That's the side-effect I need to remove in this PR.

spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

steveloughran · 2024-04-19T18:13:23Z

So both those bindings hand off to PathOutputCommitterFactory(), which looks for a committer from the config key mapreduce.outputcommitter.factory.class
FileOutputCommitterFactory: classic committer
NamedCommitterFactory: class in mapreduce.outputcommitter.named.classname

then fallback to mapreduce.outputcommitter.factory.scheme.SCHEMA factory definition.

The idea being; you get an fs specific one unless asked for.

the parquet one is there because parquet is fussy about its committer subclasses; that should be reviewed (where?)
and PathOutputCommitProtocol is probably surplus now that spark can use PathOutputCommitter everywhere...

github-actions · 2024-07-29T00:22:09Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

dongjoon-hyun · 2024-07-29T01:42:31Z

I removed Stale tag.

github-actions · 2024-11-07T00:23:49Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added DOCS CORE labels Mar 27, 2024

viirya reviewed Mar 27, 2024

View reviewed changes

yaooqinn reviewed Mar 28, 2024

View reviewed changes

yaooqinn approved these changes Mar 28, 2024

View reviewed changes

viirya reviewed Mar 28, 2024

View reviewed changes

dongjoon-hyun marked this pull request as draft March 28, 2024 03:48

github-actions bot added the Stale label Jul 29, 2024

dongjoon-hyun removed the Stale label Jul 29, 2024

github-actions bot added the Stale label Nov 7, 2024

dongjoon-hyun removed the Stale label Nov 7, 2024

dongjoon-hyun force-pushed the SPARK-47618 branch from 49f456d to dd36b10 Compare January 27, 2025 05:18

[SPARK-47618][CORE] Use Magic Committer for all S3 buckets by default

d9173dd

dongjoon-hyun force-pushed the SPARK-47618 branch from dd36b10 to d9173dd Compare January 27, 2025 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default #45740

[SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default #45740

dongjoon-hyun commented Mar 27, 2024 •

edited

Loading

dongjoon-hyun commented Mar 27, 2024

dongjoon-hyun commented Mar 27, 2024

viirya left a comment

dongjoon-hyun commented Mar 27, 2024

dongjoon-hyun commented Mar 28, 2024

yaooqinn Mar 28, 2024

dongjoon-hyun Mar 28, 2024

yaooqinn commented Mar 28, 2024

viirya Mar 28, 2024

dongjoon-hyun Mar 28, 2024

dongjoon-hyun commented Mar 28, 2024

steveloughran commented Apr 16, 2024

dongjoon-hyun commented Apr 16, 2024

steveloughran commented Apr 17, 2024

dongjoon-hyun commented Apr 17, 2024

steveloughran commented Apr 19, 2024

github-actions bot commented Jul 29, 2024

dongjoon-hyun commented Jul 29, 2024

github-actions bot commented Nov 7, 2024

[SPARK-47618][CORE] Use Magic Committer for all S3 buckets by default #45740

Are you sure you want to change the base?

[SPARK-47618][CORE] Use Magic Committer for all S3 buckets by default #45740

Conversation

dongjoon-hyun commented Mar 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun commented Mar 27, 2024

dongjoon-hyun commented Mar 27, 2024

viirya left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 27, 2024

dongjoon-hyun commented Mar 28, 2024

yaooqinn Mar 28, 2024

Choose a reason for hiding this comment

dongjoon-hyun Mar 28, 2024

Choose a reason for hiding this comment

yaooqinn commented Mar 28, 2024

viirya Mar 28, 2024

Choose a reason for hiding this comment

dongjoon-hyun Mar 28, 2024

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 28, 2024

steveloughran commented Apr 16, 2024

dongjoon-hyun commented Apr 16, 2024

steveloughran commented Apr 17, 2024

dongjoon-hyun commented Apr 17, 2024

steveloughran commented Apr 19, 2024

github-actions bot commented Jul 29, 2024

dongjoon-hyun commented Jul 29, 2024

github-actions bot commented Nov 7, 2024

[SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default #45740

[SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default #45740

dongjoon-hyun commented Mar 27, 2024 •

edited

Loading