Unable to run KMeans workload with a "large" dataset again #680

an-ys · 2024-06-24T05:16:21Z

an-ys
Jun 24, 2024

This is sort of a continuation to the previous discussion regarding launching "large" datasets. In this discussion, I mentioned about facing an unrelated error with KMeans and "GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry" in Linear Regression. I don't get that unrelated error anymore with KMeans, and I was able to use the GPU-accelerated KMeans algorithm for a while, but recently, I am getting either "GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry" or CPU OoM when it reaches the collect stage of the KMeans workload in spark-rapids-ml benchmark for a ~30GB dataset, which happens after the fitting stages. The executor node has 1TB RAM. I posted this here instead of as an issue since I feel like it is due to my configuration.

The only thing I changed recently is migrating to MinIO from HDFS (since I tried to migrate to Spark on K8S, but I was unsuccessful to do so as the executors could not be launched properly), changing parts of the configurations, and changing how I created my dataset. I was able to confirm that it's not migrating to MinIO since I am getting the error with HDFS.

As for changing my configurations, I changed my configurations to use a certain number of cores and GPUs based on the number of GPUs, so it would utilize as much memory and as many cores as possible. For example using 450GB for executor memory and 450GB (I tried to decrease it to a lower size like 128GB since I was worried that the size is too large for pinned memory but it still does not work.) I tried changing back to my previous configuration and to the configuration found in the spark-rapids-ml benchmark, but I am still getting GPU (or sometimes CPU) OoM. I also decreased the number of cores, target batch size, executor memory, and pinned as much as possible. I also tried increasing the executor memory and pinned memory to see if it would get rid of GPU OoM, but it would still give me the same error. As an aside, I remember when I previously disabled GPU-acceleration SQL, it would still run the fit and transform stages using cuML, but now it doesn't seem to use cuML when this is disabled since I don't get " Invoking cuml fit" or similar messages when I disabled it anymore?

The last thing I changed is how I created my dataset. I made a subset of data by copying the first n snappy parquet files of an existing parquet directory to a new directory. I was worried this was the reason as to why it does not work, but when I disable SQL execution, the Spark application can process the dataset just fine, and the GPU-accelerated application can run the fitting stages, so I don't think this is the case?

Interestingly enough, with my previous configuration (not the default configuration), it works if I use more GPUs, but it doesn't work if I use fewer GPUs? I will include the results for the default configuration since this configuration was much faster compared to using more cores and memory.

Default configuration in the spark-rapids-ml benchmark (note: Snapshot Jar is used here, but I also tried the release version to no avail)

Name	Value
spark.app.id	app-20240624123642-0006
spark.app.initial.jar.urls	spark://master:46815/jars/rapids-4-spark_2.12-24.04.0-SNAPSHOT-cuda12.jar
spark.app.name	benchmark_runner.py
spark.app.startTime	1719200201336
spark.app.submitTime	1719200199064
spark.driver.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions ...
spark.driver.host	master
spark.driver.memory	128g
spark.driver.port	46815
spark.eventLog.dir	file:///tmp/spark-events
spark.eventLog.enabled	true
spark.executor.cores	4
spark.executor.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions ...
spark.executor.heartbeatInterval	1000000s
spark.executor.id	driver
spark.executor.memory	128g
spark.executor.resource.gpu.amount	1
spark.executorEnv.PYTHONPATH	/home/ysan/spark_test/spark-rapids-24.04/dist/target/rapids-4-spark_2.12-24.04.0-SNAPSHOT-cuda12.jar
spark.hadoop.fs.s3a.access.key	*********(redacted)
spark.hadoop.fs.s3a.connection.establish.timeout	1000000000
spark.hadoop.fs.s3a.endpoint	http://115.145.178.209:32000/
spark.hadoop.fs.s3a.impl	org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access	true
spark.hadoop.fs.s3a.secret.key	*********(redacted)
spark.history.fs.logDirectory	file:///tmp/spark-events
spark.history.fs.update.interval	10s
spark.history.provider	org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port	18080
spark.jars	file:///home/ysan/spark_test/spark-rapids-24.04/dist/target/rapids-4-spark_2.12-24.04.0-SNAPSHOT-cuda12.jar
spark.locality.wait	0s
spark.logConf	true
spark.master	spark://master:7077
spark.network.timeout	10000001s
spark.plugins	com.nvidia.spark.SQLPlugin
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.driver.user.timezone	Z
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.minAllocFraction	0.0001
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pooling.enabled	false
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.pinnedPool.size	2G
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.batchSizeBytes	512m
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.concurrentGpuTasks	2
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel	20
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.format.parquet.reader.type	MULTITHREADED
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.multiThreadedRead.numThreads	20
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.python.gpu.enabled	true
spark.python.daemon.module	rapids.daemon
spark.python.worker.reuse	true
spark.rapids.driver.user.timezone	Z
spark.rapids.memory.gpu.minAllocFraction	0.0001
spark.rapids.memory.gpu.pooling.enabled	false
spark.rapids.memory.pinnedPool.size	2G
spark.rapids.ml.uvm.enabled	true
spark.rapids.sql.batchSizeBytes	512m
spark.rapids.sql.concurrentGpuTasks	2
spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel	20
spark.rapids.sql.format.parquet.reader.type	MULTITHREADED
spark.rapids.sql.multiThreadedRead.numThreads	20
spark.rapids.sql.python.gpu.enabled	true
spark.rdd.compress	True
spark.repl.local.jars	file:///home/ysan/spark_test/spark-rapids-24.04/dist/target/rapids-4-spark_2.12-24.04.0-SNAPSHOT-cuda12.jar
spark.scheduler.mode	FIFO
spark.serializer.objectStreamReset	100
spark.sql.adaptive.enabled	false
spark.sql.broadcastTimeout	1000000
spark.sql.cache.serializer	com.nvidia.spark.ParquetCachedBatchSerializer
spark.sql.execution.arrow.maxRecordsPerBatch	10000
spark.sql.execution.arrow.pyspark.enabled	true
spark.sql.execution.sortBeforeRepartition	false
spark.sql.extensions	com.nvidia.spark.rapids.SQLExecPlugin,com.nvidia.spark.udf.Plugin,com.nvidia.spark.rapids.optimizer.SQLOptimizerPlugin
spark.sql.files.maxPartitionBytes	2000000000000
spark.sql.files.minPartitionNum	1
spark.submit.deployMode	client
spark.submit.pyFiles
spark.task.resource.gpu.amount	0.25

Resource Profile Id Resource Profile Contents
0
Executor Reqs:
cores: [amount: 4]
memory: [amount: 131072]
offHeap: [amount: 0]
gpu: [amount: 1]
Task Reqs:
cpus: [amount: 1.0]
gpu: [amount: 0.25]
1
Executor Reqs:

Task Reqs:
cpus: [amount: 4.0]
gpu: [amount: 1.0]

STDERR of executor

Spark Executor Command: "/usr/lib/jvm/temurin-17-jdk-amd64/bin/java" "-cp" "/home/ysan/fr/spark-3.5//conf/:/home/ysan/fr/spark-3.5/assembly/target/scala-2.12/jars/*:/home/ysan/fr/hadoop-3.3/etc/hadoop/" "-Xmx131072M" "-Dspark.network.timeout=10000001s" "-Dspark.history.ui.port=18080" "-Dspark.driver.port=42825" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@master:42825" "--executor-id" "0" "--hostname" "EXECUTOR_IP" "--cores" "4" "--app-id" "app-20240624133942-0007" "--worker-url" "spark://Worker@EXECUTOR_IP:40347" "--resourceProfileId" "0" "--resourcesFile" "/home/ysan/fr/spark-3.5/work/app-20240624133942-0007/0/resource-executor-14229378749067763550.json" ========================================

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
INFO: Process 3551247 found CUDA visible device(s): 0
2024-06-24 04:48:54,753 - spark_rapids_ml.clustering.KMeans - INFO - Loading data into python worker memory
2024-06-24 04:54:04,904 - spark_rapids_ml.clustering.KMeans - INFO - Initializing cuml context
2024-06-24 04:54:06,435 - spark_rapids_ml.clustering.KMeans - INFO - Invoking cuml fit
1039452704

I have overwritten the STDERR of the driver for the previous application, but it's mostly the same. I only included the logs just before the fitting stage finishes.

STDERR of driver

24/06/24 04:59:23 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:24 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:25 INFO BarrierCoordinator: Current barrier epoch for Stage 6 (Attempt 0) is 1.
24/06/24 04:59:25 INFO BarrierCoordinator: Barrier sync epoch 1 from Stage 6 (Attempt 0) received update from Task 4, current progress: 1/1.
24/06/24 04:59:25 INFO BarrierCoordinator: Barrier sync epoch 1 from Stage 6 (Attempt 0) received all updates from tasks, finished successfully.
24/06/24 04:59:25 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:26 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:27 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:28 INFO BlockManagerInfo: Added taskresult_4 in memory on 115.145.178.219:41087 (size: 25.9 MiB, free: 56.7 GiB)
24/06/24 04:59:28 INFO TransportClientFactory: Successfully created connection to /115.145.178.219:41087 after 2 ms (0 ms spent in bootstraps)
24/06/24 04:59:28 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 4) in 645595 ms on 115.145.178.219 (executor 0) (1/1)
24/06/24 04:59:28 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
24/06/24 04:59:28 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 42175
24/06/24 04:59:28 INFO DAGScheduler: ResultStage 6 (fit at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:167) finished in 645.616 s
24/06/24 04:59:28 INFO DAGScheduler: Job 3 is finished. Cancelling potential speculative or zombie tasks for this job
24/06/24 04:59:28 INFO TaskSchedulerImpl: Killing all running tasks in stage 6: Stage finished
24/06/24 04:59:28 INFO DAGScheduler: Job 3 finished: fit at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:167, took 645.637668 s
24/06/24 04:59:28 INFO BlockManagerInfo: Removed taskresult_4 on 115.145.178.219:41087 in memory (size: 25.9 MiB, free: 56.7 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 248.0 B, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece1 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece1 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece2 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece2 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece3 stored as bytes in memory (estimated size 3.5 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece3 in memory on master:45345 (size: 3.5 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 10 from broadcast at NativeMethodAccessorImpl.java:0
24/06/24 04:59:29 WARN GpuOverrides:
!Exec cannot run on GPU because the Exec InMemoryTableScanExec has been disabled, and is disabled by default because there could be complications when using it with AQE with Spark-3.5.0 and Spark-3.5.1. For more details please check NVIDIA/spark-rapids#10603. Set spark.rapids.sql.exec.InMemoryTableScanExec to true if you wish to enable it
@expression feature_array#0 could run on GPU

24/06/24 04:59:29 INFO GpuOverrides: Plan conversion to the GPU took 15.39 ms
24/06/24 04:59:29 INFO GpuOverrides: GPU plan transition optimization took 5.68 ms
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 21.5 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 2.5 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on master:45345 (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 11 from collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173
24/06/24 04:59:29 INFO SparkContext: Starting job: collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173
24/06/24 04:59:29 INFO DAGScheduler: Registering RDD 64 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) as input to shuffle 1
24/06/24 04:59:29 INFO DAGScheduler: Got job 4 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) with 1 output partitions
24/06/24 04:59:29 INFO DAGScheduler: Final stage: ResultStage 9 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173)
24/06/24 04:59:29 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 8)
24/06/24 04:59:29 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 8)
24/06/24 04:59:29 INFO DAGScheduler: Submitting ShuffleMapStage 8 (MapPartitionsRDD[64] at collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173), which has no missing parents
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 45.8 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 20.8 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on master:45345 (size: 20.8 KiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:1585
24/06/24 04:59:29 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[64] at collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) (first 15 tasks are for partitions Vector(0))
24/06/24 04:59:29 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks resource profile 0
24/06/24 04:59:29 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 5) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 115.145.178.219:41087 (size: 20.8 KiB, free: 56.7 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 115.145.178.219:41087 (size: 2.5 KiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece1 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece2 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece3 in memory on 115.145.178.219:41087 (size: 3.5 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_python on disk on 115.145.178.219:41087 (size: 25.8 MiB)
24/06/24 05:05:28 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 5) (115.145.178.219 executor 0): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:164)
at com.nvidia.spark.rapids.GpuBatchUtils$.concatSpillBatchesAndClose(GpuBatchUtils.scala:195)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$concatInputBatch$1(BatchGroupUtils.scala:458)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.concatInputBatch(BatchGroupUtils.scala:429)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$next$10(BatchGroupUtils.scala:420)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:416)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:395)
at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:200)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:199)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:314)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.AbstractProjectSplitIterator.hasNext(basicPhysicalOperators.scala:233)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

24/06/24 05:05:28 INFO TaskSetManager: Starting task 0.1 in stage 8.0 (TID 6) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_7_piece0 on master:45345 in memory (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_9_piece0 on master:45345 in memory (size: 20.3 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_9_piece0 on 115.145.178.219:41087 in memory (size: 20.3 KiB, free: 56.7 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on master:45345 in memory (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 115.145.178.219:41087 in memory (size: 2.5 KiB, free: 56.7 GiB)
24/06/24 05:11:19 WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 6) (115.145.178.219 executor 0): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:164)
at com.nvidia.spark.rapids.GpuBatchUtils$.concatSpillBatchesAndClose(GpuBatchUtils.scala:195)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$concatInputBatch$1(BatchGroupUtils.scala:458)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.concatInputBatch(BatchGroupUtils.scala:429)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$next$10(BatchGroupUtils.scala:420)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:416)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:395)
at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:200)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:199)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:314)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.AbstractProjectSplitIterator.hasNext(basicPhysicalOperators.scala:233)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

24/06/24 05:11:19 INFO TaskSetManager: Starting task 0.2 in stage 8.0 (TID 7) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run KMeans workload with a "large" dataset again #680

{{title}}

Replies: 0 comments

Select a reply

Unable to run KMeans workload with a "large" dataset again #680

an-ys Jun 24, 2024

Replies: 0 comments

an-ys
Jun 24, 2024