You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is sort of a continuation to the previous discussion regarding launching "large" datasets. In this discussion, I mentioned about facing an unrelated error with KMeans and "GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry" in Linear Regression. I don't get that unrelated error anymore with KMeans, and I was able to use the GPU-accelerated KMeans algorithm for a while, but recently, I am getting either "GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry" or CPU OoM when it reaches the collect stage of the KMeans workload in spark-rapids-ml benchmark for a ~30GB dataset, which happens after the fitting stages. The executor node has 1TB RAM. I posted this here instead of as an issue since I feel like it is due to my configuration.
The only thing I changed recently is migrating to MinIO from HDFS (since I tried to migrate to Spark on K8S, but I was unsuccessful to do so as the executors could not be launched properly), changing parts of the configurations, and changing how I created my dataset. I was able to confirm that it's not migrating to MinIO since I am getting the error with HDFS.
As for changing my configurations, I changed my configurations to use a certain number of cores and GPUs based on the number of GPUs, so it would utilize as much memory and as many cores as possible. For example using 450GB for executor memory and 450GB (I tried to decrease it to a lower size like 128GB since I was worried that the size is too large for pinned memory but it still does not work.) I tried changing back to my previous configuration and to the configuration found in the spark-rapids-ml benchmark, but I am still getting GPU (or sometimes CPU) OoM. I also decreased the number of cores, target batch size, executor memory, and pinned as much as possible. I also tried increasing the executor memory and pinned memory to see if it would get rid of GPU OoM, but it would still give me the same error. As an aside, I remember when I previously disabled GPU-acceleration SQL, it would still run the fit and transform stages using cuML, but now it doesn't seem to use cuML when this is disabled since I don't get " Invoking cuml fit" or similar messages when I disabled it anymore?
The last thing I changed is how I created my dataset. I made a subset of data by copying the first n snappy parquet files of an existing parquet directory to a new directory. I was worried this was the reason as to why it does not work, but when I disable SQL execution, the Spark application can process the dataset just fine, and the GPU-accelerated application can run the fitting stages, so I don't think this is the case?
Interestingly enough, with my previous configuration (not the default configuration), it works if I use more GPUs, but it doesn't work if I use fewer GPUs? I will include the results for the default configuration since this configuration was much faster compared to using more cores and memory.
Default configuration in the spark-rapids-ml benchmark (note: Snapshot Jar is used here, but I also tried the release version to no avail)
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
INFO: Process 3551247 found CUDA visible device(s): 0
2024-06-24 04:48:54,753 - spark_rapids_ml.clustering.KMeans - INFO - Loading data into python worker memory
2024-06-24 04:54:04,904 - spark_rapids_ml.clustering.KMeans - INFO - Initializing cuml context
2024-06-24 04:54:06,435 - spark_rapids_ml.clustering.KMeans - INFO - Invoking cuml fit
1039452704
I have overwritten the STDERR of the driver for the previous application, but it's mostly the same. I only included the logs just before the fitting stage finishes.
STDERR of driver
24/06/24 04:59:23 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:24 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:25 INFO BarrierCoordinator: Current barrier epoch for Stage 6 (Attempt 0) is 1.
24/06/24 04:59:25 INFO BarrierCoordinator: Barrier sync epoch 1 from Stage 6 (Attempt 0) received update from Task 4, current progress: 1/1.
24/06/24 04:59:25 INFO BarrierCoordinator: Barrier sync epoch 1 from Stage 6 (Attempt 0) received all updates from tasks, finished successfully.
24/06/24 04:59:25 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:26 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:27 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:28 INFO BlockManagerInfo: Added taskresult_4 in memory on 115.145.178.219:41087 (size: 25.9 MiB, free: 56.7 GiB)
24/06/24 04:59:28 INFO TransportClientFactory: Successfully created connection to /115.145.178.219:41087 after 2 ms (0 ms spent in bootstraps)
24/06/24 04:59:28 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 4) in 645595 ms on 115.145.178.219 (executor 0) (1/1)
24/06/24 04:59:28 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
24/06/24 04:59:28 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 42175
24/06/24 04:59:28 INFO DAGScheduler: ResultStage 6 (fit at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:167) finished in 645.616 s
24/06/24 04:59:28 INFO DAGScheduler: Job 3 is finished. Cancelling potential speculative or zombie tasks for this job
24/06/24 04:59:28 INFO TaskSchedulerImpl: Killing all running tasks in stage 6: Stage finished
24/06/24 04:59:28 INFO DAGScheduler: Job 3 finished: fit at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:167, took 645.637668 s
24/06/24 04:59:28 INFO BlockManagerInfo: Removed taskresult_4 on 115.145.178.219:41087 in memory (size: 25.9 MiB, free: 56.7 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 248.0 B, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece1 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece1 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece2 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece2 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece3 stored as bytes in memory (estimated size 3.5 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece3 in memory on master:45345 (size: 3.5 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 10 from broadcast at NativeMethodAccessorImpl.java:0
24/06/24 04:59:29 WARN GpuOverrides:
!Exec cannot run on GPU because the Exec InMemoryTableScanExec has been disabled, and is disabled by default because there could be complications when using it with AQE with Spark-3.5.0 and Spark-3.5.1. For more details please check NVIDIA/spark-rapids#10603. Set spark.rapids.sql.exec.InMemoryTableScanExec to true if you wish to enable it @expression feature_array#0 could run on GPU
24/06/24 04:59:29 INFO GpuOverrides: Plan conversion to the GPU took 15.39 ms
24/06/24 04:59:29 INFO GpuOverrides: GPU plan transition optimization took 5.68 ms
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 21.5 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 2.5 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on master:45345 (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 11 from collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173
24/06/24 04:59:29 INFO SparkContext: Starting job: collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173
24/06/24 04:59:29 INFO DAGScheduler: Registering RDD 64 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) as input to shuffle 1
24/06/24 04:59:29 INFO DAGScheduler: Got job 4 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) with 1 output partitions
24/06/24 04:59:29 INFO DAGScheduler: Final stage: ResultStage 9 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173)
24/06/24 04:59:29 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 8)
24/06/24 04:59:29 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 8)
24/06/24 04:59:29 INFO DAGScheduler: Submitting ShuffleMapStage 8 (MapPartitionsRDD[64] at collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173), which has no missing parents
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 45.8 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 20.8 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on master:45345 (size: 20.8 KiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:1585
24/06/24 04:59:29 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[64] at collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) (first 15 tasks are for partitions Vector(0))
24/06/24 04:59:29 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks resource profile 0
24/06/24 04:59:29 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 5) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 115.145.178.219:41087 (size: 20.8 KiB, free: 56.7 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 115.145.178.219:41087 (size: 2.5 KiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece1 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece2 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece3 in memory on 115.145.178.219:41087 (size: 3.5 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_python on disk on 115.145.178.219:41087 (size: 25.8 MiB)
24/06/24 05:05:28 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 5) (115.145.178.219 executor 0): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:164)
at com.nvidia.spark.rapids.GpuBatchUtils$.concatSpillBatchesAndClose(GpuBatchUtils.scala:195)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$concatInputBatch$1(BatchGroupUtils.scala:458)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.concatInputBatch(BatchGroupUtils.scala:429)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$next$10(BatchGroupUtils.scala:420)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:416)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:395)
at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:200)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:199)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:314)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.AbstractProjectSplitIterator.hasNext(basicPhysicalOperators.scala:233)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
24/06/24 05:05:28 INFO TaskSetManager: Starting task 0.1 in stage 8.0 (TID 6) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_7_piece0 on master:45345 in memory (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_9_piece0 on master:45345 in memory (size: 20.3 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_9_piece0 on 115.145.178.219:41087 in memory (size: 20.3 KiB, free: 56.7 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on master:45345 in memory (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 115.145.178.219:41087 in memory (size: 2.5 KiB, free: 56.7 GiB)
24/06/24 05:11:19 WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 6) (115.145.178.219 executor 0): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:164)
at com.nvidia.spark.rapids.GpuBatchUtils$.concatSpillBatchesAndClose(GpuBatchUtils.scala:195)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$concatInputBatch$1(BatchGroupUtils.scala:458)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.concatInputBatch(BatchGroupUtils.scala:429)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$next$10(BatchGroupUtils.scala:420)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:416)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:395)
at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:200)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:199)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:314)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.AbstractProjectSplitIterator.hasNext(basicPhysicalOperators.scala:233)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This is sort of a continuation to the previous discussion regarding launching "large" datasets. In this discussion, I mentioned about facing an unrelated error with KMeans and "GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry" in Linear Regression. I don't get that unrelated error anymore with KMeans, and I was able to use the GPU-accelerated KMeans algorithm for a while, but recently, I am getting either "GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry" or CPU OoM when it reaches the collect stage of the KMeans workload in spark-rapids-ml benchmark for a ~30GB dataset, which happens after the fitting stages. The executor node has 1TB RAM. I posted this here instead of as an issue since I feel like it is due to my configuration.
The only thing I changed recently is migrating to MinIO from HDFS (since I tried to migrate to Spark on K8S, but I was unsuccessful to do so as the executors could not be launched properly), changing parts of the configurations, and changing how I created my dataset. I was able to confirm that it's not migrating to MinIO since I am getting the error with HDFS.
As for changing my configurations, I changed my configurations to use a certain number of cores and GPUs based on the number of GPUs, so it would utilize as much memory and as many cores as possible. For example using 450GB for executor memory and 450GB (I tried to decrease it to a lower size like 128GB since I was worried that the size is too large for pinned memory but it still does not work.) I tried changing back to my previous configuration and to the configuration found in the spark-rapids-ml benchmark, but I am still getting GPU (or sometimes CPU) OoM. I also decreased the number of cores, target batch size, executor memory, and pinned as much as possible. I also tried increasing the executor memory and pinned memory to see if it would get rid of GPU OoM, but it would still give me the same error. As an aside, I remember when I previously disabled GPU-acceleration SQL, it would still run the fit and transform stages using cuML, but now it doesn't seem to use cuML when this is disabled since I don't get " Invoking cuml fit" or similar messages when I disabled it anymore?
The last thing I changed is how I created my dataset. I made a subset of data by copying the first n snappy parquet files of an existing parquet directory to a new directory. I was worried this was the reason as to why it does not work, but when I disable SQL execution, the Spark application can process the dataset just fine, and the GPU-accelerated application can run the fitting stages, so I don't think this is the case?
Interestingly enough, with my previous configuration (not the default configuration), it works if I use more GPUs, but it doesn't work if I use fewer GPUs? I will include the results for the default configuration since this configuration was much faster compared to using more cores and memory.
Default configuration in the spark-rapids-ml benchmark (note: Snapshot Jar is used here, but I also tried the release version to no avail)
Resource Profile Id Resource Profile Contents
0
Executor Reqs:
cores: [amount: 4]
memory: [amount: 131072]
offHeap: [amount: 0]
gpu: [amount: 1]
Task Reqs:
cpus: [amount: 1.0]
gpu: [amount: 0.25]
1
Executor Reqs:
Task Reqs:
cpus: [amount: 4.0]
gpu: [amount: 1.0]
STDERR of executor
Spark Executor Command: "/usr/lib/jvm/temurin-17-jdk-amd64/bin/java" "-cp" "/home/ysan/fr/spark-3.5//conf/:/home/ysan/fr/spark-3.5/assembly/target/scala-2.12/jars/*:/home/ysan/fr/hadoop-3.3/etc/hadoop/" "-Xmx131072M" "-Dspark.network.timeout=10000001s" "-Dspark.history.ui.port=18080" "-Dspark.driver.port=42825" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "-Duser.timezone=UTC" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@master:42825" "--executor-id" "0" "--hostname" "EXECUTOR_IP" "--cores" "4" "--app-id" "app-20240624133942-0007" "--worker-url" "spark://Worker@EXECUTOR_IP:40347" "--resourceProfileId" "0" "--resourcesFile" "/home/ysan/fr/spark-3.5/work/app-20240624133942-0007/0/resource-executor-14229378749067763550.json" ========================================SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
INFO: Process 3551247 found CUDA visible device(s): 0
2024-06-24 04:48:54,753 - spark_rapids_ml.clustering.KMeans - INFO - Loading data into python worker memory
2024-06-24 04:54:04,904 - spark_rapids_ml.clustering.KMeans - INFO - Initializing cuml context
2024-06-24 04:54:06,435 - spark_rapids_ml.clustering.KMeans - INFO - Invoking cuml fit
1039452704
I have overwritten the STDERR of the driver for the previous application, but it's mostly the same. I only included the logs just before the fitting stage finishes.
24/06/24 04:59:23 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:24 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:25 INFO BarrierCoordinator: Current barrier epoch for Stage 6 (Attempt 0) is 1.
24/06/24 04:59:25 INFO BarrierCoordinator: Barrier sync epoch 1 from Stage 6 (Attempt 0) received update from Task 4, current progress: 1/1.
24/06/24 04:59:25 INFO BarrierCoordinator: Barrier sync epoch 1 from Stage 6 (Attempt 0) received all updates from tasks, finished successfully.
24/06/24 04:59:25 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:26 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:27 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 6 because the barrier taskSet requires 1 slots, while the total number of available slots is 0.
24/06/24 04:59:28 INFO BlockManagerInfo: Added taskresult_4 in memory on 115.145.178.219:41087 (size: 25.9 MiB, free: 56.7 GiB)
24/06/24 04:59:28 INFO TransportClientFactory: Successfully created connection to /115.145.178.219:41087 after 2 ms (0 ms spent in bootstraps)
24/06/24 04:59:28 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 4) in 645595 ms on 115.145.178.219 (executor 0) (1/1)
24/06/24 04:59:28 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
24/06/24 04:59:28 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 42175
24/06/24 04:59:28 INFO DAGScheduler: ResultStage 6 (fit at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:167) finished in 645.616 s
24/06/24 04:59:28 INFO DAGScheduler: Job 3 is finished. Cancelling potential speculative or zombie tasks for this job
24/06/24 04:59:28 INFO TaskSchedulerImpl: Killing all running tasks in stage 6: Stage finished
24/06/24 04:59:28 INFO DAGScheduler: Job 3 finished: fit at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:167, took 645.637668 s
24/06/24 04:59:28 INFO BlockManagerInfo: Removed taskresult_4 on 115.145.178.219:41087 in memory (size: 25.9 MiB, free: 56.7 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 248.0 B, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece1 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece1 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece2 stored as bytes in memory (estimated size 4.0 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece2 in memory on master:45345 (size: 4.0 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_10_piece3 stored as bytes in memory (estimated size 3.5 MiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_10_piece3 in memory on master:45345 (size: 3.5 MiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 10 from broadcast at NativeMethodAccessorImpl.java:0
24/06/24 04:59:29 WARN GpuOverrides:
!Exec cannot run on GPU because the Exec InMemoryTableScanExec has been disabled, and is disabled by default because there could be complications when using it with AQE with Spark-3.5.0 and Spark-3.5.1. For more details please check NVIDIA/spark-rapids#10603. Set spark.rapids.sql.exec.InMemoryTableScanExec to true if you wish to enable it
@expression feature_array#0 could run on GPU
24/06/24 04:59:29 INFO GpuOverrides: Plan conversion to the GPU took 15.39 ms
24/06/24 04:59:29 INFO GpuOverrides: GPU plan transition optimization took 5.68 ms
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 21.5 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 2.5 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on master:45345 (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 11 from collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173
24/06/24 04:59:29 INFO SparkContext: Starting job: collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173
24/06/24 04:59:29 INFO DAGScheduler: Registering RDD 64 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) as input to shuffle 1
24/06/24 04:59:29 INFO DAGScheduler: Got job 4 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) with 1 output partitions
24/06/24 04:59:29 INFO DAGScheduler: Final stage: ResultStage 9 (collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173)
24/06/24 04:59:29 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 8)
24/06/24 04:59:29 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 8)
24/06/24 04:59:29 INFO DAGScheduler: Submitting ShuffleMapStage 8 (MapPartitionsRDD[64] at collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173), which has no missing parents
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 45.8 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 20.8 KiB, free 76.6 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on master:45345 (size: 20.8 KiB, free: 76.6 GiB)
24/06/24 04:59:29 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:1585
24/06/24 04:59:29 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[64] at collect at /home/ysan/spark_test/spark-rapids-ml-24.04/python/benchmark/benchmark/bench_kmeans.py:173) (first 15 tasks are for partitions Vector(0))
24/06/24 04:59:29 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks resource profile 0
24/06/24 04:59:29 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 5) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 115.145.178.219:41087 (size: 20.8 KiB, free: 56.7 GiB)
24/06/24 04:59:29 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 115.145.178.219:41087 (size: 2.5 KiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece1 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece2 in memory on 115.145.178.219:41087 (size: 4.0 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_piece3 in memory on 115.145.178.219:41087 (size: 3.5 MiB, free: 56.7 GiB)
24/06/24 04:59:30 INFO BlockManagerInfo: Added broadcast_10_python on disk on 115.145.178.219:41087 (size: 25.8 MiB)
24/06/24 05:05:28 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 5) (115.145.178.219 executor 0): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:164)
at com.nvidia.spark.rapids.GpuBatchUtils$.concatSpillBatchesAndClose(GpuBatchUtils.scala:195)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$concatInputBatch$1(BatchGroupUtils.scala:458)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.concatInputBatch(BatchGroupUtils.scala:429)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$next$10(BatchGroupUtils.scala:420)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:416)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:395)
at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:200)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:199)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:314)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.AbstractProjectSplitIterator.hasNext(basicPhysicalOperators.scala:233)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
24/06/24 05:05:28 INFO TaskSetManager: Starting task 0.1 in stage 8.0 (TID 6) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_7_piece0 on master:45345 in memory (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_9_piece0 on master:45345 in memory (size: 20.3 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_9_piece0 on 115.145.178.219:41087 in memory (size: 20.3 KiB, free: 56.7 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on master:45345 in memory (size: 2.5 KiB, free: 76.6 GiB)
24/06/24 05:09:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 115.145.178.219:41087 in memory (size: 2.5 KiB, free: 56.7 GiB)
24/06/24 05:11:19 WARN TaskSetManager: Lost task 0.1 in stage 8.0 (TID 6) (115.145.178.219 executor 0): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:164)
at com.nvidia.spark.rapids.GpuBatchUtils$.concatSpillBatchesAndClose(GpuBatchUtils.scala:195)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$concatInputBatch$1(BatchGroupUtils.scala:458)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.concatInputBatch(BatchGroupUtils.scala:429)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.$anonfun$next$10(BatchGroupUtils.scala:420)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:416)
at org.apache.spark.sql.rapids.execution.python.CombiningIterator.next(BatchGroupUtils.scala:395)
at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:200)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:199)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:314)
at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.AbstractProjectSplitIterator.hasNext(basicPhysicalOperators.scala:233)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
24/06/24 05:11:19 INFO TaskSetManager: Starting task 0.2 in stage 8.0 (TID 7) (115.145.178.219, executor 0, partition 0, PROCESS_LOCAL, 7817 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
Beta Was this translation helpful? Give feedback.
All reactions