Enable Memory Profiling

Launch training job with the following command (or alternatively set configs in toml files)

CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder memory_snapshot

--profiling.enable_memory_snapshot: to enable memory profiling
--profiling.save_memory_snapshot_folder: configures the folder which memory snapshots are dumped into (./outputs/memory_snapshot/ by default)
- In case of OOMs, the snapshots will be in ./outputs/memory_snapshot/iteration_x_exit.
- Regular snapshots (taken every profiling.profile_freq iterations) will be in memory_snapshot/iteration_x.

You cab find the saved pickle files in your output folder. To visualize a snapshot file, you can drag and drop it to https://pytorch.org/memory_viz. To learn more details on memory profiling, please visit this tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory_profiler.md

memory_profiler.md

Enable Memory Profiling

Files

memory_profiler.md

Latest commit

History

memory_profiler.md

File metadata and controls

Enable Memory Profiling