Add save_checkpoint arg for TIMM training to simplify validation (#1701)

Co-authored-by: Jimin Ha <[email protected]> Co-authored-by: regisss <[email protected]>
huggingface · Jan 23, 2025 · 66bc191 · 66bc191
1 parent 626551b
commit 66bc191
Show file tree

Hide file tree

Showing 3 changed files with 38 additions and 102 deletions.
diff --git a/examples/pytorch-image-models/README.md b/examples/pytorch-image-models/README.md
@@ -16,20 +16,7 @@ limitations under the License.
 
 # pyTorch-IMage-Models (TIMM) Examples with HPUs
 
-This directory contains the scripts that showcases how to inference/fine-tune the TIMM models on intel's HPUs with the lazy/graph modes.  We support the trainging for single/multiple HPU cards both two. Currently we support several most downloadable models from Hugging Face as below list.
-
-- [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k)
-- [timm/resnet18.a1_in1k](https://huggingface.co/timm/resnet18.a1_in1k)
-- [timm/resnet18.fb_swsl_ig1b_ft_in1k](https://huggingface.co/timm/resnet18.fb_swsl_ig1b_ft_in1k)
-- [timm/wide_resnet50_2.racm_in1k](https://huggingface.co/timm/wide_resnet50_2.racm_in1k)
-- [timm/efficientnet_b3.ra2_in1k](https://huggingface.co/timm/efficientnet_b3.ra2_in1k)
-- [timm/efficientnet_lite0.ra_in1k](https://huggingface.co/timm/efficientnet_lite0.ra_in1k)
-- [timm/efficientnet_b0.ra_in1k](https://huggingface.co/timm/efficientnet_b0.ra_in1k)
-- [timm/nf_regnet_b1.ra2_in1k](https://huggingface.co/timm/nf_regnet_b1.ra2_in1k)
-- [timm/mobilenetv3_large_100.ra_in1k](https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k)
-- [timm/tf_mobilenetv3_large_minimal_100.in1k](https://huggingface.co/timm/tf_mobilenetv3_large_minimal_100.in1k)
-- [timm/vit_base_patch16_224.augreg2_in21k_ft_in1k](https://huggingface.co/timm/vit_base_patch16_224.augreg2_in21k_ft_in1k)
-- [timm/vgg19.tv_in1k](https://huggingface.co/timm/vgg19.tv_in1k)
+This directory contains the scripts that showcase how to inference/fine-tune the TIMM models on Intel's HPUs with the lazy/graph modes. Training is supported for single/multiple HPU cards. Currently we can support first 10 most downloadable models from [Hugging Face timm link](https://huggingface.co/timm). In our example below for inference/training we will use [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) as our testing model and same usage for other models. 
 
 ## Requirements
 
@@ -46,20 +33,6 @@ pip install .
 
 Here we show how to fine-tune the [imagenette2-320 dataset](https://huggingface.co/datasets/johnowhitaker/imagenette2-320) and model with [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) from Hugging Face.
 
-### Training with HPU lazy mode
-
-```bash
-python train_hpu_lazy.py \
-    --data-dir ./ \
-    --dataset hfds/johnowhitaker/imagenette2-320 \
-    --device 'hpu' \
-    --model resnet50.a1_in1k \
-    --train-split train \
-    --val-split train \
-    --dataset-download
-```
-
-python train_hpu_lazy.py --data-dir='./' --dataset hfds/johnowhitaker/imagenette2-320  --device='hpu' --model resnet50.a1_in1k 
 ### Training with HPU graph mode
 
 ```bash
@@ -70,41 +43,13 @@ python train_hpu_graph.py \
     --model resnet50.a1_in1k \
     --train-split train \
     --val-split train \
-    --dataset-download
+    --dataset-download 
 ```
 
-Here the results for lazy mode is shown below for example:
-
-```bash
-Train: 0 [   0/73 (  1%)]  Loss: 6.86 (6.86)  Time: 9.575s,   13.37/s  (9.575s,   13.37/s)  LR: 1.000e-05  Data: 0.844 (0.844)
-Train: 0 [  50/73 ( 70%)]  Loss: 6.77 (6.83)  Time: 0.320s,  400.32/s  (0.470s,  272.39/s)  LR: 1.000e-05  Data: 0.217 (0.047)
-Test: [   0/30]  Time: 6.593 (6.593)  Loss:   6.723 ( 6.723)  Acc@1:   0.000 (  0.000)  Acc@5:   0.000 (  0.000)
-Test: [  30/30]  Time: 3.856 (0.732)  Loss:   6.615 ( 6.691)  Acc@1:   0.000 (  0.076)  Acc@5:   1.176 (  3.287)
-
-Train: 1 [   0/73 (  1%)]  Loss: 6.69 (6.69)  Time: 0.796s,  160.74/s  (0.796s,  160.74/s)  LR: 1.001e-02  Data: 0.685 (0.685)
-Train: 1 [  50/73 ( 70%)]  Loss: 3.23 (3.76)  Time: 0.160s,  798.85/s  (0.148s,  863.22/s)  LR: 1.001e-02  Data: 0.053 (0.051)
-Test: [   0/30]  Time: 0.663 (0.663)  Loss:   1.926 ( 1.926)  Acc@1:  46.094 ( 46.094)  Acc@5:  85.938 ( 85.938)
-Test: [  30/30]  Time: 0.022 (0.126)  Loss:   1.462 ( 1.867)  Acc@1:  63.529 ( 39.261)  Acc@5:  83.529 ( 85.096)
-
-```
-
-
 ## Multi-HPU training
 
 Here we show how to fine-tune the [imagenette2-320 dataset](https://huggingface.co/datasets/johnowhitaker/imagenette2-320) and model with [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) from Hugging Face.
 
-### Training with HPU lazy mode
-```bash
-torchrun --nnodes 1 --nproc_per_node 2 \
-    train_hpu_lazy.py \
-    --data-dir ./ \
-    --dataset hfds/johnowhitaker/imagenette2-320 \
-    --device 'hpu' \
-    --model resnet50.a1_in1k \
-    --train-split train \
-    --val-split train \
-    --dataset-download
-```
 ### Training with HPU graph mode
 
 ```bash
@@ -119,20 +64,6 @@ torchrun --nnodes 1 --nproc_per_node 2 \
     --dataset-download
 ```
 
-Here the results for lazy mode is shown below for example:
-
-```bash
-Train: 0 [   0/36 (  3%)]  Loss: 6.88 (6.88)  Time: 10.036s,   25.51/s  (10.036s,   25.51/s)  LR: 1.000e-05  Data: 0.762 (0.762)
-Test: [   0/15]  Time: 7.796 (7.796)  Loss:   6.915 ( 6.915)  Acc@1:   0.000 (  0.000)  Acc@5:   0.000 (  0.000)
-Test: [  15/15]  Time: 1.915 (1.263)  Loss:   6.847 ( 6.818)  Acc@1:   0.000 (  0.000)  Acc@5:   0.000 (  0.688)
-
-Train: 1 [   0/36 (  3%)]  Loss: 6.84 (6.84)  Time: 6.687s,   38.28/s  (6.687s,   38.28/s)  LR: 2.001e-02  Data: 0.701 (0.701)
-Test: [   0/15]  Time: 1.315 (1.315)  Loss:   2.463 ( 2.463)  Acc@1:  14.062 ( 14.062)  Acc@5:  48.828 ( 48.828)
-Test: [  15/15]  Time: 0.020 (0.180)  Loss:   1.812 ( 1.982)  Acc@1:  52.326 ( 32.934)  Acc@5:  66.279 ( 75.064)
-
-```
-
-
 
 ## Single-HPU inference
 
@@ -149,15 +80,6 @@ python inference.py \
     --graph_mode
 ```
 
-### HPU with lazy mode
-```bash
-python inference.py \
-    --data-dir='./' \
-    --dataset hfds/johnowhitaker/imagenette2-320 \
-    --device='hpu' \
-    --model resnet50.a1_in1k \
-    --split train
-```
 
 
 
diff --git a/examples/pytorch-image-models/train_hpu_graph.py b/examples/pytorch-image-models/train_hpu_graph.py
@@ -136,6 +136,12 @@
     metavar="PATH",
     help="Load this checkpoint into model after initialization (default: none)",
 )
+group.add_argument(
+    "--save_checkpoint",
+    action="store_true",
+    default=False,
+    help="saving checkpoint for each epoch",
+)
 group.add_argument(
     "--resume",
     default="",
@@ -1048,17 +1054,18 @@ def main():
                 ]
             )
         output_dir = utils.get_outdir(args.output if args.output else "./output/train", exp_name)
-        saver = utils.CheckpointSaver(
-            model=model,
-            optimizer=optimizer,
-            args=args,
-            model_ema=model_ema,
-            amp_scaler=loss_scaler,
-            checkpoint_dir=output_dir,
-            recovery_dir=output_dir,
-            decreasing=decreasing_metric,
-            max_history=args.checkpoint_hist,
-        )
+        if args.save_checkpoint:
+            saver = utils.CheckpointSaver(
+                model=model,
+                optimizer=optimizer,
+                args=args,
+                model_ema=model_ema,
+                amp_scaler=loss_scaler,
+                checkpoint_dir=output_dir,
+                recovery_dir=output_dir,
+                decreasing=decreasing_metric,
+                max_history=args.checkpoint_hist,
+            )
         with open(os.path.join(output_dir, "args.yaml"), "w") as f:
             f.write(args_text)
 

diff --git a/examples/pytorch-image-models/train_hpu_lazy.py b/examples/pytorch-image-models/train_hpu_lazy.py
@@ -138,6 +138,12 @@
     metavar="PATH",
     help="Load this checkpoint into model after initialization (default: none)",
 )
+group.add_argument(
+    "--save_checkpoint",
+    action="store_true",
+    default=False,
+    help="saving checkpoint for each epoch",
+)
 group.add_argument(
     "--resume",
     default="",
@@ -1047,17 +1053,18 @@ def main():
                 ]
             )
         output_dir = utils.get_outdir(args.output if args.output else "./output/train", exp_name)
-        saver = utils.CheckpointSaver(
-            model=model,
-            optimizer=optimizer,
-            args=args,
-            model_ema=model_ema,
-            amp_scaler=loss_scaler,
-            checkpoint_dir=output_dir,
-            recovery_dir=output_dir,
-            decreasing=decreasing_metric,
-            max_history=args.checkpoint_hist,
-        )
+        if args.save_checkpoint:
+            saver = utils.CheckpointSaver(
+                model=model,
+                optimizer=optimizer,
+                args=args,
+                model_ema=model_ema,
+                amp_scaler=loss_scaler,
+                checkpoint_dir=output_dir,
+                recovery_dir=output_dir,
+                decreasing=decreasing_metric,
+                max_history=args.checkpoint_hist,
+            )
         with open(os.path.join(output_dir, "args.yaml"), "w") as f:
             f.write(args_text)