Skip to content

Commit

Permalink
Add save_checkpoint arg for TIMM training to simplify validation (#1701)
Browse files Browse the repository at this point in the history
Co-authored-by: Jimin Ha <[email protected]>
Co-authored-by: regisss <[email protected]>
  • Loading branch information
3 people authored Jan 23, 2025
1 parent 626551b commit 66bc191
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 102 deletions.
82 changes: 2 additions & 80 deletions examples/pytorch-image-models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,7 @@ limitations under the License.

# pyTorch-IMage-Models (TIMM) Examples with HPUs

This directory contains the scripts that showcases how to inference/fine-tune the TIMM models on intel's HPUs with the lazy/graph modes. We support the trainging for single/multiple HPU cards both two. Currently we support several most downloadable models from Hugging Face as below list.

- [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k)
- [timm/resnet18.a1_in1k](https://huggingface.co/timm/resnet18.a1_in1k)
- [timm/resnet18.fb_swsl_ig1b_ft_in1k](https://huggingface.co/timm/resnet18.fb_swsl_ig1b_ft_in1k)
- [timm/wide_resnet50_2.racm_in1k](https://huggingface.co/timm/wide_resnet50_2.racm_in1k)
- [timm/efficientnet_b3.ra2_in1k](https://huggingface.co/timm/efficientnet_b3.ra2_in1k)
- [timm/efficientnet_lite0.ra_in1k](https://huggingface.co/timm/efficientnet_lite0.ra_in1k)
- [timm/efficientnet_b0.ra_in1k](https://huggingface.co/timm/efficientnet_b0.ra_in1k)
- [timm/nf_regnet_b1.ra2_in1k](https://huggingface.co/timm/nf_regnet_b1.ra2_in1k)
- [timm/mobilenetv3_large_100.ra_in1k](https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k)
- [timm/tf_mobilenetv3_large_minimal_100.in1k](https://huggingface.co/timm/tf_mobilenetv3_large_minimal_100.in1k)
- [timm/vit_base_patch16_224.augreg2_in21k_ft_in1k](https://huggingface.co/timm/vit_base_patch16_224.augreg2_in21k_ft_in1k)
- [timm/vgg19.tv_in1k](https://huggingface.co/timm/vgg19.tv_in1k)
This directory contains the scripts that showcase how to inference/fine-tune the TIMM models on Intel's HPUs with the lazy/graph modes. Training is supported for single/multiple HPU cards. Currently we can support first 10 most downloadable models from [Hugging Face timm link](https://huggingface.co/timm). In our example below for inference/training we will use [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) as our testing model and same usage for other models.

## Requirements

Expand All @@ -46,20 +33,6 @@ pip install .

Here we show how to fine-tune the [imagenette2-320 dataset](https://huggingface.co/datasets/johnowhitaker/imagenette2-320) and model with [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) from Hugging Face.

### Training with HPU lazy mode

```bash
python train_hpu_lazy.py \
--data-dir ./ \
--dataset hfds/johnowhitaker/imagenette2-320 \
--device 'hpu' \
--model resnet50.a1_in1k \
--train-split train \
--val-split train \
--dataset-download
```

python train_hpu_lazy.py --data-dir='./' --dataset hfds/johnowhitaker/imagenette2-320 --device='hpu' --model resnet50.a1_in1k
### Training with HPU graph mode

```bash
Expand All @@ -70,41 +43,13 @@ python train_hpu_graph.py \
--model resnet50.a1_in1k \
--train-split train \
--val-split train \
--dataset-download
--dataset-download
```

Here the results for lazy mode is shown below for example:

```bash
Train: 0 [ 0/73 ( 1%)] Loss: 6.86 (6.86) Time: 9.575s, 13.37/s (9.575s, 13.37/s) LR: 1.000e-05 Data: 0.844 (0.844)
Train: 0 [ 50/73 ( 70%)] Loss: 6.77 (6.83) Time: 0.320s, 400.32/s (0.470s, 272.39/s) LR: 1.000e-05 Data: 0.217 (0.047)
Test: [ 0/30] Time: 6.593 (6.593) Loss: 6.723 ( 6.723) Acc@1: 0.000 ( 0.000) Acc@5: 0.000 ( 0.000)
Test: [ 30/30] Time: 3.856 (0.732) Loss: 6.615 ( 6.691) Acc@1: 0.000 ( 0.076) Acc@5: 1.176 ( 3.287)

Train: 1 [ 0/73 ( 1%)] Loss: 6.69 (6.69) Time: 0.796s, 160.74/s (0.796s, 160.74/s) LR: 1.001e-02 Data: 0.685 (0.685)
Train: 1 [ 50/73 ( 70%)] Loss: 3.23 (3.76) Time: 0.160s, 798.85/s (0.148s, 863.22/s) LR: 1.001e-02 Data: 0.053 (0.051)
Test: [ 0/30] Time: 0.663 (0.663) Loss: 1.926 ( 1.926) Acc@1: 46.094 ( 46.094) Acc@5: 85.938 ( 85.938)
Test: [ 30/30] Time: 0.022 (0.126) Loss: 1.462 ( 1.867) Acc@1: 63.529 ( 39.261) Acc@5: 83.529 ( 85.096)

```


## Multi-HPU training

Here we show how to fine-tune the [imagenette2-320 dataset](https://huggingface.co/datasets/johnowhitaker/imagenette2-320) and model with [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) from Hugging Face.

### Training with HPU lazy mode
```bash
torchrun --nnodes 1 --nproc_per_node 2 \
train_hpu_lazy.py \
--data-dir ./ \
--dataset hfds/johnowhitaker/imagenette2-320 \
--device 'hpu' \
--model resnet50.a1_in1k \
--train-split train \
--val-split train \
--dataset-download
```
### Training with HPU graph mode

```bash
Expand All @@ -119,20 +64,6 @@ torchrun --nnodes 1 --nproc_per_node 2 \
--dataset-download
```

Here the results for lazy mode is shown below for example:

```bash
Train: 0 [ 0/36 ( 3%)] Loss: 6.88 (6.88) Time: 10.036s, 25.51/s (10.036s, 25.51/s) LR: 1.000e-05 Data: 0.762 (0.762)
Test: [ 0/15] Time: 7.796 (7.796) Loss: 6.915 ( 6.915) Acc@1: 0.000 ( 0.000) Acc@5: 0.000 ( 0.000)
Test: [ 15/15] Time: 1.915 (1.263) Loss: 6.847 ( 6.818) Acc@1: 0.000 ( 0.000) Acc@5: 0.000 ( 0.688)

Train: 1 [ 0/36 ( 3%)] Loss: 6.84 (6.84) Time: 6.687s, 38.28/s (6.687s, 38.28/s) LR: 2.001e-02 Data: 0.701 (0.701)
Test: [ 0/15] Time: 1.315 (1.315) Loss: 2.463 ( 2.463) Acc@1: 14.062 ( 14.062) Acc@5: 48.828 ( 48.828)
Test: [ 15/15] Time: 0.020 (0.180) Loss: 1.812 ( 1.982) Acc@1: 52.326 ( 32.934) Acc@5: 66.279 ( 75.064)

```



## Single-HPU inference

Expand All @@ -149,15 +80,6 @@ python inference.py \
--graph_mode
```

### HPU with lazy mode
```bash
python inference.py \
--data-dir='./' \
--dataset hfds/johnowhitaker/imagenette2-320 \
--device='hpu' \
--model resnet50.a1_in1k \
--split train
```



29 changes: 18 additions & 11 deletions examples/pytorch-image-models/train_hpu_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,12 @@
metavar="PATH",
help="Load this checkpoint into model after initialization (default: none)",
)
group.add_argument(
"--save_checkpoint",
action="store_true",
default=False,
help="saving checkpoint for each epoch",
)
group.add_argument(
"--resume",
default="",
Expand Down Expand Up @@ -1048,17 +1054,18 @@ def main():
]
)
output_dir = utils.get_outdir(args.output if args.output else "./output/train", exp_name)
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
if args.save_checkpoint:
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
with open(os.path.join(output_dir, "args.yaml"), "w") as f:
f.write(args_text)

Expand Down
29 changes: 18 additions & 11 deletions examples/pytorch-image-models/train_hpu_lazy.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,12 @@
metavar="PATH",
help="Load this checkpoint into model after initialization (default: none)",
)
group.add_argument(
"--save_checkpoint",
action="store_true",
default=False,
help="saving checkpoint for each epoch",
)
group.add_argument(
"--resume",
default="",
Expand Down Expand Up @@ -1047,17 +1053,18 @@ def main():
]
)
output_dir = utils.get_outdir(args.output if args.output else "./output/train", exp_name)
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
if args.save_checkpoint:
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
with open(os.path.join(output_dir, "args.yaml"), "w") as f:
f.write(args_text)

Expand Down

0 comments on commit 66bc191

Please sign in to comment.