Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add save_checkpoint arg for TIMM training to simplify validation #1701

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 2 additions & 80 deletions examples/pytorch-image-models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,7 @@ limitations under the License.

# pyTorch-IMage-Models (TIMM) Examples with HPUs

This directory contains the scripts that showcases how to inference/fine-tune the TIMM models on intel's HPUs with the lazy/graph modes. We support the trainging for single/multiple HPU cards both two. Currently we support several most downloadable models from Hugging Face as below list.

- [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k)
- [timm/resnet18.a1_in1k](https://huggingface.co/timm/resnet18.a1_in1k)
- [timm/resnet18.fb_swsl_ig1b_ft_in1k](https://huggingface.co/timm/resnet18.fb_swsl_ig1b_ft_in1k)
- [timm/wide_resnet50_2.racm_in1k](https://huggingface.co/timm/wide_resnet50_2.racm_in1k)
- [timm/efficientnet_b3.ra2_in1k](https://huggingface.co/timm/efficientnet_b3.ra2_in1k)
- [timm/efficientnet_lite0.ra_in1k](https://huggingface.co/timm/efficientnet_lite0.ra_in1k)
- [timm/efficientnet_b0.ra_in1k](https://huggingface.co/timm/efficientnet_b0.ra_in1k)
- [timm/nf_regnet_b1.ra2_in1k](https://huggingface.co/timm/nf_regnet_b1.ra2_in1k)
- [timm/mobilenetv3_large_100.ra_in1k](https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k)
- [timm/tf_mobilenetv3_large_minimal_100.in1k](https://huggingface.co/timm/tf_mobilenetv3_large_minimal_100.in1k)
- [timm/vit_base_patch16_224.augreg2_in21k_ft_in1k](https://huggingface.co/timm/vit_base_patch16_224.augreg2_in21k_ft_in1k)
- [timm/vgg19.tv_in1k](https://huggingface.co/timm/vgg19.tv_in1k)
This directory contains the scripts that showcase how to inference/fine-tune the TIMM models on Intel's HPUs with the lazy/graph modes. Training is supported for single/multiple HPU cards. Currently we can support first 10 most downloadable models from [Hugging Face timm link](https://huggingface.co/timm). In our example below for inference/training we will use [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) as our testing model and same usage for other models.

## Requirements

Expand All @@ -46,20 +33,6 @@ pip install .

Here we show how to fine-tune the [imagenette2-320 dataset](https://huggingface.co/datasets/johnowhitaker/imagenette2-320) and model with [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) from Hugging Face.

### Training with HPU lazy mode

```bash
python train_hpu_lazy.py \
--data-dir ./ \
--dataset hfds/johnowhitaker/imagenette2-320 \
--device 'hpu' \
--model resnet50.a1_in1k \
--train-split train \
--val-split train \
--dataset-download
```

python train_hpu_lazy.py --data-dir='./' --dataset hfds/johnowhitaker/imagenette2-320 --device='hpu' --model resnet50.a1_in1k
### Training with HPU graph mode

```bash
Expand All @@ -70,41 +43,13 @@ python train_hpu_graph.py \
--model resnet50.a1_in1k \
--train-split train \
--val-split train \
--dataset-download
--dataset-download
```

Here the results for lazy mode is shown below for example:

```bash
Train: 0 [ 0/73 ( 1%)] Loss: 6.86 (6.86) Time: 9.575s, 13.37/s (9.575s, 13.37/s) LR: 1.000e-05 Data: 0.844 (0.844)
Train: 0 [ 50/73 ( 70%)] Loss: 6.77 (6.83) Time: 0.320s, 400.32/s (0.470s, 272.39/s) LR: 1.000e-05 Data: 0.217 (0.047)
Test: [ 0/30] Time: 6.593 (6.593) Loss: 6.723 ( 6.723) Acc@1: 0.000 ( 0.000) Acc@5: 0.000 ( 0.000)
Test: [ 30/30] Time: 3.856 (0.732) Loss: 6.615 ( 6.691) Acc@1: 0.000 ( 0.076) Acc@5: 1.176 ( 3.287)

Train: 1 [ 0/73 ( 1%)] Loss: 6.69 (6.69) Time: 0.796s, 160.74/s (0.796s, 160.74/s) LR: 1.001e-02 Data: 0.685 (0.685)
Train: 1 [ 50/73 ( 70%)] Loss: 3.23 (3.76) Time: 0.160s, 798.85/s (0.148s, 863.22/s) LR: 1.001e-02 Data: 0.053 (0.051)
Test: [ 0/30] Time: 0.663 (0.663) Loss: 1.926 ( 1.926) Acc@1: 46.094 ( 46.094) Acc@5: 85.938 ( 85.938)
Test: [ 30/30] Time: 0.022 (0.126) Loss: 1.462 ( 1.867) Acc@1: 63.529 ( 39.261) Acc@5: 83.529 ( 85.096)

```


## Multi-HPU training

Here we show how to fine-tune the [imagenette2-320 dataset](https://huggingface.co/datasets/johnowhitaker/imagenette2-320) and model with [timm/resnet50.a1_in1k](https://huggingface.co/timm/resnet50.a1_in1k) from Hugging Face.

### Training with HPU lazy mode
```bash
torchrun --nnodes 1 --nproc_per_node 2 \
train_hpu_lazy.py \
--data-dir ./ \
--dataset hfds/johnowhitaker/imagenette2-320 \
--device 'hpu' \
--model resnet50.a1_in1k \
--train-split train \
--val-split train \
--dataset-download
```
### Training with HPU graph mode

```bash
Expand All @@ -119,20 +64,6 @@ torchrun --nnodes 1 --nproc_per_node 2 \
--dataset-download
```

Here the results for lazy mode is shown below for example:

```bash
Train: 0 [ 0/36 ( 3%)] Loss: 6.88 (6.88) Time: 10.036s, 25.51/s (10.036s, 25.51/s) LR: 1.000e-05 Data: 0.762 (0.762)
Test: [ 0/15] Time: 7.796 (7.796) Loss: 6.915 ( 6.915) Acc@1: 0.000 ( 0.000) Acc@5: 0.000 ( 0.000)
Test: [ 15/15] Time: 1.915 (1.263) Loss: 6.847 ( 6.818) Acc@1: 0.000 ( 0.000) Acc@5: 0.000 ( 0.688)

Train: 1 [ 0/36 ( 3%)] Loss: 6.84 (6.84) Time: 6.687s, 38.28/s (6.687s, 38.28/s) LR: 2.001e-02 Data: 0.701 (0.701)
Test: [ 0/15] Time: 1.315 (1.315) Loss: 2.463 ( 2.463) Acc@1: 14.062 ( 14.062) Acc@5: 48.828 ( 48.828)
Test: [ 15/15] Time: 0.020 (0.180) Loss: 1.812 ( 1.982) Acc@1: 52.326 ( 32.934) Acc@5: 66.279 ( 75.064)

```



## Single-HPU inference

Expand All @@ -149,15 +80,6 @@ python inference.py \
--graph_mode
```

### HPU with lazy mode
```bash
python inference.py \
--data-dir='./' \
--dataset hfds/johnowhitaker/imagenette2-320 \
--device='hpu' \
--model resnet50.a1_in1k \
--split train
```



29 changes: 18 additions & 11 deletions examples/pytorch-image-models/train_hpu_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,12 @@
metavar="PATH",
help="Load this checkpoint into model after initialization (default: none)",
)
group.add_argument(
"--save_checkpoint",
action="store_true",
default=False,
help="saving checkpoint for each epoch",
)
group.add_argument(
"--resume",
default="",
Expand Down Expand Up @@ -1048,17 +1054,18 @@ def main():
]
)
output_dir = utils.get_outdir(args.output if args.output else "./output/train", exp_name)
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
if args.save_checkpoint:
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
with open(os.path.join(output_dir, "args.yaml"), "w") as f:
f.write(args_text)

Expand Down
29 changes: 18 additions & 11 deletions examples/pytorch-image-models/train_hpu_lazy.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,12 @@
metavar="PATH",
help="Load this checkpoint into model after initialization (default: none)",
)
group.add_argument(
"--save_checkpoint",
action="store_true",
default=False,
help="saving checkpoint for each epoch",
)
group.add_argument(
"--resume",
default="",
Expand Down Expand Up @@ -1047,17 +1053,18 @@ def main():
]
)
output_dir = utils.get_outdir(args.output if args.output else "./output/train", exp_name)
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
if args.save_checkpoint:
saver = utils.CheckpointSaver(
model=model,
optimizer=optimizer,
args=args,
model_ema=model_ema,
amp_scaler=loss_scaler,
checkpoint_dir=output_dir,
recovery_dir=output_dir,
decreasing=decreasing_metric,
max_history=args.checkpoint_hist,
)
with open(os.path.join(output_dir, "args.yaml"), "w") as f:
f.write(args_text)

Expand Down