Low loss on train set, but near 0 accuracy test set? #256

benhuhaudau · 2023-04-17T07:08:28Z

My DINO config is as follows
from detrex.config import get_config
from .models.dino_r50 import model

get default config

dataloader = get_config("common/data/coco_detr.py").dataloader
optimizer = get_config("common/optim.py").AdamW
lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep
train = get_config("common/train.py").train

modify training config

train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
train.output_dir = "./output/dino_r50_4scale_12ep"

max training iterations

train.max_iter = 90000

run evaluation every 5000 iters

train.eval_period = 5000

log training infomation every 20 iters

train.log_period = 20

save checkpoint every 5000 iters

train.checkpointer.period = 5000

gradient clipping for training

train.clip_grad.enabled = True
train.clip_grad.params.max_norm = 0.1
train.clip_grad.params.norm_type = 2

set training devices

train.device = "cuda"
model.device = train.device

modify optimizer config

optimizer.lr = 1e-4
optimizer.betas = (0.9, 0.999)
optimizer.weight_decay = 1e-4
optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1

modify dataloader config

dataloader.train.num_workers = 16

please notice that this is total batch size.

surpose you're using 4 gpus for training and the batch size for

each gpu is 16/4 = 4

dataloader.train.total_batch_size = 16

dump the testing results into output_dir for visualization

dataloader.evaluator.output_dir = train.output_dir

The training error is very low
[04/17 06:35:07] d2.utils.events INFO: eta: 0:00:00 iter: 19999 total_loss: 13.36 loss_class: 0.3083 loss_bbox: 0.09793 loss_giou: 0.5997 loss_class_0: 0.4523 loss_bbox_0: 0.08994 loss_giou_0: 0.5179 loss_class_1: 0.3937 loss_bbox_1: 0.0916 loss_giou_1: 0.5637 loss_class_2: 0.3323 loss_bbox_2: 0.08917 loss_giou_2: 0.6124 loss_class_3: 0.3066 loss_bbox_3: 0.09503 loss_giou_3: 0.583 loss_class_4: 0.3075 loss_bbox_4: 0.1018 loss_giou_4: 0.6114 loss_class_enc: 0.49 loss_bbox_enc: 0.08078 loss_giou_enc: 0.5312

But the accuracy is like 0 on the test set
[04/17 06:48:59] d2.evaluation.coco_evaluation INFO: Evaluation results for bbox:

AP	AP50	AP75	APs	APm	APl
0.137	0.694	0.014	0.097	0.203	0.146

What could be the reason for this?

rentainhe · 2023-04-18T02:25:29Z

Hello! how many GPUs you're using for running this experiments

benhuhaudau · 2023-04-18T05:06:58Z

Hi, I used 2 GPUs for training with broadcast_buffers=True.

1106X · 2023-12-27T13:05:51Z

Hi, I have the same problem. Have you solved it？

46-neko · 2024-11-01T04:37:07Z

Same thing here. I've tried with COCO2017 and with HRIPCB, got literal 0 AP for everything.

Trained on a 3090 w/ 24GB VRAM.

chengjc2019 · 2024-12-28T14:40:50Z

+1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low loss on train set, but near 0 accuracy test set? #256

Low loss on train set, but near 0 accuracy test set? #256

benhuhaudau commented Apr 17, 2023

rentainhe commented Apr 18, 2023

benhuhaudau commented Apr 18, 2023

1106X commented Dec 27, 2023

46-neko commented Nov 1, 2024 •

edited

Loading

chengjc2019 commented Dec 28, 2024

Low loss on train set, but near 0 accuracy test set? #256

Low loss on train set, but near 0 accuracy test set? #256

Comments

benhuhaudau commented Apr 17, 2023

get default config

modify training config

max training iterations

run evaluation every 5000 iters

log training infomation every 20 iters

save checkpoint every 5000 iters

gradient clipping for training

set training devices

modify optimizer config

modify dataloader config

please notice that this is total batch size.

surpose you're using 4 gpus for training and the batch size for

each gpu is 16/4 = 4

dump the testing results into output_dir for visualization

rentainhe commented Apr 18, 2023

benhuhaudau commented Apr 18, 2023

1106X commented Dec 27, 2023

46-neko commented Nov 1, 2024 • edited Loading

chengjc2019 commented Dec 28, 2024

46-neko commented Nov 1, 2024 •

edited

Loading