Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low loss on train set, but near 0 accuracy test set? #256

Open
benhuhaudau opened this issue Apr 17, 2023 · 5 comments
Open

Low loss on train set, but near 0 accuracy test set? #256

benhuhaudau opened this issue Apr 17, 2023 · 5 comments

Comments

@benhuhaudau
Copy link

My DINO config is as follows
from detrex.config import get_config
from .models.dino_r50 import model

get default config

dataloader = get_config("common/data/coco_detr.py").dataloader
optimizer = get_config("common/optim.py").AdamW
lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep
train = get_config("common/train.py").train

modify training config

train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
train.output_dir = "./output/dino_r50_4scale_12ep"

max training iterations

train.max_iter = 90000

run evaluation every 5000 iters

train.eval_period = 5000

log training infomation every 20 iters

train.log_period = 20

save checkpoint every 5000 iters

train.checkpointer.period = 5000

gradient clipping for training

train.clip_grad.enabled = True
train.clip_grad.params.max_norm = 0.1
train.clip_grad.params.norm_type = 2

set training devices

train.device = "cuda"
model.device = train.device

modify optimizer config

optimizer.lr = 1e-4
optimizer.betas = (0.9, 0.999)
optimizer.weight_decay = 1e-4
optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1

modify dataloader config

dataloader.train.num_workers = 16

please notice that this is total batch size.

surpose you're using 4 gpus for training and the batch size for

each gpu is 16/4 = 4

dataloader.train.total_batch_size = 16

dump the testing results into output_dir for visualization

dataloader.evaluator.output_dir = train.output_dir

The training error is very low
[04/17 06:35:07] d2.utils.events INFO: eta: 0:00:00 iter: 19999 total_loss: 13.36 loss_class: 0.3083 loss_bbox: 0.09793 loss_giou: 0.5997 loss_class_0: 0.4523 loss_bbox_0: 0.08994 loss_giou_0: 0.5179 loss_class_1: 0.3937 loss_bbox_1: 0.0916 loss_giou_1: 0.5637 loss_class_2: 0.3323 loss_bbox_2: 0.08917 loss_giou_2: 0.6124 loss_class_3: 0.3066 loss_bbox_3: 0.09503 loss_giou_3: 0.583 loss_class_4: 0.3075 loss_bbox_4: 0.1018 loss_giou_4: 0.6114 loss_class_enc: 0.49 loss_bbox_enc: 0.08078 loss_giou_enc: 0.5312

But the accuracy is like 0 on the test set
[04/17 06:48:59] d2.evaluation.coco_evaluation INFO: Evaluation results for bbox:

AP AP50 AP75 APs APm APl
0.137 0.694 0.014 0.097 0.203 0.146

What could be the reason for this?

@rentainhe
Copy link
Collaborator

Hello! how many GPUs you're using for running this experiments

@benhuhaudau
Copy link
Author

Hi, I used 2 GPUs for training with broadcast_buffers=True.

@1106X
Copy link

1106X commented Dec 27, 2023

Hi, I have the same problem. Have you solved it?

@46-neko
Copy link

46-neko commented Nov 1, 2024

Same thing here. I've tried with COCO2017 and with HRIPCB, got literal 0 AP for everything.

image
image

Trained on a 3090 w/ 24GB VRAM.

@chengjc2019
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants