-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low loss on train set, but near 0 accuracy test set? #256
Comments
Hello! how many GPUs you're using for running this experiments |
Hi, I used 2 GPUs for training with broadcast_buffers=True. |
Hi, I have the same problem. Have you solved it? |
+1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
My DINO config is as follows
from detrex.config import get_config
from .models.dino_r50 import model
get default config
dataloader = get_config("common/data/coco_detr.py").dataloader
optimizer = get_config("common/optim.py").AdamW
lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep
train = get_config("common/train.py").train
modify training config
train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
train.output_dir = "./output/dino_r50_4scale_12ep"
max training iterations
train.max_iter = 90000
run evaluation every 5000 iters
train.eval_period = 5000
log training infomation every 20 iters
train.log_period = 20
save checkpoint every 5000 iters
train.checkpointer.period = 5000
gradient clipping for training
train.clip_grad.enabled = True
train.clip_grad.params.max_norm = 0.1
train.clip_grad.params.norm_type = 2
set training devices
train.device = "cuda"
model.device = train.device
modify optimizer config
optimizer.lr = 1e-4
optimizer.betas = (0.9, 0.999)
optimizer.weight_decay = 1e-4
optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1
modify dataloader config
dataloader.train.num_workers = 16
please notice that this is total batch size.
surpose you're using 4 gpus for training and the batch size for
each gpu is 16/4 = 4
dataloader.train.total_batch_size = 16
dump the testing results into output_dir for visualization
dataloader.evaluator.output_dir = train.output_dir
The training error is very low
[04/17 06:35:07] d2.utils.events INFO: eta: 0:00:00 iter: 19999 total_loss: 13.36 loss_class: 0.3083 loss_bbox: 0.09793 loss_giou: 0.5997 loss_class_0: 0.4523 loss_bbox_0: 0.08994 loss_giou_0: 0.5179 loss_class_1: 0.3937 loss_bbox_1: 0.0916 loss_giou_1: 0.5637 loss_class_2: 0.3323 loss_bbox_2: 0.08917 loss_giou_2: 0.6124 loss_class_3: 0.3066 loss_bbox_3: 0.09503 loss_giou_3: 0.583 loss_class_4: 0.3075 loss_bbox_4: 0.1018 loss_giou_4: 0.6114 loss_class_enc: 0.49 loss_bbox_enc: 0.08078 loss_giou_enc: 0.5312
But the accuracy is like 0 on the test set
[04/17 06:48:59] d2.evaluation.coco_evaluation INFO: Evaluation results for bbox:
What could be the reason for this?
The text was updated successfully, but these errors were encountered: