The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped) #9

thunder123321 · 2022-02-03T14:10:34Z

[/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary
[/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary
[/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary
Segmentation fault (core dumped)

I have encountered such a problem, I have not modified the original code, may I ask what is the problem

thunder123321 · 2022-02-03T14:12:33Z

I'm running “CTC with DSLP” code

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de --save-dir checkpoints --eval-tokenized-bleu
--keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric
--eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100
--eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5 --fixed-validation-seed 7 --ddp-backend=no_c10d
--share-all-embeddings --decoder-learned-pos --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
--lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01
--fp16 --clip-norm 2.0 --max-update 300000 --task translation_lev --criterion nat_loss --arch nat_ctc_sd --noise full_mask \
--src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1 --concat-yhat --concat-dropout 0.0 --label-smoothing 0.0 \
--activation-fn gelu --dropout 0.1 --max-tokens 2048 --update-freq 4

thunder123321 · 2022-02-16T06:23:29Z

FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary

I encountered this problem when running GLAT+CTC+SD and CTC+SD codes.
What does that mean, and I haven't changed the DSLP code. I hope the author can clarify my doubts.

chenyangh · 2022-02-16T07:37:34Z

Hello, @thunder123321

Unfortunately, there is not enough information for me to tell what went wrong in your setup.
My best guess is that the error is related to your ctcdecode installation.

BTW, I just tested a clean clone of the repo with your script, and it works on my side.

chenyangh · 2022-02-16T07:46:29Z

Actually, the ctcdecode is only used as a post-processing in the final version, as I only used beam size 1.
I think you can use the --plain-ctc option to avoid using ctcdecode.

However, you need to do some post-process here:

DSLP/fairseq/models/nat/nat_ctc_sd_ss.py

Line 507 in a9d3ee1

history.append(output_tokens.clone())

You may incorporate this function:

def _ctc_postprocess(tokens):
        hyp = tokens
        # if cfg.task.plain_ctc:
        _toks = hyp.int().tolist()
        _toks = [v for i, v in enumerate(_toks) if i == 0 or v != _toks[i - 1]]
        hyp = hyp.new_tensor([v for v in _toks if v not in extra_symbols_to_ignore])
        return hyp

extra_symbols_to_ignore = []
    if hasattr(tgt_dict, "blank_index"):
        extra_symbols_to_ignore.append(tgt_dict.blank_index)
    if hasattr(tgt_dict, "mask_index"):
        extra_symbols_to_ignore.append(tgt_dict.mask_index)

thunder123321 · 2022-02-17T13:08:46Z

Hi. @chenyangh Thank you very much for answering my question. I noticed that the two post-processing functions you mentioned appear in generation.py file. Does that mean I just add the -- plain-ctc parameter? When I added the -- plain-ctc parameter in my experiment, I found that the memory footprint was higher. “ctcdecode” is used to reduce memory footprint.

chenyangh · 2022-02-17T14:18:38Z

Hi, @thunder123321 --plain-ctc was used to replace the ctcdecode module (as it is much slower even with beam 1). However, having --plain-ctc option will not perform postprocessing during the training. That's was why I suggest the above modifications if you can not get ctcdecode working.

In terms of memory consumption, I am not sure if that is caused by the plain-ctc option. But I did remember that at some point in my development, the model suddenly consumes more RAM per batch. Unfortunately, I haven't identified the reason.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped) #9

The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped) #9

thunder123321 commented Feb 3, 2022

thunder123321 commented Feb 3, 2022

thunder123321 commented Feb 16, 2022

chenyangh commented Feb 16, 2022

chenyangh commented Feb 16, 2022 •

edited

Loading

thunder123321 commented Feb 17, 2022

chenyangh commented Feb 17, 2022 •

edited

Loading

The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped) #9

The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped) #9

Comments

thunder123321 commented Feb 3, 2022

thunder123321 commented Feb 3, 2022

thunder123321 commented Feb 16, 2022

chenyangh commented Feb 16, 2022

chenyangh commented Feb 16, 2022 • edited Loading

thunder123321 commented Feb 17, 2022

chenyangh commented Feb 17, 2022 • edited Loading

chenyangh commented Feb 16, 2022 •

edited

Loading

chenyangh commented Feb 17, 2022 •

edited

Loading