How to train a model when the label contains a comma? #14588

diegonc · 2025-01-24T17:46:50Z

diegonc
Jan 24, 2025

I was following the tutorial on how to fine-tune a model located here and when I'm about to run ppocr/utils/gen_label.py I realized my labels are splited by the comma they contain (for instance 999,000000 gets converted to 999\t000000)

How should I generate the label txt if there is a comma in there?

Answered by GreatV

Jan 24, 2025

To address the issue of training a model when the label contains a comma, let’s break it down step by step based on the problem description and related documents:

Problem: Labels with Commas are Split

When running ppocr/utils/gen_label.py, labels like 999,000000 are incorrectly split into 999\t000000 due to the script treating the comma as a delimiter. This happens because many OCR label generation tools use specific delimiters (e.g., commas or tabs) to separate image paths and their corresponding labels in the ground truth file.

Solution: Handling Labels with Commas

To ensure that your labels containing commas are not split incorrectly, you can modify the label generation script (gen_l…

View full answer

GreatV · 2025-01-24T17:52:11Z

GreatV
Jan 24, 2025
Maintainer

To address the issue of training a model when the label contains a comma, let’s break it down step by step based on the problem description and related documents:

Problem: Labels with Commas are Split

When running ppocr/utils/gen_label.py, labels like 999,000000 are incorrectly split into 999\t000000 due to the script treating the comma as a delimiter. This happens because many OCR label generation tools use specific delimiters (e.g., commas or tabs) to separate image paths and their corresponding labels in the ground truth file.

Solution: Handling Labels with Commas

To ensure that your labels containing commas are not split incorrectly, you can modify the label generation script (gen_label.py) or preprocess your labels to use a delimiter that does not conflict with the actual content of the labels. Here’s how:

1. Modify the Label Generation Script

In the gen_label.py script:

Locate the part of the code where the labels are being written to the ground truth file. Usually, this involves joining the image path and label with a delimiter (e.g., a comma or tab).
Replace the comma delimiter with a more unique delimiter, such as the pipe symbol (|).

Example:

# Original code (example):
output_line = f"{image_path},{label}\n"

# Modified code:
output_line = f"{image_path}|{label}\n"

Then, ensure that the OCR model’s data loader is updated to recognize the new delimiter (|) when reading the ground truth file.

2. Escape or Encode Commas in Labels

If you prefer to keep the comma as the delimiter, you can escape commas within labels by replacing them with a placeholder string (e.g., \u002C or __COMMA__) during label generation. When modifying the gen_label.py script, preprocess the labels as follows:

Example:

# Replace commas in the label with a placeholder
label = label.replace(",", "__COMMA__")
output_line = f"{image_path},{label}\n"

During inference or evaluation, reverse this replacement to restore the original label.

3. Use Quotation Marks Around Labels

Another common approach is to wrap labels containing commas in quotation marks. Update the script to add quotes around labels if they contain a comma:

Example:

if ',' in label:
    label = f'"{label}"'
output_line = f"{image_path},{label}\n"

You will also need to ensure the data loader can correctly parse labels with quotation marks.

Additional Considerations

Impact on Training and Dictionary

Dictionary Updates: If your labels include special characters (e.g., commas), ensure they are included in your character_dict_path file. If the dictionary does not have these characters, the model will not recognize them during training or inference.
- For example, add , (comma) as a separate line in the dictionary:
```
a
b
c
,
0
1
```
Pre-trained Models: If you are fine-tuning a pre-trained model, changing the dictionary might cause issues if the pre-trained model was trained on a different character set. However, this can be mitigated by training with a sufficiently large dataset that includes the new characters.

Related Discussion Insights

From the linked discussion on fine-tuning PaddleOCR models, there are a few points that might be relevant to your case:

Accuracy Starting at Zero:
- It is normal for accuracy to start at zero during fine-tuning, especially if the dataset differs significantly from the pre-trained model’s training data. However, if pre-trained parameters are not being loaded correctly, it could also lead to this behavior. Double-check the pretrained_model path and ensure it points to a valid file.
Changing the Dictionary:
- Changing the dictionary is possible, but it may reduce initial performance if the new dictionary significantly differs from the original one. Fine-tuning with a large dataset can help mitigate this.
Removing the GTC Strategy:
- To remove the GTC strategy, you may need to adjust the Loss configuration, remove NRTRLoss, and use a simpler labeling method like CTCLabelEncode.

Summary

To train a model with labels containing commas without splitting them:

Modify the gen_label.py script:
- Use a unique delimiter (e.g., |).
- Escape commas in labels or wrap them in quotes.
Update the OCR model’s data loader to handle the new format.
Ensure that the dictionary includes commas or any other special characters in your labels.
If using a pre-trained model, ensure the character set aligns or fine-tune with a large dataset to adapt the model to the new dictionary.

By addressing these points, you can resolve the issue of labels being split incorrectly due to commas.

Response generated by feifei-bot | chatgpt-4o-latest

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train a model when the label contains a comma? #14588

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to train a model when the label contains a comma? #14588

diegonc Jan 24, 2025

Problem: Labels with Commas are Split

Solution: Handling Labels with Commas

Replies: 1 comment

GreatV Jan 24, 2025 Maintainer

Problem: Labels with Commas are Split

Solution: Handling Labels with Commas

1. Modify the Label Generation Script

2. Escape or Encode Commas in Labels

3. Use Quotation Marks Around Labels

Additional Considerations

Impact on Training and Dictionary

Related Discussion Insights

Summary

diegonc
Jan 24, 2025

GreatV
Jan 24, 2025
Maintainer