-
I was following the tutorial on how to fine-tune a model located here and when I'm about to run How should I generate the label txt if there is a comma in there? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
To address the issue of training a model when the label contains a comma, let’s break it down step by step based on the problem description and related documents: Problem: Labels with Commas are SplitWhen running Solution: Handling Labels with CommasTo ensure that your labels containing commas are not split incorrectly, you can modify the label generation script ( 1. Modify the Label Generation ScriptIn the
Example: # Original code (example):
output_line = f"{image_path},{label}\n"
# Modified code:
output_line = f"{image_path}|{label}\n" Then, ensure that the OCR model’s data loader is updated to recognize the new delimiter ( 2. Escape or Encode Commas in LabelsIf you prefer to keep the comma as the delimiter, you can escape commas within labels by replacing them with a placeholder string (e.g., Example: # Replace commas in the label with a placeholder
label = label.replace(",", "__COMMA__")
output_line = f"{image_path},{label}\n" During inference or evaluation, reverse this replacement to restore the original label. 3. Use Quotation Marks Around LabelsAnother common approach is to wrap labels containing commas in quotation marks. Update the script to add quotes around labels if they contain a comma: Example: if ',' in label:
label = f'"{label}"'
output_line = f"{image_path},{label}\n" You will also need to ensure the data loader can correctly parse labels with quotation marks. Additional ConsiderationsImpact on Training and Dictionary
Related Discussion InsightsFrom the linked discussion on fine-tuning PaddleOCR models, there are a few points that might be relevant to your case:
SummaryTo train a model with labels containing commas without splitting them:
By addressing these points, you can resolve the issue of labels being split incorrectly due to commas. Response generated by feifei-bot | chatgpt-4o-latest |
Beta Was this translation helpful? Give feedback.
To address the issue of training a model when the label contains a comma, let’s break it down step by step based on the problem description and related documents:
Problem: Labels with Commas are Split
When running
ppocr/utils/gen_label.py
, labels like999,000000
are incorrectly split into999\t000000
due to the script treating the comma as a delimiter. This happens because many OCR label generation tools use specific delimiters (e.g., commas or tabs) to separate image paths and their corresponding labels in the ground truth file.Solution: Handling Labels with Commas
To ensure that your labels containing commas are not split incorrectly, you can modify the label generation script (
gen_l…