Appended text outside the table in the table cells #545

npn-zakipoint · 2025-02-19T06:21:05Z

Hi, I am extracting the tabular data using camelot. Recently, I found that camelot is merging the text outside the table with the table cells. I am using the following configurations while reading my PDF document:

 camelot.read_pdf(pdf_file_2, line_scale=40, copy_text=["v", "h"], strip_text='\n', parallel=True,
                              layout_kwargs={'detect_vertical': False}, backend="pdfium")

Input PDF has clear tabular structure i.e table has proper line with row and spanned columns in some cases and table is detected properly. I have validated the extracted tables using .plot() method.

camelot.plot(tables[0], kind='contour').show()

Do you have any idea, why it has merged the text outside the table (text which is more than 200-300 pixel above in the case)? I think this is bug in camelot because it is mapping the text outside the table to the table. I think there might be some issue while mapping the word in the page with the table’s cells based on their x and y coordinates. Thank you.

OS: Linux
python: 3.10
camelot-py: 1.0.0
backend: pdfium

The text was updated successfully, but these errors were encountered:

npn-zakipoint · 2025-02-19T11:23:29Z

I have tested the extraction pipeline of camelot where I found that camelot is built on top of pdfminer to group the texts to word and sentences but pdfminer has returned the text which are very far in the document in the single LTTextBoxHorizontal horizontal box (In the PDF document I have tested). I have also tweaked the LAParams but nothing worked for me.

    filename,
    line_overlap=0.5,
    char_margin=1.0,
    line_margin=0.5,
    word_margin=0.1,
    boxes_flow=0.5,
    detect_vertical=True,
    all_texts=True,
):
    """Return a PDFMiner LTPage object and page dimension of a single page pdf.

    To get the definitions of kwargs, see
    https://pdfminersix.rtfd.io/en/latest/reference/composable.html.

    Parameters
    ----------
    filename : string
        Path to pdf file.
    line_overlap : float
    char_margin : float
    line_margin : float
    word_margin : float
    boxes_flow : float
    detect_vertical : bool
    all_texts : bool

    Returns
    -------
    layout : object
        PDFMiner LTPage object.
    dim : tuple
        Dimension of pdf page in the form (width, height).

    """
    with open(filename, "rb") as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed(
                f"Text extraction is not allowed: {filename}"
            )
        laparams = LAParams(
            line_overlap=line_overlap,
            char_margin=char_margin,
            line_margin=line_margin,
            word_margin=word_margin,
            boxes_flow=boxes_flow,
            detect_vertical=detect_vertical,
            all_texts=all_texts,
        )
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        page = next(PDFPage.create_pages(document), None)
        if page is None:
            raise PDFTextExtractionNotAllowed
        interpreter.process_page(page)
        layout = device.get_result()
        width = layout.bbox[2]
        height = layout.bbox[3]
        dim = (width, height)
        print(dim)
        return layout, dim

I think this is big issue because camelot is built on top of pdfminer and layout output from the library is deviated from ground truth. If you know any solution to get rid of such issues, it will be really helpful to me and the community. Thank you.

npn-zakipoint added the bug Something isn't working label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appended text outside the table in the table cells #545

Appended text outside the table in the table cells #545

npn-zakipoint commented Feb 19, 2025 •

edited

Loading

npn-zakipoint commented Feb 19, 2025 •

edited

Loading

Appended text outside the table in the table cells #545

Appended text outside the table in the table cells #545

Comments

npn-zakipoint commented Feb 19, 2025 • edited Loading

npn-zakipoint commented Feb 19, 2025 • edited Loading

npn-zakipoint commented Feb 19, 2025 •

edited

Loading

npn-zakipoint commented Feb 19, 2025 •

edited

Loading