Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appended text outside the table in the table cells #545

Open
npn-zakipoint opened this issue Feb 19, 2025 · 1 comment
Open

Appended text outside the table in the table cells #545

npn-zakipoint opened this issue Feb 19, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@npn-zakipoint
Copy link

npn-zakipoint commented Feb 19, 2025

Hi, I am extracting the tabular data using camelot. Recently, I found that camelot is merging the text outside the table with the table cells. I am using the following configurations while reading my PDF document:

 camelot.read_pdf(pdf_file_2, line_scale=40, copy_text=["v", "h"], strip_text='\n', parallel=True,
                              layout_kwargs={'detect_vertical': False}, backend="pdfium")

Input PDF has clear tabular structure i.e table has proper line with row and spanned columns in some cases and table is detected properly. I have validated the extracted tables using .plot() method.

camelot.plot(tables[0], kind='contour').show()

Do you have any idea, why it has merged the text outside the table (text which is more than 200-300 pixel above in the case)? I think this is bug in camelot because it is mapping the text outside the table to the table. I think there might be some issue while mapping the word in the page with the table’s cells based on their x and y coordinates. Thank you.

OS: Linux
python: 3.10
camelot-py: 1.0.0
backend: pdfium

@npn-zakipoint npn-zakipoint added the bug Something isn't working label Feb 19, 2025
@npn-zakipoint
Copy link
Author

npn-zakipoint commented Feb 19, 2025

I have tested the extraction pipeline of camelot where I found that camelot is built on top of pdfminer to group the texts to word and sentences but pdfminer has returned the text which are very far in the document in the single LTTextBoxHorizontal horizontal box (In the PDF document I have tested). I have also tweaked the LAParams but nothing worked for me.

    filename,
    line_overlap=0.5,
    char_margin=1.0,
    line_margin=0.5,
    word_margin=0.1,
    boxes_flow=0.5,
    detect_vertical=True,
    all_texts=True,
):
    """Return a PDFMiner LTPage object and page dimension of a single page pdf.

    To get the definitions of kwargs, see
    https://pdfminersix.rtfd.io/en/latest/reference/composable.html.

    Parameters
    ----------
    filename : string
        Path to pdf file.
    line_overlap : float
    char_margin : float
    line_margin : float
    word_margin : float
    boxes_flow : float
    detect_vertical : bool
    all_texts : bool

    Returns
    -------
    layout : object
        PDFMiner LTPage object.
    dim : tuple
        Dimension of pdf page in the form (width, height).

    """
    with open(filename, "rb") as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed(
                f"Text extraction is not allowed: {filename}"
            )
        laparams = LAParams(
            line_overlap=line_overlap,
            char_margin=char_margin,
            line_margin=line_margin,
            word_margin=word_margin,
            boxes_flow=boxes_flow,
            detect_vertical=detect_vertical,
            all_texts=all_texts,
        )
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        page = next(PDFPage.create_pages(document), None)
        if page is None:
            raise PDFTextExtractionNotAllowed
        interpreter.process_page(page)
        layout = device.get_result()
        width = layout.bbox[2]
        height = layout.bbox[3]
        dim = (width, height)
        print(dim)
        return layout, dim


I think this is big issue because camelot is built on top of pdfminer and layout output from the library is deviated from ground truth. If you know any solution to get rid of such issues, it will be really helpful to me and the community. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant