You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am extracting the tabular data using camelot. Recently, I found that camelot is merging the text outside the table with the table cells. I am using the following configurations while reading my PDF document:
Input PDF has clear tabular structure i.e table has proper line with row and spanned columns in some cases and table is detected properly. I have validated the extracted tables using .plot() method.
camelot.plot(tables[0], kind='contour').show()
Do you have any idea, why it has merged the text outside the table (text which is more than 200-300 pixel above in the case)? I think this is bug in camelot because it is mapping the text outside the table to the table. I think there might be some issue while mapping the word in the page with the table’s cells based on their x and y coordinates. Thank you.
OS: Linux
python: 3.10
camelot-py: 1.0.0
backend: pdfium
The text was updated successfully, but these errors were encountered:
I have tested the extraction pipeline of camelot where I found that camelot is built on top of pdfminer to group the texts to word and sentences but pdfminer has returned the text which are very far in the document in the single LTTextBoxHorizontal horizontal box (In the PDF document I have tested). I have also tweaked the LAParams but nothing worked for me.
filename,
line_overlap=0.5,
char_margin=1.0,
line_margin=0.5,
word_margin=0.1,
boxes_flow=0.5,
detect_vertical=True,
all_texts=True,
):
"""Return a PDFMiner LTPage object and page dimension of a single page pdf.
To get the definitions of kwargs, see
https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
Parameters
----------
filename : string
Path to pdf file.
line_overlap : float
char_margin : float
line_margin : float
word_margin : float
boxes_flow : float
detect_vertical : bool
all_texts : bool
Returns
-------
layout : object
PDFMiner LTPage object.
dim : tuple
Dimension of pdf page in the form (width, height).
"""
with open(filename, "rb") as f:
parser = PDFParser(f)
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed(
f"Text extraction is not allowed: {filename}"
)
laparams = LAParams(
line_overlap=line_overlap,
char_margin=char_margin,
line_margin=line_margin,
word_margin=word_margin,
boxes_flow=boxes_flow,
detect_vertical=detect_vertical,
all_texts=all_texts,
)
rsrcmgr = PDFResourceManager()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
page = next(PDFPage.create_pages(document), None)
if page is None:
raise PDFTextExtractionNotAllowed
interpreter.process_page(page)
layout = device.get_result()
width = layout.bbox[2]
height = layout.bbox[3]
dim = (width, height)
print(dim)
return layout, dim
I think this is big issue because camelot is built on top of pdfminer and layout output from the library is deviated from ground truth. If you know any solution to get rid of such issues, it will be really helpful to me and the community. Thank you.
Hi, I am extracting the tabular data using camelot. Recently, I found that camelot is merging the text outside the table with the table cells. I am using the following configurations while reading my PDF document:
Input PDF has clear tabular structure i.e table has proper line with row and spanned columns in some cases and table is detected properly. I have validated the extracted tables using .plot() method.
Do you have any idea, why it has merged the text outside the table (text which is more than 200-300 pixel above in the case)? I think this is bug in camelot because it is mapping the text outside the table to the table. I think there might be some issue while mapping the word in the page with the table’s cells based on their x and y coordinates. Thank you.
OS: Linux
python: 3.10
camelot-py: 1.0.0
backend: pdfium
The text was updated successfully, but these errors were encountered: