Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table/Figure captions often broken when Tables and Figures follow ref-list #109

Open
axfelix opened this issue May 24, 2017 · 12 comments
Open

Comments

@axfelix
Copy link
Contributor

axfelix commented May 24, 2017

Hi Martin,

Something we've been noticing across our corpus and would like to improve is the very low accuracy of tagging table and figure captions in meTypeset when the tables and figures are deliberately positioned at the end of the Word document by the author to meet some journals' formatting requirements.

Sometimes the result is the table or figure caption being tagged as its own paragraph, which I assume can be improved on its own by e.g. making the caption classifier more aggressive or adding another set of linguistic cues as has been done for https://github.com/MartinPaulEve/meTypeset/tree/master/language

However, I'm often seeing the captions being subsumed into the ref-list (though the tables themselves are always detected properly), and this is especially obvious when there are a few tables or figures in a row and only the first one has its title "broken."

What would be the best way for me to address this? I know the bibliography classifier is run before the caption classifiers in https://github.com/MartinPaulEve/meTypeset/blob/master/bin/nlmprocessor.py, and I believe it tries specifically to carve out the last block of unstructured text, so I'd want to be careful to regression test any changes we make to this behaviour.

@axfelix
Copy link
Contributor Author

axfelix commented May 24, 2017

It seems like this should be caught by stuff like https://github.com/MartinPaulEve/meTypeset/blob/master/bin/teimanipulate.py#L438 -- are we dropping linebreaks somehow in these cases?

@MartinPaulEve
Copy link
Owner

MartinPaulEve commented May 26, 2017 via email

@axfelix
Copy link
Contributor Author

axfelix commented May 26, 2017

table_after_refs.docx

Here's a start, I'll see if I can trim it down further.

@axfelix
Copy link
Contributor Author

axfelix commented May 26, 2017

table_after_refs_minimal.docx

It actually breaks considerably worse this way...

Definitely looks like the bibliography classifier is being overzealous, but not too much illuminating in debug output.

@axfelix
Copy link
Contributor Author

axfelix commented May 26, 2017

Just prodding around at this point, but after looping through elements_to_parse in teimanipulate and printing their children, it looks like there are definitely some elements getting added to the ref-list that contain only table rows, in the latter test example:

$ metypeset table_after_refs_minimal.docx test [<Element {http://www.tei-c.org/ns/1.0}ref at 0x4022d88>] [<Element {http://www.tei-c.org/ns/1.0}hi at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}hi at 0x4022c48>] [<Element {http://www.tei-c.org/ns/1.0}row at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4022b48>, <Element {http://www.tei-c.org/ns/1.0}row at0x4022c48>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4022948>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031048>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031788>, <Element {http://www.tei-c.org/ns/1.0}row at 0x40310c8>,<Element {http://www.tei-c.org/ns/1.0}row at 0x4031088>, <Element {http://www.tei-c.org/ns/1.0}row at 0x40311c8>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031108>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031148>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031188>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031588>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031208>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031288>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031348>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031448>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031548>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031fc8>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031888>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031f88>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4036048>] [<Element {http://www.tei-c.org/ns/1.0}hi at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}hi at 0x4022b48>]

@MartinPaulEve
Copy link
Owner

MartinPaulEve commented May 27, 2017 via email

@axfelix
Copy link
Contributor Author

axfelix commented May 27, 2017

Yeah, that's what I thought. As a starting point for fixing this, should I be trying to prevent problematic items from being added to that list at all? Or remove them once they're there? I'm not sure which has more implications for the reference classifier "leaving them alone".

We are seeing several documents submitted to some journals that always have tables and figures after references.

@axfelix
Copy link
Contributor Author

axfelix commented Jun 19, 2017

Hey Martin,

Any chance you can revisit this with me?

@MartinPaulEve
Copy link
Owner

MartinPaulEve commented Jun 20, 2017 via email

@axfelix
Copy link
Contributor Author

axfelix commented Jun 20, 2017

Sure! The two test documents I uploaded earlier in this thread (a couple comments up) should hopefully be illuminating and work for this purpose -- basically, the bibliography classifier needs to not ruin parsing of tables (especially table captions) that are located below the bibliography.

I'll be on vacation the rest of the week myself and at a conference next week, but let me know what I can do to help further.

@MartinPaulEve
Copy link
Owner

OK, so I've looked at this a bit further, and I am not sure quite what we should do with it. Should tables be appended to the end of the body? Obviously, JATS separates out the ref-list from the body. We assume that once we've got to the REF list we're done with the body.

If we want to include this in the body, we need to add special handling for content that should never occur in a ref-list that would need to include tables.

Let me know your thoughts.

@axfelix
Copy link
Contributor Author

axfelix commented Jun 27, 2017

I think appending it to the body is a good idea. Would the blacklist implementation you're proposing be sufficient to catch table captions if they precede the table itself and otherwise look like "normal" paragraphs? It's probably not the most salient issue, but it is what led to us looking into this in the first place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants