Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some parsing errors trigger a cascade of failures #23

Open
jtmorgan opened this issue Aug 17, 2016 · 5 comments
Open

Some parsing errors trigger a cascade of failures #23

jtmorgan opened this issue Aug 17, 2016 · 5 comments

Comments

@jtmorgan
Copy link

I haven't been able to 100% identify the cause here, but it looks like in some cases failure to parse a particular comment causes WikiChatter to fail when parsing the rest of the comments on the page, even those in different threads. It will split up text blocks in subsequent comments, but won't extract author or timestamp, or try to reconstruct the thread structure.

May be related to this issue (@yuvipanda and I are working on the same dataset). I can provide examples, but wanted to check whether the cause was known, before I do more spot checks. Anyone else experienced this issue?

@kjschiroo
Copy link
Collaborator

Could you provide the revision id number for one of the pages that is parsing poorly?

@jtmorgan
Copy link
Author

Hi @kjschiroo, thanks for the quick reply. Here's an example

In this case, maybe it has something to do with the fact that the final comment in the thread was added after the fact?

That's the only edit to this page that wasn't made by a lowercase sigmabot.

@kjschiroo
Copy link
Collaborator

kjschiroo commented Aug 17, 2016

@jtmorgan I think this has been fixed with the current version of the parser. I'm guessing it had something to do with the date formatting, but I don't remember exactly.

Using the current version from this input:
695594020.txt

I get this output:
output_pretty.txt

Please let me know if this is not the case.

@kjschiroo
Copy link
Collaborator

Revising my previous statement. The error does occur with the current version. It appears to be an issue with mwparserfromhell. mwp isn't parsing the user page link as a link and so we don't conclude that we have a signature yet. To demonstrate this:

import mwparserfromhell as mwp

text = open("695594020.txt").read()
wcode = mwp.parse(text)
for node in wcode.nodes:
    if "04:42, 12 December 2015 (UTC)" in node:
        print(node)
        print(type(node))
        print("**************************************")
        break
# re-parsing a node should give us a single node
wcode = mwp.parse(str(node))
for node in wcode.nodes:
    print(node)
    print(type(node))

Gives this as output:

Hello TeaHouse !

I wondering or curious about why my article Data Excellence submission was declined.
I forward you the message of a reviewer: This submission seems to be a test edit and not an article worthy of an encyclopedia. Please use the sandbox for any editing tests, but do not submit for review until you have an article that you want reviewed for inclusion in Wikipedia. Thank you.

I don't understand really what can i do ?
Can you help me ? [[User:Yanniyolo|Yanniyolo]] ([[User talk:Yanniyolo|talk]]) 04:42, 12 December 2015 (UTC)


<class 'mwparserfromhell.nodes.text.Text'>
**************************************

Hello TeaHouse !

I wondering or curious about why my article Data Excellence submission was declined.
I forward you the message of a reviewer: This submission seems to be a test edit and not an article worthy of an encyclopedia. Please use the sandbox for any editing tests, but do not submit for review until you have an article that you want reviewed for inclusion in Wikipedia. Thank you.

I don't understand really what can i do ?
Can you help me ? 
<class 'mwparserfromhell.nodes.text.Text'>
[[User:Yanniyolo|Yanniyolo]]
<class 'mwparserfromhell.nodes.wikilink.Wikilink'>
 (
<class 'mwparserfromhell.nodes.text.Text'>
[[User talk:Yanniyolo|talk]]
<class 'mwparserfromhell.nodes.wikilink.Wikilink'>
) 04:42, 12 December 2015 (UTC)


<class 'mwparserfromhell.nodes.text.Text'>

I will raise an issue with them.

@kjschiroo
Copy link
Collaborator

For reference the issue with them is: earwig/mwparserfromhell#160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants