Add option to skip check for article elements #2

noahyoungs · 2024-04-08T16:25:52Z

I added an argument to MainContentExtractor.extract() that bypasses the check for article elements and causes the function to use Trafilatura if there is no main element found in the html. I think this is helpful because I've found that some webpages have article elements that don't contain the page's main content and the main content is found elsewhere on the webpage.

A further improvement to the logic could be using the contents of article elements only if it contains text above a minimum character count, or above a minimum percentage of all the text found on the webpage, but this is just an idea.

HawkClaws · 2024-04-09T00:53:10Z

Thank you for your pull request!
I agree with you that it would be good to have a flag that says how far the process should go (what functionality of this library you want to use).
However, I don't feel comfortable with the fact that there are several checkpoints, but only the middle one.
Either you should have advanced flags so that you can proceed to a specific check process, or you should split up the methods so that they can be used individually.
I think we need to think about such things.

noahyoungs · 2024-04-09T02:43:19Z

Thank you for your review.
I thought at first about having a list parameter like ['main', 'article', 'trifalatura'] that defines not only which checks should be used but also the order of which one to check first. But then, I realized that trifalatura always has to be at the end of the process because the other checks are much more likely to not find anything, and there could be a high error rate if trifalatura is not used. That's when I decided against implementing this.

However, I completely understand why you might think that my current implementation with just the skip article flag is sloppy. If you think it's a good idea, I can implement a list like ['main', 'article'] that defines the checks that happen before trifalatura and the order in which they are performed.

Otherwise, I will experiment with the article flag in my forked repo and create a new PR if I come up with a better solution.

By the way, thank you for sharing this library I think it is great.

Edit: I overlooked the check that uses the deepest elements with id "main" or "content" before using trifalatura, this check can also be added to the list as well.

noahyoungs added 2 commits April 9, 2024 01:15

added argument to skip article

23849fa

fix

ad6f2d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to skip check for article elements #2

Add option to skip check for article elements #2

noahyoungs commented Apr 8, 2024

HawkClaws commented Apr 9, 2024

noahyoungs commented Apr 9, 2024 •

edited

Loading

Add option to skip check for article elements #2

Are you sure you want to change the base?

Add option to skip check for article elements #2

Conversation

noahyoungs commented Apr 8, 2024

HawkClaws commented Apr 9, 2024

noahyoungs commented Apr 9, 2024 • edited Loading

noahyoungs commented Apr 9, 2024 •

edited

Loading