Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip check for article elements #2

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

noahyoungs
Copy link
Contributor

I added an argument to MainContentExtractor.extract() that bypasses the check for article elements and causes the function to use Trafilatura if there is no main element found in the html. I think this is helpful because I've found that some webpages have article elements that don't contain the page's main content and the main content is found elsewhere on the webpage.

A further improvement to the logic could be using the contents of article elements only if it contains text above a minimum character count, or above a minimum percentage of all the text found on the webpage, but this is just an idea.

@HawkClaws
Copy link
Owner

Thank you for your pull request!
I agree with you that it would be good to have a flag that says how far the process should go (what functionality of this library you want to use).
However, I don't feel comfortable with the fact that there are several checkpoints, but only the middle one.
Either you should have advanced flags so that you can proceed to a specific check process, or you should split up the methods so that they can be used individually.
I think we need to think about such things.

@noahyoungs
Copy link
Contributor Author

noahyoungs commented Apr 9, 2024

Thank you for your review.
I thought at first about having a list parameter like ['main', 'article', 'trifalatura'] that defines not only which checks should be used but also the order of which one to check first. But then, I realized that trifalatura always has to be at the end of the process because the other checks are much more likely to not find anything, and there could be a high error rate if trifalatura is not used. That's when I decided against implementing this.

However, I completely understand why you might think that my current implementation with just the skip article flag is sloppy. If you think it's a good idea, I can implement a list like ['main', 'article'] that defines the checks that happen before trifalatura and the order in which they are performed.

Otherwise, I will experiment with the article flag in my forked repo and create a new PR if I come up with a better solution.

By the way, thank you for sharing this library I think it is great.

Edit: I overlooked the check that uses the deepest elements with id "main" or "content" before using trifalatura, this check can also be added to the list as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants