-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TrafilaturaExtractor
class
#431
base: main
Are you sure you want to change the base?
Changes from 2 commits
632df99
51b1145
db8c5b2
877d6ae
b85cb56
4d447a1
b3fb664
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,11 @@ | |
|
||
import pytest | ||
|
||
from nemo_curator.download import ResiliparseExtractor, download_and_extract | ||
from nemo_curator.download import ( | ||
ResiliparseExtractor, | ||
TrafilaturaExtractor, | ||
download_and_extract, | ||
) | ||
from nemo_curator.download.commoncrawl import ( | ||
CommonCrawlWARCDownloader, | ||
CommonCrawlWARCExtractor, | ||
|
@@ -12,68 +16,73 @@ | |
) | ||
|
||
|
||
@pytest.fixture | ||
def html_string(): | ||
# Modified from https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/abdf1966fb3cefe3e0790e510ab5cb1446f99a79/tests/resiliparse/extract/test_html2text.py | ||
html = """<!doctype html> | ||
<head> | ||
<title>My Title</title> | ||
<meta charset="utf-8"> | ||
<style>* { margin: 0; }</style> | ||
</head> | ||
<body> | ||
<section id="wrapper"> | ||
<nav> | ||
<ul> | ||
<li>Nav 1</li> | ||
<li> | ||
<p>Nav 2</p> | ||
<ul> | ||
<li><p>Nav 3</p></li> | ||
</ul> | ||
</li> | ||
</ul> | ||
</nav> | ||
<main> | ||
This is a sample paragraph. In it we write words. | ||
These are stopwords: because did than has near we almost while what still. | ||
<a href="#foo" hidden>bar</a> | ||
|
||
<p> | ||
This paragraph doesn't have many stopwords. Remove it. | ||
<br>Let's keep this paragraph: either came does last new took taken making became from. | ||
</p> | ||
|
||
<button aria-hidden="true">Click here</button> | ||
<input type="hidden" value="foo"> | ||
<input type="text" value="Some text" placeholder="Insert text"> | ||
<input type="text" placeholder="Insert text"> | ||
<img src="" alt="Some image"> | ||
<object data="" class="some-class hidden">Cannot display object</object> | ||
</main> | ||
<script language="vbscript" type="text/vbscript">MsgBox("Hello World!")</script> | ||
<noscript>Sorry, your browser doesn't support VB Script!</noscript> | ||
<div><div><div><footer id="global-footer"> | ||
Copyright (C) 2021 Foo Bar | ||
</footer></div></div></div> | ||
</section> | ||
</body> | ||
</html>""" | ||
return html | ||
|
||
|
||
class TestDownload: | ||
def test_imports(self): | ||
from nemo_curator.download import ( | ||
JusTextExtractor, | ||
ResiliparseExtractor, | ||
TrafilaturaExtractor, | ||
download_arxiv, | ||
download_common_crawl, | ||
download_wikipedia, | ||
) | ||
|
||
assert True | ||
|
||
def test_resiliparse_extract_text(self): | ||
# Modified from https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/abdf1966fb3cefe3e0790e510ab5cb1446f99a79/tests/resiliparse/extract/test_html2text.py | ||
html = """<!doctype html> | ||
<head> | ||
<title>My Title</title> | ||
<meta charset="utf-8"> | ||
<style>* { margin: 0; }</style> | ||
</head> | ||
<body> | ||
<section id="wrapper"> | ||
<nav> | ||
<ul> | ||
<li>Nav 1</li> | ||
<li> | ||
<p>Nav 2</p> | ||
<ul> | ||
<li><p>Nav 3</p></li> | ||
</ul> | ||
</li> | ||
</ul> | ||
</nav> | ||
<main> | ||
This is a sample paragraph. In it we write words. | ||
These are stopwords: because did than has near we almost while what still. | ||
<a href="#foo" hidden>bar</a> | ||
|
||
<p> | ||
This paragraph doesn't have many stopwords. Remove it. | ||
<br>Let's keep this paragraph: either came does last new took taken making became from. | ||
</p> | ||
|
||
<button aria-hidden="true">Click here</button> | ||
<input type="hidden" value="foo"> | ||
<input type="text" value="Some text" placeholder="Insert text"> | ||
<input type="text" placeholder="Insert text"> | ||
<img src="" alt="Some image"> | ||
<object data="" class="some-class hidden">Cannot display object</object> | ||
</main> | ||
<script language="vbscript" type="text/vbscript">MsgBox("Hello World!")</script> | ||
<noscript>Sorry, your browser doesn't support VB Script!</noscript> | ||
<div><div><div><footer id="global-footer"> | ||
Copyright (C) 2021 Foo Bar | ||
</footer></div></div></div> | ||
</section> | ||
</body> | ||
</html>""" | ||
|
||
def test_resiliparse_extract_text(self, html_string): | ||
algorithm = ResiliparseExtractor() | ||
stop_words = get_stop_list_dict() | ||
result = algorithm.extract_text(html, stop_words["ENGLISH"]) | ||
result = algorithm.extract_text(html_string, stop_words["ENGLISH"]) | ||
|
||
expected = [ | ||
"This is a sample paragraph. In it we write words. These are stopwords: because did than has near we almost while what still.", | ||
|
@@ -82,6 +91,17 @@ def test_resiliparse_extract_text(self): | |
|
||
assert result == expected | ||
|
||
def test_trafilatura_extract_text(self, html_string): | ||
algorithm = TrafilaturaExtractor() | ||
stop_words = get_stop_list_dict() | ||
result = algorithm.extract_text(html_string, stop_words["ENGLISH"]) | ||
|
||
expected = [ | ||
"Let's keep this paragraph: either came does last new took taken making became from.", | ||
] | ||
Comment on lines
+146
to
+148
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trafilatura has a really bad bug where it is returning the string twice. I can double check all of my logic, but in the case that this is a Trafilatura-specific issue, I am debating whether I should add our own exact deduplication code into this. (Trafilatura actually has their own Possible related issue: adbar/trafilatura#634 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have not scoped to see what conditions cause this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another issue: adbar/trafilatura#768 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Solved by adding support for these extraction and deduplication parameters: https://trafilatura.readthedocs.io/en/latest/settings.html. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like an issue customers might run into. Have you looked in papers and such for good default values? If Trafilatura's defaults aren't good (or aren't used by most researchers), it could be good to substitute our own. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I can look into it. At the very least, I think setting |
||
|
||
assert result == expected | ||
|
||
def test_common_crawl_urls(self): | ||
start_snapshot = "2021-04" | ||
end_snapshot = "2021-10" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a class level docstring explaining what trafilatura is?