Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add open-source text extraction libraries #293

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d49463b
feat: add justext
garrethlee Sep 25, 2024
9fbc6b2
fix: remove justext cli comment
garrethlee Sep 25, 2024
a6cce5d
feat: add resiliparse
garrethlee Sep 25, 2024
add6807
feat: add inscriptis
garrethlee Sep 27, 2024
e3a7285
feat: add readabilipy
garrethlee Sep 27, 2024
2a6ef15
feat: add readability
garrethlee Sep 27, 2024
84f1ed4
feat: add require_readability to utils
garrethlee Sep 27, 2024
c085736
feat: add tests
garrethlee Sep 27, 2024
ea3a915
feat: changed configs & pyproject
garrethlee Sep 27, 2024
891850e
fix: move postprocessor to init
garrethlee Sep 27, 2024
76816c8
feat: implement clean_html method in extractors and update Inscriptis…
garrethlee Dec 11, 2024
b5cd839
feat: add justext
garrethlee Sep 25, 2024
8f5fc26
fix: remove justext cli comment
garrethlee Sep 25, 2024
b3e7942
feat: add resiliparse
garrethlee Sep 25, 2024
24d1594
feat: add inscriptis
garrethlee Sep 27, 2024
be87f9a
feat: add readabilipy
garrethlee Sep 27, 2024
9f83073
feat: add readability
garrethlee Sep 27, 2024
c40db90
feat: add require_readability to utils
garrethlee Sep 27, 2024
b3912b2
feat: add tests
garrethlee Sep 27, 2024
0eb2f2a
feat: changed configs & pyproject
garrethlee Sep 27, 2024
462ffc1
fix: move postprocessor to init
garrethlee Sep 27, 2024
af5a067
feat: implement clean_html method in extractors and update Inscriptis…
garrethlee Dec 11, 2024
e273e51
Merge branch 'main' into feat/text-extraction
garrethlee Dec 11, 2024
d0f6ead
Merge remote-tracking branch 'refs/remotes/origin/feat/text-extractio…
garrethlee Dec 11, 2024
26bf413
refactor: move warning log to constructor to avoid ballooning log fil…
garrethlee Dec 18, 2024
c9f1c2b
style: fixed lint errors
garrethlee Dec 21, 2024
2ac81d5
style: fixed ruff format errors
garrethlee Dec 21, 2024
751ec13
remove additional brotlipy in pyproject
garrethlee Dec 21, 2024
0be58ae
undo changes made to tokenizer (for number tokenization experiment)
garrethlee Dec 22, 2024
56a71ed
delete modular extractor due to redundancy
garrethlee Dec 22, 2024
cd18c59
improved test case robustness
garrethlee Dec 22, 2024
aae7e33
nit
guipenedo Dec 26, 2024
f816913
changed trafilatura default args
garrethlee Dec 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: add require_readability to utils
garrethlee committed Dec 11, 2024
commit c40db90a40591f4d79671e75755c7ba7d46600a3
8 changes: 8 additions & 0 deletions tests/utils.py
Original file line number Diff line number Diff line change
@@ -87,6 +87,14 @@ def require_inscriptis(test_case):
return test_case


def require_readabilipy(test_case):
try:
import readabilipy # noqa: F401
except ImportError:
test_case = unittest.skip("test requires readabilipy")(test_case)
return test_case


def require_pyarrow(test_case):
try:
import pyarrow # noqa: F401