Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): Add --filename-template and --max-length options #763

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,11 @@ the University of Munich.
- JSON
- HTML, XML and [XML-TEI](https://tei-c.org/)

- Flexible output file naming:
- Template-based filename generation with variables like {domain}, {path}, {hash}
- Path length control and automatic truncation
- Safe character handling and URL component parsing

- Optional add-ons:
- Language detection on extracted content
- Speed optimizations
Expand All @@ -74,7 +79,6 @@ the University of Munich.
- Regular updates, feature additions, and optimizations
- Comprehensive documentation


### Evaluation and alternatives

Trafilatura consistently outperforms other open-source libraries in text
Expand Down
6 changes: 6 additions & 0 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,13 @@ Extraction options are also available on the command-line and they can be combin
$ < myfile.html trafilatura --json --no-tables
Use ``--filename-template`` to control how output filenames are generated based on the URL and content.

.. code-block:: bash
$ trafilatura -u "https://example.com/path/dirs" --filename-template "{domain}/{path_dirs}/{hash}.{ext}" --markdown -o output/
this will produce a file named ``example.com/path/dirs/uOHdo6wKo4IK0pkL.md`` in the ``output`` directory.

Further steps
-------------
Expand Down
11 changes: 11 additions & 0 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,17 @@ Using a custom file on the command-line
With the ``--config-file`` option, followed by the file name or path. All the required variables have to be present in the custom file.


Filename Generation
^^^^^^^^^^^^^^^^^^^^^
Two new options allow customizing how output filenames are generated:

--filename-template: Specify a template string for generating filenames, using variables like {domain}, {path}, {hash}, {ext}, etc. Example: --filename-template "{domain}/{hash}.{ext}"
--max-length: Set the maximum total path length, including directory components. The default is 250 characters. Example: --max-length 200

The filename template can include directory separators to preserve parts of the original URL's path structure. Unsafe characters are sanitized automatically. If the total path would exceed max-length, it is intelligently truncated while preserving key components.
Invalid variables or unsafe characters will raise an error.


Adapting settings in Python
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
10 changes: 10 additions & 0 deletions docs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,13 @@ Download first and extract later
Since the they have distinct characteristics it can be useful to separate the infrastructure needed for download from the extraction. Using a custom IP or network infrastructure can also prevent your usual IP from getting banned.

For an approach using files from the Common Crawl and Trafilatura, see the external tool `datatrove/process_common_crawl_dump.py <https://github.com/huggingface/datatrove/blob/main/examples/process_common_crawl_dump.py>`_.


Invalid template variables and filenames
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you see an error about invalid template variables, check that your ``--filename-template`` string only uses supported values like ``{domain}``, ``{hash}``, etc.
Refer to the filename.py source for a complete list.

An error about unsafe characters in the filename template means that characters like ``<>``, ``:``, ``"`` were used outside of ``{variable}`` sections.
Make sure to only use alphanumeric characters, underscores, dashes and forward slashes in static parts of the template.
38 changes: 35 additions & 3 deletions docs/usage-cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,33 @@ Output as TXT without metadata is the default, another format can be selected in
*HTML output is available from version 1.11, Markdown from version 1.9 onwards.*


Filename Customization
~~~~~~~~~~~~~~~~~~~~~~

Use ``--filename-template`` to control how output filenames are generated based on the URL and content. Supported variables:

- {domain}: Website domain
- {path}: URL path segments, joined by underscores
- {path_dirs}: URL path segments, joined by directory separators
- {params}: URL query parameters
- {hash}: Hash of extracted content
- {ext}: File extension
- {lang}: Identified language

Example: ``--filename-template "{domain}/{hash}.{ext}"``

Use ``--max-length`` to set the maximum total path length, including any directories. It defaults to 250 characters.

If the generated path would exceed this limit, it is intelligently truncated:
1. Individual directory and file components are preserved as long as possible.
2. The file component is reduced to a minimum of {hash}.{ext}.
3. The --output-dir is omitted from length calculations.

Example: ``--max-length 200``

Invalid template variables or unsafe path characters will raise an error.


Optimizing for precision and recall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -166,7 +193,7 @@ Two major command line arguments are necessary here:

.. hint::
Backup of HTML sources can be useful for archival and further processing:

``$ trafilatura --input-file links.txt --output-dir converted/ --backup-dir html-sources/ --xml``


Expand Down Expand Up @@ -288,14 +315,15 @@ For all usage instructions see ``trafilatura -h``:
trafilatura [-h] [-i INPUTFILE | --input-dir INPUTDIR | -u URL]
[--parallel PARALLEL] [-b BLACKLIST] [--list]
[-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs]
[--filename-template FILENAME_TEMPLATE] [--max-length MAX_LENGTH]
[--feed [FEED] | --sitemap [SITEMAP] | --crawl [CRAWL] |
--explore [EXPLORE] | --probe [PROBE]] [--archived]
[--url-filter URL_FILTER [URL_FILTER ...]] [-f]
[--formatting] [--links] [--images] [--no-comments]
[--no-tables] [--only-with-metadata] [--with-metadata]
[--target-language TARGET_LANGUAGE] [--deduplicate]
[--config-file CONFIG_FILE] [--precision] [--recall]
[--output-format {csv,json,html,markdown,txt,xml,xmltei} |
[--output-format {csv,json,html,markdown,txt,xml,xmltei} |
--csv | --html | --json | --markdown | --xml | --xmltei]
[--validate-tei] [-v] [--version]

Expand Down Expand Up @@ -331,6 +359,11 @@ Output:
preserve a copy of downloaded files in a backup
directory
--keep-dirs keep input directory structure and file names
--filename-template FILENAME_TEMPLATE
template for generating filenames (e.g.
{domain}/{path}-{hash}.{ext})
--max-length MAX_LENGTH
maximum length for generated file paths

Navigation:
Link discovery and web crawling
Expand Down Expand Up @@ -381,4 +414,3 @@ Format:
--xml shorthand for XML output
--xmltei shorthand for XML TEI output
--validate-tei validate XML TEI output

64 changes: 64 additions & 0 deletions tests/cli_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

from trafilatura import cli, cli_utils, spider, settings
from trafilatura.downloads import add_to_compressed_dict, fetch_url
from trafilatura.filename import generate_hash_filename
from trafilatura.utils import LANGID_FLAG

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Expand Down Expand Up @@ -586,6 +587,67 @@ def test_probing():
else:
assert f.getvalue().strip() == url

def test_filename_template_cli_integration():
"""Test CLI integration with FilenameTemplate."""
# Test hierarchical structure with no extension
testargs = ["", "--filename-template", "{domain}/{path_dirs}", "--output-dir", "/tmp/test", "-u", "https://example.com/blog/post1"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path, destination_dir = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 1")
assert destination_dir == "/tmp/test/example.com/blog"
assert output_path == "/tmp/test/example.com/blog/post1"

# Test with markdown extension
testargs = ["", "--filename-template", "{domain}/{path_dirs}.{ext}", "--output-dir", "/tmp/test", "--markdown", "-u", "https://example.com/blog/post1"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path2, destination_dir2 = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 1")
assert destination_dir2 == "/tmp/test/example.com/blog"
assert output_path2 == "/tmp/test/example.com/blog/post1.md"

# Test flattened structure
testargs = ["", "--filename-template", "{domain}/{path}", "--output-dir", "/tmp/test", "-u", "https://example.com/articles/tech/news"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path3, destination_dir3 = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 2")
assert destination_dir3 == "/tmp/test/example.com"
assert output_path3 == "/tmp/test/example.com/articles_tech_news"

# Test with parameters
testargs = ["", "--filename-template", "{domain}/{path_dirs}/{hash}-{params}", "--output-dir", "/tmp/test", "-u", "https://example.com/articles/tech?id=123&cat=news"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path4, destination_dir4 = cli_utils.determine_output_path(args=args, orig_filename="", content="Test content 3")
assert destination_dir4 == "/tmp/test/example.com/articles/tech"
assert output_path4 == f"/tmp/test/example.com/articles/tech/{generate_hash_filename('Test content 3')}-cat-news_id-123"

@pytest.mark.usefixtures("caplog")
def test_filename_template_cli_errors(caplog):
"""Test error handling in CLI filename template integration."""
# Test URL too long
testargs = ["", "--filename-template", "{domain}/{path_dirs}", "--output-dir", "/tmp/test", "-u", "https://example.com/" + "a" * 100, "--max-length", "100"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

output_path, destination_dir = cli_utils.determine_output_path(args=args, orig_filename="", content="test content")
assert "_ttt_" in output_path
assert destination_dir == "/tmp/test/example.com"
assert generate_hash_filename("test content") in output_path

# Test no URL
testargs = ["", "--filename-template", "{domain}/{path}", "--output-dir", "/tmp/test"]
with patch.object(sys, "argv", testargs):
args = cli.parse_args(testargs)

caplog.set_level(logging.WARNING)
output_path2, destination_dir2 = cli_utils.determine_output_path(args=args, orig_filename="", content="test content")
assert "Template generation failed: URL is required for template variables" in caplog.text
assert output_path2 == "/tmp/test"
assert generate_hash_filename("test content") in destination_dir2

if __name__ == "__main__":
test_parser()
Expand All @@ -599,3 +661,5 @@ def test_probing():
test_crawling()
test_download()
test_probing()
test_filename_template_cli_integration()
test_filename_template_cli_errors()
2 changes: 1 addition & 1 deletion tests/deduplication_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
import trafilatura.deduplication

from trafilatura import extract
from trafilatura.cli_utils import generate_hash_filename
from trafilatura.core import Extractor
from trafilatura.deduplication import (LRUCache, Simhash, content_fingerprint,
duplicate_test)
from trafilatura.filename import generate_hash_filename


DEFAULT_OPTIONS = Extractor()
Expand Down
Loading
Loading