Skip to content

Commit

Permalink
review text
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Feb 16, 2024
1 parent 24aaeb1 commit d3b7014
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 15 deletions.
6 changes: 3 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Introduction

Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats.

Smart navigation and going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements (headers, footers etc.), and **making sense of the data** with selected information. The extractor is designed to be **robust and reasonably fast**, it runs in production on millions of documents.
Going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements (headers, footers etc.), and **making sense of the data** with selected information. The extractor is designed to be **robust and reasonably fast**, it runs in production on millions of documents.

The tool's versatility makes it **useful for quantitative and data-driven approaches**. It is used in the academic domain and beyond (e.g. in natural language processing, computational social science, search engine optimization, and information security).

Expand Down Expand Up @@ -128,7 +128,7 @@ Usage and documentation
- `Tutorials and use cases <https://trafilatura.readthedocs.io/en/latest/tutorials.html>`_


For video tutorials see this Youtube playlist:
Youtube playlist with video tutorials in several languages:

- `Web scraping tutorials and how-tos <https://www.youtube.com/watch?v=8GkiOM17t0Q&list=PL-pKWbySIRGMgxXQOtGIz1-nbfYLvqrci>`_

Expand Down Expand Up @@ -200,7 +200,7 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition.
Software ecosystem
~~~~~~~~~~~~~~~~~~

This software is part of a larger ecosystem. Case studies and publications are listed on the `Used By documentation page <https://trafilatura.readthedocs.io/en/latest/used-by.html>`_.
Case studies and publications are listed on the `Used By documentation page <https://trafilatura.readthedocs.io/en/latest/used-by.html>`_.

Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:

Expand Down
6 changes: 3 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Evaluation and alternatives

Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. The extractor tries to strike a balance between limiting noise and including all valid parts.

The `benchmark section <evaluation.html>`_ details alternatives and results and the `evaluation readme <https://github.com/adbar/trafilatura/blob/master/tests/README.rst>`_ describes how to reproduce the evaluation.
The `benchmark section <evaluation.html>`_ details alternatives and results, the `evaluation readme <https://github.com/adbar/trafilatura/blob/master/tests/README.rst>`_ describes how to reproduce the evaluation.


In a nutshell
Expand All @@ -102,7 +102,7 @@ On the command-line:
$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...
For more information please refer to `usage documentation <usage.html>`_ and `tutorials <tutorials.html>`_.
For more see `usage documentation <usage.html>`_ and `tutorials <tutorials.html>`_.


.. raw:: html
Expand Down Expand Up @@ -181,7 +181,7 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition.
Software ecosystem
~~~~~~~~~~~~~~~~~~

This software is part of a larger ecosystem. Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.
Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.

Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Blog posts
Videos
^^^^^^

Youtube playlist
Youtube playlist with video tutorials in several languages
`Web scraping how-tos and tutorials <https://www.youtube.com/watch?v=8GkiOM17t0Q&list=PL-pKWbySIRGMgxXQOtGIz1-nbfYLvqrci>`_.


Expand Down
22 changes: 14 additions & 8 deletions docs/usage-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,23 @@ Examples
The API takes JSON as input and a corresponding header is required. It then returns a JSON string with the result.


CLI
~~~

.. code-block:: bash
$ curl -X POST "https://trafilatura.mooo.com/extract-demo" \
-H "content-type: application/json" \
--data '{
"url": "https://example.org",
"args": {
"output_format": "xml"
}
}'
-H "content-type: application/json" \
--data '{
"url": "https://example.org",
"args": {
"output_format": "xml"
}
}'
Python
~~~~~~

.. code-block:: python
Expand All @@ -67,5 +73,5 @@ The API takes JSON as input and a corresponding header is required. It then retu
Further information
-------------------

The API is still an early-stage product and the code is currently not available under an open-source license.
The API is still an early-stage product and the code is not available under an open-source license.

0 comments on commit d3b7014

Please sign in to comment.