diff --git a/README.rst b/README.rst index 3882d94e..19768284 100644 --- a/README.rst +++ b/README.rst @@ -48,7 +48,7 @@ Introduction Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats. -Smart navigation and going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements (headers, footers etc.), and **making sense of the data** with selected information. The extractor is designed to be **robust and reasonably fast**, it runs in production on millions of documents. +Going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements (headers, footers etc.), and **making sense of the data** with selected information. The extractor is designed to be **robust and reasonably fast**, it runs in production on millions of documents. The tool's versatility makes it **useful for quantitative and data-driven approaches**. It is used in the academic domain and beyond (e.g. in natural language processing, computational social science, search engine optimization, and information security). @@ -128,7 +128,7 @@ Usage and documentation - `Tutorials and use cases `_ -For video tutorials see this Youtube playlist: +Youtube playlist with video tutorials in several languages: - `Web scraping tutorials and how-tos `_ @@ -200,7 +200,7 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition. Software ecosystem ~~~~~~~~~~~~~~~~~~ -This software is part of a larger ecosystem. Case studies and publications are listed on the `Used By documentation page `_. +Case studies and publications are listed on the `Used By documentation page `_. Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis: diff --git a/docs/index.rst b/docs/index.rst index 08cb69b0..319b185f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -78,7 +78,7 @@ Evaluation and alternatives Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. The extractor tries to strike a balance between limiting noise and including all valid parts. -The `benchmark section `_ details alternatives and results and the `evaluation readme `_ describes how to reproduce the evaluation. +The `benchmark section `_ details alternatives and results, the `evaluation readme `_ describes how to reproduce the evaluation. In a nutshell @@ -102,7 +102,7 @@ On the command-line: $ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/" # outputs main content and comments as plain text ... -For more information please refer to `usage documentation `_ and `tutorials `_. +For more see `usage documentation `_ and `tutorials `_. .. raw:: html @@ -181,7 +181,7 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition. Software ecosystem ~~~~~~~~~~~~~~~~~~ -This software is part of a larger ecosystem. Case studies and publications are listed on the `Used By documentation page `_. +Case studies and publications are listed on the `Used By documentation page `_. Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis: diff --git a/docs/tutorials.rst b/docs/tutorials.rst index 694b6a42..b5e271bb 100644 --- a/docs/tutorials.rst +++ b/docs/tutorials.rst @@ -32,7 +32,7 @@ Blog posts Videos ^^^^^^ -Youtube playlist +Youtube playlist with video tutorials in several languages `Web scraping how-tos and tutorials `_. diff --git a/docs/usage-api.rst b/docs/usage-api.rst index f62f6979..5c7fdff1 100644 --- a/docs/usage-api.rst +++ b/docs/usage-api.rst @@ -32,17 +32,23 @@ Examples The API takes JSON as input and a corresponding header is required. It then returns a JSON string with the result. +CLI +~~~ + .. code-block:: bash $ curl -X POST "https://trafilatura.mooo.com/extract-demo" \ - -H "content-type: application/json" \ - --data '{ - "url": "https://example.org", - "args": { - "output_format": "xml" - } - }' + -H "content-type: application/json" \ + --data '{ + "url": "https://example.org", + "args": { + "output_format": "xml" + } + }' + +Python +~~~~~~ .. code-block:: python @@ -67,5 +73,5 @@ The API takes JSON as input and a corresponding header is required. It then retu Further information ------------------- -The API is still an early-stage product and the code is currently not available under an open-source license. +The API is still an early-stage product and the code is not available under an open-source license.