Release v0.2

Removes the dependency on allow- and deny-list files and manages this feature through the UI. Added: - Pluggable allow- and deny-list filters (@Sidneys1). - UI management for the deny-list, allow-list entries, and their filters (@Sidneys1). Removed: - The old file-base allow- and deny-list filters. Signed-off-by: Sidneys1 <[email protected]>
Sidneys1 · Jul 2, 2024 · 0749c7c · 0749c7c
2 parents d1131b1 + 0d90364
commit 0749c7c
Show file tree

Hide file tree

Showing 53 changed files with 1,590 additions and 804 deletions.
diff --git a/.vscode/extensions.json b/.vscode/extensions.json
@@ -1,5 +1,10 @@
 {
     "recommendations": [
-        "ms-python.isort"
+        "ms-python.isort",
+        "streetsidesoftware.code-spell-checker",
+        "editorconfig.editorconfig",
+        "bierner.github-markdown-preview",
+        "eeyore.yapf",
+        "ms-python.mypy-type-checker"
     ]
 }
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -10,5 +10,8 @@
     "isort.args": [
         "--settings-path=${workspaceFolder}"
     ],
-    "python.analysis.typeCheckingMode": "off"
+    "cSpell.words": [
+        "hostnames"
+    ],
+    "python.analysis.typeCheckingMode": "standard"
 }
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,21 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/), and this project adheres to
 [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+v0.2 - 2024-07-02
+-----------------
+
+Removes the dependency on allow- and deny-list files and manages this feature through the UI.
+
+### Added
+
+- Pluggable allow- and deny-list filters (@sidneys1).
+- UI management for the deny-list, allow-list entries, and their filters (@sidneys1).
+
+### Removed
+
+- The old file-base allow- and deny-list filters.
+
+
 v0.1 - 2024-06-06
 -----------------
 

diff --git a/README.md b/README.md
@@ -125,87 +125,6 @@ podman-compose up --build --profile elasticsearch
 Configuration
 -------------
 
-### Allow and Deny Lists
-
-Memoria utilizes allow and deny lists to filter incoming history items so that unwanted websites aren't indexed. These
-lists are currently just text files containing one rule per line.
-
-<details>
-
-Shell-like quotation marks and backslashes are
-supported. A history item will be downloaded by Memoria, given the entries matching its domain name, if the URL is:
-
-1. Matched by any *strong* allowlist entry pertaining; or
-2. Matched by any *weak* allowlist entry pertaining, **and** doesn't match any *strong* denylist entries pertaining.
-
-Additionally, if a subdomain is not matched by any entries then its parent domains will be used sequentially. For
-example, if `gist.github.com` doesn't match any entries, then entries for `github.com` will be checked.
-
-A *weak* list entry is composed of just a domain name:
-```sh
-example.com
-```
-While a *strong* list entry is composed of a domain name and zero or more rules that can further restrict the entry:
-```sh
-example.com /login r^/$
-```
-
-There are currently two types of rules:
-- **Path rules** start with `/` and match if the URL path-part begins with this value.
-- **Regular expression rules** start with `r` and match if any part of the URL matches.
-
-So, to break it down, putting `example.com` in the allowlist and this entry in the denylist:
-
-<h3><code><ruby>example.com<rt>domain</rt></ruby> <ruby><code>/login</code><rt>path&ensp;rule</rt></ruby> <ruby><code>r^/$'</code><rt>regex&ensp;rule</rt></ruby></code></h3>
-
-Would result in these URLs being allowed:
-
-- `https://example.com/foo`
-- `https://example.com/foo/bar/baz#link?search=bat`
-
-And these URLs being denied:
-
-- <h3><samp>https:<wbr>//www<wbr>.<ruby><code>example.com</code><rt>domain</rt></ruby><ruby><code>/login</code><rt>path&ensp;rule</rt></ruby></samp></h3>
-- <h3><samp>https:<wbr>//www<wbr>.<ruby><code>example.com</code><rt>domain</rt></ruby><ruby><code>/login</code><rt>path&ensp;rule</rt>/flow2?step=0</ruby></samp></h3>
-- <h3><samp>https://<ruby><code>example.com</code><rt>domain</rt></ruby><ruby><code>/</code>&nbsp;&nbsp;&nbsp;<rt>regex rule</rt></ruby></samp></h3>
-
-<details><summary>Examples</summary>
-
-- Allow all URLs under GitHub.com, except login, search, my (Sidneys1) own projects and pages, and searches within
-  projects or organizations:
-
-  ```sh
-  # allowlist.txt
-  github.com
-
-  # denylist.txt
-  github.com /login /search /Sidneys1/ 'r/(?:search|repositories|issues)\?q='
-  ```
-
-- Allow any page under a domain except the landing page (`example.com/`):
-
-  ```sh
-  # allowlist.txt
-  example.com
-
-  # denylist.txt
-  example.com r^/$
-  ```
-
-* Deny any page at stackoverflow.com except questions:
-
-  ```sh
-  # allowlist.txt
-  stackoverflow.com /questions/ /q/
-
-  # denylist.txt
-  stackoverflow.com
-  ```
-
-</details>
-
-</details>
-
 ### Options
 
 Memoria has several deployment configuration options that control overall behavior. These can be set via environment
@@ -227,11 +146,6 @@ variables or container secrets. The following configuration options are provided
 
 $\frac{cpus}{2}$[^2]</td></tr>
     </tbody>
-    <tbody>
-        <tr><th rowspan="4">Allow/Deny Lists</th>
-            <td><code>allowlist</code></td> <td>Path to a file defining allowlist<sup><a href="#allow-and-deny-lists">§</a></sup> entries</td> <td><code>./data/allowlist.txt</code></td></tr>
-        <tr><td><code>denylist</code></td>  <td>Path to a file defining denylist<sup><a href="#allow-and-deny-lists">§</a></sup> entries</td>  <td><code>./data/denylist.txt</code></td></tr>
-    </tbody>
     <tbody>
         <tr><th rowspan="4">Databases</th>
             <td><code>database_uri</code></td>     <td>Connection URI to the Memoria database</td>   <td><code>sqlite+aiosqlite:///./data/memoria.db</code></td></tr>
@@ -261,12 +175,12 @@ Plugins
 -------
 
 Memoria utilizes a plugin architecture that allows for different methods of downloading URLs, transforming the
-downloaded content, and extracting indexable plain text from the content. Selecting which plugins Memoria will use is
-described in [§Configuration](#configuration).
+downloaded content, and extracting indexable plain text from the content. Selecting which plugins Memoria will use for
+web content retrieval and processing is described in [§Configuration](#configuration).
 
 <details>
 
-There are currently three types of Memoria Plugins, used during web content retrieval and processing:
+There are currently three types of Memoria Plugins used during web content retrieval and processing:
 - **Downloaders**<br>
   Downloaders are responsible for accessing a URL and retrieving its content from the internet. They can provide this
   content in many different formats to the next plugin in the stack. The most basic Downloaders (like the built-in
@@ -287,8 +201,15 @@ There are currently three types of Memoria Plugins, used during web content retr
   the original downloaded HTML (before any potential modification by Filter plugins) for `<meta ...>` values that could
   be used to enrich the Elasticsearch document, such as `"author"` or `"description"`.
 
+Other types of plugins:
+- **Scraping Rule Filters**<br>
+  Scraping rule filter plugins allow the Scraping Rules in the Settings UI to be extended with new functionality. These
+  filters help determine which history URLs will be retrieved and scraped.
+
 </details>
 
+<!-- TODO: section on scraping rule plugins -->
+
 > [!TIP]
 > See the [📑 Plugin Development](./docs/Plugin%20Development.md) guide for information on developing your own Memoria plugins.
 

diff --git a/docs/Plugin Development.md b/docs/Plugin Development.md
@@ -10,6 +10,7 @@ Plugin Development
   * 💾 [§ Downloaders](#downloaders)
   * ⚗️ [§ Filters](#filters)
   * 🔤 [§ Extractors](#extractors)
+* ⚗️ [§ Scraping Rule Filtering](#scraping-rule-filtering)
 
 A plugin is any Python class implementing one or more [plugin types](#plugin-functionalities) that is exported as an
 [importlib entrypoint][ep] in the `memoria` group.
@@ -106,3 +107,15 @@ content-type.
 [dl]: ../src/memoria/plugins/downloader.py
 [fl]: ../src/memoria/plugins/filter.py
 [ex]: ../src/memoria/plugins/extractor.py
+
+### Scraping Rule Filtering
+
+When Memoria ingests history items it must decide for each item whether to scrape and index the URL.
+This decision is managed by the Scraping Rules section of the Settings UI.
+For each Host configured in the Scraping Rules there can be optional 'filters', that cause the rule to only apply if the
+filters are met.
+Different types of filters are provided by the `AllowlistRule` plugin type. These plugins provide a `matches` function
+that is used to determine whether a given URL matches a specific filter value.
+
+For example, a "prefix" filter could have a value of `/login`, and checks that the path component of a URL starts with
+"`/login`".
diff --git a/pyproject.toml b/pyproject.toml
@@ -24,7 +24,6 @@ dependencies = [
     "fasthx~=0.2403.1",
     "fastapi~=0.111.0",
     "humanize~=4.9.0",
-    "ijson~=3.2.3",
     "pydantic-settings~=2.2.1",
     "python-magic~=0.4.27",
     "SQLAlchemy[asyncio]~=2.0.30"
@@ -46,6 +45,8 @@ dev = [
     "fastapi-cli",
     "isort",
     "mypy",
+    "types-aiofiles",
+    "types-beautifulsoup4",
 ]
 uvicorn = [
     "uvicorn",

diff --git a/src/memoria/__about__.py b/src/memoria/__about__.py
@@ -1,4 +1,4 @@
-__version__ = "v0.1"
+__version__ = "v0.2"
 __description__ = """A selfhosted service for indexing and searching personal web history."""
 __authors__ = [{'name': 'Sidneys1', 'email': '[email protected]'}]