Skip to content

Commit

Permalink
Release v0.2
Browse files Browse the repository at this point in the history
Removes the dependency on allow- and deny-list files and manages this feature through the UI.

Added:
- Pluggable allow- and deny-list filters (@Sidneys1).
- UI management for the deny-list, allow-list entries, and their filters (@Sidneys1).

Removed:
- The old file-base allow- and deny-list filters.

Signed-off-by: Sidneys1 <[email protected]>
  • Loading branch information
Sidneys1 committed Jul 2, 2024
2 parents d1131b1 + 0d90364 commit 0749c7c
Show file tree
Hide file tree
Showing 53 changed files with 1,590 additions and 804 deletions.
7 changes: 6 additions & 1 deletion .vscode/extensions.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
{
"recommendations": [
"ms-python.isort"
"ms-python.isort",
"streetsidesoftware.code-spell-checker",
"editorconfig.editorconfig",
"bierner.github-markdown-preview",
"eeyore.yapf",
"ms-python.mypy-type-checker"
]
}
5 changes: 4 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,8 @@
"isort.args": [
"--settings-path=${workspaceFolder}"
],
"python.analysis.typeCheckingMode": "off"
"cSpell.words": [
"hostnames"
],
"python.analysis.typeCheckingMode": "standard"
}
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/), and this project adheres to
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).

v0.2 - 2024-07-02
-----------------

Removes the dependency on allow- and deny-list files and manages this feature through the UI.

### Added

- Pluggable allow- and deny-list filters (@sidneys1).
- UI management for the deny-list, allow-list entries, and their filters (@sidneys1).

### Removed

- The old file-base allow- and deny-list filters.


v0.1 - 2024-06-06
-----------------

Expand Down
99 changes: 10 additions & 89 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,87 +125,6 @@ podman-compose up --build --profile elasticsearch
Configuration
-------------

### Allow and Deny Lists

Memoria utilizes allow and deny lists to filter incoming history items so that unwanted websites aren't indexed. These
lists are currently just text files containing one rule per line.

<details>

Shell-like quotation marks and backslashes are
supported. A history item will be downloaded by Memoria, given the entries matching its domain name, if the URL is:

1. Matched by any *strong* allowlist entry pertaining; or
2. Matched by any *weak* allowlist entry pertaining, **and** doesn't match any *strong* denylist entries pertaining.

Additionally, if a subdomain is not matched by any entries then its parent domains will be used sequentially. For
example, if `gist.github.com` doesn't match any entries, then entries for `github.com` will be checked.

A *weak* list entry is composed of just a domain name:
```sh
example.com
```
While a *strong* list entry is composed of a domain name and zero or more rules that can further restrict the entry:
```sh
example.com /login r^/$
```

There are currently two types of rules:
- **Path rules** start with `/` and match if the URL path-part begins with this value.
- **Regular expression rules** start with `r` and match if any part of the URL matches.

So, to break it down, putting `example.com` in the allowlist and this entry in the denylist:

<h3><code><ruby>example.com<rt>domain</rt></ruby> <ruby><code>/login</code><rt>path&ensp;rule</rt></ruby> <ruby><code>r^/$'</code><rt>regex&ensp;rule</rt></ruby></code></h3>

Would result in these URLs being allowed:

- `https://example.com/foo`
- `https://example.com/foo/bar/baz#link?search=bat`

And these URLs being denied:

- <h3><samp>https:<wbr>//www<wbr>.<ruby><code>example.com</code><rt>domain</rt></ruby><ruby><code>/login</code><rt>path&ensp;rule</rt></ruby></samp></h3>
- <h3><samp>https:<wbr>//www<wbr>.<ruby><code>example.com</code><rt>domain</rt></ruby><ruby><code>/login</code><rt>path&ensp;rule</rt>/flow2?step=0</ruby></samp></h3>
- <h3><samp>https://<ruby><code>example.com</code><rt>domain</rt></ruby><ruby><code>/</code>&nbsp;&nbsp;&nbsp;<rt>regex rule</rt></ruby></samp></h3>

<details><summary>Examples</summary>

- Allow all URLs under GitHub.com, except login, search, my (Sidneys1) own projects and pages, and searches within
projects or organizations:

```sh
# allowlist.txt
github.com

# denylist.txt
github.com /login /search /Sidneys1/ 'r/(?:search|repositories|issues)\?q='
```

- Allow any page under a domain except the landing page (`example.com/`):

```sh
# allowlist.txt
example.com

# denylist.txt
example.com r^/$
```

* Deny any page at stackoverflow.com except questions:

```sh
# allowlist.txt
stackoverflow.com /questions/ /q/

# denylist.txt
stackoverflow.com
```

</details>

</details>

### Options

Memoria has several deployment configuration options that control overall behavior. These can be set via environment
Expand All @@ -227,11 +146,6 @@ variables or container secrets. The following configuration options are provided

$\frac{cpus}{2}$[^2]</td></tr>
</tbody>
<tbody>
<tr><th rowspan="4">Allow/Deny Lists</th>
<td><code>allowlist</code></td> <td>Path to a file defining allowlist<sup><a href="#allow-and-deny-lists">§</a></sup> entries</td> <td><code>./data/allowlist.txt</code></td></tr>
<tr><td><code>denylist</code></td> <td>Path to a file defining denylist<sup><a href="#allow-and-deny-lists">§</a></sup> entries</td> <td><code>./data/denylist.txt</code></td></tr>
</tbody>
<tbody>
<tr><th rowspan="4">Databases</th>
<td><code>database_uri</code></td> <td>Connection URI to the Memoria database</td> <td><code>sqlite+aiosqlite:///./data/memoria.db</code></td></tr>
Expand Down Expand Up @@ -261,12 +175,12 @@ Plugins
-------

Memoria utilizes a plugin architecture that allows for different methods of downloading URLs, transforming the
downloaded content, and extracting indexable plain text from the content. Selecting which plugins Memoria will use is
described in [§Configuration](#configuration).
downloaded content, and extracting indexable plain text from the content. Selecting which plugins Memoria will use for
web content retrieval and processing is described in [§Configuration](#configuration).

<details>

There are currently three types of Memoria Plugins, used during web content retrieval and processing:
There are currently three types of Memoria Plugins used during web content retrieval and processing:
- **Downloaders**<br>
Downloaders are responsible for accessing a URL and retrieving its content from the internet. They can provide this
content in many different formats to the next plugin in the stack. The most basic Downloaders (like the built-in
Expand All @@ -287,8 +201,15 @@ There are currently three types of Memoria Plugins, used during web content retr
the original downloaded HTML (before any potential modification by Filter plugins) for `<meta ...>` values that could
be used to enrich the Elasticsearch document, such as `"author"` or `"description"`.

Other types of plugins:
- **Scraping Rule Filters**<br>
Scraping rule filter plugins allow the Scraping Rules in the Settings UI to be extended with new functionality. These
filters help determine which history URLs will be retrieved and scraped.

</details>

<!-- TODO: section on scraping rule plugins -->

> [!TIP]
> See the [📑 Plugin Development](./docs/Plugin%20Development.md) guide for information on developing your own Memoria plugins.
Expand Down
13 changes: 13 additions & 0 deletions docs/Plugin Development.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Plugin Development
* 💾 [§ Downloaders](#downloaders)
* ⚗️ [§ Filters](#filters)
* 🔤 [§ Extractors](#extractors)
* ⚗️ [§ Scraping Rule Filtering](#scraping-rule-filtering)

A plugin is any Python class implementing one or more [plugin types](#plugin-functionalities) that is exported as an
[importlib entrypoint][ep] in the `memoria` group.
Expand Down Expand Up @@ -106,3 +107,15 @@ content-type.
[dl]: ../src/memoria/plugins/downloader.py
[fl]: ../src/memoria/plugins/filter.py
[ex]: ../src/memoria/plugins/extractor.py

### Scraping Rule Filtering

When Memoria ingests history items it must decide for each item whether to scrape and index the URL.
This decision is managed by the Scraping Rules section of the Settings UI.
For each Host configured in the Scraping Rules there can be optional 'filters', that cause the rule to only apply if the
filters are met.
Different types of filters are provided by the `AllowlistRule` plugin type. These plugins provide a `matches` function
that is used to determine whether a given URL matches a specific filter value.

For example, a "prefix" filter could have a value of `/login`, and checks that the path component of a URL starts with
"`/login`".
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ dependencies = [
"fasthx~=0.2403.1",
"fastapi~=0.111.0",
"humanize~=4.9.0",
"ijson~=3.2.3",
"pydantic-settings~=2.2.1",
"python-magic~=0.4.27",
"SQLAlchemy[asyncio]~=2.0.30"
Expand All @@ -46,6 +45,8 @@ dev = [
"fastapi-cli",
"isort",
"mypy",
"types-aiofiles",
"types-beautifulsoup4",
]
uvicorn = [
"uvicorn",
Expand Down
2 changes: 1 addition & 1 deletion src/memoria/__about__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "v0.1"
__version__ = "v0.2"
__description__ = """A selfhosted service for indexing and searching personal web history."""
__authors__ = [{'name': 'Sidneys1', 'email': '[email protected]'}]

Expand Down
Loading

0 comments on commit 0749c7c

Please sign in to comment.