Skip to content

Commit

Permalink
Merge pull request #75 from uktrade/docs/update-query-data
Browse files Browse the repository at this point in the history
Minor improvements to docs
  • Loading branch information
leo-mazzone authored Feb 27, 2025
2 parents 03f9bb8 + e628db7 commit fdcf1d7
Show file tree
Hide file tree
Showing 33 changed files with 138 additions and 20 deletions.
2 changes: 2 additions & 0 deletions docs/api/client/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

`matchbox.client` is the client used to interact with the [Matchbox server](../../server/install.md).

All names in `matchbox.client` are also accessible from the top-level `matchbox` module.

::: matchbox.client
options:
show_root_heading: true
Expand Down
12 changes: 12 additions & 0 deletions docs/client/explore_resolutions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Matchbox lets you link many sources of data in many different ways. But when you query it, which way should you choose?

A *resolution*, or *point of resolution* represents a queriable state describing how to cluster entities from one or more data sources. A resolution can represent an original data source, a deduplicated data source, or the result of linking two or more resolutions.

In order to explore which resolutions are stored on Matchbox, you can use the following client method:

=== "Example"
```python
from matchbox import draw_resolution_graph

draw_resolution_graph()
```
16 changes: 9 additions & 7 deletions docs/client/query-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ Given a primary key and a source dataset, retrieves all primary keys that share
=== "Example"
```python
import matchbox as mb
from matchbox.client.helpers import selector
from matchbox import select
import sqlalchemy

engine = sqlalchemy.create_engine('postgresql://')

mb.match(
select("datahub_companies", engine=engine),
source=select("companies_house", engine=engine),
source_pk="8534735",
source="dbt.companieshouse",
target="hmrc.exporters",
resolution="companies",
resolution_name="last_linker",
)
```

Expand Down Expand Up @@ -44,13 +44,13 @@ Retrieves entire data sources along with a unique entity identifier according to
=== "Example"
```python
import matchbox as mb
from matchbox.client.helpers import selector
from matchbox import select
import sqlalchemy

engine = sqlalchemy.create_engine('postgresql://')

mb.query(
selector(
select(
{
"dbt.companieshouse": ["company_name"],
"hmrc.exporters": ["year", "commodity_codes"],
Expand All @@ -68,4 +68,6 @@ Retrieves entire data sources along with a unique entity identifier according to
122 Acme Ltd. 2024 ['72142', '72143']
5 Gamma Exports 2023 ['90328', '90329']
...
```
```

For more information on how to use the functions on this page, please check out the relevant examples in the [client API docs](../../api/client/).
7 changes: 5 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,12 @@ hide:

---

Learn how to quickly install and use Matchbox.
Learn how to quickly install and use Matchbox:

* The **client** lets you query and link/dedupe data
* The **server** is for setting up a new Matchbox instance for your organisation.

[:octicons-download-16: Install client](./client/install.md){ .md-button .md-button--primary } [:octicons-download-16: Install server](./server/install.md){ .md-button .md-button--primary }
[:octicons-zap-16: Get started with the client](./client/install.md){ .md-button .md-button--primary } [:octicons-download-16: Deploy server in your org](./server/install.md){ .md-button .md-button--primary }

</div>

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ nav:
- Use cases: use-cases.md
- Client:
- Installation: client/install.md
- Explore resolutions: client/explore_resolutions.md
- Retrieve: client/query-data.md
- Link and deduplicate: client/link-data.md
- API:
Expand Down
8 changes: 1 addition & 7 deletions src/matchbox/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,4 @@
load_dotenv(dotenv_path)

# Environment variables must be loaded first for other imports to work

from matchbox.client.helpers.cleaner import process # NoQA: E402
from matchbox.client.helpers.index import index # NoQA: E402
from matchbox.client.helpers.selector import match, query, select # NoQA: E402
from matchbox.client.models.models import make_model # NoQA: E402

__all__ = ("make_model", "process", "select", "query", "match", "index")
from matchbox.client import * # noqa: E402, F403
11 changes: 7 additions & 4 deletions src/matchbox/client/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
"""All client-side functionalities of Matchbox."""

from matchbox.client.helpers.cleaner import process
from matchbox.client.helpers.index import index
from matchbox.client.helpers.selector import match, query
from matchbox.client.models.models import make_model
from matchbox.client.visualisation import draw_resolution_graph

__all__ = (
# Visualisation
"draw_resolution_graph",
)
__all__ = ("process", "index", "match", "query", "make_model", "draw_resolution_graph")
2 changes: 2 additions & 0 deletions src/matchbox/client/_handler.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Functions abstracting the interaction with the server API."""

import time
from collections.abc import Iterable
from io import BytesIO
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/_logging.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Client-side logging utilities."""

import logging
import sys

Expand Down
Empty file removed src/matchbox/client/clean/.gitkeep
Empty file.
2 changes: 2 additions & 0 deletions src/matchbox/client/clean/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Library of default cleaning functions."""

from matchbox.client.clean.lib import (
company_name,
company_number,
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/clean/lib.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Implementation of default cleaning functions."""

from functools import partial

from pandas import DataFrame
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/clean/steps/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Low-level components of default cleaning functions."""

from matchbox.client.clean.steps.clean_basic import (
array_except,
array_intersect,
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/clean/steps/clean_basic.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Low-level primitives supporting default cleaning functions."""

from typing import Dict, List

from matchbox.client.clean.utils import ABBREVIATIONS, STOPWORDS
Expand Down
3 changes: 3 additions & 0 deletions src/matchbox/client/clean/steps/clean_basic_original.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
"""Legacy cleaning rules inherited by the Company Matching Service."""


def cms_original_clean_company_name_general(column):
"""Replicates the original Company Matching Service company name cleaning
regex exactly. Intended to help replicate the methodology for comparison.
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/clean/utils.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Generic utilities for default cleaning functions."""

from typing import Callable

import duckdb
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/helpers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Core functionalities of the Matchbox client."""

from matchbox.client.helpers.cleaner import cleaner, cleaners
from matchbox.client.helpers.comparison import comparison
from matchbox.client.helpers.selector import select
Expand Down
43 changes: 43 additions & 0 deletions src/matchbox/client/helpers/cleaner.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,60 @@
"""Functions to pre-process data sources."""

from typing import Any, Callable, Dict

from pandas import DataFrame


def cleaner(function: Callable, arguments: Dict) -> Dict[str, Dict[str, Any]]:
"""Define a function to clean a dataset.
Args:
function: the callable implementing the cleaning behaviour
arguments: a dictionary of keyword arguments to pass to the cleaning function
Returns:
A representation of the cleaner ready to be passed to the `cleaners()` function
"""
return {function.__name__: {"function": function, "arguments": arguments}}


def cleaners(*cleaner: Dict[str, Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
"""Combine multiple cleaners in a single object to pass to `process()`
Args:
cleaner: Output of the `cleaner()` function
Returns:
A representation of multiple cleaners to be passed to the `process()` function
Examples:
```python
clean_pipeline = cleaners(
cleaner(
normalise_company_number,
{"column": "company_number"},
),
cleaner(
normalise_postcode,
{"column": "postcode"},
),
)
```
"""
return {k: v for d in cleaner for k, v in d.items()}


def process(data: DataFrame, pipeline: Dict[str, Dict[str, Any]]) -> DataFrame:
"""Apply cleaners to input dataframe.
Args:
data: The dataframe to process
pipeline: Output of the `cleaners()` function
Returns:
The processed dataset
"""
curr = data
for func in pipeline.keys():
curr = pipeline[func]["function"](curr, **pipeline[func]["arguments"])
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/helpers/comparison.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Functions to compare fields in different datasets."""

import sqlglot.expressions as exp
from sqlglot import parse_one
from sqlglot.errors import ParseError
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/helpers/index.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Functions to index data sources to the Matchbox server."""

from sqlalchemy import Engine

from matchbox.client import _handler
Expand Down
12 changes: 12 additions & 0 deletions src/matchbox/client/helpers/selector.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Functions to select and retrieve data from the Matchbox server."""

import itertools
from os import getenv
from typing import Literal
Expand Down Expand Up @@ -216,6 +218,16 @@ def match(
If None, uses the resolutions' default threshold
If an integer, uses that threshold for the specified resolution, and the
resolution's cached thresholds for its ancestors
Examples:
```python
mb.match(
select("datahub_companies", engine=engine),
source=select("companies_house", engine=engine),
source_pk="8534735",
resolution_name="last_linker",
)
```
"""
if len(source) > 1:
raise ValueError("Only one source can be matched at one time")
Expand Down
1 change: 1 addition & 0 deletions src/matchbox/client/models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Deduplication and linking methodologies."""
2 changes: 2 additions & 0 deletions src/matchbox/client/models/dedupers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Deduplication methodologies."""

from matchbox.client.models.dedupers.naive import NaiveDeduper

__all__ = ("NaiveDeduper",)
2 changes: 2 additions & 0 deletions src/matchbox/client/models/dedupers/base.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Base class for deduplication methodologies."""

import warnings
from abc import ABC, abstractmethod

Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/dedupers/naive.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""A deduplication methodology based on a deterministic set of conditions."""

from typing import List, Type

import duckdb
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/linkers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Linking methodologies."""

from matchbox.client.models.linkers.deterministic import DeterministicLinker
from matchbox.client.models.linkers.splinklinker import SplinkLinker
from matchbox.client.models.linkers.weighteddeterministic import (
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/linkers/base.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Base class for linkers."""

import warnings
from abc import ABC, abstractmethod

Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/linkers/deterministic.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""A linking methodology based on a deterministic set of conditions."""

from typing import Type

import duckdb
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/linkers/splinklinker.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""A linking methodology leveraging Splink."""

import ast
import inspect
import logging
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/linkers/weighteddeterministic.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""A linking methodology that applies different weights to field comparisons."""

from typing import List, Type

import duckdb
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/models/models.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Functions and classes to define, run and register models."""

from typing import Any, ParamSpec, TypeVar

from pandas import DataFrame
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/results.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Objects representing the results of running a model client-side."""

import logging
from functools import wraps
from typing import TYPE_CHECKING, Any, Callable, Hashable, ParamSpec, TypeVar
Expand Down
2 changes: 2 additions & 0 deletions src/matchbox/client/visualisation.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Visualisation utilities."""

import rustworkx as rx
from matplotlib.figure import Figure
from rustworkx.visualization import mpl_draw
Expand Down

0 comments on commit fdcf1d7

Please sign in to comment.