Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add feature file for duplicate detection #103

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions reports/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Automation Tools Reports Module

A collection of reporting scripts that can be run independently of the
automation tools or in concert with.

## Duplicates

Duplicates can identify duplicate entries across AIPs across your entire AIP
store. See the [README](duplicates/README.md)
Empty file added reports/__init__.py
Empty file.
133 changes: 133 additions & 0 deletions reports/duplicates/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Duplicates

Duplicates can identify duplicate entries across AIPs across your entire AIP
store.

## Configuration

**Python**

The duplicates module has its own dependencies. To ensure it can run, please
install these first:

* `$ sudo pip install -r requirements.txt`

**Storage Service**

To configure your report, modify [config.json](config.json) with information
about how to connect to your Storage Service, e.g.
```json
{
"storage_service_url": "http://127.0.0.1:62081",
"storage_service_user": "test",
"storage_service_api_key": "test"
}
```

## Running the script

Once configured there are a number of ways to run the script.

* **From the duplicates directory:** `$ python duplicates.py`
* **From the report folder as a module:** `$ python -m duplicates.duplicates`
* **From the automation-tools folder as a module:** `$ python -m reports.duplicates.duplicates`

## Output

The tool has two outputs:

* `aipstore-duplicates.json`
* `aipstore-duplicates.csv`

A description of those follows:

* **Json**: Which reports on the packages across which duplicates have been
found and lists duplicate objects organized by checksum. The output might be
useful for developers creating other tooling around this work, e.g.
visualizations, as json is an easy to manipulate standard in most programming
languages.

The json output is organised as follows:
```json
{
"manifest_data": {
"{matched-checksum-1}": [
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
}
],
"{matched-checksum-2}": [
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
}
]
},
"packages": {
"{package-uuid}": "{package-name}",
"{package-uuid}": "{package-name}"
}
}
```

* **CSV**: Which reports the same information but as a 2D representation. The
CSV is ready-made to be manipulated in tools such as
[OpenRefine](http://openrefine.org/). The CSV dynamically resizes depending on
where some rows have different numbers of duplicate files to report.

## Process followed

Much of the work done by this package relies on the
[amclient package](https://github.com/artefactual-labs/amclient). The process
used to create a report is as follows:

1. Retrieve a list of all AIPs across all pipelines.
2. For every AIP download the bag manifest for the AIP (all manifest
permutations are tested so all duplicates are discovered whether you are using
MD5, SHA1 or SHA256 in your Archivematica instances).
3. For every entry in the bag manifest record the checksum, package, and path.
4. Filter objects with matching checksums into a duplicates report.
5. For every matched file in the duplicates report download the package METS
file.
6. Using the METS file augment the report with date_modified information.
(Other data might be added in future).
7. Output the report as JSON to `aipstore-duplicates.json`.
8. Re-format the report to output in a 2D table to `aipstore-duplicates.csv`.

## Future work

As a standalone module, the duplicates work could be developed in a number of
ways that might be desirable in an archival appraisal workflow.
Empty file added reports/duplicates/__init__.py
Empty file.
29 changes: 29 additions & 0 deletions reports/duplicates/appconfig.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# -*- coding: utf-8 -*-

import json
import os

from amclient import AMClient


class AppConfig:
def __init__(self):
"""Initialize class."""
config_file = os.path.join(os.path.dirname(__file__), "config.json")
self._load_config(config_file)

def _load_config(self, config_file):
"""Load our configuration information."""
with open(config_file) as json_config:
conf = json.load(json_config)
self.storage_service_user = conf.get("storage_service_user")
self.storage_service_api_key = conf.get("storage_service_api_key")
self.storage_service_url = conf.get("storage_service_url")

def get_am_client(self):
"""Return an Archivematica API client to the caller."""
am = AMClient()
am.ss_url = self.storage_service_url
am.ss_user_name = self.storage_service_user
am.ss_api_key = self.storage_service_api_key
return am
5 changes: 5 additions & 0 deletions reports/duplicates/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"storage_service_url": "http://127.0.0.1:62081",
"storage_service_user": "test",
"storage_service_api_key": "test"
}
20 changes: 20 additions & 0 deletions reports/duplicates/duplicates.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Feature: Identify true duplicates in the Archivematica AIP store.

Background: Alma uses checksums and archival context to determine "true" duplicate files in their collection (i.e. the context of creation and use is identical).

Scenario: Generate a duplicate report
Given an AIP has been ingested
When a duplicate checksum is found
Then a duplicates report is generated

Scenario Outline: Detect a true duplicate file
Given a duplicates report is available
When a file's <properties> are equivalent
Then the files are true duplicates

Examples:
| properties |
| AIP dir_name |
| base_name |
| file_path |
| date_modified |
Loading