Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add feature file for duplicate detection #103

Closed
Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions reports/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Automation Tools Reports Module

A collection of reporting scripts that can be run independently of the
automation tools or in concert with.

## Duplicates

Duplicates can identify duplicate entries across AIPs across your entire AIP
store. See the [README](duplicates/README.md)
Empty file added reports/__init__.py
Empty file.
133 changes: 133 additions & 0 deletions reports/duplicates/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Duplicates

Duplicates can identify duplicate entries across AIPs across your entire AIP
store.

## Configuration

**Python**

The duplicates module has its own dependencies. To ensure it can run, please
install these first:

* `$ sudo pip install -r requirements.txt`

**Storage Service**

To configure your report, modify [config.json](config.json) with information
about how to connect to your Storage Service, e.g.
```json
{
"storage_service_url": "http://127.0.0.1:62081",
"storage_service_user": "test",
"storage_service_api_key": "test"
}
```

## Running the script

Once configured there are a number of ways to run the script.

* **From the duplicates directory:** `$ python duplicates.py`
* **From the report folder as a module:** `$ python -m duplicates.duplicates`
* **From the automation-tools folder as a module:** `$ python -m reports.duplicates.duplicates`

## Output

The tool has two outputs:

* `aipstore-duplicates.json`
* `aipstore-duplicates.csv`

A description of those follows:

* **Json**: Which reports on the packages across which duplicates have been
found and lists duplicate objects organized by checksum. The output might be
useful for developers creating other tooling around this work, e.g.
visualizations, as json is an easy to manipulate standard in most programming
languages.

The json output is organised as follows:
```json
{
"manifest_data": {
"{matched-checksum-1}": [
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
}
],
"{matched-checksum-2}": [
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
}
]
},
"packages": {
"{package-uuid}": "{package-name}",
"{package-uuid}": "{package-name}"
}
}
```

* **CSV**: Which reports the same information but as a 2D representation. The
CSV is ready-made to be manipulated in tools such as
[OpenRefine](http://openrefine.org/). The CSV dynamically resizes depending on
where some rows have different numbers of duplicate files to report.

## Process followed

Much of the work done by this package relies on the
[amclient package](https://github.com/artefactual-labs/amclient). The process
used to create a report is as follows:

1. Retrieve a list of all AIPs across all pipelines.
2. For every AIP download the bag manifest for the AIP (all manifest
permutations are tested so all duplicates are discovered whether you are using
MD5, SHA1 or SHA256 in your Archivematica instances).
3. For every entry in the bag manifest record the checksum, package, and path.
4. Filter objects with matching checksums into a duplicates report.
5. For every matched file in the duplicates report download the package METS
file.
6. Using the METS file augment the report with date_modified information.
(Other data might be added in future).
7. Output the report as JSON to `aipstore-duplicates.json`.
8. Re-format the report to output in a 2D table to `aipstore-duplicates.csv`.

## Future work

As a standalone module, the duplicates work could be developed in a number of
ways that might be desirable in an archival appraisal workflow.
Empty file added reports/duplicates/__init__.py
Empty file.
29 changes: 29 additions & 0 deletions reports/duplicates/appconfig.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# -*- coding: utf-8 -*-

import json
import os

from amclient import AMClient


class AppConfig:
def __init__(self):
"""Initialize class."""
config_file = os.path.join(os.path.dirname(__file__), "config.json")
self._load_config(config_file)

def _load_config(self, config_file):
"""Load our configuration information."""
with open(config_file) as json_config:
conf = json.load(json_config)
self.storage_service_user = conf.get("storage_service_user")
self.storage_service_api_key = conf.get("storage_service_api_key")
self.storage_service_url = conf.get("storage_service_url")

def get_am_client(self):
"""Return an Archivematica API client to the caller."""
am = AMClient()
am.ss_url = self.storage_service_url
am.ss_user_name = self.storage_service_user
am.ss_api_key = self.storage_service_api_key
return am
5 changes: 5 additions & 0 deletions reports/duplicates/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"storage_service_url": "http://127.0.0.1:62081",
"storage_service_user": "test",
"storage_service_api_key": "test"
}
14 changes: 14 additions & 0 deletions reports/duplicates/duplicates.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Feature: Identify true duplicates in your Archivematica AIP store.
peterVG marked this conversation as resolved.
Show resolved Hide resolved

Background: A checksum match might indicate that duplicate objects exist in your Archivematica AIP archival storage but further analysis of the object’s context will determine whether you have identified a “true” duplicate in the archival sense (i.e. the context of creation and use is identical).
peterVG marked this conversation as resolved.
Show resolved Hide resolved

Scenario: Detect a true duplicate file
Given an AIP has been ingested
When the duplicates.py script is run
And a duplicate checksum is found
Then the api-store-duplicates.csv file is generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral driven development asks about the what not the how. In this example, we don't want to rely on the mechanism duplicates.py because that script name might change. We should shorten this to:

Given an AIP has been ingested
When a duplicate checksum is found
Then a duplicates report is generated

When the AIP dir_name is equivalent
When the base_name is equivalent
When the file_path is equivalent
When the date_modified is equivalent
Then the files are true duplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a second scenario, the first Generate a duplicate report, the second Detect a true duplicate file, this would look something like:

Given a duplicates report
When a file's <properties> are equivalent
Then the files are true duplicates

And we can use a table to describe some rows of properties we need to validate.

It would be good to get clarification on which feature we're writing for - this current version looks like it could be tagged @future because I can see how it's where we might want to be. For IISH the feature is about generating the report so that the analysis can be done - I think it's the difference between Generate a duplicates report and Generate a true duplicates report.

Loading