-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable duplicate detection via bag manifests
This commit enables duplicate detection via bag manifests in the AIP store. AIP comparison to other AIPs.
- Loading branch information
1 parent
2ea6c72
commit 4db9351
Showing
14 changed files
with
582 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Automation Tools Reports Module | ||
|
||
A collection of reporting scripts that can be run independently of the | ||
automation tools or in concert with. | ||
|
||
## Duplicates | ||
|
||
Duplicates can identify duplicate entries across AIPs across your entire AIP | ||
store. See the [README](duplicates/README.md) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# Duplicates | ||
|
||
Duplicates can identify duplicate entries across AIPs across your entire AIP | ||
store. | ||
|
||
## Configuration | ||
|
||
**Python** | ||
|
||
The duplicates module has its own dependencies. To ensure it can run, please | ||
install these first: | ||
|
||
* `$ sudo pip install -r requirements.txt` | ||
|
||
**Storage Service** | ||
|
||
To configure your report, modify [config.json](config.json) with information | ||
about how to connect to your Storage Service, e.g. | ||
```json | ||
{ | ||
"storage_service_url": "http://127.0.0.1:62081", | ||
"storage_service_user": "test", | ||
"storage_service_api_key": "test" | ||
} | ||
``` | ||
|
||
## Running the script | ||
|
||
Once configured there are a number of ways to run the script. | ||
|
||
* **From the duplicates directory:** `$ python duplicates.py` | ||
* **From the report folder as a module:** `$ python -m duplicates.duplicates` | ||
* **From the automation-tools folder as a module:** `$ python -m reports.duplicates.duplicates` | ||
|
||
## Output | ||
|
||
The tool has two outputs: | ||
|
||
* `aipstore-duplicates.json` | ||
* `aipstore-duplicates.csv` | ||
|
||
A description of those follows: | ||
|
||
* **Json**: Which reports on the packages across which duplicates have been | ||
found and lists duplicate objects organized by checksum. The output might be | ||
useful for developers creating other tooling around this work, e.g. | ||
visualizations, as json is an easy to manipulate standard in most programming | ||
languages. | ||
|
||
The json output is organised as follows: | ||
```json | ||
{ | ||
"manifest_data": { | ||
"{matched-checksum-1}": [ | ||
{ | ||
"basename": "{filename}", | ||
"date_modified": "{modified-date}", | ||
"dirname": "{directory-name}", | ||
"filepath": "{relative-path}", | ||
"package_name": "{package-name}", | ||
"package_uuid": "{package-uuid}" | ||
}, | ||
{ | ||
"basename": "{filename}", | ||
"date_modified": "{modified-date}", | ||
"dirname": "{directory-name}", | ||
"filepath": "{relative-path}", | ||
"package_name": "{package-name}", | ||
"package_uuid": "{package-uuid}" | ||
}, | ||
{ | ||
"basename": "{filename}", | ||
"date_modified": "{modified-date}", | ||
"dirname": "{directory-name}", | ||
"filepath": "{relative-path}", | ||
"package_name": "{package-name}", | ||
"package_uuid": "{package-uuid}" | ||
} | ||
], | ||
"{matched-checksum-2}": [ | ||
{ | ||
"basename": "{filename}", | ||
"date_modified": "{modified-date}", | ||
"dirname": "{directory-name}", | ||
"filepath": "{relative-path}", | ||
"package_name": "{package-name}", | ||
"package_uuid": "{package-uuid}" | ||
}, | ||
{ | ||
"basename": "{filename}", | ||
"date_modified": "{modified-date}", | ||
"dirname": "{directory-name}", | ||
"filepath": "{relative-path}", | ||
"package_name": "{package-name}", | ||
"package_uuid": "{package-uuid}" | ||
} | ||
] | ||
}, | ||
"packages": { | ||
"{package-uuid}": "{package-name}", | ||
"{package-uuid}": "{package-name}" | ||
} | ||
} | ||
``` | ||
|
||
* **CSV**: Which reports the same information but as a 2D representation. The | ||
CSV is ready-made to be manipulated in tools such as | ||
[OpenRefine](http://openrefine.org/). The CSV dynamically resizes depending on | ||
where some rows have different numbers of duplicate files to report. | ||
|
||
## Process followed | ||
|
||
Much of the work done by this package relies on the | ||
[amclient package](https://github.com/artefactual-labs/amclient). The process | ||
used to create a report is as follows: | ||
|
||
1. Retrieve a list of all AIPs across all pipelines. | ||
2. For every AIP download the bag manifest for the AIP (all manifest | ||
permutations are tested so all duplicates are discovered whether you are using | ||
MD5, SHA1 or SHA256 in your Archivematica instances). | ||
3. For every entry in the bag manifest record the checksum, package, and path. | ||
4. Filter objects with matching checksums into a duplicates report. | ||
5. For every matched file in the duplicates report download the package METS | ||
file. | ||
6. Using the METS file augment the report with date_modified information. | ||
(Other data might be added in future). | ||
7. Output the report as JSON to `aipstore-duplicates.json`. | ||
8. Re-format the report to output in a 2D table to `aipstore-duplicates.csv`. | ||
|
||
## Future work | ||
|
||
As a standalone module, the duplicates work could be developed in a number of | ||
ways that might be desirable in an archival appraisal workflow. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
import json | ||
import os | ||
|
||
from amclient import AMClient | ||
|
||
|
||
class AppConfig: | ||
def __init__(self): | ||
"""Initialize class.""" | ||
config_file = os.path.join(os.path.dirname(__file__), "config.json") | ||
self._load_config(config_file) | ||
|
||
def _load_config(self, config_file): | ||
"""Load our configuration information.""" | ||
with open(config_file) as json_config: | ||
conf = json.load(json_config) | ||
self.storage_service_user = conf.get("storage_service_user") | ||
self.storage_service_api_key = conf.get("storage_service_api_key") | ||
self.storage_service_url = conf.get("storage_service_url") | ||
|
||
def get_am_client(self): | ||
"""Return an Archivematica API client to the caller.""" | ||
am = AMClient() | ||
am.ss_url = self.storage_service_url | ||
am.ss_user_name = self.storage_service_user | ||
am.ss_api_key = self.storage_service_api_key | ||
return am |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{ | ||
"storage_service_url": "http://127.0.0.1:62081", | ||
"storage_service_user": "test", | ||
"storage_service_api_key": "test" | ||
} |
Oops, something went wrong.