add feature file for duplicate detection #103

peterVG · 2019-04-01T21:19:18Z

first draft

ross-spencer

Hi @peterVG I left some comments where we can pull things apart a little more. The direction is good though. If you have questions, perhaps we can discuss them in the meeting later?

The checks that you are writing look great. As we finalize this first draft, I'd like to see us introduce some negative assertions into a scenario of their own so that we can test not just what we expect to happen, but what we definitely don't want to happen.

E.g. in the scenario as it is written, we assert when all properties match we will generate a report. But we don't assert that when only one or two out of three or four properties match, then no report will be generated.

The negative assertions are just as important in this scenario because we can generate True results very easily doing silly things in code. We should make sure that they don't manifest themselves when we're only testing for positives and then affect future disposal recommendations.

reports/duplicates/duplicates.feature

ross-spencer · 2019-04-02T08:00:25Z

reports/duplicates/duplicates.feature

+	Given an AIP has been ingested
+	When the duplicates.py script is run 
+	And a duplicate checksum is found
+	Then the api-store-duplicates.csv file is generated


Behavioral driven development asks about the what not the how. In this example, we don't want to rely on the mechanism duplicates.py because that script name might change. We should shorten this to:

Given an AIP has been ingested When a duplicate checksum is found Then a duplicates report is generated

ross-spencer · 2019-04-02T08:11:31Z

reports/duplicates/duplicates.feature

+	When the base_name is equivalent
+	When the file_path is equivalent
+	When the date_modified is equivalent
+	Then the files are true duplicates


I feel like this is a second scenario, the first Generate a duplicate report, the second Detect a true duplicate file, this would look something like:

Given a duplicates report When a file's <properties> are equivalent Then the files are true duplicates

And we can use a table to describe some rows of properties we need to validate.

It would be good to get clarification on which feature we're writing for - this current version looks like it could be tagged @future because I can see how it's where we might want to be. For IISH the feature is about generating the report so that the analysis can be done - I think it's the difference between Generate a duplicates report and Generate a true duplicates report.

Co-Authored-By: peterVG <[email protected]>

ross-spencer

I think this is nearly there @peterVG and the table looks great. There's a single line missing, and we just need to change the spacing, so we'll end up with:

Adding something like this in the second scenario: Given a duplicates report is available

And then correcting the tabs to spaces, and using more programming-language like indentation, so four-spaces for each indent:

(If you're happy with adding the line above, you can copy and paste the below):

Feature: Identify true duplicates in the Archivematica AIP store.

Background: Alma uses checksums and archival context to determine "true" duplicate files in their collection (i.e. the context of creation and use is identical).

Scenario: Generate a duplicate report
    Given an AIP has been ingested
    When a duplicate checksum is found
    Then a duplicates report is generated
    
Scenario Outline: Detect a true duplicate file
    Given a duplicates report is available
    When a file's <properties> are equivalent
    Then the files are true duplicates

    Examples:
    | properties    |
    | AIP dir_name  |  
    | base_name     |
    | file_path     |
    | date_modified |

peterVG · 2019-04-04T20:11:35Z

Thanks Ross. I committed these updates to my patch branch for your review and PR to your dev/issue-448-add-duplicate-reporting-mechanism if you agree.

ross-spencer · 2019-04-05T09:48:52Z

Looks great @peterVG - last thing is structure of the repository, can I propose:

├── duplicates
│   ├── appconfig.py
│   ├── config.json
│   ├── duplicates.py
│   ├── features    <--- New structure
│   │   ├── duplicates.feature    <--- Your feature file
│   │   └── steps    <--- We can add this folder another time
│   │       └── duplicates.steps
│   ├── __init__.py
│   ├── loggingconfig.py
│   ├── parsemets.py
│   ├── README.md
│   ├── requirements
│   │   ├── base.txt
│   │   ├── local.txt
│   │   └── production.txt
│   ├── requirements.txt
│   └── serialize_to_csv.py
├── __init__.py
└── README.md

Ref: https://behave.readthedocs.io/en/latest/gherkin.html#feature-testing-layout

I'm proposing this structure opposed to the tests layout, because in Archivematica our Unit Tests are normally under tests, and I think keeping these separated makes it clean and easy to understand what we're doing.

CC. @replaceafill this could be our first feature that uses Behave outside of AMAUAT - do you think this seems sensible as a layout for a standalone feature?

replaceafill · 2019-04-05T15:00:25Z

@ross-spencer Nice! And yes, it makes sense to me. I'd just rename duplicates.steps to duplicates_steps.py if that's a Python module.

ross-spencer · 2019-06-26T17:59:17Z

Closing in favor of #118 the feature file work has been rebased and cherry-picked into there.

ross-spencer and others added 14 commits March 14, 2019 17:03

Enable duplicate detection via bag manifests

c75f378

Fortify existing codebase

efde38e

Extract data from METS

1880375

Augment data with dates

f20e733

Fortify work before demo

5798512

Add logging and fixup delete error

966f65f

Add CSV capability

03cbeef

Output to files and stream

319c889

Prepare for demo

c968398

Order CSV columns

b8f3ac7

Minor refactor for reliability

e4a1298

Update column ordering

27c83ba

Add pip instructions to the configuration section

743861b

add feature file for duplicate detection

ed8cfa8

first draft

peterVG requested a review from ross-spencer April 1, 2019 21:19

ross-spencer reviewed Apr 2, 2019

View reviewed changes

ross-spencer and others added 2 commits April 2, 2019 09:38

Update reports/duplicates/duplicates.feature

53ed9d7

Co-Authored-By: peterVG <[email protected]>

incorporate revisions suggested by Ross

c20bda6

ross-spencer reviewed Apr 3, 2019

View reviewed changes

update formatting (#105)

6936415

ross-spencer self-requested a review April 5, 2019 09:41

ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch from 743861b to 1874b3d Compare April 10, 2019 11:25

ross-spencer closed this Jun 26, 2019

ross-spencer reopened this Jun 26, 2019

ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch from 743861b to 4db9351 Compare June 26, 2019 16:20

ross-spencer closed this Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add feature file for duplicate detection #103

add feature file for duplicate detection #103

peterVG commented Apr 1, 2019

ross-spencer left a comment

ross-spencer Apr 2, 2019

ross-spencer Apr 2, 2019

ross-spencer left a comment

peterVG commented Apr 4, 2019

ross-spencer commented Apr 5, 2019

replaceafill commented Apr 5, 2019

ross-spencer commented Jun 26, 2019

add feature file for duplicate detection #103

add feature file for duplicate detection #103

Conversation

peterVG commented Apr 1, 2019

ross-spencer left a comment

Choose a reason for hiding this comment

ross-spencer Apr 2, 2019

Choose a reason for hiding this comment

ross-spencer Apr 2, 2019

Choose a reason for hiding this comment

ross-spencer left a comment

Choose a reason for hiding this comment

peterVG commented Apr 4, 2019

ross-spencer commented Apr 5, 2019

replaceafill commented Apr 5, 2019

ross-spencer commented Jun 26, 2019