Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add feature file for duplicate detection #103

Closed

Conversation

peterVG
Copy link

@peterVG peterVG commented Apr 1, 2019

first draft

@peterVG peterVG requested a review from ross-spencer April 1, 2019 21:19
Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @peterVG I left some comments where we can pull things apart a little more. The direction is good though. If you have questions, perhaps we can discuss them in the meeting later?

The checks that you are writing look great. As we finalize this first draft, I'd like to see us introduce some negative assertions into a scenario of their own so that we can test not just what we expect to happen, but what we definitely don't want to happen.

E.g. in the scenario as it is written, we assert when all properties match we will generate a report. But we don't assert that when only one or two out of three or four properties match, then no report will be generated.

The negative assertions are just as important in this scenario because we can generate True results very easily doing silly things in code. We should make sure that they don't manifest themselves when we're only testing for positives and then affect future disposal recommendations.

reports/duplicates/duplicates.feature Outdated Show resolved Hide resolved
reports/duplicates/duplicates.feature Outdated Show resolved Hide resolved
Given an AIP has been ingested
When the duplicates.py script is run
And a duplicate checksum is found
Then the api-store-duplicates.csv file is generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral driven development asks about the what not the how. In this example, we don't want to rely on the mechanism duplicates.py because that script name might change. We should shorten this to:

Given an AIP has been ingested
When a duplicate checksum is found
Then a duplicates report is generated

When the base_name is equivalent
When the file_path is equivalent
When the date_modified is equivalent
Then the files are true duplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a second scenario, the first Generate a duplicate report, the second Detect a true duplicate file, this would look something like:

Given a duplicates report
When a file's <properties> are equivalent
Then the files are true duplicates

And we can use a table to describe some rows of properties we need to validate.

It would be good to get clarification on which feature we're writing for - this current version looks like it could be tagged @future because I can see how it's where we might want to be. For IISH the feature is about generating the report so that the analysis can be done - I think it's the difference between Generate a duplicates report and Generate a true duplicates report.

Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is nearly there @peterVG and the table looks great. There's a single line missing, and we just need to change the spacing, so we'll end up with:

Adding something like this in the second scenario: Given a duplicates report is available

And then correcting the tabs to spaces, and using more programming-language like indentation, so four-spaces for each indent:

(If you're happy with adding the line above, you can copy and paste the below):

Feature: Identify true duplicates in the Archivematica AIP store.

Background: Alma uses checksums and archival context to determine "true" duplicate files in their collection (i.e. the context of creation and use is identical).

Scenario: Generate a duplicate report
    Given an AIP has been ingested
    When a duplicate checksum is found
    Then a duplicates report is generated
    
Scenario Outline: Detect a true duplicate file
    Given a duplicates report is available
    When a file's <properties> are equivalent
    Then the files are true duplicates

    Examples:
    | properties    |
    | AIP dir_name  |  
    | base_name     |
    | file_path     |
    | date_modified |

@peterVG
Copy link
Author

peterVG commented Apr 4, 2019

Thanks Ross. I committed these updates to my patch branch for your review and PR to your dev/issue-448-add-duplicate-reporting-mechanism if you agree.

@ross-spencer ross-spencer self-requested a review April 5, 2019 09:41
@ross-spencer
Copy link
Contributor

Looks great @peterVG - last thing is structure of the repository, can I propose:

├── duplicates
│   ├── appconfig.py
│   ├── config.json
│   ├── duplicates.py
│   ├── features    <--- New structure
│   │   ├── duplicates.feature    <--- Your feature file
│   │   └── steps    <--- We can add this folder another time
│   │       └── duplicates.steps
│   ├── __init__.py
│   ├── loggingconfig.py
│   ├── parsemets.py
│   ├── README.md
│   ├── requirements
│   │   ├── base.txt
│   │   ├── local.txt
│   │   └── production.txt
│   ├── requirements.txt
│   └── serialize_to_csv.py
├── __init__.py
└── README.md

Ref: https://behave.readthedocs.io/en/latest/gherkin.html#feature-testing-layout

I'm proposing this structure opposed to the tests layout, because in Archivematica our Unit Tests are normally under tests, and I think keeping these separated makes it clean and easy to understand what we're doing.

CC. @replaceafill this could be our first feature that uses Behave outside of AMAUAT - do you think this seems sensible as a layout for a standalone feature?

@replaceafill
Copy link
Member

@ross-spencer Nice! And yes, it makes sense to me. I'd just rename duplicates.steps to duplicates_steps.py if that's a Python module.

@ross-spencer ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch from 743861b to 1874b3d Compare April 10, 2019 11:25
@ross-spencer ross-spencer reopened this Jun 26, 2019
@ross-spencer ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch from 743861b to 4db9351 Compare June 26, 2019 16:20
@ross-spencer
Copy link
Contributor

Closing in favor of #118 the feature file work has been rebased and cherry-picked into there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants