-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add feature file for duplicate detection #103
add feature file for duplicate detection #103
Conversation
first draft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @peterVG I left some comments where we can pull things apart a little more. The direction is good though. If you have questions, perhaps we can discuss them in the meeting later?
The checks that you are writing look great. As we finalize this first draft, I'd like to see us introduce some negative assertions into a scenario of their own so that we can test not just what we expect to happen, but what we definitely don't want to happen.
E.g. in the scenario as it is written, we assert when all properties match we will generate a report. But we don't assert that when only one or two out of three or four properties match, then no report will be generated.
The negative assertions are just as important in this scenario because we can generate True
results very easily doing silly things in code. We should make sure that they don't manifest themselves when we're only testing for positives and then affect future disposal recommendations.
Given an AIP has been ingested | ||
When the duplicates.py script is run | ||
And a duplicate checksum is found | ||
Then the api-store-duplicates.csv file is generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavioral driven development asks about the what not the how. In this example, we don't want to rely on the mechanism duplicates.py
because that script name might change. We should shorten this to:
Given an AIP has been ingested
When a duplicate checksum is found
Then a duplicates report is generated
When the base_name is equivalent | ||
When the file_path is equivalent | ||
When the date_modified is equivalent | ||
Then the files are true duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is a second scenario, the first Generate a duplicate report
, the second Detect a true duplicate file
, this would look something like:
Given a duplicates report
When a file's <properties> are equivalent
Then the files are true duplicates
And we can use a table to describe some rows of properties we need to validate.
It would be good to get clarification on which feature we're writing for - this current version looks like it could be tagged @future
because I can see how it's where we might want to be. For IISH the feature is about generating the report so that the analysis can be done - I think it's the difference between Generate a duplicates report
and Generate a true duplicates report
.
Co-Authored-By: peterVG <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nearly there @peterVG and the table looks great. There's a single line missing, and we just need to change the spacing, so we'll end up with:
Adding something like this in the second scenario: Given a duplicates report is available
And then correcting the tabs to spaces, and using more programming-language like indentation, so four-spaces for each indent:
(If you're happy with adding the line above, you can copy and paste the below):
Feature: Identify true duplicates in the Archivematica AIP store.
Background: Alma uses checksums and archival context to determine "true" duplicate files in their collection (i.e. the context of creation and use is identical).
Scenario: Generate a duplicate report
Given an AIP has been ingested
When a duplicate checksum is found
Then a duplicates report is generated
Scenario Outline: Detect a true duplicate file
Given a duplicates report is available
When a file's <properties> are equivalent
Then the files are true duplicates
Examples:
| properties |
| AIP dir_name |
| base_name |
| file_path |
| date_modified |
Thanks Ross. I committed these updates to my patch branch for your review and PR to your dev/issue-448-add-duplicate-reporting-mechanism if you agree. |
Looks great @peterVG - last thing is structure of the repository, can I propose:
Ref: https://behave.readthedocs.io/en/latest/gherkin.html#feature-testing-layout I'm proposing this structure opposed to the CC. @replaceafill this could be our first feature that uses Behave outside of AMAUAT - do you think this seems sensible as a layout for a standalone feature? |
@ross-spencer Nice! And yes, it makes sense to me. I'd just rename |
743861b
to
1874b3d
Compare
743861b
to
4db9351
Compare
Closing in favor of #118 the feature file work has been rebased and cherry-picked into there. |
first draft