Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate checksum verification #1012

Merged
merged 4 commits into from
Mar 27, 2018

Conversation

jrwdunham
Copy link
Contributor

Fixes #918

In short, removes the once-per-file verifyPREMISChecksums_v0.0 micro-service and replaces it with once-per-SIP equivalent functionality in a new verifyAIP_v1.0 client script (which supersedes verifyAIP_v0.0). Question: I have deleted verifyAIP_v0.0 because other superseded client scripts appear to have been deleted. Should it be retained?

The new verifyAIP_v1.0 parses the BagIt manifest file for the checksums that bag has just calculated instead of needlessly re-calculating them.

The functional consequences of this change are as follows.

  1. AIP METS files will no longer have fixity check PREMIS events for each file.

  2. The AIP pointer file will now contain a single fixity check PREMIS event for the entire AIP:

    <mets:mdWrap MDTYPE="PREMIS:EVENT">
      <mets:xmlData>
        <premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
          <premis:eventIdentifier>
            <premis:eventIdentifierType>UUID</premis:eventIdentifierType>
            <premis:eventIdentifierValue>bdf5c772-1923-4241-a88b-1f949ab542c2</premis:eventIdentifierValue>
          </premis:eventIdentifier>
          <premis:eventType>fixity check</premis:eventType>
          <premis:eventDateTime>2018-03-23T16:57:37.602859+00:00</premis:eventDateTime>
          <premis:eventDetail>program="python, bag"; module="hashlib.sha256()"</premis:eventDetail>
          <premis:eventOutcomeInformation>
            <premis:eventOutcome>Pass</premis:eventOutcome>
            <premis:eventOutcomeDetail>
              <premis:eventOutcomeDetailNote>
                All 3 checksums generated at start of transfer match those generated by BagIt (bag).
              </premis:eventOutcomeDetailNote>
            </premis:eventOutcomeDetail>
          </premis:eventOutcomeInformation>
          <premis:linkingAgentIdentifier>
            <premis:linkingAgentIdentifierType>preservation system</premis:linkingAgentIdentifierType>
            <premis:linkingAgentIdentifierValue>Archivematica-1.7</premis:linkingAgentIdentifierValue>
          </premis:linkingAgentIdentifier>
          <premis:linkingAgentIdentifier>
            <premis:linkingAgentIdentifierType>repository code</premis:linkingAgentIdentifierType>
            <premis:linkingAgentIdentifierValue>test</premis:linkingAgentIdentifierValue>
          </premis:linkingAgentIdentifier>
        </premis:event>
      </mets:xmlData>
    </mets:mdWrap>
    
  3. AIPs without pointer files will have no record of the SIP/AIP-level fixity check. This was determined to be an acceptable trade-off and should be remedied once we [begin creating pointer files for uncompressed AIPs] (Problem: uncompressed AIPs need pointer files archivematica-storage-service#324), which is motivated by other considerations as well.

@@ -130,11 +130,10 @@ trimVerifyChecksums_v0.0 = %clientScriptsDirectory%trimVerifyChecksums.py
trimVerifyManifest_v0.0 = %clientScriptsDirectory%trimVerifyManifest.py
updateSizeAndChecksum_v0.0 = %clientScriptsDirectory%archivematicaUpdateSizeAndChecksum.py
validateFile_v1.0 = %clientScriptsDirectory%validateFile.py
verifyAIP_v0.0 = %clientScriptsDirectory%verifyAIP.py
verifyAIP_v1.0 = %clientScriptsDirectory%verify_AIP.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.python.org/dev/peps/pep-0008/#package-and-module-names

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

import sys

import django
django.setup()
Copy link
Member

@sevein sevein Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this call inside the if __name__ == '__main__': block? I find it's more explicit and clearer. If there was a case where someone is using this module then the caller should be responsible to initialize Django. Probably not very important, but I've been doing it this way recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok



def extract_aip(aip_path, extract_path):
os.makedirs(extract_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default mode is ok? Just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be ok. Extraction works

"""

sip_uuid = sys.argv[1] # %sip_uuid%
aip_path = sys.argv[2] # SIPDirectory%%sip_name%-%sip_uuid%.7z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to put it with the examples, that's useful :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do it, but I'll take the compliment

return_code = 1

if return_code == 0:
return_code = verify_checksums(bag, sip_uuid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could argue that it's not verify_checksums's concern to know about exit codes - it'd be more pythonic to have it raise an exception right? Feel free to ignore this. Just a random thought, I've done the same so many times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right. I'll change.


# cleanup
if not is_uncompressed_aip:
shutil.rmtree(extract_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make the script report an error if the clean up did not succeed? You may want to trap OSError if that's not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

- Removes the verifyPREMISChecksums_v0.0 client script
  verifyPREMISChecksums.py.
- Removes verifyPREMISChecksums_v0.0 from archivematicaClientModules.
- Introduces a migration that removes the "Verify checksums generated on
  ingest" micro-service.
- After performing Bag validation to verify the AIP, the new verify_AIP client
  script (verifyAIP_v1.0) queries the db for transfer-generated checksums and
  then verifies that they match what is documented in the corresponding
  bag-generated manifest file.
- Added DEFAULT_CHECKSUM_ALGORITHM to MCPClient Django settings.
- Modified "remove verify premis checksums" migration so that the "Verify AIP"
  micro-service would point to the new verifyAIP_v1.0 client script as its
  `execute` value.
- Added cosmetic refactoring of databaseFunctions.py::insertIntoEvents.
@sevein
Copy link
Member

sevein commented Mar 26, 2018

Sorry I didn't realize that it wasn't really your code what I was looking at!

@qubot qubot force-pushed the dev/issue-918-deduplicate-checksum-verification branch from defb42c to 9451abf Compare March 26, 2018 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants