-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misidentification / FIDO vs. Siegfried #76
Comments
Hi Genevieve The underlying problem here is the PRONOM byte signatures. The SVG signatures (for PUIDs fmt/91, fmt/92 and fmt/413) all take the form Fido reports two svg results (fmt/92 and fmt/413) but not on the basis of a byte match, just on the extension matching (that "extension" in the last field of your Fido report). Siegfried is more conservative than Fido when reporting extension matches: if an extension matches a signature but the format has a byte signature that hasn't matched, then siegfried won't return a result but will instead give UNKNOWN and will list the possible extension matches in the warning field. The rationale for this is that in situations where the extension says one thing, and the file contents say another, it is safest for users to inspect the file and verify. I think the best solution here is to request an update of the SVG signatures, so that they don't require an xml declaration to match. You can request changes to PRONOM using this form: https://apps.nationalarchives.gov.uk/PRONOM/submitinfo.htm. I've just made this request to the TNA and hopefully this will be amended in a future release of PRONOM. Another (future) option might be to use a MIME-Info signature file with siegfried. The latest release of siegfried added this option so that you can now choose to use the signatures from the Apache Tika or Freedesktop.org projects, rather than PRONOM. These signatures have better XML detection than PRONOM (because they have signature types that look for the root tags and namespaces of XML files, rather than just treating them as byte streams) and so are more reliable for formats like SVG. The "Try Siegfried" demonstrator at http://www.itforarchivists.com/siegfried now gives both PRONOM and Tika results when it scans and you can see for your sample file that while it is UNKNOWN for PRONOM, the Tika identifier gives a correct match: This is a new feature of siegfried that hasn't found its way into archivematica yet, but I'm hopeful that in future releases of archivematica you'll get more options about how siegfried is configured. I hope this all makes sense, and thanks again for the report, cheers |
Thanks for making that request with PRONOM - and for clarifying things! |
Just to note - this is still on our backlog and I'll try to address it in our next release, probably around late-May |
The latest |
Apologies Richard, my fault entirely. I'm aiming for November for v95 and this update will be included. |
@Dclipsham You are on FIRE!!! |
Hi Richard,
Here are two compared outputs of this attached Vector Image. In Archivematica, Siegfried is defaulting to identification of this .svg as Generic TXT, which is a problem mainly because the format normalization policies are different (and also it's just incorrect). FIDO, however, ID's this file correctly. See below:
SIEGFRIED OUTPUT -- Siegfried in Archivematica defaults to ID'ing as TXT:
archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$ sf '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg' ---
siegfried : 1.1.0
scandate : 2016-03-29T13:10:23-04:00
signature : archivematica.sig
created : 2015-05-16T20:44:59+10:00
identifiers :
name : 'archivematica'
filename : '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg'details : 'DROID_SignatureFile_V82.xml; container-signature-20150327.xml; extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'
filesize : 1629
errors :
matches :
puid : UNKNOWN
format :
version :
mime :
basis :
warning : 'no match; possibilities based on extension are fmt/91, fmt/92, fmt/413'
archives@archives-ThinkStation-P300:
/Desktop/FPR_Test 2/Image-Vector/SVG$ ^C/Desktop/FPR_Test 2/Image-Vector/SVG$archives@archives-ThinkStation-P300:
FIDO OUTPUT - in Archivematica, ID's as SVG:
archives@archives-ThinkStation-P300:
/Desktop/FPR_Test 2/Image-Vector/SVG$ fido '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg'/Desktop/FPR_Test 2/Image-Vector/SVG$FIDO v1.3.1 (formats-v81.xml, container-signature-20130501.xml, format_extensions.xml)
bad repeat interval
bad repeat interval
bad repeat interval
OK,95,fmt/92,"Scalable Vector Graphics","External",1629,"/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg","image/svg+xml","extension"
OK,95,fmt/413,"Scalable Vector Graphics Tiny","External",1629,"/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg","None","extension"
FIDO: Processed 1 files in 129.50 msec, 8 files/sec
archives@archives-ThinkStation-P300:
Archivematica Report:
IDCommand UUID: 8cc792b4-362d-4002-8981-a4e808c04b24
File: (17776d39-5796-4f37-8a1e-40706fd40e8a) /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/FPR_Test_SVG-487653f8-6df5-4835-b30a-90f45b65ff3e/objects/green-blue-70220cf9da9f0b6cff6086e78b69ddfb-2.svg
x-fmt/111
Command output: x-fmt/111
/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/FPR_Test_SVG-487653f8-6df5-4835-b30a-90f45b65ff3e/objects/green-blue-70220cf9da9f0b6cff6086e78b69ddfb-2.svg identified as a Generic TXT
Attached file:
green-blue.svg.zip
The text was updated successfully, but these errors were encountered: