Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misidentification / FIDO vs. Siegfried #76

Open
genfhk opened this issue Mar 29, 2016 · 7 comments
Open

Misidentification / FIDO vs. Siegfried #76

genfhk opened this issue Mar 29, 2016 · 7 comments
Labels

Comments

@genfhk
Copy link

genfhk commented Mar 29, 2016

Hi Richard,
Here are two compared outputs of this attached Vector Image. In Archivematica, Siegfried is defaulting to identification of this .svg as Generic TXT, which is a problem mainly because the format normalization policies are different (and also it's just incorrect). FIDO, however, ID's this file correctly. See below:

SIEGFRIED OUTPUT -- Siegfried in Archivematica defaults to ID'ing as TXT:
archives@archives-ThinkStation-P300:~/Desktop/FPR_Test 2/Image-Vector/SVG$ sf '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg' ---
siegfried : 1.1.0
scandate : 2016-03-29T13:10:23-04:00
signature : archivematica.sig
created : 2015-05-16T20:44:59+10:00
identifiers :

  • name : 'archivematica'
    details : 'DROID_SignatureFile_V82.xml; container-signature-20150327.xml; extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'

    filename : '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg'
    filesize : 1629
    errors :
    matches :
  • id : archivematica
    puid : UNKNOWN
    format :
    version :
    mime :
    basis :
    warning : 'no match; possibilities based on extension are fmt/91, fmt/92, fmt/413'
    archives@archives-ThinkStation-P300:/Desktop/FPR_Test 2/Image-Vector/SVG$ ^C
    archives@archives-ThinkStation-P300:
    /Desktop/FPR_Test 2/Image-Vector/SVG$

FIDO OUTPUT - in Archivematica, ID's as SVG:
archives@archives-ThinkStation-P300:/Desktop/FPR_Test 2/Image-Vector/SVG$ fido '/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg'
FIDO v1.3.1 (formats-v81.xml, container-signature-20130501.xml, format_extensions.xml)
bad repeat interval
bad repeat interval
bad repeat interval
OK,95,fmt/92,"Scalable Vector Graphics","External",1629,"/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg","image/svg+xml","extension"
OK,95,fmt/413,"Scalable Vector Graphics Tiny","External",1629,"/home/archives/Desktop/FPR_Test 2/Image-Vector/SVG/green-blue.svg","None","extension"
FIDO: Processed 1 files in 129.50 msec, 8 files/sec
archives@archives-ThinkStation-P300:
/Desktop/FPR_Test 2/Image-Vector/SVG$

Archivematica Report:
IDCommand UUID: 8cc792b4-362d-4002-8981-a4e808c04b24
File: (17776d39-5796-4f37-8a1e-40706fd40e8a) /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/FPR_Test_SVG-487653f8-6df5-4835-b30a-90f45b65ff3e/objects/green-blue-70220cf9da9f0b6cff6086e78b69ddfb-2.svg
x-fmt/111

Command output: x-fmt/111
/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/FPR_Test_SVG-487653f8-6df5-4835-b30a-90f45b65ff3e/objects/green-blue-70220cf9da9f0b6cff6086e78b69ddfb-2.svg identified as a Generic TXT

Attached file:
green-blue.svg.zip

@richardlehane
Copy link
Owner

Hi Genevieve
thanks very much for this detailed report.

The underlying problem here is the PRONOM byte signatures. The SVG signatures (for PUIDs fmt/91, fmt/92 and fmt/413) all take the form <?xml version="1.0"*<svg. Your sample file begins with <svg and is missing that <?xml declaration so none of the byte signatures match.

Fido reports two svg results (fmt/92 and fmt/413) but not on the basis of a byte match, just on the extension matching (that "extension" in the last field of your Fido report). Siegfried is more conservative than Fido when reporting extension matches: if an extension matches a signature but the format has a byte signature that hasn't matched, then siegfried won't return a result but will instead give UNKNOWN and will list the possible extension matches in the warning field. The rationale for this is that in situations where the extension says one thing, and the file contents say another, it is safest for users to inspect the file and verify.

I think the best solution here is to request an update of the SVG signatures, so that they don't require an xml declaration to match. You can request changes to PRONOM using this form: https://apps.nationalarchives.gov.uk/PRONOM/submitinfo.htm. I've just made this request to the TNA and hopefully this will be amended in a future release of PRONOM.

Another (future) option might be to use a MIME-Info signature file with siegfried. The latest release of siegfried added this option so that you can now choose to use the signatures from the Apache Tika or Freedesktop.org projects, rather than PRONOM. These signatures have better XML detection than PRONOM (because they have signature types that look for the root tags and namespaces of XML files, rather than just treating them as byte streams) and so are more reliable for formats like SVG. The "Try Siegfried" demonstrator at http://www.itforarchivists.com/siegfried now gives both PRONOM and Tika results when it scans and you can see for your sample file that while it is UNKNOWN for PRONOM, the Tika identifier gives a correct match:

capture

This is a new feature of siegfried that hasn't found its way into archivematica yet, but I'm hopeful that in future releases of archivematica you'll get more options about how siegfried is configured.

I hope this all makes sense, and thanks again for the report,

cheers
Richard

@genfhk
Copy link
Author

genfhk commented Mar 30, 2016

Thanks for making that request with PRONOM - and for clarifying things!

@Dclipsham
Copy link

Just to note - this is still on our backlog and I'll try to address it in our next release, probably around late-May

@ablwr
Copy link

ablwr commented Sep 26, 2018

The latest siegfried now identifies this sample .svg as 'UNKNOWN'!

@richardlehane
Copy link
Owner

I did make a PRONOM request for this in 2016 but the PRONOM signatures all still require an xml declaration that is missing from this file:
image

So the ID remains unknown and this issue remains open :( I do have an idea to convert PRONOM xml signatures to proper XML signatures that can be matched by siegfried's XML matching algo (which is only used for mime-info signatures like tika at present). That's a piece of work I'm yet to get around to. But it would resolve this issue independent of a PRONOM change and would make PRONOM xml matching better across the board.

@Dclipsham
Copy link

Apologies Richard, my fault entirely. I'm aiming for November for v95 and this update will be included.

@ablwr
Copy link

ablwr commented Oct 1, 2018

@Dclipsham You are on FIRE!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants