Idea: don't blindly trust the extension over magic detection #2

dumblob · 2015-12-08T21:58:00Z

Right now is the extension tried first and if a match was found, the result is returned. This prevents the actual type recognition by consulting information from the magic database and thus provides very unreliable type detection.

I would try to combine both resolution methods (by extension and by a magic lookup). Namely 3 possibilities popped up in my mind:

be smart - consult magic DB first and for hardly distinguishable MIME types (e.g. C versus C++ source code or for the whole MIME group text/) stick to (or get inspired by) the extension
in case the MIME detection and the extension detection don't match, let the user to choose which recognition type should be preferred (MIME DB or extension) - keep in mind though, that this is not batch behavior and thus strongly discouraged for such a terminal utility
show both results in a fixed order

The text was updated successfully, but these errors were encountered:

technosaurus · 2015-12-09T08:54:31Z

This is already in the TODO list as MIMEverify() ... ensure that the magic and extension agree; however, at the moment there are a lot of common extensions that don't have magic checks (I just haven't had the time to track down/interpret the info) I would like to support magic checks on most of the popular formats before I implement that function; otherwise it is pretty useless.

BTW, the existing function is that way on purpose. Its a lot faster to check the extension since the file doesn't have to be opened, so no userspace->kernel->userspace switching. This an efficient way for servers to provide MIME information on files of known origin. I have seen heavily loaded servers serving static files, where after the files had been pre-compressed, over half of the CPU time was spent in libmagic... Switching to extension based MIME guessing more than doubled the throughput to the point where the server was bandwidth limited.

dumblob · 2015-12-09T08:58:28Z

Of course using magic is way slower. On the other hand, web servers are very specific use-case, so the default behaviour should be rather reliable than quick-at-any-cost. Making it configurable at compile-time sound like a solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: don't blindly trust the extension over magic detection #2

Idea: don't blindly trust the extension over magic detection #2

dumblob commented Dec 8, 2015

technosaurus commented Dec 9, 2015

dumblob commented Dec 9, 2015

Idea: don't blindly trust the extension over magic detection #2

Idea: don't blindly trust the extension over magic detection #2

Comments

dumblob commented Dec 8, 2015

technosaurus commented Dec 9, 2015

dumblob commented Dec 9, 2015