Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: don't blindly trust the extension over magic detection #2

Open
dumblob opened this issue Dec 8, 2015 · 2 comments
Open

Idea: don't blindly trust the extension over magic detection #2

dumblob opened this issue Dec 8, 2015 · 2 comments

Comments

@dumblob
Copy link

dumblob commented Dec 8, 2015

Right now is the extension tried first and if a match was found, the result is returned. This prevents the actual type recognition by consulting information from the magic database and thus provides very unreliable type detection.

I would try to combine both resolution methods (by extension and by a magic lookup). Namely 3 possibilities popped up in my mind:

  1. be smart - consult magic DB first and for hardly distinguishable MIME types (e.g. C versus C++ source code or for the whole MIME group text/) stick to (or get inspired by) the extension
  2. in case the MIME detection and the extension detection don't match, let the user to choose which recognition type should be preferred (MIME DB or extension) - keep in mind though, that this is not batch behavior and thus strongly discouraged for such a terminal utility
  3. show both results in a fixed order
@technosaurus
Copy link
Owner

This is already in the TODO list as MIMEverify() ... ensure that the magic and extension agree; however, at the moment there are a lot of common extensions that don't have magic checks (I just haven't had the time to track down/interpret the info) I would like to support magic checks on most of the popular formats before I implement that function; otherwise it is pretty useless.

BTW, the existing function is that way on purpose. Its a lot faster to check the extension since the file doesn't have to be opened, so no userspace->kernel->userspace switching. This an efficient way for servers to provide MIME information on files of known origin. I have seen heavily loaded servers serving static files, where after the files had been pre-compressed, over half of the CPU time was spent in libmagic... Switching to extension based MIME guessing more than doubled the throughput to the point where the server was bandwidth limited.

@dumblob
Copy link
Author

dumblob commented Dec 9, 2015

Of course using magic is way slower. On the other hand, web servers are very specific use-case, so the default behaviour should be rather reliable than quick-at-any-cost. Making it configurable at compile-time sound like a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants