Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accented character in filename prevents payload verification on macOS #140

Open
dmoles opened this issue Dec 4, 2022 · 10 comments
Open

Comments

@dmoles
Copy link

dmoles commented Dec 4, 2022

Environment

  • bagit-java version: v5.1.1
  • Java version: OpenJDK Runtime Environment Zulu17.30+15-CA (build 17.0.1+12-LTS)
  • OS: macOS 12.6.1 Monterey

Details

Given

  • I have a bag containing non-Latin characters
  • And the file is correctly hashed and listed in the manifest

When

  • I verify the bag

Then

  • PayloadVerifier should not raise FileNotInManifestException

Discussion

The underlying issue appears to be that sun.nio.fs.UnixPath compares paths based on their internal byte representation rather than their string representation, and that somehow, the internal byte representation of the Path produced by PayloadVerifier.getAllFilesListedInManifests() differs from that in the manifest.

I created a file with the name contrôle.txt, where as near as I can tell, the ô is represented as the single Unicode codepoint 0xf4. Certainly that's what it is in the manifest, where if I open the file in a hex editor I can clearly see that it's correctly represented as the two-byte UTF-8 sequence 0xc3 0xb4.

However, when getAllFilesListedInManifests() creates a Path object, it somehow ends up representing it as a plain o (0x6f) plus a combining circumflex accent (codepoint x0302), so the internal byte representation has the UTF-8 bytes 0x6f 0xcc 0x82, the comparison fails, and the file appears not to be in the manifest.

I'm not sure at just what point the conversion from ô to o-plus-combining ^ happens -- probably some kind of Unicode normalization that one is going through and the other isn't

However, oddly, when I call toString(), the string representations of the paths are equal. So one workaround would be to convert the Path objects to Strings before making the comparison.

@dmoles
Copy link
Author

dmoles commented Dec 4, 2022

I've added a test here, but TBH I'm having a lot of trouble getting this code to build with modern versions of Java — I had to update to a later version of Gradle 4 and comment out pretty much all the code quality plugins and tasks — so I'm reluctant to try to put together a full pull request.

@jscancella
Copy link
Contributor

@dmoles have you tried using https://github.com/jscancella/bagging ? I created it because it seems like LoC is no longer maintaining this repo.

@dmoles
Copy link
Author

dmoles commented Dec 4, 2022

@jscancella Thanks for that! Unfortunately I'm not using bagit-java directly but via the Fedora Import/Export Utility.

Trying your library, it looks like it has the same issue. I'll see if I can create a PR with a test case for you.

@dmoles
Copy link
Author

dmoles commented Dec 4, 2022

It looks like this may be a long-standing Java / macOS issue to do with how HFS+ does Unicode normalization.

If toString() conversion seems too risky, an alternative would be to compare the paths with Files.isSameFile(Path, Path), which does seem to admit they're the same.

@dmoles
Copy link
Author

dmoles commented Jan 10, 2023

FWIW, I now think the correct fix is to normalize the paths before comparison, probably via canonical decomposition (java.text.Normalizer.Form.NFD).

@acdha
Copy link
Member

acdha commented Jan 10, 2023

FWIW, I now think the correct fix is to normalize the paths before comparison, probably via canonical decomposition (java.text.Normalizer.Form.NFD).

Normalization is definitely the way to handle comparisons but testing is a challenge for BagIt since the character names are stored in a manifest file with a known encoding process but are also subject to what tools and filesystems do for normalization (e.g. Git's core.precomposeunicode setting).

@dmoles
Copy link
Author

dmoles commented Jan 10, 2023

@acdha Hmm, good catch. Providing test data as zip or tar archives might get around that, but I wouldn't be surprised if different tools/libraries for handling those have different behavior as well.

@acdha
Copy link
Member

acdha commented Jan 10, 2023

I'm almost certain that instinct is correct (IIRC some ZIP implementations normalize) so I'm wondering whether this needs to be created on checkout or if we want to have the Java codebase generate those names (possibly mocked?) in its test suite.

@dmoles
Copy link
Author

dmoles commented Jan 10, 2023

@acdha This might be a case for just (or additionally) pulling out the comparison function and unit-testing that.

@whikloj
Copy link

whikloj commented Feb 16, 2023

Just wondering if this is an issue and also more generally following on to this comment whether this library is still being maintained. It appears that the CircleCi builds use Java 8 so I'm wondering if there was a plan to do testing against Java 11 and 17? If this does not work against either, are there cycles to get that work completed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants