Read lines with default encoding from os #12942

breedx-splk · 2024-12-20T22:19:26Z

Repeat of #12419 which fixes #12418

@laurit had previously asked if a file could be provided to reproduce this. Unfortunately, it's problematic because it depends on the os encoding. The original problem was from z/OS.....which I have no reasonable way of reproducing with tests etc. It's odd that z/os wouldn't support or use UTF-8, but here we are.

This change should at least handle those edge cases.

laurit · 2024-12-21T11:51:30Z

To me using Charset.defaultCharset() doesn't seem like the best option. Since jdk18 the default charset is utf-8 unless user explicitly configures to use a different encoding, see https://openjdk.org/jeps/400. On older jdks users could use -Dfile.encoding=utf-8 even if the system encoding is something different.
Looking at https://github.com/moby/sys/blob/mountinfo/v0.7.2/mountinfo/mountinfo_linux.go#L23 I think they are also reading it as utf-8, could be wrong, don't really know any go. Without a sample that doesn't decode as with utf-8 it is hard to guess why it would fail. Could be that z/os does something weird. Also could be that the assumption of this file containing utf-8 is wrong and only works because everybody uses utf-8. Weren't paths just blobs from the linux kernel perspective? Instead of Charset.defaultCharset() I'd try StandardCharsets.ISO_8859_1. From what I understand we don't really care about the paths or whatever that could be in utf-8, container id is a hex string and the other symbols we use for parsing are all present in iso-8859-1.

breedx-splk · 2025-01-06T23:48:58Z

Since jdk18 the default charset is utf-8 unless user explicitly configures to use a different encoding, see https://openjdk.org/jeps/400.

Right, this PR is exactly a workaround for that specific case on older platforms (like z/os apparently) that don't use utf8 as a default. In other words, when we call the single-argument Files.lines() method, we get the jvm sensible default, which is utf8.

Instead of Charset.defaultCharset() I'd try StandardCharsets.ISO_8859_1.

🤯 You're blowing my mind with this recommendation. Just no. Maybe you have confidence in this, but I sure don't.

laurit · 2025-01-07T08:53:48Z

Since jdk18 the default charset is utf-8 unless user explicitly configures to use a different encoding, see https://openjdk.org/jeps/400.

Right, this PR is exactly a workaround for that specific case on older platforms (like z/os apparently) that don't use utf8 as a default. In other words, when we call the single-argument Files.lines() method, we get the jvm sensible default, which is utf8.

Let me rephrase, your proposed fix does not work on jdk18 and newer and might not work on older jdks if user configures utf-8 as the default encoding for jvm.

Instead of Charset.defaultCharset() I'd try StandardCharsets.ISO_8859_1.

🤯 You're blowing my mind with this recommendation. Just no. Maybe you have confidence in this, but I sure don't.

An alternative would be to read the file as bytes. If it decodes as utf-8 then all is fine, if it doesn't try another encoding. Considering that we only care about the 7-bit chars that are the same in utf-8 and iso-8859-1 imo just using iso-8859-1 to read the file would be simpler.

breedx-splk · 2025-01-07T20:29:55Z

your proposed fix does not work on jdk18 and newer and might not work on older jdks if user configures utf-8 as the default encoding for jvm

Can you help me to understand why you think that this won't work?

In the case of newer JVMs, as you pointed out, the default is already utf8, so Files.lines(path, Charset.defaultCharset()) should be effectively the same as Files.lines(path, Charset.forName("UTF-8") right? And this is effectively the same as Files.lines(path) which is what the code was before. This should still be fine, right?

This PR isn't trying to work-around the case where the user misconfigures the default encoding for the jvm (to mismatch the os, for example). It only works around the case where the jvm default is not utf-8 because the system default is not utf-8.

Considering that we only care about the 7-bit chars that are the same in utf-8 and iso-8859-1 [snip]

Not only am I not sure that we only care about the chars that are the same, I have not done the charset comparison to verify that the even ARE the same.

laurit · 2025-01-08T11:26:14Z

Can you help me to understand why you think that this won't work?

In the case of newer JVMs, as you pointed out, the default is already utf8, so Files.lines(path, Charset.defaultCharset()) should be effectively the same as Files.lines(path, Charset.forName("UTF-8") right? And this is effectively the same as Files.lines(path) which is what the code was before. This should still be fine, right?

This is all correct, the only thing you are forgetting is that reading the file as utf-8 is what is causing the issue in the first place. Using Files.lines(path, Charset.defaultCharset()) could fix the issue only if Charset.defaultCharset() does not return utf-8.

Considering that we only care about the 7-bit chars that are the same in utf-8 and iso-8859-1 [snip]

Not only am I not sure that we only care about the chars that are the same, I have not done the charset comparison to verify that the even ARE the same.

If you look at https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/resources/library/src/main/java/io/opentelemetry/instrumentation/resources/CgroupV1ContainerIdExtractor.java and https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/resources/library/src/main/java/io/opentelemetry/instrumentation/resources/CgroupV2ContainerIdExtractor.java you'll see that we only search for simple characters like /, : and return a hex string.
The 7bit character set is called ASCII. Both iso-8859-1 and utf-8 extend ascii, check the wikipedia links.

breedx-splk · 2025-01-08T23:53:01Z

Using Files.lines(path, Charset.defaultCharset()) could fix the issue only if Charset.defaultCharset() does not return utf-8.

Right, this is literally what is stated in the description of #12418 and what this PR addresses.

I've added a test, maybe that helps to add some clarity.

laurit · 2025-01-13T13:54:04Z

Using Files.lines(path, Charset.defaultCharset()) could fix the issue only if Charset.defaultCharset() does not return utf-8.

Right, this is literally what is stated in the description of #12418 and what this PR addresses.

So you are fine with this fix ceasing to work on jdk18+ where Charset.defaultCharset() returns utf-8?

I've added a test, maybe that helps to add some clarity.

With this test are you hinting that since the original issue is on z/os the encoding is going to be an EBCDIC encoding that is not compatible with ascii? It is a shame that we couldn't get a sample for the failing file from the original reporter.
I'm not super convinced about the usefulness of the test, but we can keep it if you think it is useful. I'd prefer a comment that explains that Charset.defaultCharset() is used to work around an encoding issue on z/os and perhaps link to the original issue. I'd guess that a proper fix for this would involve getting the encoding for z/os with https://www.ibm.com/docs/api/v1/content/SSYKE2_8.0.0/com.ibm.java.zsecurity.api.80.doc/com.ibm.jzos/com/ibm/jzos/ZUtil.html#getDefaultPlatformEncoding-- or some other means, but this clearly can be implemented only by someone who knows z/os so the best we can do is use Charset.defaultCharset().

breedx-splk · 2025-01-13T18:11:09Z

So you are fine with this fix ceasing to work on jdk18+ where Charset.defaultCharset() returns utf-8?

It seemed to be good enough for the original author. 🤷🏻

With this test are you hinting that since the original issue is on z/os the encoding is going to be an EBCDIC encoding that is not compatible with ascii?

That's right. But without access to z/or or input from the original author, we can't really be sure.

used to work around an encoding issue on z/os ...[snip]

Right, it's an unusual and strange use case for sure....especially when you consider that the "file" in question is a linux/docker cgroups virtual/fake file...which somehow suggests that the semeru jvm is running in z/os within docker? Or the original author is just hacking things to make the resource detector believe that it's in linux when it is not. I dunno.

In any case, I was only hoping to mop up a prior abandoned PR with a two-line change, and this has absolutely devolved into something consuming way way way way way (way) much more time and energy than I had ever intended.

laurit · 2025-01-14T12:48:37Z

Right, it's an unusual and strange use case for sure....especially when you consider that the "file" in question is a linux/docker cgroups virtual/fake file...which somehow suggests that the semeru jvm is running in z/os within docker? Or the original author is just hacking things to make the resource detector believe that it's in linux when it is not. I dunno.

https://www.ibm.com/docs/en/zos/3.1.0?topic=system-process-associated-files apparently z/os can have a file named /proc/self/mountinfo. Perhaps the best option would be to detect that it is running on z/os with https://github.com/OpenLiberty/open-liberty/blob/2cceab1e16ee45e8fbeb7a8b8d8e6b78bb3bc6d4/dev/com.ibm.ws.crypto.ltpakeyutil/src/com/ibm/ws/crypto/ltpakeyutil/LTPAKeyUtil.java#L180 and just disable the resource detection.

SoftlySplinter · 2025-01-21T11:31:59Z

To add some context as the original author - I was just trying to run the Java Agent of a z/OS Java application. It immediately failed and, after a little digging, found what I thought would be an easy fix (and also got stuck in corparate limbo trying to get authorization for CLA).

I'm happy to try the suggestion from @laurit, as that would probably be a better fix.

breedx-splk requested a review from a team as a code owner December 20, 2024 22:19

trask approved these changes Dec 20, 2024

View reviewed changes

jaydeluca approved these changes Dec 21, 2024

View reviewed changes

breedx-splk added 2 commits January 8, 2025 15:22

read lines with default encoding from os

d45c897

add test to cover alternate encodings

7e2fcaa

breedx-splk force-pushed the use_default_charset branch from 1789deb to 7e2fcaa Compare January 8, 2025 23:50

breedx-splk closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read lines with default encoding from os #12942

Read lines with default encoding from os #12942

breedx-splk commented Dec 20, 2024

laurit commented Dec 21, 2024

breedx-splk commented Jan 6, 2025

laurit commented Jan 7, 2025

breedx-splk commented Jan 7, 2025

laurit commented Jan 8, 2025

breedx-splk commented Jan 8, 2025

laurit commented Jan 13, 2025

breedx-splk commented Jan 13, 2025

laurit commented Jan 14, 2025

SoftlySplinter commented Jan 21, 2025

Read lines with default encoding from os #12942

Read lines with default encoding from os #12942

Conversation

breedx-splk commented Dec 20, 2024

laurit commented Dec 21, 2024

breedx-splk commented Jan 6, 2025

laurit commented Jan 7, 2025

breedx-splk commented Jan 7, 2025

laurit commented Jan 8, 2025

breedx-splk commented Jan 8, 2025

laurit commented Jan 13, 2025

breedx-splk commented Jan 13, 2025

laurit commented Jan 14, 2025

SoftlySplinter commented Jan 21, 2025