-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAK file reading: don't assume UTF-8 encoding #718
DAK file reading: don't assume UTF-8 encoding #718
Conversation
When extracting strings from a DAK (.stp or .pat) format file, we shouldn't assume that these strings are encoded using UTF-8. In particular, it appears localized color names are not encoded using UTF-8, and trying to decode them as UTF-8 causes pattern loading to fail. Another place where strings are read from the files is as part of the .stp decryption process. In that case, it seems safer to not try to interpret those byte sequences at all. This commit changes the strings to Python bytestrings, so we don't need to guess at the encoding that DAK actually uses for color names, and we can be sure the bytes in the encryption keys are unmodified.
WalkthroughThe changes in the pull request primarily involve updates to the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
src/main/python/main/ayab/pattern_import.py (3)
54-54
: LGTM! Consistent byte string handling in the Color class.The changes to the
Color
class constructor andstring
method are consistent with the PR objective. Usingbytes
for thename
parameter and the!r
format specifier ensures proper handling of potentially non-UTF-8 encoded color names.For improved clarity, consider adding a comment explaining the use of
!r
for thename
attribute:def string(self) -> str: return ( f"{hex(self.code)}, {str(self.n)}, '{chr(cast(int, self.symbol))}'," + f" {self.name!r}, {hex(int.from_bytes(self.rgb, 'big'))}" # !r ensures proper representation of bytes )This comment would help future maintainers understand the reasoning behind using
!r
.Also applies to: 78-78
462-470
: LGTM! Consistent byte string handling in key calculation.The changes in the
__calc_key
method to use byte strings are consistent with the PR objective. This ensures that all string manipulations are performed on raw bytes, preventing any potential encoding issues.Consider optimizing the repeated
str(...).encode()
calls by using byte string literals or f-strings with byte conversion:keystring = b'%d' % getWordAt(data, 0x3D) # or keystring = f'{getWordAt(data, 0x3D)}'.encode()This approach might be more efficient and cleaner, especially for the repeated conversions of numeric values to byte strings.
Also applies to: 488-497
566-596
: Improve f-string usage in error messages.The use of f-strings for error messages is a good improvement for readability. However, there's a minor issue with the f-string usage. When the entire string is an f-string, you don't need to use
str()
inside it. Consider updating the error messages as follows:self.exit(f".cut file misspecified at row {row}", -5) # FIXME translate self.exit(f"row {row} is too long", -5) # FIXME translateThis change makes the code slightly cleaner and more idiomatic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- src/main/python/main/ayab/pattern_import.py (9 hunks)
🧰 Additional context used
🪛 Ruff
src/main/python/main/ayab/pattern_import.py
386-386: Yoda condition detected
Rewrite as
i == 0
(SIM300)
🔇 Additional comments (4)
src/main/python/main/ayab/pattern_import.py (4)
35-37
: LGTM! Improved handling of potentially non-UTF-8 encoded data.The change from returning
str
tobytes
aligns well with the PR objective. By avoiding implicit decoding, this modification prevents potential issues with non-UTF-8 encoded data in DAK files. This is a good step towards more robust file handling.
386-394
: LGTM! Consistent use of byte strings for color names.The change to use
b""
instead of an empty string for thename
parameter in theColor
constructor call is consistent with the PR objective. This ensures that color names are always stored as byte strings, preventing any potential encoding issues.🧰 Tools
🪛 Ruff
386-386: Yoda condition detected
Rewrite as
i == 0
(SIM300)
Line range hint
1-596
: Overall, excellent changes addressing the UTF-8 encoding assumption issue.The modifications throughout this file consistently move towards using byte strings instead of Unicode strings, which effectively addresses the PR objective of not assuming UTF-8 encoding. These changes should resolve the issues with loading patterns that have non-UTF-8 encoded color names.
Key improvements:
getStringAt
function now returns bytes, avoiding implicit decoding.Color
class now handles color names as byte strings.- Key calculation in
StpPatternConverter
uses byte string operations.- Consistent use of byte strings in various methods across different converter classes.
These changes significantly improve the robustness of the DAK file handling, especially for files with localized color names. The code is now better equipped to handle various encodings without making assumptions about UTF-8.
To further improve the code:
- Consider adding comments explaining the byte string usage where it might not be immediately obvious.
- Look for opportunities to optimize repeated string-to-bytes conversions, especially in the
__calc_key
method.- Ensure that any code that consumes the color names is updated to handle byte strings appropriately.
Great work on addressing this important issue!
450-450
: LGTM! Consistent use of byte strings in __appendKeystring method.The change in the method signature from
str
tobytes
is consistent with the PR objective of handling non-UTF-8 encoded data. This modification ensures that the key string manipulation is performed on raw bytes rather than decoded strings.To ensure the implementation is correct, please run the following script to check the
__appendKeystring
method:✅ Verification successful
Verification Successful:
__appendKeystring
Method Implementation ConfirmedThe
__appendKeystring
method is correctly implemented to handle byte strings, ensuring that key string manipulation is performed on raw bytes as intended.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of __appendKeystring method # Test: Search for the __appendKeystring method implementation ast-grep --lang python --pattern $'def __appendKeystring(next_string: bytes, max_size: int) -> bytes: $$$ 'Length of output: 338
I tested using one of my files and running a simulation and it is working. |
Thanks for your feedback. |
Problem
Fixes #716.
Proposed solution
When extracting strings from a DAK (.stp or .pat) format file, we shouldn't assume that these strings are encoded using UTF-8.
In particular, it appears localized color names are not encoded using UTF-8, and trying to decode them as UTF-8 causes pattern loading to fail.
Another place where strings are read from the files is as part of the .stp decryption process. In that case, it seems safer to not try to interpret those byte sequences at all.
This PR changes the strings to Python bytestrings, so we don't need to guess at the encoding that DAK actually uses for color names, and we can be sure the bytes in the encryption keys are unmodified.
How to test
Try to load the file attached in #716.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor