Skip to content

Commit

Permalink
Encode characters in Latin-1 to avoid (de)serialization failure (#37)
Browse files Browse the repository at this point in the history
  • Loading branch information
h1994st committed Apr 3, 2022
1 parent c34493d commit ff4e5a2
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 3 deletions.
4 changes: 3 additions & 1 deletion grammars/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,6 @@ An example, `test_hex.json`, is included in this directory:
}
```

Note that, this workaround only works for ASCII characters (i.e., `\u0000` \~ `\u007f`). Otherwise, the special characters will be converted into more than one byte in the UTF-8 encoding. A wrong grammar file, `wrong_hex.json`, is included in this directory as well. Please refer to [post1](https://www.utf8-chartable.de/) and [post2](https://stackoverflow.com/a/59624562) for more details.
Note that, this workaround only works for ASCII characters (i.e., `\u0000` \~ `\u00ff`). Otherwise, the grammar file cannot be processed.

References: [post1](https://www.utf8-chartable.de/), [post2](https://stackoverflow.com/a/59624562), [post3](https://stackoverflow.com/a/66601996), and [post4](https://stackoverflow.com/questions/66601743/python3-str-to-bytes-convertation-problem).
9 changes: 7 additions & 2 deletions grammars/f1_c_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,14 @@ def to_bytes(self):
val_len = len(self.val)
ret += val_len.to_bytes(4, byteorder='little', signed=False)
# val
val_bytes = bytes(self.val, 'utf-8')
# Latin-1 is an 8-bit character set. The first 128 characters of its
# set are identical to the US ASCII standard. By encoding the string as
# Latin-1, we can handle all hex characters from \u0000 to \u00ff
# Refs:
# - https://stackoverflow.com/questions/66601743/python3-str-to-bytes-convertation-problem
# - https://kb.iu.edu/d/aepu
val_bytes = bytes(self.val, 'latin-1')
if val_len != len(val_bytes):
# NOTE: we only support ASCII characters (i.e., single-byte characters)
print(f'The length of `val` should be {val_len}, but found {len(val_bytes)}.')
print(f'`val` bytes in UTF-8 encoding: {val_bytes}')
print('Please check your grammar file!')
Expand Down
File renamed without changes.

1 comment on commit ff4e5a2

@ilyaa-c2a
Copy link

@ilyaa-c2a ilyaa-c2a commented on ff4e5a2 Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumping extended ASCII to C file still results in Unicode characters instead of original 128+ ASCII
one should use write(fuzz_src.encode('latin-1'))

Please sign in to comment.