diff --git a/packages/README.md b/packages/README.md index d858f42c..c2429315 100644 --- a/packages/README.md +++ b/packages/README.md @@ -60,40 +60,64 @@ The release has a SHA2-256 checksum and size as shown: SHA-256 d116d9e85c77826c1fd3ff4d18c56c311a6295c8247f83686cd7e8805963220f 28953 newgid-1.10.tgz -### `utf8-unicode` - -The code for `utf8-unicode` which processes named files or standard -input as UTF-8 and prints the sequence of bytes that make up a single -character and the Unicode character it corresponds to. - -For example: - - $ echo "grep –i where | wc -l" | utf8-unicode - 0x67 = U+0067 - 0x72 = U+0072 - 0x65 = U+0065 - 0x70 = U+0070 - 0x20 = U+0020 - 0xE2 0x80 0x93 = U+2013 - 0x69 = U+0069 - 0x20 = U+0020 - 0x77 = U+0077 - 0x68 = U+0068 - 0x65 = U+0065 - 0x72 = U+0072 - 0x65 = U+0065 - 0x20 = U+0020 - 0x7C = U+007C - 0x20 = U+0020 - 0x77 = U+0077 - 0x63 = U+0063 - 0x20 = U+0020 - 0x2D = U+002D - 0x6C = U+006C +### `strspan` + +The code for library functions `str_span()` and `str_cspan()`, which are +related to, but different from `strspn()` and `strcspn()`. +These functions are designed to be used repeatedly for the same set of +searching. +They precompute a table of which characters are to be matched. +One of the functions `set_span()` or `set_ranges()` is used to +initialize the precomputed data. +They readily out-perform `strspn()` and `strcspn()` on moderate size +searches. + +The distribution includes a timing program `test2.strspan` which can be +run with a number files. +It measures the time to read the files — processing them with +`strlen()` and `strchr()` to warm up the cache and I/O buffers. +It then runs `str_span()` and `str_cspan()` on the same files, and then +`strspn()` and `strcspn()`. +It can be effective to name the same file multiple times on the command +line. + +Example use: + + $ test2.strspan bible-be.txt bible-be.txt bible-be.txt + # NB: The tests for str_span and strspn are comparable + # The tests for strlen and strchr are not comparable + strlen 0.187297 (4467663) bible-be.txt + strlen 0.186324 (4467663) bible-be.txt + strlen 0.187616 (4467663) bible-be.txt + strchr 0.182676 (4467663) bible-be.txt + strchr 0.185405 (4467663) bible-be.txt + strchr 0.184813 (4467663) bible-be.txt + str_span 0.195715 (4467663) bible-be.txt + str_span 0.199516 (4467663) bible-be.txt + str_span 0.194588 (4467663) bible-be.txt + strspn 0.347890 (4467663) bible-be.txt + strspn 0.346028 (4467663) bible-be.txt + strspn 0.347305 (4467663) bible-be.txt + $ test2.strspan great.panjandrum great.panjandrum great.panjandrum + # NB: The tests for str_span and strspn are comparable + # The tests for strlen and strchr are not comparable + strlen 0.000046 (487) great.panjandrum + strlen 0.000031 (487) great.panjandrum + strlen 0.000030 (487) great.panjandrum + strchr 0.000036 (487) great.panjandrum + strchr 0.000030 (487) great.panjandrum + strchr 0.000030 (487) great.panjandrum + str_span 0.000035 (487) great.panjandrum + str_span 0.000032 (487) great.panjandrum + str_span 0.000031 (487) great.panjandrum + strspn 0.000061 (487) great.panjandrum + strspn 0.000052 (487) great.panjandrum + strspn 0.000053 (487) great.panjandrum $ -The Unicode EN DASH U+2013 was why that `grep` command was failing with -an error about being unable to find the file `where`. +These results show that `str_span()` and `str_cspan()` are marginally +slower than using `strlen()` or `strchr(), but considerably quicker than +use `strspn()` and `strcspn()`. ### `timecmd` @@ -115,7 +139,7 @@ The code for `timecmd` which measures elapsed time of commands specified as part Example uses: - $ timecmd -m sleep 65 + $ timecmd -m sleep 65 2020-03-01 08:42:58.079 [PID 16916] sleep 65 2020-03-01 08:44:03.086 [PID 16916; status 0x0000] - 1m 5.007s $ timecmd -b -m sleep 65 @@ -135,5 +159,40 @@ Example uses: The sample commands all produced no output. It works fine with commands that do. +### `utf8-unicode` + +The code for `utf8-unicode` which processes named files or standard +input as UTF-8 and prints the sequence of bytes that make up a single +character and the Unicode character it corresponds to. + +For example: + + $ echo "grep –i where | wc -l" | utf8-unicode + 0x67 = U+0067 + 0x72 = U+0072 + 0x65 = U+0065 + 0x70 = U+0070 + 0x20 = U+0020 + 0xE2 0x80 0x93 = U+2013 + 0x69 = U+0069 + 0x20 = U+0020 + 0x77 = U+0077 + 0x68 = U+0068 + 0x65 = U+0065 + 0x72 = U+0072 + 0x65 = U+0065 + 0x20 = U+0020 + 0x7C = U+007C + 0x20 = U+0020 + 0x77 = U+0077 + 0x63 = U+0063 + 0x20 = U+0020 + 0x2D = U+002D + 0x6C = U+006C + $ + +The Unicode EN DASH U+2013 was why that `grep` command was failing with +an error about being unable to find the file `where`. + Jonathan Leffler (jonathan.leffler@gmail.com) -Sunday 1st March 2020 +Wednesday 18th March 2020 diff --git a/packages/strspan-1.03.tgz b/packages/strspan-1.03.tgz new file mode 100644 index 00000000..3331a79b Binary files /dev/null and b/packages/strspan-1.03.tgz differ