Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put all IPA characters into NFD #3

Open
kalvinchang opened this issue Nov 30, 2022 · 2 comments
Open

Put all IPA characters into NFD #3

kalvinchang opened this issue Nov 30, 2022 · 2 comments

Comments

@kalvinchang
Copy link
Member

Discussion from Slack - quoting Young Min

32 0x20
- 45 0x2d
/ 47 0x2f
3 51 0x33
a 97 0x61
b 98 0x62
c 99 0x63
d 100 0x64
e 101 0x65
f 102 0x66
g 103 0x67
h 104 0x68
i 105 0x69
j 106 0x6a
k 107 0x6b
l 108 0x6c
m 109 0x6d
n 110 0x6e
o 111 0x6f
p 112 0x70
s 115 0x73
t 116 0x74
u 117 0x75
v 118 0x76
w 119 0x77
x 120 0x78
y 121 0x79
z 122 0x7a
² 178 0xb2
³ 179 0xb3
¹ 185 0xb9
ã 227 0xe3
ä 228 0xe4
æ 230 0xe6
õ 245 0xf5
ø 248 0xf8
ĩ 297 0x129
ŋ 331 0x14b
œ 339 0x153
ũ 361 0x169
ɐ 592 0x250
ɑ 593 0x251
ɒ 594 0x252
ɔ 596 0x254
ɕ 597 0x255
ɖ 598 0x256
ɘ 600 0x258
ə 601 0x259
ɛ 603 0x25b
ɜ 604 0x25c
ɡ 609 0x261
ɣ 611 0x263
ɤ 612 0x264
ɦ 614 0x266
ɨ 616 0x268
ɪ 618 0x26a
ɯ 623 0x26f
ɲ 626 0x272
ɳ 627 0x273
ɵ 629 0x275
ɸ 632 0x278
ɻ 635 0x27b
ʂ 642 0x282
ʈ 648 0x288
ʊ 650 0x28a
ʏ 655 0x28f
ʐ 656 0x290
ʑ 657 0x291
ʔ 660 0x294
ʰ 688 0x2b0
ʷ 695 0x2b7
ː 720 0x2d0
˥ 741 0x2e5
˦ 742 0x2e6
˧ 743 0x2e7
˨ 744 0x2e8
˩ 745 0x2e9
 ̂ 770 0x302
 ̃ 771 0x303
 ̊ 778 0x30a
 ̞ 798 0x31e
 ̠ 800 0x320
 ̥ 805 0x325
 ̩ 809 0x329
 ̯ 815 0x32f
 ̱ 817 0x331
 ͘ 856 0x358
 ͡ 865 0x361
ẽ 7869 0x1ebd
ỹ 7929 0x1ef9
⁴ 8308 0x2074
ⁿ 8319 0x207f
㣇 14535 0x38c7
䴉 19721 0x4d09

this is the list of all the characters that appear in the wikihan-ipa.tsv file in the wikihan repo

at first glance,

g 103 0x67
ã 227 0xe3
ä 228 0xe4
õ 245 0xf5
ĩ 297 0x129
ũ 361 0x169
ẽ 7869 0x1ebd
ỹ 7929 0x1ef9
㣇 14535 0x38c7
䴉 19721 0x4d09

these are the characters that are sus

I think it might be cleaner if we checked every character in the file and look it up in a list of all the valid ipa character unicodes as per https://en.wikipedia.org/wiki/Phonetic_symbols_in_Unicode#endnote_centralized_tag and handle the ones that are not in that list

@kalvinchang kalvinchang changed the title Put all characters into NFD Put all IPA characters into NFD Nov 30, 2022
@bahducoup
Copy link

We can get the list of valid ipa characters from these files in panphon:

{'ɡ͡b', 'd̪', 'ʕ', 'ʁ', 'd', 'ɨ', 't', 'a', 'ɻ', '˨', 'd͡ʒ', 'ɱ', 'ʙ', '˩', 'ʀ', 'i', 'x', 'ɸ', 'ǁ', 'ə', 'ɕ', 'ð', 't͡ɬ', 'ʍ', 't̪͡s̪', 't̪͡ɬ̪', 'ɤ', 'ɓ', 'ʔ', 'ʑ', 'l̪', 'ʊ', 'k', 'ʄ', 'ɟ', 'n̪', 'b', 'ʘ', 'ɘ', 'æ', 'β', 'ʎ', 'p͡t', 'ɧ', 'ɜ', 'ɯ', 'ɫ', 'ʌ', 'z̪', 'v', 'j', 'ɳ', 'ɭ', 'ɣ', 'd͡ʑ', 'ɬ', 'l', 'p͡ɸ', 'ɔ', 'ʟ', 'ʃ', 'œ', 'd͡z', 'ǀ', 'θ', 'w', 'y', 't͡s', 'ɲ', 'r̪', 'e', 'n', 'ɒ', 'b͡v', 'ɽ', 'c', 'ɡ͡ɣ', 'ɗ', 'ɡ', 'ɺ', 'ʂ', 'ɛ', 'k͡x', 't͡ɕ', 'ʛ', 'd̪͡ɮ̪', 'ʋ', 'ɾ', 't͡ʃ', 'h', 'z', 'ʐ', 'ɹ', 'q͡χ', 'ʒ', 'ɰ', 'ɶ', '˥', 'ɟ͡ʝ', 't̪', 'ʈ', 'o', 'ŋ', 'ɢ͡ʁ', 'u', 'ħ', 'ɐ', 't̪͡θ', 'r', 'ɴ', 'ɪ', 'p', '˧', 'ɢ', 'm', 'b͡β', 'ʝ', 'f', 'ɥ', 'ɠ', 'ɖ', 'd̪͡ð', 'ʈ͡ʂ', 'ɞ', 'p͡f', 'b͡d', 'd͡ɮ', 's', 'ʏ', 'ç', 'ǃ', 'c͡ç', 'k͡p', 'ɦ', 'ǂ', 'χ', 'd̪͡z̪', 'ɖ͡ʐ', 'ɵ', 's̪', 'ɮ', 'ɬ̪', 'ɑ', 'ʉ', 'ø', '˦', 'q'}
{'̥', '̼', 'ʰ', '̤', '̻', 'ʼ', '̆', '̟', '˞', '̘', '̩', 'ˤ', 'ᶣ', 'ʲ', '̰', '̴', '̙', 'ˠ', 'ˀ', 'ⁿ', 'ˡ', '̺', '̃', '̝', '̯', '̞', '̈', 'ʷ', '̠', 'ː'}

@kalvinchang
Copy link
Member Author

kalvinchang commented Dec 1, 2022

Using "g" instead of IPA ɡ should be fixed in dmort27/epitran#134
We just need to update the Epitran version to 1.24

In addition, the repo's Middle Chinese transcriptions need to be re-run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants