-
-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(locale): filter and cleanup PersonEntryDefintions data #3266
base: refactor/person/sex
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## refactor/person/sex #3266 +/- ##
=======================================================
- Coverage 99.97% 99.97% -0.01%
=======================================================
Files 2811 2811
Lines 217025 183684 -33341
Branches 941 940 -1
=======================================================
- Hits 216973 183632 -33341
Misses 52 52
|
Perhaps it would be possible to summarise the length of each locale definition file before and after the changes to get a feel for what the actual impacted methods and locales are without having to review the giant diff eg |
Click to expand
|
I sorted by entry first, and removed
|
previously
This suggests to me that rather than a fixed 80% gendered and 20% generic result, if say female is requested it should pick randomly from the female definitions concatenated with the generic definitions, so that locales with only a small number of generic definitions dont keep picking the same small number of generic names. |
We also considered this. We also considerd weighting them:
In summary, we haven found the perfect solution yet. |
I kind of tend to just proceed with a non optimal weight distribution and tweak it in subsequent PRs. |
What if the percentage of generic names was something that could be set for each locale definition seperately?
Then you could have say a 10% chance of getting a gender-neutral english prefix, but a 50% chance of getting a gender-neutral Chinese first_name. |
Is finding the right percentage for each distribution a (merge or release) blocking issue for you or is that something we can adjust in later PRs? |
I'd say yes. Having 20 percent of all Japanese first names output the same because there is only one generic name feels like a bug/ regression. |
I'm not sure what the best way to handle this is. We don't want to leave them all in generic otherwise female names would be returned when you asked for male. So we would have to go through and split the generic names into male and female. There are some (free and paid) apis which might be able to help with that like https://genderize.io/ |
My plan for this - if you are fine with it - looks like this:
Please let me know what you think of this and what your suggestions are. |
In general that sounds fine. There's no great hurry for this and we are less likely to accidentally break things if we spread this over a few releases. However I think we should try to figure out what we will do the problematic locales so we don't get stuck in future. Even if we truncate en generic first names to 1000 first that's a lot to go through by hand. |
IMO we can either check the existing list, which can be a lot, or we could search for a new list. Whatever is easier for us. |
Would we allow 1000 male and 1000 female names? Or 1000 total across all genders? |
I think the current script limits it to up to 1000 each. |
…or/person/sex-localeData
How about using a ratio of:
Percentage of choosing specific
Percentage of choosing generic
|
'Živana', | ||
'Žofie', | ||
], | ||
generic: ['Nikola', 'René'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to move them to female and male respectively?
Team Decision
We believe that these values represent the use case best while leaning towards specific values, if specific has been requested. |
Just wanted to check I understood this right So for example if there were 9 generic first names, 25 female first names and 36 male first names, then if I request firstName("female") then I'll get a name from the female list versus the generic list in a ratio of 3*sqrt(25) : sqrt(9) 15:3 i.e. I get a name from the female list 15/18 of the time, 83.3 percent. |
Second part of #3058
generic
prefixes #3058Extension of #3259
This PR cleans up the PersonEntryDefintions locale data.
generic
values are checked whether they exist in exclusively eitherfemale
andmale
, if so, they are removed from generic. This solves the issue where generic = merge(female, male)female
values are checked whether they are ingeneric
, if so, they are removed fromfemale
.female
values are checked whether they are inmale
, if so, they are added togeneric
and removed fromfemale
.male
values are checked whether they are ingeneric
, if so, they are removed frommale
.I haven't run the script yet, because there is a large diff, due to the person data not being sorted.Summary (changes only)