Corpus analysis #21

iandoug · 2021-08-10T18:59:03Z

iandoug
Aug 10, 2021

Hi

Okay, the Leipzig corpus is better if not perfect.

Attached analysis of character frequency in Leipzig and Gutenberg, plus average. The data scientists can advise if that is the correct methodology :-)
It's sorted by character rather than count.

spanish-character-frequency-v1.ods

Final Leipzig corpus :
~/1web/spanish/leipzig $ wc spanish_leipzig-clean2.txt
5,480,692 lines, 265,281,891 words, 1,610,201,701 characters

Bigrams coming.

iandoug · 2021-08-10T20:31:12Z

iandoug
Aug 10, 2021
Author

Bigrams. Sorted. All permutations.

Do you need trigrams?

spanish-bigram-frequency-v1.ods

2 replies

binarybottle Aug 10, 2021
Maintainer

Fantastic! I don't need trigrams.

iandoug Aug 10, 2021
Author

I think Den (BEAKL) is going to have a go with Opt and trigrams will be useful. Will see tomorrow.

iandoug · 2021-08-11T21:00:35Z

iandoug
Aug 11, 2021
Author

First attempt at Cervantes Monkey. Needed to fiddle a bit to get Unicode to behave.

100k text and frequencies attached. Will try 1MB overnight ... takes about 4+ hours if I remember correctly.

monkey-freq.txt
spanishmonkeytest.txt

2 replies

iandoug Aug 12, 2021
Author

Memory a bit rusty ... the 1MB version takes over a day.

Video of it at work. The "size" in the yellow text is the number of bigrams left to process. It speeds up as that gets smaller, to about 15 characters per second. The stack started an about 1.2 MB (1MB + 20% spare)

Sorry it's a mkv in zip, Github does not allow mkv and VLC is not co-operating to convert it... some encoder not found.

cervantes-monkey-2021-08-12_11.07.05.zip

iandoug Aug 12, 2021
Author

Decided to see how fast it ran on the new PC.
Much faster... already at over 110 chars/second.
Screen recorder can actually save mp4...

cervantes-monkey-2-2021-08-12_11.44.43.mp4

iandoug · 2021-08-12T08:00:07Z

iandoug
Aug 12, 2021
Author

Trigrams plus counts from Leipzig corpus.

I cut it off at the point where the count/highest is 0.01, so just the 1222 most common. Should be enough.

spanish-trios-leipzig.csv
tab-separated.

BTW, before importing into spreadsheet, highlight the first column and Format Cells to text, else the stuff may get interpreted as a formula. Spreadsheets are too clever for their own good. This applies to the other .csv frequency files as well.

Cheers, Ian

0 replies

iandoug · 2021-08-12T11:08:10Z

iandoug
Aug 12, 2021
Author

1MB chained bigrams, plus char frequency.

Original job still running :-)

cervantes-1mb.txt-freq.txt
cervantes-1mb.txt

0 replies

iandoug · 2021-08-12T11:09:39Z

iandoug
Aug 12, 2021
Author

I think that concludes things with the corpus.

Will add the texts to KLAnext, but need to revise the user interface a bit ...

2 replies

Lobo-Feroz Aug 13, 2021

Awesome work you did here, Ian,

I'm sure this will be insanely useful for future keyboard designers and testers.

We're very lucky to have you around !

iandoug Aug 13, 2021
Author

Other size chained bigrams and analysis.
Never going to get exact char frequency as in full corpus, but should be good enough for development use,

10, 20, 30, 40, 50, 60, 100k, 1MB
spanish-chained-bigrams.zip

binarybottle · 2021-08-15T18:34:44Z

binarybottle
Aug 15, 2021
Maintainer

Thank you so much, @iandoug!

Now that I've returned from my travels, I will run the engram-es layout optimization code on your cleaned-up version of the Leipzig corpus, which is the largest and I hope most representative of what people would type.

3 replies

iandoug Aug 15, 2021
Author

Leipzig does not include "book type" text ... dialogue etc.

binarybottle Aug 15, 2021
Maintainer

According to https://wortschatz.uni-leipzig.de/en/download/Spanish:

Used text material was taken from news websites (typically on a daily basis via RSS feeds).

Used text material was crawled from news websites and may be older than the specified year.

Used text material was taken from randomly chosen Web sites.

Used text material was taken from Wikipedia dumps.

Did you grab all of these? Even these have different volumes, so we have to be careful about bias.

iandoug Aug 15, 2021
Author

I took the largest file in each row, except for those labelled as South/Central American.
There were a lot of non-Spanish names etc , which got dropped.... I dropped the entire offending line.
See the stats above ... it's bigger than most corpora I've seen analysed.

Will look at bigrams tomorrow.

binarybottle · 2021-08-15T18:49:20Z

binarybottle
Aug 15, 2021
Maintainer

Okay, the Leipzig corpus is better if not perfect.
It's sorted by character rather than count.

@iandoug -- There appears to be a redundant letter "i":

¡ 80563
i 139041230

2 replies

iandoug Aug 15, 2021
Author

One is i. The other is inverted exclamation. Github needs better fonts.
i¡

binarybottle Aug 15, 2021
Maintainer

¡Got it -- thanks!

binarybottle · 2021-08-15T22:38:51Z

binarybottle
Aug 15, 2021
Maintainer

@iandoug -- I would like to put together a text data repo for Spanish like the one I did for English, with the modified files and a readme with a description of your modifications -- could you provide both? And, not speaking or reading Spanish, I don't know how representative each of these categories of text data may be for what people will type. Do you have any sense what the relative representativeness might be, so we can weight them accordingly? What I'm thinking is if you could provide the count of the one-grams and bigrams for each category (news, websites, etc.), then we can combine them with each other and with the Gutenberg counts as we see fit, and others can do so for themselves in the future.

18 replies

Lobo-Feroz Aug 20, 2021

goodwords2.zip

Thanks Ian! This is great to have. Will try processing and uploading it to type-fu.

I suppose part of the problem is any understanding of how Spanish changes words from one use case to the next.... whether tense, gender, whatever. So if every possibility was not covered in the word lists, then there will be gaps.

This is exactly what happens with the vast majority of the words in badwords.txt. Most of them are perfectly valid words, only with less common gender, plurals, or verb tense. Exactly this.

iandoug Aug 21, 2021
Author

Sorry I screwed up ... I uploaded the wrong bad words file. Was the original.
Should have been this one. How do these look?
badwords2.txt

Lobo-Feroz Aug 21, 2021

Sorry I screwed up ... I uploaded the wrong bad words file. Was the original.
Should have been this one. How do these look?
badwords2.txt

This one has much less words than the previous one, and most of them have typos from missing characters, missing the accute accent and so on. Most words here are actually invalid.

That means all valid words from the previous file have been already added to your dictionary? That's good news.

How could I help with this one?

iandoug Aug 21, 2021
Author

Then I think we are done? :-) (And Arno can relax? :-) )

I grabbed some Spanish books from Gutenberg, will see how clean their "Chapter 1"'s are to add to KLA or for your typing apps.

"La cigüeña tocaba cada vez mejor el saxofón y el búho pedía kiwi y queso."
Benjamín pidió una bebida de kiwi y fresa. Noé, sin vergüenza, la más exquisita champaña del menú.
El cadáver de Wamba, rey godo de España, fue exhumado y trasladado en una caja de zinc que pesó un kilo.
El pingüino Wenceslao hizo kilómetros bajo exhaustiva lluvia y frío; añoraba a su querido cachorro.
El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja.
El viejo Señor Gómez pedía queso, kiwi y habas, pero le ha tocado un saxofón.
Es extraño mojar queso en la cerveza o probar whisky de garrafa.
Ese libro explica en su epígrafe las hazañas y aventuras de Don Quijote de la Mancha en Kuwait.
Extraño pan de col y kiwi se quemó bajo fugaz vaho.
José compró una vieja zampoña en Perú. Excusándose, Sofía tiró su whisky al desagüe de la banqueta.
Jovencillo emponzoñado de whisky, qué mala figurota exhibes.
Jovencillo emponzoñado de whisky: ¡qué figurota exhibe!
La cigüeña gigante bebió ocho copas de whisky, más quince jarras llenas de fría cerveza rubia, y enseguida huyó en un taxi.
La niña, viéndose atrapada en el áspero baúl índigo y sintiendo asfixia, lloró de vergüenza; mientras que la frustrada madre llamaba a su hija diciendo: “¿Dónde estás Waleska?”.
Queda gazpacho, fibra, látex, jamón, kiwi y viñas.
Quiere la boca exhausta vid, kiwi, piña y fugaz jamón.
Whisky bueno: ¡excitad mi frágil pequeña vejez!

binarybottle Aug 21, 2021
Maintainer

Whew!!! On to punctuation today...

binarybottle · 2021-08-16T16:52:15Z

binarybottle
Aug 16, 2021
Maintainer

@iandoug -- Zenodo is a great idea. There is also osf.io, which is a free service that I use a lot; it caps files at around 5 gb. We should upload all files and description to either or both service when we are ready, and link from our documentation.

1 reply

iandoug Sep 12, 2021
Author

Tried to register at osf but I am NOT doing that {)@({@#($ captcha.

binarybottle · 2021-08-16T16:54:05Z

binarybottle
Aug 16, 2021
Maintainer

@iandoug -- are you recommending that we derive counts for the Leipzig (summed across all categories), and also for Gutenberg, and combine these counts as we see fit afterwards?

2 replies

iandoug Aug 16, 2021
Author

That is what I did.

In order to get a "combined" bigram score, I multiplied the count for each pair in Gutenberg by (total leipzig)/(total gutenberg), ie scaling the whole corpus to be about the same, then simply added the counts for each bigram together.
I suppose we could do the same for the letters-only counts. Will see how they look.

For your purposes, for the characters, I converted each char's count to a percentage of the total (for Leipzig and Gutenberg separately) , then took the average of the percentages.

So effectively weighted the result as 50% Leipzig and 50% Gutenberg.

binarybottle Aug 16, 2021
Maintainer

Sounds great!

iandoug · 2021-08-16T20:05:16Z

iandoug
Aug 16, 2021
Author

Attached spreadsheet with bigrams, Leipzig and Gutenberg

Letters only
lowercase -> uppercase
diacritic -> plain

spanish-bigrams-unicase-letters-leipzig-gutenberg.ods

If you only want first and second last columns, export to csv, close, open csv, delete unwanted columns, save, import ...
I left it as spreadsheet so you could see the formulas.

2 replies

binarybottle Aug 16, 2021
Maintainer

You are awesome, @iandoug! This is exactly what I needed!
Now I can get cranking!!!

iandoug Aug 16, 2021
Author

Char frequencies still coming.

binarybottle · 2021-08-20T13:49:05Z

binarybottle
Aug 20, 2021
Maintainer

I'm working with the assumption that the unicase letter bigram frequency file that I received is final, because this is how I generated the current layout for which I am conducting tests.

2 replies

Lobo-Feroz Aug 20, 2021

I don't know if you are referring to the fact that I'm still cleaning the "bad words" list. I don't think the count for very low frequency words would be statistically relevant for the monogram and bigram frequencies that Ian has given us already, for layout optimization purposes. We're scraping the bottom of the barrel for valid but actually pretty rare words here.

But I do think that having a cleaner corpus and probably a good spanish dictionary of valid words can have some other uses, for instance to generate random lists of words for typing training, to validate further spanish corpora, other uses for KLA maybe...

binarybottle Aug 20, 2021
Maintainer

That makes sense. Thank you!
I'll continue with what I'm doing...

Lobo-Feroz · 2021-08-21T16:23:28Z

Lobo-Feroz
Aug 21, 2021

Hey @iandoug

Gutemberg's corpus distorts symbol count by using them as special characters.

For instance the underscore is used to indicate italicized words.

https://www.gutenberg.org/attic/html_faq.html#can-i-submit-a-html-or-other-format-of-somebody-elses-text
"A plain text file, using extended character sets like ISO-8859 [V.76] or Unicode [V.77] and underscores for italics, can capture all of the author’s intent in almost all cases."

And that's why there are a gajillion underscores in Gutemberg.

This, obviously, distorts its frequency, massively. It's true that nowadays underscore has this purpose, many apps like whatsapp, telegram or this very github use underscores for italics, and usually asterisks for bold.

The question is: should we entertain this convention in order to create an optimal spanish layout?

1 reply

iandoug Aug 21, 2021
Author

Why do Spanish authors put so much in italics? :-)

English books use italics and bold very seldom.

Underscore is also used a lot in code.

I suggest you keep it with dash, and just worry about dash frequency, which is probably also distorted by Gutenberg using -- for —

binarybottle · 2021-09-02T15:15:51Z

binarybottle
Sep 2, 2021
Maintainer

@iandoug -- It would be nice to wrap this up and put the relevant files and description up in osf.io / zenodo. If you don't have time, I would be happy to do so -- just point me to the most relevant files. Thanks!

6 replies

iandoug Sep 5, 2021
Author

Attached draft for review and comment. Probably still needs a bit of polish.

The "file list" (section 5) is not final yet .

Please ensure I credited everybody correctly. :-)

@NickG13
@binarybottle
@Lobo-Feroz

Thanks, Ian

Creating-a-corpus-and-chained-bigrams-for-Spanish-keyboard-development-and-evaluation-0.9.0.pdf

binarybottle Sep 7, 2021
Maintainer

FANTASTIC, @iandoug! You have done a great job documenting the project, and people will find this very helpful!

Lobo-Feroz Sep 11, 2021

Awesome work @iandoug !

What a great documentation, both to show that this layout has been thoroughly researched, and as a base for other optimized spanish layouts based on different paradigms.

And thanks for the credits!

Found a little typo. In "6. Acknowledgements" there's a missing space after the parenthesis and before "Arno" here:

Thanks to (alphabetically)Arno Klein

Also the beginning of the PDF title shows up as:

Zep Tepi Mathematics 201

Which looks like a different project altogether. Possibly a field unchanged from the document used as template.

Congratulations again! 👍

iandoug Sep 12, 2021
Author

Done. Hope all is correct :-)

Report and corpus/analysis files:

https://zenodo.org/record/5501914

all versions: https://doi.org/10.5281/zenodo.5501913

this version: https://doi.org/10.5281/zenodo.5501914

Suspect German will be next, but we need to add deadkey support to KLA ...

Cheers, Ian

iandoug Sep 12, 2021
Author

Also the beginning of the PDF title shows up as:

Zep Tepi Mathematics 201

Which looks like a different project altogether. Possibly a field unchanged from the document used as template.

Yuch. Only saw that after uploading. Only shows up when viewing the PDF.

Uploaded v1.0.1 to fix that. Is indeed from template... I need to get back to Giza :-)

binarybottle · 2021-09-12T16:16:16Z

binarybottle
Sep 12, 2021
Maintainer

Congratulations and great work, @iandoug !!!

0 replies

Corpus analysis #21

iandoug Aug 10, 2021

Replies: 15 comments · 43 replies

iandoug Aug 10, 2021 Author

binarybottle Aug 10, 2021 Maintainer

iandoug Aug 10, 2021 Author

iandoug Aug 11, 2021 Author

iandoug Aug 12, 2021 Author

iandoug Aug 12, 2021 Author

iandoug Aug 12, 2021 Author

iandoug Aug 12, 2021 Author

iandoug Aug 12, 2021 Author

Lobo-Feroz Aug 13, 2021

iandoug Aug 13, 2021 Author

binarybottle Aug 15, 2021 Maintainer

iandoug Aug 15, 2021 Author

binarybottle Aug 15, 2021 Maintainer

iandoug Aug 15, 2021 Author

binarybottle Aug 15, 2021 Maintainer

iandoug Aug 15, 2021 Author

binarybottle Aug 15, 2021 Maintainer

binarybottle Aug 15, 2021 Maintainer

Lobo-Feroz Aug 20, 2021

iandoug Aug 21, 2021 Author

Lobo-Feroz Aug 21, 2021

iandoug Aug 21, 2021 Author

binarybottle Aug 21, 2021 Maintainer

binarybottle Aug 16, 2021 Maintainer

iandoug Sep 12, 2021 Author

binarybottle Aug 16, 2021 Maintainer

iandoug Aug 16, 2021 Author

binarybottle Aug 16, 2021 Maintainer

iandoug Aug 16, 2021 Author

binarybottle Aug 16, 2021 Maintainer

iandoug Aug 16, 2021 Author

binarybottle Aug 20, 2021 Maintainer

Lobo-Feroz Aug 20, 2021

binarybottle Aug 20, 2021 Maintainer

Lobo-Feroz Aug 21, 2021

iandoug Aug 21, 2021 Author

binarybottle Sep 2, 2021 Maintainer

iandoug Sep 5, 2021 Author

binarybottle Sep 7, 2021 Maintainer

Lobo-Feroz Sep 11, 2021

iandoug Sep 12, 2021 Author

iandoug Sep 12, 2021 Author

binarybottle Sep 12, 2021 Maintainer

iandoug
Aug 10, 2021

Replies: 15 comments 43 replies

iandoug
Aug 10, 2021
Author

binarybottle Aug 10, 2021
Maintainer

iandoug Aug 10, 2021
Author

iandoug
Aug 11, 2021
Author

iandoug Aug 12, 2021
Author

iandoug Aug 12, 2021
Author

iandoug
Aug 12, 2021
Author

iandoug
Aug 12, 2021
Author

iandoug
Aug 12, 2021
Author

iandoug Aug 13, 2021
Author

binarybottle
Aug 15, 2021
Maintainer

iandoug Aug 15, 2021
Author

binarybottle Aug 15, 2021
Maintainer

iandoug Aug 15, 2021
Author

binarybottle
Aug 15, 2021
Maintainer

iandoug Aug 15, 2021
Author

binarybottle Aug 15, 2021
Maintainer

binarybottle
Aug 15, 2021
Maintainer

iandoug Aug 21, 2021
Author

iandoug Aug 21, 2021
Author

binarybottle Aug 21, 2021
Maintainer

binarybottle
Aug 16, 2021
Maintainer

iandoug Sep 12, 2021
Author

binarybottle
Aug 16, 2021
Maintainer

iandoug Aug 16, 2021
Author

binarybottle Aug 16, 2021
Maintainer

iandoug
Aug 16, 2021
Author

binarybottle Aug 16, 2021
Maintainer

iandoug Aug 16, 2021
Author

binarybottle
Aug 20, 2021
Maintainer

binarybottle Aug 20, 2021
Maintainer

Lobo-Feroz
Aug 21, 2021

iandoug Aug 21, 2021
Author

binarybottle
Sep 2, 2021
Maintainer

iandoug Sep 5, 2021
Author

binarybottle Sep 7, 2021
Maintainer

iandoug Sep 12, 2021
Author

iandoug Sep 12, 2021
Author

binarybottle
Sep 12, 2021
Maintainer