Replies: 15 comments 43 replies
-
Bigrams. Sorted. All permutations. Do you need trigrams? |
Beta Was this translation helpful? Give feedback.
-
First attempt at Cervantes Monkey. Needed to fiddle a bit to get Unicode to behave. 100k text and frequencies attached. Will try 1MB overnight ... takes about 4+ hours if I remember correctly. |
Beta Was this translation helpful? Give feedback.
-
Trigrams plus counts from Leipzig corpus. I cut it off at the point where the count/highest is 0.01, so just the 1222 most common. Should be enough. spanish-trios-leipzig.csv BTW, before importing into spreadsheet, highlight the first column and Format Cells to text, else the stuff may get interpreted as a formula. Spreadsheets are too clever for their own good. This applies to the other .csv frequency files as well. Cheers, Ian |
Beta Was this translation helpful? Give feedback.
-
1MB chained bigrams, plus char frequency. Original job still running :-) |
Beta Was this translation helpful? Give feedback.
-
I think that concludes things with the corpus. Will add the texts to KLAnext, but need to revise the user interface a bit ... |
Beta Was this translation helpful? Give feedback.
-
Thank you so much, @iandoug! Now that I've returned from my travels, I will run the engram-es layout optimization code on your cleaned-up version of the Leipzig corpus, which is the largest and I hope most representative of what people would type. |
Beta Was this translation helpful? Give feedback.
-
@iandoug -- There appears to be a redundant letter "i":
|
Beta Was this translation helpful? Give feedback.
-
@iandoug -- I would like to put together a text data repo for Spanish like the one I did for English, with the modified files and a readme with a description of your modifications -- could you provide both? And, not speaking or reading Spanish, I don't know how representative each of these categories of text data may be for what people will type. Do you have any sense what the relative representativeness might be, so we can weight them accordingly? What I'm thinking is if you could provide the count of the one-grams and bigrams for each category (news, websites, etc.), then we can combine them with each other and with the Gutenberg counts as we see fit, and others can do so for themselves in the future. |
Beta Was this translation helpful? Give feedback.
-
@iandoug -- Zenodo is a great idea. There is also osf.io, which is a free service that I use a lot; it caps files at around 5 gb. We should upload all files and description to either or both service when we are ready, and link from our documentation. |
Beta Was this translation helpful? Give feedback.
-
@iandoug -- are you recommending that we derive counts for the Leipzig (summed across all categories), and also for Gutenberg, and combine these counts as we see fit afterwards? |
Beta Was this translation helpful? Give feedback.
-
Attached spreadsheet with bigrams, Leipzig and Gutenberg
spanish-bigrams-unicase-letters-leipzig-gutenberg.ods If you only want first and second last columns, export to csv, close, open csv, delete unwanted columns, save, import ... |
Beta Was this translation helpful? Give feedback.
-
I'm working with the assumption that the unicase letter bigram frequency file that I received is final, because this is how I generated the current layout for which I am conducting tests. |
Beta Was this translation helpful? Give feedback.
-
Hey @iandoug Gutemberg's corpus distorts symbol count by using them as special characters. For instance the underscore is used to indicate italicized words. https://www.gutenberg.org/attic/html_faq.html#can-i-submit-a-html-or-other-format-of-somebody-elses-text And that's why there are a gajillion underscores in Gutemberg. This, obviously, distorts its frequency, massively. It's true that nowadays underscore has this purpose, many apps like whatsapp, telegram or this very github use underscores for italics, and usually asterisks for bold. The question is: should we entertain this convention in order to create an optimal spanish layout? |
Beta Was this translation helpful? Give feedback.
-
@iandoug -- It would be nice to wrap this up and put the relevant files and description up in osf.io / zenodo. If you don't have time, I would be happy to do so -- just point me to the most relevant files. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Congratulations and great work, @iandoug !!! |
Beta Was this translation helpful? Give feedback.
-
Hi
Okay, the Leipzig corpus is better if not perfect.
Attached analysis of character frequency in Leipzig and Gutenberg, plus average. The data scientists can advise if that is the correct methodology :-)
It's sorted by character rather than count.
spanish-character-frequency-v1.ods
Final Leipzig corpus :
~/1web/spanish/leipzig $ wc spanish_leipzig-clean2.txt
5,480,692 lines, 265,281,891 words, 1,610,201,701 characters
Bigrams coming.
Beta Was this translation helpful? Give feedback.
All reactions