diff --git a/CHANGELOG.md b/CHANGELOG.md index 4fa1bfd27..ef5c161f0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,7 +20,7 @@ ## [3.6.0](https://github.com/BLKSerene/Wordless/releases/tag/3.6.0) - ??/??/2024 ### 🎉 New Features -- Measures: Add effect size - squared association ratio +- Measures: Add effect size - conditional probability / squared association ratio - Utils: Add Stanza's Sindhi dependency parser ### 📌 Bugfixes diff --git a/doc/doc.md b/doc/doc.md index d98138534..36b202682 100644 --- a/doc/doc.md +++ b/doc/doc.md @@ -914,6 +914,7 @@ Ukrainian |KOI8-U |✔ Urdu |CP1006 |✔ Vietnamese |CP1258 |✔ + ### [12.4 Supported Measures](#doc) @@ -946,8 +947,6 @@ The following variables would be used in formulas:
**NumCharsAlpha**: Number of alphabetic characters (letters, CJK characters, etc.) -Test of Statistical Significance|Measure of Bayes Factor|Formula ---------------------------------|-----------------------|------- -Fisher's exact test
([Pedersen, 1996](#ref-pedersen-1996))||See: [Fisher's exact test - Wikipedia](https://en.wikipedia.org/wiki/Fisher%27s_exact_test#Example) -Log-likelihood ratio test
([Dunning, 1993](#ref-dunning-1993))|Log-likelihood ratio test
([Wilson, 2013](#ref-wilson-2013))|![Formula](/doc/measures/statistical_significance/log_likehood_ratio_test.svg) -Mann-Whitney U test
([Kilgarriff, 2001](#ref-kilgarriff-2001))||See: [Mann–Whitney U test - Wikipedia](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Calculations) -Pearson's chi-squared test
([Hofland & Johansson, 1982](#ref-hofland-johansson-1982); [Oakes, 1998](#ref-oakes-1998))||![Formula](/doc/measures/statistical_significance/pearsons_chi_squared_test.svg) -Student's t-test (1-sample)
([Church et al., 1991](#ref-church-et-al-1991))||![Formula](/doc/measures/statistical_significance/students_t_test_1_sample.svg) -Student's t-test (2-sample)
([Paquot & Bestgen, 2009](#ref-paquot-bestgen-2009))|Student's t-test (2-sample)
([Wilson, 2013](#ref-wilson-2013))|![Formula](/doc/measures/statistical_significance/students_t_test_2_sample.svg) -Z-test
([Dennis, 1964](#ref-dennis-1964))||![Formula](/doc/measures/statistical_significance/z_test.svg) -Z-test (Berry-Rogghe)
([Berry-Rogghe, 1973](#ref-berry-rogghe-1973))||![Formula](/doc/measures/statistical_significance/z_test_berry_rogghe.svg)
where **S** is the average span size on both sides of the node word. +Test of Statistical Significance|Measure of Bayes Factor|Formula|Collocation Extraction|Keyword Extraction +--------------------------------|-----------------------|-------|----------------------|------------------ +Fisher's exact test
([Pedersen, 1996](#ref-pedersen-1996); [Kilgarriff, 2001, p. 105](#ref-kilgarriff-2001))||See: [Fisher's exact test - Wikipedia](https://en.wikipedia.org/wiki/Fisher%27s_exact_test#Example)|✔|✔ +Log-likelihood ratio test
([Dunning, 1993](#ref-dunning-1993); [Kilgarriff, 2001, p. 105](#ref-kilgarriff-2001))|Log-likelihood ratio test
([Wilson, 2013](#ref-wilson-2013))|![Formula](/doc/measures/statistical_significance/log_likehood_ratio_test.svg)|✔|✔ +Mann-Whitney U test
([Kilgarriff, 2001, pp. 103–104](#ref-kilgarriff-2001))||See: [Mann–Whitney U test - Wikipedia](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Calculations)|✖️|✔ +Pearson's chi-squared test
([Hofland & Johansson, 1982, p. 12](#ref-hofland-johansson-1982); [Dunning, 1993, p. 63](#ref-dunning-1993); [Oakes, 1998, p. 25](#ref-oakes-1998))||![Formula](/doc/measures/statistical_significance/pearsons_chi_squared_test.svg)|✔|✔ +Student's t-test (1-sample)
([Church et al., 1991, pp. 120–126](#ref-church-et-al-1991))||![Formula](/doc/measures/statistical_significance/students_t_test_1_sample.svg)|✔|✖️ +Student's t-test (2-sample)
([Paquot & Bestgen, 2009, pp. 252–253](#ref-paquot-bestgen-2009))|Student's t-test (2-sample)
([Wilson, 2013](#ref-wilson-2013))|![Formula](/doc/measures/statistical_significance/students_t_test_2_sample.svg)|✖️|✔ +Z-test
([Dennis, 1964, p. 69](#ref-dennis-1964))||![Formula](/doc/measures/statistical_significance/z_test.svg)|✔|✖️ +Z-test (Berry-Rogghe)
([Berry-Rogghe, 1973](#ref-berry-rogghe-1973))||![Formula](/doc/measures/statistical_significance/z_test_berry_rogghe.svg)
where **S** is the average span size on both sides of the node word.|✔|✖️ -Measure of Effect Size|Formula -----------------------|------- -%DIFF
([Gabrielatos & Marchi, 2011](#ref-gabrielatos-marchi-2011))|![Formula](/doc/measures/effect_size/pct_diff.svg) -Cubic association ratio
([Daille, 1994](#ref-daille-1994))|![Formula](/doc/measures/effect_size/im3.svg) -Dice-Sørensen coefficient
([Smadja et al., 1996](#ref-smadja-et-al-1996))|![Formula](/doc/measures/effect_size/dice_sorensen_coeff.svg) -Difference coefficient
([Hofland & Johansson, 1982](#ref-hofland-johansson-1982); [Gabrielatos, 2018](#ref-gabrielatos-2018))|![Formula](/doc/measures/effect_size/diff_coeff.svg) -Jaccard index
([Dunning, 1998](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/jaccard_index.svg) -Kilgarriff's ratio
([Kilgarriff, 2009](#ref-kilgarriff-2009))|![Formula](/doc/measures/effect_size/kilgarriffs_ratio.svg)
where **α** is the smoothing parameter, whose value could be changed via **Menu Bar → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter**. -logDice
([RychlĂ˝, 2008](#ref-rychly-2008))|![Formula](/doc/measures/effect_size/log_dice.svg) -Log-frequency biased MD
([Thanopoulos et al., 2002](#ref-thanopoulos-et-al-2002))|![Formula](/doc/measures/effect_size/lfmd.svg) -Log Ratio
([Hardie, 2014](#ref-hardie-2014))|![Formula](/doc/measures/effect_size/log_ratio.svg) -MI.log-f
([Kilgarriff & Tugwell, 2002](#ref-kilgarriff-tugwell-2002); [Lexical Computing Ltd., 2015](#ref-lexical-computing-ltd-2015))|![Formula](/doc/measures/effect_size/mi_log_f.svg) -Minimum sensitivity
([Pedersen, 1998](#ref-pedersen-1998))|![Formula](/doc/measures/effect_size/min_sensitivity.svg) -Mutual Dependency
([Thanopoulos et al., 2002](#ref-thanopoulos-et-al-2002))|![Formula](/doc/measures/effect_size/md.svg) -Mutual Expectation
([Dias et al., 1999](#ref-dias-et-al-1999))|![Formula](/doc/measures/effect_size/me.svg) -Mutual information
([Dunning, 1998](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/mi.svg) -Odds ratio
([Pojanapunya & Todd, 2016](#ref-pojanapunya-todd-2016))|![Formula](/doc/measures/effect_size/odds_ratio.svg) -Pointwise mutual information
([Church & Hanks, 1990](#ref-church-hanks-1990))|![Formula](/doc/measures/effect_size/pmi.svg) -Poisson collocation measure
([Quasthoff & Wolff, 2002](#ref-quasthoff-wolff-2002))|![Formula](/doc/measures/effect_size/poisson_collocation_measure.svg) -Squared association ratio
([Daille, 1995](#ref-daille-1995))|![Formula](/doc/measures/effect_size/im2.svg) -Squared phi coefficient
([Church & Gale, 1991](#ref-church-gale-1991))|![Formula](/doc/measures/effect_size/squared_phi_coeff.svg) +Measure of Effect Size|Formula|Collocation Extraction|Keyword Extraction +----------------------|-------|----------------------|------------------ +%DIFF
([Gabrielatos & Marchi, 2011](#ref-gabrielatos-marchi-2011))|![Formula](/doc/measures/effect_size/pct_diff.svg)|✖️|✔ +Conditional probability
([Durrant, 2008, p. 84](#ref-durrant-2008))|![Formula](/doc/measures/effect_size/conditional_probability.svg)|✔|✖️ +Cubic association ratio
([Daille, 1994, p. 139](#ref-daille-1994); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im3.svg)|✔|✔ +Dice-Sørensen coefficient
([Smadja et al., 1996, p. 8](#ref-smadja-et-al-1996))|![Formula](/doc/measures/effect_size/dice_sorensen_coeff.svg)|✔|✖️ +Difference coefficient
([Hofland & Johansson, 1982, p. 14](#ref-hofland-johansson-1982); [Gabrielatos, 2018, p. 236](#ref-gabrielatos-2018))|![Formula](/doc/measures/effect_size/diff_coeff.svg)|✖️|✔ +Jaccard index
([Dunning, 1998, p. 48](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/jaccard_index.svg)|✔|✖️ +Kilgarriff's ratio
([Kilgarriff, 2009](#ref-kilgarriff-2009))|![Formula](/doc/measures/effect_size/kilgarriffs_ratio.svg)
where **α** is the smoothing parameter, whose value could be changed via **Menu Bar → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter**.|✖️|✔ +logDice
([Rychlý, 2008, p. 9](#ref-rychly-2008))|![Formula](/doc/measures/effect_size/log_dice.svg)|✔|✖️ +Log-frequency biased MD
([Thanopoulos et al., 2002, p. 621](#ref-thanopoulos-et-al-2002))|![Formula](/doc/measures/effect_size/lfmd.svg)|✔|✖️ +Log Ratio
([Hardie, 2014](#ref-hardie-2014))|![Formula](/doc/measures/effect_size/log_ratio.svg)|✔|✔ +MI.log-f
([Kilgarriff & Tugwell, 2002](#ref-kilgarriff-tugwell-2002); [Lexical Computing Ltd., 2015, p. 4](#ref-lexical-computing-ltd-2015))|![Formula](/doc/measures/effect_size/mi_log_f.svg)|✔|✖️ +Minimum sensitivity
([Pedersen, 1998](#ref-pedersen-1998))|![Formula](/doc/measures/effect_size/min_sensitivity.svg)|✔|✖️ +Mutual Dependency
([Thanopoulos et al., 2002, p. 621](#ref-thanopoulos-et-al-2002))|![Formula](/doc/measures/effect_size/md.svg)|✔|✖️ +Mutual Expectation
([Dias et al., 1999](#ref-dias-et-al-1999))|![Formula](/doc/measures/effect_size/me.svg)|✔|✖️ +Mutual information
([Dunning, 1998, pp. 49–52](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/mi.svg)|✔|✖️ +Odds ratio
([Pecina, 2005, p. 15](#ref-pecina-2005), [Pojanapunya & Todd, 2016](#ref-pojanapunya-todd-2016))|![Formula](/doc/measures/effect_size/odds_ratio.svg)|✔|✔ +Pointwise mutual information
([Church & Hanks, 1990](#ref-church-hanks-1990); [Kilgarriff, 2001, pp. 104–105](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/pmi.svg)|✔|✔ +Poisson collocation measure
([Quasthoff & Wolff, 2002](#ref-quasthoff-wolff-2002))|![Formula](/doc/measures/effect_size/poisson_collocation_measure.svg)|✔|✖️ +Squared association ratio
([Daille, 1995, p. 21](#ref-daille-1995); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im2.svg)|✔|✔ +Squared phi coefficient
([Church & Gale, 1991](#ref-church-gale-1991))|![Formula](/doc/measures/effect_size/squared_phi_coeff.svg)|✔|✖️ ## [13 References](#doc) @@ -1579,7 +1580,7 @@ Measure of Effect Size|Formula 1. [**^**](#ref-cttr) Carroll, J. B. (1964). *Language and thought*. Prentice-Hall. -1. [**^**](#ref-carrolls-d2) [**^**](#ref-carrolls-um) Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. *Computer Studies in the Humanities and Verbal Behaviour*, *3*(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x +1. [**^**](#ref-carrolls-d2) [**^**](#ref-carrolls-um) Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. *ETS Research Bulletin Series*, *1970*(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x 1. [**^**](#ref-rgl) Caylor, J. S., & Sticht, T. G. (1973). *Development of a simple readability index for job reading material*. Human Resource Research Organization. https://ia902703.us.archive.org/31/items/ERIC_ED076707/ERIC_ED076707.pdf @@ -1613,7 +1614,7 @@ Measure of Effect Size|Formula 1. [**^**](#ref-dawoods-readability-formula) Dawood, B.A.K. (1977). *The relationship between readability and selected language variables* [Unpublished master’s thesis]. University of Baghdad. -1. [**^**](#ref-z-test) Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), *Proceedings of the symposium on statistical association methods for mechanized documentation* (pp. 61–148). National Bureau of Standards. +1. [**^**](#ref-z-test) Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), *Statistical association methods for mechanized documentation: Symposium proceedings* (pp. 61–148). National Bureau of Standards. 1. [**^**](#ref-me) Dias, G., Guilloré, S., & Pereira Lopes, J. G. (1999). Language independent automatic acquisition of rigid multiword units from unrestricted text corpora. In A. Condamines, C. Fabre, & M. Péry-Woodley (Eds.), *TALN'99: 6ème Conférence Annuelle Sur le Traitement Automatique des Langues Naturelles* (pp. 333–339). TALN. @@ -1625,9 +1626,11 @@ Measure of Effect Size|Formula 1. [**^**](#ref-logttr) [**^**](#ref-logttr) Dugast, D. (1979). *Vocabulaire et stylistique: I théâtre et dialogue, travaux de linguistique quantitative*. Slatkine. -1. [**^**](#ref-log-likehood-ratio-test) Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. *Computational Linguistics*, *19*(1), 61–74. +1. [**^**](#ref-log-likehood-ratio-test) [**^**](#ref-pearsons-chi-squared-test) Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. *Computational Linguistics*, *19*(1), 61–74. 1. [**^**](#ref-jaccard-index) [**^**](#ref-mi) Dunning, T. E. (1998). *Finding structure in text, genome and other symbolic sequences* [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 + +1. [**^**](#ref-conditional-probability) Durrant, P. (2008). *High frequency collocations and second language learning* [Doctoral dissertation, University of Nottingham]. Nottingham eTheses. https://eprints.nottingham.ac.uk/10622/1/final_thesis.pdf 1. [**^**](#ref-osman) El-Haj, M., & Rayson, P. (2016). OSMAN: A novel Arabic readability metric. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), *Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)* (pp. 250–255). European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2016/index.html @@ -1678,11 +1681,11 @@ Linguistic Computing Bulletin*, *7*(2), 172–177. 1. [**^**](#ref-re) Kandel, L., & Moles, A. (1958). Application de l’indice de flesch à la langue française. *The Journal of Educational Research*, *21*, 283–287. -1. [**^**](#ref-mann-whiteney-u-test) Kilgarriff, A. (2001). Comparing corpora. *International Journal of Corpus Linguistics*, *6*(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil +1. [**^**](#ref-fishers-exact-test) [**^**](#ref-log-likehood-ratio-test) [**^**](#ref-mann-whiteney-u-test) [**^**](#ref-im3) [**^**](#ref-pmi) [**^**](#ref-im2) Kilgarriff, A. (2001). Comparing corpora. *International Journal of Corpus Linguistics*, *6*(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil -1. [**^**](#ref-kilgarriffs-ratio) Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), *Proceedings of the Corpus Linguistics Conference 2009* (p. 171). University of Liverpool. +1. [**^**](#ref-kilgarriffs-ratio) Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), *Proceedings of the Corpus Linguistics Conference 2009 (CL2009)* (Article 171). University of Liverpool. -1. [**^**](#ref-mi-log-f) Kilgarriff, A., & Tugwell, D. (2002). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In *Proceedings of the 8th Machine Translation Summit* (pp. 187–190). European Association for Machine Translation. +1. [**^**](#ref-mi-log-f) Kilgarriff, A., & Tugwell, D. (2001). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In B. Maegaard (Ed.), *Proceedings of Machine Translation Summit VIII* (pp. 187–190). European Association for Machine Translation. 1. [**^**](#ref-ari) [**^**](#ref-gl) [**^**](#ref-fog-index) Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). *Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel* (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf @@ -1700,7 +1703,7 @@ Linguistic Computing Bulletin*, *7*(2), 172–177. 1. [**^**](#ref-gulpease) Lucisano, P., & Emanuela Piemontese, M. (1988). GULPEASE: A formula for the prediction of the difficulty of texts in Italian. *Scuola e Città*, *39*(3), 110–124. -1. [**^**](#ref-num-syls-luong-nguyen-dinh-1000) [**^**](#ref-luong-nguyen-dinhs-readability-formula) Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. *2018 10th International Conference on Knowledge and Systems Engineering (KSE)* (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379 +1. [**^**](#ref-num-syls-luong-nguyen-dinh-1000) [**^**](#ref-luong-nguyen-dinhs-readability-formula) Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. In T. M. Phuong & M. L. Nguyen (Eds.), *Proceedings of 2018 10th International Conference on Knowledge and Systems Engineering (KSE)* (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379 1. [**^**](#ref-lynes-d3) Lyne, A. A. (1985). Dispersion. In A. A. Lyne (Ed.), *The vocabulary of French business correspondence: Word frequencies, collocations, and problems of lexicometric method* (pp. 101–124). Slatkine. @@ -1710,7 +1713,7 @@ Linguistic Computing Bulletin*, *7*(2), 172–177. 1. [**^**](#ref-eflaw) McAlpine, R. (2006). *From plain English to global English*. Journalism Online. Retrieved October 31, 2024, from https://www.angelfire.com/nd/nirmaldasan/journalismonline/fpetge.html -1. [**^**](#ref-mtld) McCarthy, P. M. (2005). *An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD)* [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global. +1. [**^**](#ref-mtld) McCarthy, P. M. (2005). *An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD)* (Publication No. 3199485) [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global. 1. [**^**](#ref-hdd) [**^**](#ref-mtld) McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. *Behavior Research Methods*, *42*(2), 381–392. https://doi.org/10.3758/BRM.42.2.381 @@ -1731,6 +1734,8 @@ Linguistic Computing Bulletin*, *7*(2), 172–177. 1. [**^**](#ref-fishers-exact-test) Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), *Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference* (pp. 188–200). The South–Central Regional SAS Users' Group. 1. [**^**](#ref-min-sensitivity) Pedersen, T. (1998). Dependent bigram identification. In *Proceedings of the Fifteenth National Conference on Artificial Intelligence* (p. 1197). AAAI Press. + +1. [**^**](#ref-odds-ratio) Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In C. Callison-Burch & S. Wan (Eds.), *Proceedings of the Student Research Workshop* (pp. 13–18). Association for Computational Linguistics. 1. [**^**](#ref-fog-index) Pisarek, W. (1969). Jak mierzyć zrozumiałość tekstu? *Zeszyty Prasoznawcze*, *4*(42), 35–48. @@ -1746,7 +1751,7 @@ Linguistic Computing Bulletin*, *7*(2), 172–177. 1. [**^**](#ref-rosengrens-s) [**^**](#ref-rosengrens-kf) Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. *Études de linguistique appliquée*, *1*, 103–127. -1. [**^**](#ref-log-dice) Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), *Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing*. Masaryk University +1. [**^**](#ref-log-dice) Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), *Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing* (pp. 6–9). Masaryk University 1. [**^**](#ref-ald) [**^**](#ref-fald) [**^**](#ref-arf) [**^**](#ref-farf) [**^**](#ref-awt) [**^**](#ref-fawt) Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. *Journal of Quantitative Linguistics*, *9*(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 diff --git a/doc/measures/effect_size/conditional_probability.svg b/doc/measures/effect_size/conditional_probability.svg new file mode 100644 index 000000000..547f7ed3e --- /dev/null +++ b/doc/measures/effect_size/conditional_probability.svg @@ -0,0 +1,29 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/tests/test_colligation_extractor.py b/tests/test_colligation_extractor.py index 36031b8b0..cadaa7fb7 100644 --- a/tests/test_colligation_extractor.py +++ b/tests/test_colligation_extractor.py @@ -34,12 +34,12 @@ def test_colligation_extractor(): tests_statistical_significance = [ test_statistical_significance for test_statistical_significance, vals in main.settings_global['tests_statistical_significance'].items() - if vals['collocation_extractor'] + if vals['collocation'] ] measures_bayes_factor = [ measure_bayes_factor for measure_bayes_factor, vals in main.settings_global['measures_bayes_factor'].items() - if vals['collocation_extractor'] + if vals['collocation'] ] measures_effect_size = list(main.settings_global['measures_effect_size'].keys()) diff --git a/tests/test_collocation_extractor.py b/tests/test_collocation_extractor.py index 90bc79fb5..b0d16f3b1 100644 --- a/tests/test_collocation_extractor.py +++ b/tests/test_collocation_extractor.py @@ -34,12 +34,12 @@ def test_collocation_extractor(): tests_statistical_significance = [ test_statistical_significance for test_statistical_significance, vals in main.settings_global['tests_statistical_significance'].items() - if vals['collocation_extractor'] + if vals['collocation'] ] measures_bayes_factor = [ measure_bayes_factor for measure_bayes_factor, vals in main.settings_global['measures_bayes_factor'].items() - if vals['collocation_extractor'] + if vals['collocation'] ] measures_effect_size = list(main.settings_global['measures_effect_size'].keys()) diff --git a/tests/test_keyword_extractor.py b/tests/test_keyword_extractor.py index 802512301..d5073c8ce 100644 --- a/tests/test_keyword_extractor.py +++ b/tests/test_keyword_extractor.py @@ -31,12 +31,12 @@ def test_keyword_extractor(): tests_statistical_significance = [ test_statistical_significance for test_statistical_significance, vals in main.settings_global['tests_statistical_significance'].items() - if vals['keyword_extractor'] + if vals['keyword'] ] measures_bayes_factor = [ measure_bayes_factor for measure_bayes_factor, vals in main.settings_global['measures_bayes_factor'].items() - if vals['keyword_extractor'] + if vals['keyword'] ] measures_effect_size = list(main.settings_global['measures_effect_size'].keys()) diff --git a/tests/tests_measures/test_measures_adjusted_freq.py b/tests/tests_measures/test_measures_adjusted_freq.py index db22c8c7a..9e88403b5 100644 --- a/tests/tests_measures/test_measures_adjusted_freq.py +++ b/tests/tests_measures/test_measures_adjusted_freq.py @@ -22,7 +22,7 @@ main = wl_test_init.Wl_Test_Main() -# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 410) +# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 410 def test_fald(): assert round(wl_measures_adjusted_freq.fald(main, test_measures_dispersion.TOKENS, 'a'), 3) == 11.764 assert wl_measures_adjusted_freq.fald(main, test_measures_dispersion.TOKENS, 'aa') == 0 @@ -36,9 +36,9 @@ def test_fawt(): assert wl_measures_adjusted_freq.fawt(main, test_measures_dispersion.TOKENS, 'aa') == 0 # References: -# Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/ -# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. (p. 122) -# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 409) +# Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. ETS Research Bulletin Series, 1970(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x | p. 13 +# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. | p. 122 +# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 409 def test_carrolls_um(): assert round(wl_measures_adjusted_freq.carrolls_um(main, [2, 1, 1, 1, 0]), 2) == 4.31 assert round(wl_measures_adjusted_freq.carrolls_um(main, [4, 2, 1, 1, 0]), 3) == 6.424 @@ -46,9 +46,9 @@ def test_carrolls_um(): assert wl_measures_adjusted_freq.carrolls_um(main, [0, 0, 0, 0, 0]) == 0 # References -# Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x -# Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127. (p. 115) -# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. (p. 122) +# Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. ETS Research Bulletin Series, 1970(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x | p. 14 +# Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127. | p. 115 +# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. | p. 122 def test_juillands_u(): assert round(wl_measures_adjusted_freq.juillands_u(main, [0, 4, 3, 2, 1]), 2) == 6.46 assert round(wl_measures_adjusted_freq.juillands_u(main, [2, 2, 2, 2, 2]), 0) == 10 @@ -56,9 +56,9 @@ def test_juillands_u(): assert wl_measures_adjusted_freq.juillands_u(main, [0, 0, 0, 0, 0]) == 0 # References: -# Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127. (p. 117) -# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. (p. 122) -# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 409) +# Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée, 1, 103–127. | p. 117 +# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. | p. 122 +# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 409 def test_rosengres_kf(): assert round(wl_measures_adjusted_freq.rosengrens_kf(main, [2, 2, 2, 2, 1]), 2) == 8.86 assert round(wl_measures_adjusted_freq.rosengrens_kf(main, [4, 2, 1, 1, 0]), 3) == 5.863 @@ -66,14 +66,14 @@ def test_rosengres_kf(): assert wl_measures_adjusted_freq.rosengrens_kf(main, [0, 0, 0, 0, 0]) == 0 # References: -# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. (p. 122) -# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 409) +# Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. | p. 122 +# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 409 def test_engwalls_fm(): assert round(wl_measures_adjusted_freq.engwalls_fm(main, [4, 2, 1, 1, 0]), 1) == 6.4 assert round(wl_measures_adjusted_freq.engwalls_fm(main, [1, 2, 3, 4, 5]), 0) == 15 assert wl_measures_adjusted_freq.engwalls_fm(main, [0, 0, 0, 0, 0]) == 0 -# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 409) +# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 409 def test_kromers_ur(): assert round(wl_measures_adjusted_freq.kromers_ur(main, [2, 1, 1, 1, 0]), 1) == 4.5 assert wl_measures_adjusted_freq.kromers_ur(main, [0, 0, 0, 0, 0]) == 0 diff --git a/tests/tests_measures/test_measures_dispersion.py b/tests/tests_measures/test_measures_dispersion.py index 35fd1329a..990ba65fa 100644 --- a/tests/tests_measures/test_measures_dispersion.py +++ b/tests/tests_measures/test_measures_dispersion.py @@ -21,7 +21,7 @@ main = wl_test_init.Wl_Test_Main() -# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (pp. 406, 410) +# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | pp. 406, 410 TOKENS = 'b a m n i b e u p k | b a s a t b e w q n | b c a g a b e s t a | b a g h a b e a a t | b a h a a b e a x a'.replace('|', '').split() DISTS = [2, 10, 2, 9, 2, 5, 2, 3, 3, 1, 3, 2, 1, 3, 2] @@ -41,14 +41,14 @@ def test_awt(): assert wl_measures_dispersion.awt(main, TOKENS, 'a') == 3.18 assert wl_measures_dispersion.awt(main, TOKENS, 'aa') == 0 -# Reference: Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x +# Reference: Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. ETS Research Bulletin Series, 1970(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x | p. 13 def test_carrolls_d2(): assert round(wl_measures_dispersion.carrolls_d2(main, [2, 1, 1, 1, 0]), 4) == 0.8277 assert wl_measures_dispersion.carrolls_d2(main, [0, 0, 0, 0]) == 0 # References: -# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 416) -# Lijffijt, J., & Gries, S. T. (2012). Correction to Stefan Th. Gries’ “dispersions and adjusted frequencies in corpora” International Journal of Corpus Linguistics, 17(1), 147–149. https://doi.org/10.1075/ijcl.17.1.08lij (p. 148) +# Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 416 +# Lijffijt, J., & Gries, S. T. (2012). Correction to Stefan Th. Gries’ “dispersions and adjusted frequencies in corpora” International Journal of Corpus Linguistics, 17(1), 147–149. https://doi.org/10.1075/ijcl.17.1.08lij | p. 148 def test_griess_dp(): main.settings_custom['measures']['dispersion']['griess_dp']['apply_normalization'] = False @@ -60,22 +60,22 @@ def test_griess_dp(): assert round(wl_measures_dispersion.griess_dp(main, [2, 1, 0]), 1) == 0.5 assert wl_measures_dispersion.griess_dp(main, [0, 0, 0, 0]) == 0 -# Reference: Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x +# Reference: Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. ETS Research Bulletin Series, 1970(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x | p. 14 def test_juillands_d(): assert round(wl_measures_dispersion.juillands_d(main, [0, 4, 3, 2, 1]), 4) == 0.6464 assert wl_measures_dispersion.juillands_d(main, [0, 0, 0, 0]) == 0 -# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 408) +# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 408 def test_lynes_d3(): assert round(wl_measures_dispersion.lynes_d3(main, [1, 2, 3, 4, 5]), 3) == 0.944 assert wl_measures_dispersion.lynes_d3(main, [0, 0, 0, 0]) == 0 -# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 407) +# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 407 def test_rosengrens_s(): assert round(wl_measures_dispersion.rosengrens_s(main, [1, 2, 3, 4, 5]), 3) == 0.937 assert wl_measures_dispersion.rosengrens_s(main, [0, 0, 0, 0]) == 0 -# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri (p. 408) +# Reference: Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri | p. 408 def test_zhangs_distributional_consistency(): assert round(wl_measures_dispersion.zhangs_distributional_consistency(main, [1, 2, 3, 4, 5]), 3) == 0.937 assert wl_measures_dispersion.zhangs_distributional_consistency(main, [0, 0, 0, 0]) == 0 diff --git a/tests/tests_measures/test_measures_effect_size.py b/tests/tests_measures/test_measures_effect_size.py index f29af43c1..fb0d9aed4 100644 --- a/tests/tests_measures/test_measures_effect_size.py +++ b/tests/tests_measures/test_measures_effect_size.py @@ -35,17 +35,17 @@ def assert_zeros(func, result = 0): numpy.array([result] * 10) ) -# Reference: Gabrielatos, C., & Marchi, A. (2012, September 13–14). Keyness: Appropriate metrics and practical issues [Conference session]. CADS International Conference 2012, University of Bologna, Italy. (pp. 21-22) +# Reference: Gabrielatos, C., & Marchi, A. (2011, November 5). Keyness: Matching metrics to definitions [Conference session]. Corpus Linguistics in the South 1, University of Portsmouth, United Kingdom. https://eprints.lancs.ac.uk/id/eprint/51449/4/Gabrielatos_Marchi_Keyness.pdf | p. 18 def test_pct_diff(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.pct_diff( main, - numpy.array([20] * 2), - numpy.array([1] * 2), - numpy.array([29954 - 20] * 2), - numpy.array([23691 - 1] * 2) - ), 2), - numpy.array([1481.83] * 2) + numpy.array([206523] * 2), + numpy.array([178174] * 2), + numpy.array([959641 - 206523] * 2), + numpy.array([1562358 - 178174] * 2) + ), 1), + numpy.array([88.7] * 2) ) numpy.testing.assert_array_equal( @@ -59,10 +59,23 @@ def test_pct_diff(): numpy.array([float('-inf'), float('inf'), 0]) ) +# Reference: Durrant, P. (2008). High frequency collocations and second language learning [Doctoral dissertation, University of Nottingham]. Nottingham eTheses. https://eprints.nottingham.ac.uk/10622/1/final_thesis.pdf | pp. 80, 84 +def test_conditional_probability(): + numpy.testing.assert_array_equal( + numpy.round(wl_measures_effect_size.conditional_probability( + main, + numpy.array([28, 28]), + numpy.array([8002, 15740]), + numpy.array([15740, 8002]), + numpy.array([97596164, 97596164]) + ), 3), + numpy.array([0.178, 0.349]) + ) + def test_im3(): assert_zeros(wl_measures_effect_size.im3) -# Reference: Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), pp. 1–38. (p. 13) +# Reference: Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), pp. 1–38. | p. 13 def test_dice_sorensen_coeff(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.dice_sorensen_coeff( @@ -77,7 +90,7 @@ def test_dice_sorensen_coeff(): assert_zeros(wl_measures_effect_size.dice_sorensen_coeff) -# Reference: Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities. (p. 471) +# Reference: Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities. | p. 471 def test_diff_coeff(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.diff_coeff( @@ -95,7 +108,7 @@ def test_diff_coeff(): def test_jaccard_index(): assert_zeros(wl_measures_effect_size.jaccard_index) -# Reference: Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (p. 171). University of Liverpool. +# Reference: Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (CL2009) (Article 171). University of Liverpool. def test_kilgarriffs_ratio(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.kilgarriffs_ratio( @@ -164,7 +177,7 @@ def test_md(): def test_me(): assert_zeros(wl_measures_effect_size.me) -# Reference: Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 (p. 51) +# Reference: Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 | p. 51 def test_mi(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.mi( @@ -179,7 +192,7 @@ def test_mi(): assert_zeros(wl_measures_effect_size.mi) -# Reference: Pojanapunya, P., & Todd, R. W. (2016). Log-likelihood and odds ratio keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 15(1), pp. 133–167. https://doi.org/10.1515/cllt-2015-0030 (p. 154) +# Reference: Pojanapunya, P., & Todd, R. W. (2016). Log-likelihood and odds ratio keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 15(1), pp. 133–167. https://doi.org/10.1515/cllt-2015-0030 | p. 154 def test_odds_ratio(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.odds_ratio( @@ -203,7 +216,7 @@ def test_odds_ratio(): numpy.array([float('-inf'), float('inf'), 0]) ) -# Reference: Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29. (p. 24) +# Reference: Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29. | p. 24 def test_pmi(): numpy.testing.assert_array_equal( numpy.round(wl_measures_effect_size.pmi( @@ -241,6 +254,7 @@ def test_squared_phi_coeff(): if __name__ == '__main__': test_pct_diff() + test_conditional_probability() test_im3() test_dice_sorensen_coeff() test_diff_coeff() diff --git a/tests/tests_measures/test_measures_lexical_density_diversity.py b/tests/tests_measures/test_measures_lexical_density_diversity.py index a64edf280..729e084d9 100644 --- a/tests/tests_measures/test_measures_lexical_density_diversity.py +++ b/tests/tests_measures/test_measures_lexical_density_diversity.py @@ -30,7 +30,7 @@ TOKENS_101 = ['This', 'is', 'a', 'sentence', '.'] * 20 + ['another'] TOKENS_1000 = ['This', 'is', 'a', 'sentence', '.'] * 200 -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 26). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 26 TOKENS_225 = [1] * 11 + [2, 3] * 9 + [4] * 7 + [5, 6] * 6 + [7, 8] * 5 + list(range(9, 16)) * 4 + list(range(16, 22)) * 3 + list(range(22, 40)) * 2 + list(range(40, 125)) def get_test_text(tokens): @@ -130,31 +130,31 @@ def test_popescu_macutek_altmanns_b1_b2_b3_b4_b5(): assert round(b4, 3) == 0.078 assert round(b5, 3) == 0.664 -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 30). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 30 def test_popescus_r1(): r1 = wl_measures_lexical_density_diversity.popescus_r1(main, text_tokens_225) assert round(r1, 4) == 0.8667 -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 39). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 39 def test_popescus_r2(): r2 = wl_measures_lexical_density_diversity.popescus_r2(main, text_tokens_225) assert round(r2, 3) == 0.871 -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 51). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 51 def test_popescus_r3(): r3 = wl_measures_lexical_density_diversity.popescus_r3(main, text_tokens_225) assert round(r3, 4) == 0.3778 -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 59). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 59 def test_popescus_r4(): r4 = wl_measures_lexical_density_diversity.popescus_r4(main, text_tokens_225) assert round(r4, 4) == 0.6344 -# Reference: Popescu, I.-I. (2009). Word frequency studies (pp. 170, 172). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | pp. 170, 172 def test_repeat_rate(): settings['repeat_rate']['use_data'] = 'Rank-frequency distribution' rr_distribution = wl_measures_lexical_density_diversity.repeat_rate(main, text_tokens_225) @@ -169,7 +169,7 @@ def test_rttr(): assert rttr == 5 / 100 ** 0.5 -# Reference: Popescu, I.-I. (2009). Word frequency studies (pp. 176, 178). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | pp. 176, 178 def test_shannon_entropy(): settings['shannon_entropy']['use_data'] = 'Rank-frequency distribution' h_distribution = wl_measures_lexical_density_diversity.shannon_entropy(main, text_tokens_225) diff --git a/tests/tests_measures/test_measures_statistical_significance.py b/tests/tests_measures/test_measures_statistical_significance.py index d9d925cfa..18aa93813 100644 --- a/tests/tests_measures/test_measures_statistical_significance.py +++ b/tests/tests_measures/test_measures_statistical_significance.py @@ -55,7 +55,7 @@ def test_get_alt(): assert wl_measures_statistical_significance.get_alt('Left-tailed') == 'less' assert wl_measures_statistical_significance.get_alt('Right-tailed') == 'greater' -# References: Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188-200). The South–Central Regional SAS Users' Group. (p. 10) +# References: Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188–200). The South–Central Regional SAS Users' Group. | p. 10 def test_fishers_exact_test(): settings['fishers_exact_test']['direction'] = 'Two-tailed' test_stats, p_vals = wl_measures_statistical_significance.fishers_exact_test( @@ -100,7 +100,7 @@ def test_fishers_exact_test(): assert test_stats == [None] * 2 numpy.testing.assert_array_equal(p_vals, numpy.array([1] * 2)) -# References: Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. (p. 72) +# References: Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. | p. 72 def test_log_likelihood_ratio_test(): settings['log_likelihood_ratio_test']['apply_correction'] = False gs, _ = wl_measures_statistical_significance.log_likelihood_ratio_test( @@ -134,7 +134,7 @@ def test_log_likelihood_ratio_test(): numpy.testing.assert_array_equal(gs, numpy.array([0, 0])) numpy.testing.assert_array_equal(p_vals, numpy.array([1, 1])) -# References: Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil (p. 238) +# References: Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil | p. 238 def test_mann_whitney_u_test(): u1s, _ = wl_measures_statistical_significance.mann_whitney_u_test( main, @@ -175,8 +175,8 @@ def test_mann_whitney_u_test(): ) # References: -# Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. (p. 73) -# Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188-200). The South–Central Regional SAS Users' Group. (p. 10) +# Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. | p. 73 +# Pedersen, T. (1996). Fishing for exactness. In T. Winn (Ed.), Proceedings of the Sixth Annual South-Central Regional SAS Users' Group Conference (pp. 188–200). The South–Central Regional SAS Users' Group. | p. 10 def test_pearsons_chi_squared_test(): settings['pearsons_chi_squared_test']['apply_correction'] = False chi2s, _ = wl_measures_statistical_significance.pearsons_chi_squared_test( @@ -209,7 +209,7 @@ def test_pearsons_chi_squared_test(): numpy.testing.assert_array_equal(chi2s, numpy.array([0] * 2)) numpy.testing.assert_array_equal(p_vals, numpy.array([1] * 2)) -# Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT Press. (pp. 164-165) +# Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT Press. | pp. 164–165 def test_students_t_test_1_sample(): t_stats, _ = wl_measures_statistical_significance.students_t_test_1_sample( main, diff --git a/tests/tests_widgets/test_widgets.py b/tests/tests_widgets/test_widgets.py index a60cbf35e..a58cd43a7 100644 --- a/tests/tests_widgets/test_widgets.py +++ b/tests/tests_widgets/test_widgets.py @@ -125,11 +125,11 @@ def test_wl_widgets_search_settings_tokens(): def test_wl_widgets_context_settings(): wl_widgets.wl_widgets_context_settings(main, tab = 'concordancer') -def test_wl_widgets_measures_wordlist_generator(): - wl_widgets.wl_widgets_measures_wordlist_generator(main) +def test_wl_widgets_measures_wordlist_ngram_generation(): + wl_widgets.wl_widgets_measures_wordlist_ngram_generation(main) -def test_wl_widgets_measures_collocation_extractor(): - wl_widgets.wl_widgets_measures_collocation_extractor(main, tab = 'collocation_extractor') +def test_wl_widgets_measures_collocation_keyword_extraction(): + wl_widgets.wl_widgets_measures_collocation_keyword_extraction(main, tab = 'collocation_extractor') def test_wl_widgets_table_settings(): table = QTableView() @@ -223,8 +223,8 @@ def test_wl_widgets_direction(): test_wl_widgets_search_settings() test_wl_widgets_context_settings() - test_wl_widgets_measures_wordlist_generator() - test_wl_widgets_measures_collocation_extractor() + test_wl_widgets_measures_wordlist_ngram_generation() + test_wl_widgets_measures_collocation_keyword_extraction() test_wl_widgets_table_settings() test_wl_widgets_table_settings_span_position() diff --git a/wordless/wl_colligation_extractor.py b/wordless/wl_colligation_extractor.py index c61cb65c2..6d31aaf82 100644 --- a/wordless/wl_colligation_extractor.py +++ b/wordless/wl_colligation_extractor.py @@ -214,7 +214,10 @@ def __init__(self, main): self.combo_box_measure_bayes_factor, self.label_measure_effect_size, self.combo_box_measure_effect_size - ) = wl_widgets.wl_widgets_measures_collocation_extractor(self, tab = 'collocation_extractor') + ) = wl_widgets.wl_widgets_measures_collocation_keyword_extraction( + self, + extraction_type = 'collocation' + ) self.combo_box_limit_searching.addItems([ self.tr('None'), diff --git a/wordless/wl_collocation_extractor.py b/wordless/wl_collocation_extractor.py index 4c13aff35..65ae8e281 100644 --- a/wordless/wl_collocation_extractor.py +++ b/wordless/wl_collocation_extractor.py @@ -213,7 +213,10 @@ def __init__(self, main): self.combo_box_measure_bayes_factor, self.label_measure_effect_size, self.combo_box_measure_effect_size - ) = wl_widgets.wl_widgets_measures_collocation_extractor(self, tab = 'collocation_extractor') + ) = wl_widgets.wl_widgets_measures_collocation_keyword_extraction( + self, + extraction_type = 'collocation' + ) self.combo_box_limit_searching.addItems([ self.tr('None'), diff --git a/wordless/wl_keyword_extractor.py b/wordless/wl_keyword_extractor.py index b893d9711..e9765a10d 100644 --- a/wordless/wl_keyword_extractor.py +++ b/wordless/wl_keyword_extractor.py @@ -128,7 +128,10 @@ def __init__(self, main): self.combo_box_measure_bayes_factor, self.label_measure_effect_size, self.combo_box_measure_effect_size - ) = wl_widgets.wl_widgets_measures_collocation_extractor(self, tab = 'keyword_extractor') + ) = wl_widgets.wl_widgets_measures_collocation_keyword_extraction( + self, + extraction_type = 'keyword' + ) self.combo_box_test_statistical_significance.currentTextChanged.connect(self.generation_settings_changed) self.combo_box_measure_bayes_factor.currentTextChanged.connect(self.generation_settings_changed) diff --git a/wordless/wl_measures/wl_measures_adjusted_freq.py b/wordless/wl_measures/wl_measures_adjusted_freq.py index eff4fabd4..612ae628e 100644 --- a/wordless/wl_measures/wl_measures_adjusted_freq.py +++ b/wordless/wl_measures/wl_measures_adjusted_freq.py @@ -26,8 +26,8 @@ # Euler-Mascheroni Constant C = -scipy.special.digamma(1) -# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 # Average logarithmic distance +# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def fald(main, tokens, search_term): dists = wl_measures_dispersion._get_dists(tokens, search_term) @@ -40,10 +40,12 @@ def fald(main, tokens, search_term): return fald # Average reduced frequency +# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def farf(main, tokens, search_term): return wl_measures_dispersion.arf(main, tokens, search_term) # Average waiting time +# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def fawt(main, tokens, search_term): dists = wl_measures_dispersion._get_dists(tokens, search_term) @@ -55,7 +57,7 @@ def fawt(main, tokens, search_term): return fawt # Carroll's Um -# Reference: Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x +# Reference: Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. ETS Research Bulletin Series, 1970(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x def carrolls_um(main, freqs): freq_total = sum(freqs) @@ -65,7 +67,7 @@ def carrolls_um(main, freqs): return um # Engwall's FM -# Reference: Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. +# Reference: Engwall, G. (1974). Fréquence et distribution du vocabulaire dans un choix de romans français [Unpublished doctoral dissertation]. Stockholm University. | p. 53 def juillands_u(main, freqs): d = wl_measures_dispersion.juillands_d(main, freqs) u = max(0, d) * sum(freqs) @@ -73,7 +75,7 @@ def juillands_u(main, freqs): return u # Juilland's U -# Reference: Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton. +# Reference: Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton. | p. LXVIII def rosengrens_kf(main, freqs): return numpy.sum(numpy.sqrt(freqs)) ** 2 / len(freqs) diff --git a/wordless/wl_measures/wl_measures_dispersion.py b/wordless/wl_measures/wl_measures_dispersion.py index 5a2572e92..a04e31c0e 100644 --- a/wordless/wl_measures/wl_measures_dispersion.py +++ b/wordless/wl_measures/wl_measures_dispersion.py @@ -23,7 +23,6 @@ from wordless.wl_measures import wl_measures_adjusted_freq -# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def _get_dists(tokens, search_term): positions = numpy.array([i for i, token in enumerate(tokens) if token == search_term]) @@ -37,6 +36,7 @@ def _get_dists(tokens, search_term): return dists # Average logarithmic distance +# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def ald(main, tokens, search_term): dists = _get_dists(tokens, search_term) @@ -48,6 +48,7 @@ def ald(main, tokens, search_term): return ald # Average reduced frequency +# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def arf(main, tokens, search_term): dists = _get_dists(tokens, search_term) @@ -60,6 +61,7 @@ def arf(main, tokens, search_term): return arf # Average waiting time +# Reference: Savický, P., & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. https://doi.org/10.1076/jqul.9.3.215.14124 def awt(main, tokens, search_term): dists = _get_dists(tokens, search_term) @@ -71,7 +73,7 @@ def awt(main, tokens, search_term): return awt # Carroll's D₂ -# Reference: Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2), 61–65. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x +# Reference: Carroll, J. B. (1970). An alternative to Juillands's usage coefficient for lexical frequencies. ETS Research Bulletin Series, 1970(2), i–15. https://doi.org/10.1002/j.2333-8504.1970.tb00778.x def carrolls_d2(main, freqs): freqs = numpy.array(freqs) @@ -108,7 +110,7 @@ def griess_dp(main, freqs): return dp # Juilland's D -# Reference: Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton. +# Reference: Juilland, A., & Chang-Rodriguez, E. (1964). Frequency dictionary of Spanish words. Mouton. | p. LIII def juillands_d(main, freqs): freqs = numpy.array(freqs) diff --git a/wordless/wl_measures/wl_measures_effect_size.py b/wordless/wl_measures/wl_measures_effect_size.py index 5073da287..348566467 100644 --- a/wordless/wl_measures/wl_measures_effect_size.py +++ b/wordless/wl_measures/wl_measures_effect_size.py @@ -40,15 +40,22 @@ def pct_diff(main, o11s, o12s, o21s, o22s): ) ) +# Conditional probability +# Reference: Durrant, P. (2008). High frequency collocations and second language learning [Doctoral dissertation, University of Nottingham]. Nottingham eTheses. https://eprints.nottingham.ac.uk/10622/1/final_thesis.pdf | p. 84 +def conditional_probability(main, o11s, o12s, o21s, o22s): + _, _, ox1s, _ = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s) + + return wl_measure_utils.numpy_divide(o11s, ox1s) * 100 + # Cubic association ratio -# Reference: Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid= +# Reference: Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid= | p. 139 def im3(main, o11s, o12s, o21s, o22s): e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 3, e11s)) # Dice-Sørensen coefficient -# Reference: Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38. +# Reference: Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38. | p. 8 def dice_sorensen_coeff(main, o11s, o12s, o21s, o22s): o1xs, _, ox1s, _ = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s) @@ -56,8 +63,8 @@ def dice_sorensen_coeff(main, o11s, o12s, o21s, o22s): # Difference coefficient # References: -# Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities. -# Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review (pp. 225–258). Routledge. +# Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities. | p. 14 +# Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review (pp. 225–258). Routledge. | p. 236 def diff_coeff(main, o11s, o12s, o21s, o22s): _, _, ox1s, ox2s = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s) @@ -71,12 +78,12 @@ def diff_coeff(main, o11s, o12s, o21s, o22s): ) # Jaccard index -# Reference: Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 +# Reference: Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 | p. 48 def jaccard_index(main, o11s, o12s, o21s, o22s): return wl_measure_utils.numpy_divide(o11s, o11s + o12s + o21s) # Kilgarriff's ratio -# Reference: Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (p. 171). University of Liverpool. +# Reference: Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (CL2009) (Article 171). University of Liverpool. def kilgarriffs_ratio(main, o11s, o12s, o21s, o22s): smoothing_param = main.settings_custom['measures']['effect_size']['kilgarriffs_ratio']['smoothing_param'] @@ -86,14 +93,14 @@ def kilgarriffs_ratio(main, o11s, o12s, o21s, o22s): ) # logDice -# Reference: Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing. Masaryk University +# Reference: Rychlý, P. (2008). A lexicographyer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing (pp. 6–9). Masaryk University def log_dice(main, o11s, o12s, o21s, o22s): o1xs, _, ox1s, _ = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s) return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(2 * o11s, o1xs + ox1s), default = 14) # Log-frequency biased MD -# Reference: Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association. +# Reference: Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association. | p. 621 def lfmd(main, o11s, o12s, o21s, o22s): e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) @@ -121,8 +128,8 @@ def log_ratio(main, o11s, o12s, o21s, o22s): # MI.log-f # References: -# Kilgarriff, A., & Tugwell, D. (2002). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In Proceedings of the 8th Machine Translation Summit (pp. 187–190). European Association for Machine Translation. -# Lexical Computing. (2015, July 8). Statistics used in Sketch Engine. Sketch Engine. https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/ +# Kilgarriff, A., & Tugwell, D. (2001). WASP-bench: An MT lexicographers' workstation supporting state-of-the-art lexical disambiguation. In B. Maegaard (Ed.), Proceedings of Machine Translation Summit VIII (pp. 187–190). European Association for Machine Translation. +# Lexical Computing. (2015, July 8). Statistics used in Sketch Engine. Sketch Engine. https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/ | p. 4 def mi_log_f(main, o11s, o12s, o21s, o22s): e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) @@ -139,7 +146,7 @@ def min_sensitivity(main, o11s, o12s, o21s, o22s): ) # Mutual Dependency -# Reference: Thanopoulos, A, Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González, & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association. +# Reference: Thanopoulos, A, Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González, & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association. | p. 621 def md(main, o11s, o12s, o21s, o22s): e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) @@ -153,7 +160,7 @@ def me(main, o11s, o12s, o21s, o22s): return o11s * wl_measure_utils.numpy_divide(2 * o11s, o1xs + ox1s) # Mutual information -# Reference: Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 +# Reference: Dunning, T. E. (1998). Finding structure in text, genome and other symbolic sequences [Doctoral dissertation, University of Sheffield]. arXiv. https://arxiv.org/pdf/1207.1847 | pp. 49–52 def mi(main, o11s, o12s, o21s, o22s): oxxs = o11s + o12s + o21s + o22s e11s, e12s, e21s, e22s = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) @@ -200,7 +207,7 @@ def poisson_collocation_measure(main, o11s, o12s, o21s, o22s): ) # Squared association ratio -# Reference: Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University. +# Reference: Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University. | p. 21 def im2(main, o11s, o12s, o21s, o22s): e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) diff --git a/wordless/wl_measures/wl_measures_lexical_density_diversity.py b/wordless/wl_measures/wl_measures_lexical_density_diversity.py index e0491c5cb..9985fccfe 100644 --- a/wordless/wl_measures/wl_measures_lexical_density_diversity.py +++ b/wordless/wl_measures/wl_measures_lexical_density_diversity.py @@ -39,7 +39,7 @@ def brunets_index(main, text): # Corrected TTR # References: # Carroll, J. B. (1964). Language and thought. Prentice-Hall. -# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment (p. 26). Palgrave Macmillan. +# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan. | p. 26 def cttr(main, text): return text.num_types / numpy.sqrt(2 * text.num_tokens) @@ -115,7 +115,7 @@ def honores_stat(main, text): return r # Lexical density -# Reference: Halliday, M. A. K. (1989). Spoken and written language (2nd ed., p. 64). Oxford University Press. +# Reference: Halliday, M. A. K. (1989). Spoken and written language (2nd ed.). Oxford University Press. | p. 64 def lexical_density(main, text): if text.lang in main.settings_global['pos_taggers']: wl_pos_tagging.wl_pos_tag_universal(main, text.get_tokens_flat(), lang = text.lang, tagged = text.tagged) @@ -135,19 +135,19 @@ def lexical_density(main, text): # LogTTR # Herdan: -# Herdan, G. (1960). Type-token mathematics: A textbook of mathematical linguistics (p. 28). Mouton. +# Herdan, G. (1960). Type-token mathematics: A textbook of mathematical linguistics. Mouton. | p. 28 # Somers: # Somers, H. H. (1966). Statistical methods in literary analysis. In J. Leeds (Ed.), The computer and literary style (pp. 128–140). Kent State University Press. -# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment (p. 28). Palgrave Macmillan. +# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan. | p. 28 # Rubet: # Dugast, D. (1979). Vocabulaire et stylistique: I théâtre et dialogue, travaux de linguistique quantitative. Slatkine. -# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment (p. 28). Palgrave Macmillan. +# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan. | p. 28 # Maas: # Maas, H.-D. (1972). Über den zusammenhang zwischen wortschatzumfang und länge eines textes. Zeitschrift für Literaturwissenschaft und Linguistik, 2(8), 73–96. # Dugast: # Dugast, D. (1978). Sur quoi se fonde la notion d’étendue théoretique du vocabulaire? Le Français Moderne, 46, 25–32. # Dugast, D. (1979). Vocabulaire et stylistique: I théâtre et dialogue, travaux de linguistique quantitative. Slatkine. -# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment (p. 28). Palgrave Macmillan. +# Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan. | p. 28 def logttr(main, text): variant = main.settings_custom['measures']['lexical_density_diversity']['logttr']['variant'] @@ -167,7 +167,7 @@ def logttr(main, text): # Mean segmental TTR # References: # Johnson, W. (1944). Studies in language behavior: I. a program of research. Psychological Monographs, 56(2), 1–15. https://doi.org/10.1037/h0093508 -# McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) [Doctoral dissertation, The University of Memphis] (p. 37). ProQuest Dissertations and Theses Global. +# McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) (Publication No. 3199485) [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global. | p. 37 def msttr(main, text): num_tokens_seg = main.settings_custom['measures']['lexical_density_diversity']['msttr']['num_tokens_in_each_seg'] @@ -187,7 +187,7 @@ def msttr(main, text): # Measure of textual lexical diversity # References: -# McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) [Doctoral dissertation, The University of Memphis] (pp. 95–96, 99–100). ProQuest Dissertations and Theses Global. +# McCarthy, P. M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD) (Publication No. 3199485) [Doctoral dissertation, The University of Memphis]. ProQuest Dissertations and Theses Global. | pp. 95–96, 99–100 # McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. https://doi.org/10.3758/BRM.42.2.381 def mtld(main, text): mtlds = numpy.empty(shape = 2) @@ -275,7 +275,7 @@ def popescu_macutek_altmanns_b1_b2_b3_b4_b5(main, text): return b1, b2, b3, b4, b5 # Popescu's R₁ -# Reference: Popescu, I.-I. (2009). Word frequency studies (pp. 18, 30, 33). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | pp. 18, 30, 33 def popescus_r1(main, text): types_freqs = collections.Counter(text.get_tokens_flat()) ranks = numpy.empty(shape = text.num_types) @@ -309,7 +309,7 @@ def popescus_r1(main, text): return r1 # Popescu's R₂ -# Reference: Popescu, I.-I. (2009). Word frequency studies (pp. 35–36, 38). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | pp. 35–36, 38 def popescus_r2(main, text): types_freqs = collections.Counter(text.get_tokens_flat()) freqs_nums_types = sorted(collections.Counter(types_freqs.values()).items()) @@ -344,7 +344,7 @@ def popescus_r2(main, text): return r2 # Popescu's R₃ -# Reference: Popescu, I.-I. (2009). Word frequency studies (pp. 48–49, 53). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | pp. 48–49, 53 def popescus_r3(main, text): types_freqs = collections.Counter(text.get_tokens_flat()) ranks_freqs = [ @@ -373,7 +373,7 @@ def popescus_r3(main, text): return r3 # Popescu's R₄ -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 57). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 57 def popescus_r4(main, text): types_freqs = collections.Counter(text.get_tokens_flat()) @@ -389,7 +389,7 @@ def popescus_r4(main, text): return r4 # Repeat rate -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 166). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 166 def repeat_rate(main, text): use_data = main.settings_custom['measures']['lexical_density_diversity']['repeat_rate']['use_data'] @@ -414,7 +414,7 @@ def rttr(main, text): return text.num_types / numpy.sqrt(text.num_tokens) # Shannon entropy -# Reference: Popescu, I.-I. (2009). Word frequency studies (p. 173). Mouton de Gruyter. +# Reference: Popescu, I.-I. (2009). Word frequency studies. Mouton de Gruyter. | p. 173 def shannon_entropy(main, text): use_data = main.settings_custom['measures']['lexical_density_diversity']['shannon_entropy']['use_data'] @@ -450,7 +450,7 @@ def ttr(main, text): return text.num_types / text.num_tokens # vocd-D -# Reference: Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment (pp. 51, 56–57). Palgrave Macmillan. +# Reference: Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan. | pp. 51, 56–57 def vocdd(main, text): def ttr(n, d): return (d / n) * (numpy.sqrt(1 + 2 * n / d) - 1) @@ -480,7 +480,7 @@ def ttr(n, d): return popt[0] # Yule's characteristic K -# Reference: Yule, G. U. (1944). The statistical study of literary vocabulary (pp. 52–53). Cambridge University Press. +# Reference: Yule, G. U. (1944). The statistical study of literary vocabulary. Cambridge University Press. | pp. 52–53 def yules_characteristic_k(main, text): types_freqs = collections.Counter(text.get_tokens_flat()) freqs_nums_types = collections.Counter(types_freqs.values()) @@ -493,7 +493,7 @@ def yules_characteristic_k(main, text): return k # Yule's Index of Diversity -# Reference: Williams, C. B. (1970). Style and vocabulary: Numerical studies (p. 100). Griffin. +# Reference: Williams, C. B. (1970). Style and vocabulary: Numerical studies. Griffin. | p. 100 def yules_index_of_diversity(main, text): types_freqs = collections.Counter(text.get_tokens_flat()) freqs_nums_types = collections.Counter(types_freqs.values()) diff --git a/wordless/wl_measures/wl_measures_readability.py b/wordless/wl_measures/wl_measures_readability.py index d5faaf697..6838c4b15 100644 --- a/wordless/wl_measures/wl_measures_readability.py +++ b/wordless/wl_measures/wl_measures_readability.py @@ -183,7 +183,7 @@ def get_num_sentences_sample(text, sample, sample_start): ) # Al-Heeti's readability formula -# Reference: Al-Heeti, K. N. (1984). Judgment analysis technique applied to readability prediction of Arabic reading material [Doctoral dissertation, University of Northern Colorado] (pp. 102, 104, 106). ProQuest Dissertations and Theses Global. +# Reference: Al-Heeti, K. N. (1984). Judgment analysis technique applied to readability prediction of Arabic reading material [Doctoral dissertation, University of Northern Colorado]. ProQuest Dissertations and Theses Global. | pp. 102, 104, 106 def rd(main, text): if text.lang == 'ara': text = get_nums(main, text) @@ -232,9 +232,9 @@ def aari(main, text): # Automated Readability Index # References: -# Smith, E. A., & Senter, R. J. (1967). Automated readability index (p. 8). Aerospace Medical Research Laboratories. https://apps.dtic.mil/sti/pdfs/AD0667273.pdf +# Smith, E. A., & Senter, R. J. (1967). Automated readability index. Aerospace Medical Research Laboratories. https://apps.dtic.mil/sti/pdfs/AD0667273.pdf | p. 8 # Navy: -# Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75, p. 14). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf +# Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf | p. 14 def ari(main, text): text = get_nums(main, text) @@ -257,7 +257,7 @@ def ari(main, text): return ari # Bormuth's cloze mean & grade placement -# Reference: Bormuth, J. R. (1969). Development of readability analyses (pp. 152, 160). U.S. Department of Health, Education, and Welfare. http://files.eric.ed.gov/fulltext/ED029166.pdf +# Reference: Bormuth, J. R. (1969). Development of readability analyses. U.S. Department of Health, Education, and Welfare. http://files.eric.ed.gov/fulltext/ED029166.pdf | pp. 152, 160 def bormuths_cloze_mean(main, text): if text.lang.startswith('eng_'): text = get_nums(main, text) @@ -515,7 +515,7 @@ def devereux_readability_index(main, text): # Dickes-Steiwer Handformel # References: # Dickes, P. & Steiwer, L. (1977). Ausarbeitung von lesbarkeitsformeln für die deutsche sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 9(1), 20–28. -# Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache (p. 57). Jugend und Volk. +# Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk. | p. 57 def dickes_steiwer_handformel(main, text): text = get_nums(main, text) @@ -547,7 +547,7 @@ def elf(main, text): return elf # Flesch-Kincaid grade level -# Reference: Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75, p. 14). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf +# Reference: Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf | p. 14 def gl(main, text): if text.lang in main.settings_global['syl_tokenizers']: text = get_nums(main, text) @@ -571,7 +571,7 @@ def gl(main, text): # Powers-Sumner-Kearl: # Powers, R. D., Sumner, W. A., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of Educational Psychology, 49(2), 99–105. https://doi.org/10.1037/h0043254 # Dutch (Douma): -# Douma, W. H. (1960). De leesbaarheid van landbouwbladen: Een onderzoek naar en een toepassing van leesbaarheidsformules [Readability of Dutch farm papers: A discussion and application of readability-formulas] (p. 453). Afdeling Sociologie en Sociografie van de Landbouwhogeschool Wageningen. https://edepot.wur.nl/276323 +# Douma, W. H. (1960). De leesbaarheid van landbouwbladen: Een onderzoek naar en een toepassing van leesbaarheidsformules [Readability of Dutch farm papers: A discussion and application of readability-formulas]. Afdeling Sociologie en Sociografie van de Landbouwhogeschool Wageningen. https://edepot.wur.nl/276323 | p. 453 # Dutch (Brouwer's Leesindex A): # Brouwer, R. H. M. (1963). Onderzoek naar de leesmoeilijkheid van Nederlands proza. Paedagogische Studiën, 40, 454–464. https://objects.library.uu.nl/reader/index.php?obj=1874-205260&lan=en # French: @@ -579,17 +579,17 @@ def gl(main, text): # Sitbon, L., Bellot, P., & Blache, P. (2007). Eléments pour adapter les systèmes de recherche d’information aux dyslexiques. Revue TAL : traitement automatique des langues, 48(2), 123–147. # German: # Amstad, T. (1978). Wie verständlich sind unsere Zeitungen? [Unpublished doctoral dissertation]. University of Zurich. -# Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache (p. 56). Jugend und Volk. +# Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk. | p. 56 # Italian: # Franchina, V., & Vacca, R. (1986). Adaptation of Flesh readability index on a bilingual text written by the same author both in Italian and English languages. Linguaggi, 3, 47–49. # Garais, E. (2011). Web applications readability. Journal of Information Systems and Operations Management, 5(1), 117–121. http://www.rebe.rau.ro/RePEc/rau/jisomg/SP11/JISOM-SP11-A13.pdf # Russian: -# Oborneva, I. V. (2006). Автоматизированная оценка сложности учебных текстов на основе статистических параметров [Doctoral dissertation, Institute for Strategy of Education Development of the Russian Academy of Education] (p. 13). Freereferats.ru. https://static.freereferats.ru/_avtoreferats/01002881899.pdf?ver=3 +# Oborneva, I. V. (2006). Автоматизированная оценка сложности учебных текстов на основе статистических параметров [Doctoral dissertation, Institute for Strategy of Education Development of the Russian Academy of Education]. Freereferats.ru. https://static.freereferats.ru/_avtoreferats/01002881899.pdf?ver=3 | p. 13 # Spanish (Fernández Huerta): # Fernández Huerta, J. (1959). Medidas sencillas de lecturabilidad. Consigna, 214, 29–32. # Garais, E. (2011). Web applications readability. Journal of Information Systems and Operations Management, 5(1), 117–121. http://www.rebe.rau.ro/RePEc/rau/jisomg/SP11/JISOM-SP11-A13.pdf # Spanish (Szigriszt Pazos): -# Szigriszt Pazos, F. (1993). Sistemas predictivos de legibilidad del mensaje escrito: Formula de perspicuidad [Doctoral dissertation, Complutense University of Madrid] (p. 247). Biblos-e Archivo. https://repositorio.uam.es/bitstream/handle/10486/2488/3907_barrio_cantalejo_ines_maria.pdf?sequence=1&isAllowed=y +# Szigriszt Pazos, F. (1993). Sistemas predictivos de legibilidad del mensaje escrito: Formula de perspicuidad [Doctoral dissertation, Complutense University of Madrid]. Biblos-e Archivo. https://repositorio.uam.es/bitstream/handle/10486/2488/3907_barrio_cantalejo_ines_maria.pdf?sequence=1&isAllowed=y | p. 247 # Ukrainian: # Partiko, Z. V. (2001). Zagal’ne redaguvannja. Normativni osnovi. Afiša. # Grzybek, P. (2010). Text difficulty and the Arens-Altmann law. In P. Grzybek, E. Kelih, & J. Mačutek (eds.), Text and language: Structures · functions · interrelations quantitative perspectives. Praesens Verlag. https://www.iqla.org/includes/basic_references/qualico_2009_proceedings_Grzybek_Kelih_Macutek_2009.pdf @@ -707,7 +707,7 @@ def re_farr_jenkins_paterson(main, text): return re # FORCAST -# Reference: Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material (p. 3). Human Resource Research Organization. https://ia902703.us.archive.org/31/items/ERIC_ED076707/ERIC_ED076707.pdf +# Reference: Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material. Human Resource Research Organization. https://ia902703.us.archive.org/31/items/ERIC_ED076707/ERIC_ED076707.pdf | p. 3 def rgl(main, text): if text.lang in main.settings_global['syl_tokenizers']: text = get_nums(main, text) @@ -728,7 +728,7 @@ def rgl(main, text): # Fucks's Stilcharakteristik # References: # Fucks, W. (1955). Unterschied des prosastils von dichtern und anderen schriftstellern: Ein beispiel mathematischer stilanalyse. Bouvier. -# Briest, W. (1974). Kann man Verständlichkeit messen? STUF - Language Typology and Universals, 27(1-3), 543–563. https://doi.org/10.1524/stuf.1974.27.13.543 +# Briest, W. (1974). Kann man Verständlichkeit messen? STUF - Language Typology and Universals, 27(1–3), 543–563. https://doi.org/10.1524/stuf.1974.27.13.543 def fuckss_stilcharakteristik(main, text): if text.lang in main.settings_global['syl_tokenizers']: text = get_nums(main, text) @@ -764,11 +764,11 @@ def gulpease(main, text): # Gunning Fog Index # References: -# Gunning, R. (1968). The technique of clear writing (revised ed., p. 38). McGraw-Hill Book Company. +# Gunning, R. (1968). The technique of clear writing (revised ed.). McGraw-Hill Book Company. | p. 38 # Powers-Sumner-Kearl: # Powers, R. D., Sumner, W. A., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of Educational Psychology, 49(2), 99–105. https://doi.org/10.1037/h0043254 # Navy: -# Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75, p. 14). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf +# Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and Flesch reading ease formula) for Navy enlisted personnel (Report No. RBR 8-75). Naval Air Station Memphis. https://apps.dtic.mil/sti/pdfs/ADA006655.pdf | p. 14 # Polish: # Pisarek, W. (1969). Jak mierzyć zrozumiałość tekstu? Zeszyty Prasoznawcze, 4(42), 35–48. def fog_index(main, text): @@ -889,7 +889,7 @@ def mu(main, text): return mu # Lensear Write Formula -# Reference: O’Hayre, J. (1966). Gobbledygook has gotta go (p. 8). U.S. Government Printing Office. https://www.governmentattic.org/15docs/Gobbledygook_Has_Gotta_Go_1966.pdf +# Reference: O’Hayre, J. (1966). Gobbledygook has gotta go. U.S. Government Printing Office. https://www.governmentattic.org/15docs/Gobbledygook_Has_Gotta_Go_1966.pdf | p. 8 def lensear_write_formula(main, text): if text.lang.startswith('eng_') and text.lang in main.settings_global['syl_tokenizers']: text = get_nums(main, text) @@ -945,10 +945,10 @@ def lix(main, text): # Lorge Readability Index # References: # Lorge, I. (1944). Predicting readability. Teachers College Record, 45, 404–419. -# DuBay, W. H. (2006). In W. H. DuBay (Ed.), The classic readability studies (pp. 46–60). Impact Information. https://files.eric.ed.gov/fulltext/ED506404.pdf +# Lorge, I. (1944). Predicting readability. In W. H. DuBay (Ed.), The classic readability studies (pp. 46–60). Impact Information. https://files.eric.ed.gov/fulltext/ED506404.pdf # Corrected: # Lorge, I. (1948). The Lorge and Flesch readability formulae: A correction. School and Society, 67, 141–142. -# DuBay, W. H. (2006). In W. H. DuBay (Ed.), The classic readability studies (pp. 46–60). Impact Information. https://files.eric.ed.gov/fulltext/ED506404.pdf +# Lorge, I. (1944). Predicting readability. In W. H. DuBay (Ed.), The classic readability studies (pp. 46–60). Impact Information. https://files.eric.ed.gov/fulltext/ED506404.pdf def lorge_readability_index(main, text): if text.lang.startswith('eng_'): text = get_nums(main, text) @@ -987,7 +987,7 @@ def lorge_readability_index(main, text): return lorge # Luong-Nguyen-Dinh's readability formula -# Reference: Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. 2018 10th International Conference on Knowledge and Systems Engineering (KSE) (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379 +# Reference: Luong, A.-V., Nguyen, D., & Dinh, D. (2018). A new formula for Vietnamese text readability assessment. In T. M. Phuong & M. L. Nguyen (Eds.), Proceedings of 2018 10th International Conference on Knowledge and Systems Engineering (KSE) (pp. 198–202). IEEE. https://doi.org/10.1109/KSE.2018.8573379 def luong_nguyen_dinhs_readability_formula(main, text): if text.lang == 'vie': text = get_nums(main, text) @@ -1026,7 +1026,7 @@ def eflaw(main, text): return eflaw # neue Wiener Literaturformeln -# Reference: Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache (p. 82). Jugend und Volk. +# Reference: Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk. | p. 82 def nwl(main, text): if text.lang.startswith('deu_'): text = get_nums(main, text) @@ -1054,7 +1054,7 @@ def nwl(main, text): return nwl # neue Wiener Sachtextformel -# Reference: Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache (pp. 83–84). Jugend und Volk. +# Reference: Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk. | pp. 83–84 def nws(main, text): if text.lang.startswith('deu_'): text = get_nums(main, text) @@ -1173,7 +1173,7 @@ def rix(main, text): # References: # McLaughlin, G. H. (1969). SMOG Grading: A new readability formula. Journal of Reading, 12(8), 639–646. # German: -# Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache (p. 78). Jugend und Volk. +# Bamberger, R., & Vanecek, E. (1984). Lesen-verstehen-lernen-schreiben: Die schwierigkeitsstufen von texten in deutscher sprache. Jugend und Volk. | p. 78 def smog_grading(main, text): if text.lang in main.settings_global['syl_tokenizers']: text = get_nums(main, text) diff --git a/wordless/wl_measures/wl_measures_statistical_significance.py b/wordless/wl_measures/wl_measures_statistical_significance.py index 39aab6407..a6d9f9cfe 100644 --- a/wordless/wl_measures/wl_measures_statistical_significance.py +++ b/wordless/wl_measures/wl_measures_statistical_significance.py @@ -109,7 +109,7 @@ def log_likelihood_ratio_test(main, o11s, o12s, o21s, o22s): return gs, p_vals # Mann-Whitney U test -# References: Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil +# References: Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 232–263. https://doi.org/10.1075/ijcl.6.1.05kil | pp. 103–104 def mann_whitney_u_test(main, freqs_x1s, freqs_x2s): settings = main.settings_custom['measures']['statistical_significance']['mann_whitney_u_test'] @@ -131,8 +131,8 @@ def mann_whitney_u_test(main, freqs_x1s, freqs_x2s): # Pearson's chi-squared test # References: -# Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities. -# Oakes, M. P. (1998). Statistics for corpus linguistics. Edinburgh University Press. +# Hofland, K., & Johanson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities. | p. 12 +# Oakes, M. P. (1998). Statistics for corpus linguistics. Edinburgh University Press. | p. 25 def pearsons_chi_squared_test(main, o11s, o12s, o21s, o22s): settings = main.settings_custom['measures']['statistical_significance']['pearsons_chi_squared_test'] @@ -155,7 +155,7 @@ def pearsons_chi_squared_test(main, o11s, o12s, o21s, o22s): return chi2s, p_vals # Student's t-test (1-sample) -# References: Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Psychology Press. +# References: Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). Psychology Press. | pp. 120–126 def students_t_test_1_sample(main, o11s, o12s, o21s, o22s): settings = main.settings_custom['measures']['statistical_significance']['students_t_test_1_sample'] @@ -178,7 +178,7 @@ def students_t_test_1_sample(main, o11s, o12s, o21s, o22s): return t_stats, p_vals # Student's t-test (2-sample) -# References: Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Language and Computers, 68, 247–269. +# References: Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Language and Computers, 68, 247–269. | pp. 252–253 def students_t_test_2_sample(main, freqs_x1s, freqs_x2s): settings = main.settings_custom['measures']['statistical_significance']['students_t_test_2_sample'] @@ -218,7 +218,7 @@ def _z_test_p_val(z_scores, direction): return p_vals # Z-test -# References: Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), Proceedings of the symposium on statistical association methods for mechanized documentation (pp. 61–148). National Bureau of Standards. +# References: Dennis, S. F. (1964). The construction of a thesaurus automatically from a sample of text. In M. E. Stevens, V. E. Giuliano, & L. B. Heilprin (Eds.), Statistical association methods for mechanized documentation: Symposium proceedings (pp. 61–148). National Bureau of Standards. | p. 69 def z_test(main, o11s, o12s, o21s, o22s): settings = main.settings_custom['measures']['statistical_significance']['z_test'] diff --git a/wordless/wl_ngram_generator.py b/wordless/wl_ngram_generator.py index d81f12738..da5a48270 100644 --- a/wordless/wl_ngram_generator.py +++ b/wordless/wl_ngram_generator.py @@ -244,7 +244,7 @@ def __init__(self, main): self.combo_box_measure_dispersion, self.label_measure_adjusted_freq, self.combo_box_measure_adjusted_freq - ) = wl_widgets.wl_widgets_measures_wordlist_generator(self) + ) = wl_widgets.wl_widgets_measures_wordlist_ngram_generation(self) self.spin_box_allow_skipped_tokens.setRange(1, 20) diff --git a/wordless/wl_settings/wl_settings_global.py b/wordless/wl_settings/wl_settings_global.py index 89c570475..329417dad 100644 --- a/wordless/wl_settings/wl_settings_global.py +++ b/wordless/wl_settings/wl_settings_global.py @@ -3594,18 +3594,19 @@ def init_settings_global(): 'effect_size': { _tr('wl_settings_global', 'None'): 'none', '%DIFF': 'pct_diff', + _tr('wl_settings_global', 'Conditional probability'): 'conditional_probability', _tr('wl_settings_global', 'Cubic association ratio'): 'im3', - _tr('wl_settings_global', "Dice's coefficient"): 'dices_coeff', + _tr('wl_settings_global', "Dice-Sørensen coefficient"): 'dice_sorensen_coeff', _tr('wl_settings_global', 'Difference coefficient'): 'diff_coeff', _tr('wl_settings_global', 'Jaccard index'): 'jaccard_index', _tr('wl_settings_global', "Kilgarriff's ratio"): 'kilgarriffs_ratio', 'logDice': 'log_dice', _tr('wl_settings_global', 'Log-frequency biased MD'): 'lfmd', - _tr('wl_settings_global', 'Log ratio'): 'log_ratio', + _tr('wl_settings_global', 'Log Ratio'): 'log_ratio', 'MI.log-f': 'mi_log_f', _tr('wl_settings_global', 'Minimum sensitivity'): 'min_sensitivity', - _tr('wl_settings_global', 'Mutual dependency'): 'md', - _tr('wl_settings_global', 'Mutual expectation'): 'me', + _tr('wl_settings_global', 'Mutual Dependency'): 'md', + _tr('wl_settings_global', 'Mutual Expectation'): 'me', _tr('wl_settings_global', 'Mutual information'): 'mi', _tr('wl_settings_global', 'Odds ratio'): 'or', _tr('wl_settings_global', 'Pointwise mutual information'): 'pmi', @@ -3738,8 +3739,8 @@ def init_settings_global(): 'col_text': None, 'func': None, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': True }, 'fishers_exact_test': { @@ -3747,64 +3748,64 @@ def init_settings_global(): 'col_text': None, 'func': wl_measures_statistical_significance.fishers_exact_test, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': True }, 'log_likelihood_ratio_test': { 'col_text': _tr('wl_settings_global', 'Log-likelihood Ratio'), 'func': wl_measures_statistical_significance.log_likelihood_ratio_test, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': True }, 'mann_whitney_u_test': { 'col_text': 'U1', 'func': wl_measures_statistical_significance.mann_whitney_u_test, 'to_sections': True, - 'collocation_extractor': False, - 'keyword_extractor': True + 'collocation': False, + 'keyword': True }, 'pearsons_chi_squared_test': { 'col_text': 'χ2', 'func': wl_measures_statistical_significance.pearsons_chi_squared_test, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': True }, 'students_t_test_1_sample': { 'col_text': _tr('wl_settings_global', 't-statistic'), 'func': wl_measures_statistical_significance.students_t_test_1_sample, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': False }, 'students_t_test_2_sample': { 'col_text': _tr('wl_settings_global', 't-statistic'), 'func': wl_measures_statistical_significance.students_t_test_2_sample, 'to_sections': True, - 'collocation_extractor': False, - 'keyword_extractor': True + 'collocation': False, + 'keyword': True }, 'z_test': { 'col_text': _tr('wl_settings_global', 'z-score'), 'func': wl_measures_statistical_significance.z_test, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': False }, 'z_test_berry_rogghe': { 'col_text': _tr('wl_settings_global', 'z-score'), 'func': wl_measures_statistical_significance.z_test_berry_rogghe, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': False + 'collocation': True, + 'keyword': False } }, @@ -3812,124 +3813,171 @@ def init_settings_global(): 'none': { 'func': None, 'to_sections': None, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': True }, 'log_likelihood_ratio_test': { 'func': wl_measures_bayes_factor.bayes_factor_log_likelihood_ratio_test, 'to_sections': False, - 'collocation_extractor': True, - 'keyword_extractor': True + 'collocation': True, + 'keyword': True }, 'students_t_test_2_sample': { 'func': wl_measures_bayes_factor.bayes_factor_students_t_test_2_sample, 'to_sections': True, - 'collocation_extractor': False, - 'keyword_extractor': True + 'collocation': False, + 'keyword': True }, }, 'measures_effect_size': { 'none': { 'col_text': None, - 'func': None + 'func': None, + 'collocation': True, + 'keyword': True }, 'pct_diff': { 'col_text': '%DIFF', - 'func': wl_measures_effect_size.pct_diff + 'func': wl_measures_effect_size.pct_diff, + 'collocation': False, + 'keyword': True + }, + + 'conditional_probability': { + 'col_text': 'P', + 'func': wl_measures_effect_size.conditional_probability, + 'collocation': True, + 'keyword': False }, 'im3': { 'col_text': 'IM³', - 'func': wl_measures_effect_size.im3 + 'func': wl_measures_effect_size.im3, + 'collocation': True, + 'keyword': True }, 'dice_sorensen_coeff': { - 'col_text': _tr('wl_settings_global', 'Dice-Sørensen coefficient'), - 'func': wl_measures_effect_size.dice_sorensen_coeff + 'col_text': _tr('wl_settings_global', 'Dice-Sørensen Coefficient'), + 'func': wl_measures_effect_size.dice_sorensen_coeff, + 'collocation': True, + 'keyword': False }, 'diff_coeff': { 'col_text': _tr('wl_settings_global', 'Difference Coefficient'), - 'func': wl_measures_effect_size.diff_coeff + 'func': wl_measures_effect_size.diff_coeff, + 'collocation': False, + 'keyword': True }, 'jaccard_index': { 'col_text': _tr('wl_settings_global', 'Jaccard Index'), - 'func': wl_measures_effect_size.jaccard_index - }, - - 'lfmd': { - 'col_text': 'LFMD', - 'func': wl_measures_effect_size.lfmd + 'func': wl_measures_effect_size.jaccard_index, + 'collocation': True, + 'keyword': False }, 'kilgarriffs_ratio': { 'col_text': _tr('wl_settings_global', "Kilgarriff's Ratio"), - 'func': wl_measures_effect_size.kilgarriffs_ratio + 'func': wl_measures_effect_size.kilgarriffs_ratio, + 'collocation': False, + 'keyword': True }, 'log_dice': { 'col_text': 'logDice', - 'func': wl_measures_effect_size.log_dice + 'func': wl_measures_effect_size.log_dice, + 'collocation': True, + 'keyword': False + }, + + 'lfmd': { + 'col_text': 'LFMD', + 'func': wl_measures_effect_size.lfmd, + 'collocation': True, + 'keyword': False }, 'log_ratio': { 'col_text': _tr('wl_settings_global', 'Log Ratio'), - 'func': wl_measures_effect_size.log_ratio + 'func': wl_measures_effect_size.log_ratio, + 'collocation': True, + 'keyword': True }, 'mi_log_f': { 'col_text': 'MI.log-f', - 'func': wl_measures_effect_size.mi_log_f + 'func': wl_measures_effect_size.mi_log_f, + 'collocation': True, + 'keyword': False }, 'min_sensitivity': { 'col_text': _tr('wl_settings_global', 'Minimum Sensitivity'), - 'func': wl_measures_effect_size.min_sensitivity + 'func': wl_measures_effect_size.min_sensitivity, + 'collocation': True, + 'keyword': False }, 'md': { 'col_text': 'MD', - 'func': wl_measures_effect_size.md + 'func': wl_measures_effect_size.md, + 'collocation': True, + 'keyword': False }, 'me': { 'col_text': 'ME', - 'func': wl_measures_effect_size.me + 'func': wl_measures_effect_size.me, + 'collocation': True, + 'keyword': False }, 'mi': { 'col_text': 'MI', - 'func': wl_measures_effect_size.mi + 'func': wl_measures_effect_size.mi, + 'collocation': True, + 'keyword': False }, 'or': { 'col_text': 'OR', - 'func': wl_measures_effect_size.odds_ratio + 'func': wl_measures_effect_size.odds_ratio, + 'collocation': True, + 'keyword': True }, 'pmi': { 'col_text': 'PMI', - 'func': wl_measures_effect_size.pmi + 'func': wl_measures_effect_size.pmi, + 'collocation': True, + 'keyword': True }, 'poisson_collocation_measure': { 'col_text': _tr('wl_settings_global', 'Poisson Collocation Measure'), - 'func': wl_measures_effect_size.poisson_collocation_measure + 'func': wl_measures_effect_size.poisson_collocation_measure, + 'collocation': True, + 'keyword': False }, 'im2': { 'col_text': 'IM²', - 'func': wl_measures_effect_size.im2 + 'func': wl_measures_effect_size.im2, + 'collocation': True, + 'keyword': True }, 'squared_phi_coeff': { 'col_text': 'φ2', - 'func': wl_measures_effect_size.squared_phi_coeff + 'func': wl_measures_effect_size.squared_phi_coeff, + 'collocation': True, + 'keyword': False } }, diff --git a/wordless/wl_widgets/wl_widgets.py b/wordless/wl_widgets/wl_widgets.py index b45dc270e..8022eed5c 100644 --- a/wordless/wl_widgets/wl_widgets.py +++ b/wordless/wl_widgets/wl_widgets.py @@ -616,7 +616,7 @@ def wl_widgets_context_settings(parent, tab): return label_context_settings, button_context_settings # Generation Settings -def wl_widgets_measures_wordlist_generator(parent): +def wl_widgets_measures_wordlist_ngram_generation(parent): label_measure_dispersion = QLabel(_tr('wl_widgets', 'Measure of dispersion:'), parent) combo_box_measure_dispersion = wl_boxes.Wl_Combo_Box_Measure(parent, measure_type = 'dispersion') label_measure_adjusted_freq = QLabel(_tr('wl_widgets', 'Measure of adjusted frequency:'), parent) @@ -627,7 +627,7 @@ def wl_widgets_measures_wordlist_generator(parent): label_measure_adjusted_freq, combo_box_measure_adjusted_freq ) -def wl_widgets_measures_collocation_extractor(parent, tab): +def wl_widgets_measures_collocation_keyword_extraction(parent, extraction_type): main = wl_misc.find_wl_main(parent) label_test_statistical_significance = QLabel(_tr('wl_widgets', 'Test of statistical significance:'), parent) @@ -641,16 +641,23 @@ def wl_widgets_measures_collocation_extractor(parent, tab): measure_text = combo_box_test_statistical_significance.itemText(i) measure_code = wl_measure_utils.to_measure_code(main, 'statistical_significance', measure_text) - if not main.settings_global['tests_statistical_significance'][measure_code][tab]: + if not main.settings_global['tests_statistical_significance'][measure_code][extraction_type]: combo_box_test_statistical_significance.removeItem(i) for i in reversed(range(combo_box_measure_bayes_factor.count())): measure_text = combo_box_measure_bayes_factor.itemText(i) measure_code = wl_measure_utils.to_measure_code(main, 'bayes_factor', measure_text) - if not main.settings_global['measures_bayes_factor'][measure_code][tab]: + if not main.settings_global['measures_bayes_factor'][measure_code][extraction_type]: combo_box_measure_bayes_factor.removeItem(i) + for i in reversed(range(combo_box_measure_effect_size.count())): + measure_text = combo_box_measure_effect_size.itemText(i) + measure_code = wl_measure_utils.to_measure_code(main, 'effect_size', measure_text) + + if not main.settings_global['measures_effect_size'][measure_code][extraction_type]: + combo_box_measure_effect_size.removeItem(i) + return ( label_test_statistical_significance, combo_box_test_statistical_significance, label_measure_bayes_factor, combo_box_measure_bayes_factor, diff --git a/wordless/wl_wordlist_generator.py b/wordless/wl_wordlist_generator.py index 3e6d12b9d..17d4c5544 100644 --- a/wordless/wl_wordlist_generator.py +++ b/wordless/wl_wordlist_generator.py @@ -132,7 +132,7 @@ def __init__(self, main): self.combo_box_measure_dispersion, self.label_measure_adjusted_freq, self.combo_box_measure_adjusted_freq - ) = wl_widgets.wl_widgets_measures_wordlist_generator(self) + ) = wl_widgets.wl_widgets_measures_wordlist_ngram_generation(self) self.checkbox_syllabification.stateChanged.connect(self.generation_settings_changed) self.combo_box_measure_dispersion.currentTextChanged.connect(self.generation_settings_changed)