diff --git a/CHANGELOG.md b/CHANGELOG.md index b86e18e8d..5c57e3373 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -28,6 +28,9 @@ - File Area: Fix Open Files - Opening Non-text Files - Do not show this again - Utils: Fix Wordless's Japanese kanji tokenizer +### ❌ Removals +- Measures: Remove effect size - Log-frequency biased MD / Mutual Dependency + ## [3.5.0](https://github.com/BLKSerene/Wordless/releases/tag/3.5.0) - 07/01/2024 ### 🎉 New Features - File Area: Add support for .lrc and .pptx files diff --git a/doc/doc.md b/doc/doc.md index ce402312f..60d283815 100644 --- a/doc/doc.md +++ b/doc/doc.md @@ -1495,9 +1495,6 @@ Kilgarriff's ratio: logDice: \text{logDice} = 14 + \log_{2} \frac{2 \times O_{11}}{O_{1x} + O_{x1}} -Log-frequency biased MD: - \text{LFMD} = \log_{2} \frac{O_{11}}{E_{11}} + \log_{2} O_{11} - Log Ratio: \text{Log Ratio} = \log_{2} \frac{\frac{O_{11}}{O_{x1}}}{\frac{O_{12}}{O_{x2}}} @@ -1507,9 +1504,6 @@ MI.log-f: Minimum sensitivity: \text{S} = \min\left\{\frac{O_{11}}{O_{1x}},\;\frac{O_{11}}{O_{x1}}\right\} -Mutual Dependency: - \text{MD} = \log_{2} \frac{{O_{11}}^2}{E_{11}} - Mutual Expectation: \text{ME} = O_{11} \times \frac{2 \times O_{11}}{O_{1x} + O_{x1}} @@ -1538,27 +1532,30 @@ Squared phi coefficient: Measure of Effect Size|Formula|Collocation Extraction|Keyword Extraction ----------------------|-------|:--------------------:|:----------------: Conditional probability
([Durrant, 2008, p. 84](#ref-durrant-2008))|![Formula](/doc/measures/effect_size/conditional_probability.svg)|✔|✖️ -Cubic association ratio
([Daille, 1994, p. 139](#ref-daille-1994); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im3.svg)|✔|✔
ΔP
([Gries, 2013](#ref-gries-2013))|![Formula](/doc/measures/effect_size/delta_p.svg)|✔|✖️ Dice-Sørensen coefficient
([Smadja et al., 1996, p. 8](#ref-smadja-et-al-1996))|![Formula](/doc/measures/effect_size/dice_sorensen_coeff.svg)|✔|✖️ Difference coefficient
([Hofland & Johansson, 1982, p. 14](#ref-hofland-johansson-1982); [Gabrielatos, 2018, p. 236](#ref-gabrielatos-2018))|![Formula](/doc/measures/effect_size/diff_coeff.svg)|✖️|✔ Jaccard index
([Dunning, 1998, p. 48](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/jaccard_index.svg)|✔|✖️ Kilgarriff's ratio
([Kilgarriff, 2009](#ref-kilgarriff-2009))|![Formula](/doc/measures/effect_size/kilgarriffs_ratio.svg)
where **α** is the smoothing parameter, whose value could be changed via **Menu Bar → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter**.|✖️|✔ logDice
([Rychlý, 2008, p. 9](#ref-rychly-2008))|![Formula](/doc/measures/effect_size/log_dice.svg)|✔|✖️ -Log-frequency biased MD
([Thanopoulos et al., 2002, p. 621](#ref-thanopoulos-et-al-2002))|![Formula](/doc/measures/effect_size/lfmd.svg)|✔|✖️ Log Ratio
([Hardie, 2014](#ref-hardie-2014))|![Formula](/doc/measures/effect_size/log_ratio.svg)|✔|✔ MI.log-f
([Kilgarriff & Tugwell, 2002](#ref-kilgarriff-tugwell-2002); [Lexical Computing Ltd., 2015, p. 4](#ref-lexical-computing-ltd-2015))|![Formula](/doc/measures/effect_size/mi_log_f.svg)|✔|✖️ Minimum sensitivity
([Pedersen, 1998](#ref-pedersen-1998))|![Formula](/doc/measures/effect_size/min_sensitivity.svg)|✔|✖️ -Mutual Dependency
([Thanopoulos et al., 2002, p. 621](#ref-thanopoulos-et-al-2002))|![Formula](/doc/measures/effect_size/md.svg)|✔|✖️ Mutual Expectation
([Dias et al., 1999](#ref-dias-et-al-1999))|![Formula](/doc/measures/effect_size/me.svg)|✔|✖️ Mutual information
([Dunning, 1998, pp. 49–52](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/mi.svg)|✔|✖️ Odds ratio
([Pecina, 2005, p. 15](#ref-pecina-2005), [Pojanapunya & Todd, 2016](#ref-pojanapunya-todd-2016))|![Formula](/doc/measures/effect_size/odds_ratio.svg)|✔|✔ %DIFF
([Gabrielatos & Marchi, 2011](#ref-gabrielatos-marchi-2011))|![Formula](/doc/measures/effect_size/pct_diff.svg)|✖️|✔ Pointwise mutual information
([Church & Hanks, 1990](#ref-church-hanks-1990); [Kilgarriff, 2001, pp. 104–105](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/pmi.svg)|✔|✔ +Pointwise mutual information (cubic)**¹**
([Daille, 1994, p. 139](#ref-daille-1994); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im3.svg)|✔|✔ +Pointwise mutual information (squared)**¹**
([Daille, 1995, p. 21](#ref-daille-1995); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im2.svg)|✔|✔ Poisson collocation measure
([Quasthoff & Wolff, 2002](#ref-quasthoff-wolff-2002))|![Formula](/doc/measures/effect_size/poisson_collocation_measure.svg)|✔|✖️ -Squared association ratio
([Daille, 1995, p. 21](#ref-daille-1995); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im2.svg)|✔|✔ Squared phi coefficient
([Church & Gale, 1991](#ref-church-gale-1991))|![Formula](/doc/measures/effect_size/squared_phi_coeff.svg)|✔|✖️ +> [!NOTE] +1. The calculation of *Pointwise mutual information (squared)* and *pointwise mutual information (cubic)* are exactly the same as that of *Mutual Dependency* and *Log-frequency biased MD* respectively which were proposed in:
+ +
Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), *Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)* (pp. 620–625). European Language Resources Association.
+ ## [13 References](#doc) @@ -1778,8 +1775,6 @@ Linguistic Computing Bulletin*, *7*(2), 172–177. 1. [**^**](#ref-num-words-spache) [**^**](#ref-spache-readability-formula) Spache, G. (1974). *Good reading for poor readers* (Rev. 9th ed.). Garrard. 1. [**^**](#ref-re) Szigriszt Pazos, F. (1993). *Sistemas predictivos de legibilidad del mensaje escrito: Formula de perspicuidad* [Doctoral dissertation, Complutense University of Madrid]. Biblos-e Archivo. https://repositorio.uam.es/bitstream/handle/10486/2488/3907_barrio_cantalejo_ines_maria.pdf?sequence=1&isAllowed=y - -1. [**^**](#ref-lfmd) [**^**](#ref-md) Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), *Proceedings of the Third International Conference on Language Resources and Evaluation* (pp. 620–625). European Language Resources Association. 1. [**^**](#ref-trankle-bailers-readability-formula) Tränkle, U., & Bailer, H. (1984). Kreuzvalidierung und neuberechnung von lesbarkeitsformeln für die Deutsche sprache. *Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie*, *16*(3), 231–244. diff --git a/doc/measures/effect_size/lfmd.svg b/doc/measures/effect_size/lfmd.svg deleted file mode 100644 index 88f886d18..000000000 --- a/doc/measures/effect_size/lfmd.svg +++ /dev/null @@ -1,45 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/doc/measures/effect_size/md.svg b/doc/measures/effect_size/md.svg deleted file mode 100644 index 8d317ea7c..000000000 --- a/doc/measures/effect_size/md.svg +++ /dev/null @@ -1,33 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/tests/tests_measures/test_measures_effect_size.py b/tests/tests_measures/test_measures_effect_size.py index 1cb937a16..963f52437 100644 --- a/tests/tests_measures/test_measures_effect_size.py +++ b/tests/tests_measures/test_measures_effect_size.py @@ -50,9 +50,6 @@ def test_conditional_probability(): assert_zeros(wl_measures_effect_size.conditional_probability) -def test_im3(): - assert_zeros(wl_measures_effect_size.im3) - # Reference: Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165. https://doi.org/10.1075/ijcl.18.1.09gri | p. 144 def test_delta_p(): numpy.testing.assert_array_equal( @@ -119,9 +116,6 @@ def test_kilgarriffs_ratio(): def test_log_dice(): assert_zeros(wl_measures_effect_size.log_dice, result = 14) -def test_lfmd(): - assert_zeros(wl_measures_effect_size.lfmd) - # Reference: Hardie, A. (2014, April 28). Log Ratio: An informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/. def test_log_ratio(): numpy.testing.assert_array_equal( @@ -164,9 +158,6 @@ def test_min_sensitivity(): assert_zeros(wl_measures_effect_size.min_sensitivity) -def test_md(): - assert_zeros(wl_measures_effect_size.md) - def test_me(): assert_zeros(wl_measures_effect_size.me) @@ -248,12 +239,15 @@ def test_pmi(): assert_zeros(wl_measures_effect_size.pmi) -def test_poisson_collocation_measure(): - assert_zeros(wl_measures_effect_size.poisson_collocation_measure) +def test_im3(): + assert_zeros(wl_measures_effect_size.im3) def test_im2(): assert_zeros(wl_measures_effect_size.im2) +def test_poisson_collocation_measure(): + assert_zeros(wl_measures_effect_size.poisson_collocation_measure) + # Reference: Church, K. W., & Gale, W. A. (1991, September 29–October 1). Concordances for parallel text [Paper presentation]. Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, Oxford, United Kingdom. def test_squared_phi_coeff(): numpy.testing.assert_array_equal( @@ -271,23 +265,21 @@ def test_squared_phi_coeff(): if __name__ == '__main__': test_conditional_probability() - test_im3() test_delta_p() test_dice_sorensen_coeff() test_diff_coeff() test_jaccard_index() test_kilgarriffs_ratio() test_log_dice() - test_lfmd() test_log_ratio() test_mi_log_f() test_min_sensitivity() - test_md() test_me() test_mi() test_odds_ratio() test_pct_diff() test_pmi() - test_poisson_collocation_measure() + test_im3() test_im2() + test_poisson_collocation_measure() test_squared_phi_coeff() diff --git a/wordless/wl_measures/wl_measures_effect_size.py b/wordless/wl_measures/wl_measures_effect_size.py index 922bc5b7a..98939dbaa 100644 --- a/wordless/wl_measures/wl_measures_effect_size.py +++ b/wordless/wl_measures/wl_measures_effect_size.py @@ -29,13 +29,6 @@ def conditional_probability(main, o11s, o12s, o21s, o22s): return wl_measure_utils.numpy_divide(o11s, ox1s) * 100 -# Cubic association ratio -# Reference: Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid= | p. 139 -def im3(main, o11s, o12s, o21s, o22s): - e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) - - return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 3, e11s)) - # ΔP # Reference: Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165. https://doi.org/10.1075/ijcl.18.1.09gri def delta_p(main, o11s, o12s, o21s, o22s): @@ -88,13 +81,6 @@ def log_dice(main, o11s, o12s, o21s, o22s): return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(2 * o11s, o1xs + ox1s), default = 14) -# Log-frequency biased MD -# Reference: Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association. | p. 621 -def lfmd(main, o11s, o12s, o21s, o22s): - e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) - - return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 2, e11s)) + wl_measure_utils.numpy_log2(o11s) - # Log Ratio # Reference: Hardie, A. (2014, April 28). Log Ratio: An informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/ def log_ratio(main, o11s, o12s, o21s, o22s): @@ -134,13 +120,6 @@ def min_sensitivity(main, o11s, o12s, o21s, o22s): wl_measure_utils.numpy_divide(o11s, ox1s) ) -# Mutual Dependency -# Reference: Thanopoulos, A, Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In M. G. González, & C. P. S. Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). European Language Resources Association. | p. 621 -def md(main, o11s, o12s, o21s, o22s): - e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) - - return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 2, e11s)) - # Mutual Expectation # Reference: Dias, G., Guilloré, S., & Pereira Lopes, J. G. (1999). Language independent automatic acquisition of rigid multiword units from unrestricted text corpora. In A. Condamines, C. Fabre, & M. Péry-Woodley (Eds.), TALN'99: 6ème Conférence Annuelle Sur le Traitement Automatique des Langues Naturelles (pp. 333–339). TALN. def me(main, o11s, o12s, o21s, o22s): @@ -202,6 +181,20 @@ def pmi(main, o11s, o12s, o21s, o22s): return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s, e11s)) +# Pointwise mutual information (cubic) +# Reference: Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Doctoral thesis, Paris Diderot University]. Béatrice Daille. http://www.bdaille.com/index.php?option=com_docman&task=doc_download&gid=8&Itemid= | p. 139 +def im3(main, o11s, o12s, o21s, o22s): + e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) + + return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 3, e11s)) + +# Pointwise mutual information (squared) +# Reference: Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University. | p. 21 +def im2(main, o11s, o12s, o21s, o22s): + e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) + + return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 2, e11s)) + # Poisson collocation measure # Reference: Quasthoff, U., & Wolff, C. (2002). The poisson collocation measure and its applications. Proceedings of 2nd International Workshop on Computational Approaches to Collocations. IEEE. def poisson_collocation_measure(main, o11s, o12s, o21s, o22s): @@ -213,13 +206,6 @@ def poisson_collocation_measure(main, o11s, o12s, o21s, o22s): wl_measure_utils.numpy_log(oxxs) ) -# Squared association ratio -# Reference: Daille, B. (1995). Combined approach for terminology extraction: Lexical statistics and linguistic filtering. UCREL technical papers (Vol. 5). Lancaster University. | p. 21 -def im2(main, o11s, o12s, o21s, o22s): - e11s, _, _, _ = wl_measures_statistical_significance.get_freqs_expected(o11s, o12s, o21s, o22s) - - return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 2, e11s)) - # Squared phi coefficient # Reference: Church, K. W., & Gale, W. A. (1991, September 29–October 1). Concordances for parallel text [Paper presentation]. Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, Oxford, United Kingdom. def squared_phi_coeff(main, o11s, o12s, o21s, o22s): diff --git a/wordless/wl_settings/wl_settings_global.py b/wordless/wl_settings/wl_settings_global.py index 8ebcfe2f1..c36c6af86 100644 --- a/wordless/wl_settings/wl_settings_global.py +++ b/wordless/wl_settings/wl_settings_global.py @@ -3594,25 +3594,23 @@ def init_settings_global(): 'effect_size': { _tr('wl_settings_global', 'None'): 'none', _tr('wl_settings_global', 'Conditional probability'): 'conditional_probability', - _tr('wl_settings_global', 'Cubic association ratio'): 'im3', 'ΔP': 'delta_p', _tr('wl_settings_global', 'Dice-Sørensen coefficient'): 'dice_sorensen_coeff', _tr('wl_settings_global', 'Difference coefficient'): 'diff_coeff', _tr('wl_settings_global', 'Jaccard index'): 'jaccard_index', _tr('wl_settings_global', "Kilgarriff's ratio"): 'kilgarriffs_ratio', 'logDice': 'log_dice', - _tr('wl_settings_global', 'Log-frequency biased MD'): 'lfmd', _tr('wl_settings_global', 'Log Ratio'): 'log_ratio', 'MI.log-f': 'mi_log_f', _tr('wl_settings_global', 'Minimum sensitivity'): 'min_sensitivity', - _tr('wl_settings_global', 'Mutual Dependency'): 'md', _tr('wl_settings_global', 'Mutual Expectation'): 'me', _tr('wl_settings_global', 'Mutual information'): 'mi', _tr('wl_settings_global', 'Odds ratio'): 'or', '%DIFF': 'pct_diff', _tr('wl_settings_global', 'Pointwise mutual information'): 'pmi', + _tr('wl_settings_global', 'Pointwise mutual information (cubic)'): 'im3', + _tr('wl_settings_global', 'Pointwise mutual information (squared)'): 'im2', _tr('wl_settings_global', 'Poisson collocation measure'): 'poisson_collocation_measure', - _tr('wl_settings_global', 'Squared association ratio'): 'im2', _tr('wl_settings_global', 'Squared phi coefficient'): 'squared_phi_coeff' } }, @@ -3849,13 +3847,6 @@ def init_settings_global(): 'keyword': False }, - 'im3': { - 'col_text': 'IM³', - 'func': wl_measures_effect_size.im3, - 'collocation': True, - 'keyword': True - }, - 'delta_p': { 'col_text': 'ΔP', 'func': wl_measures_effect_size.delta_p, @@ -3898,13 +3889,6 @@ def init_settings_global(): 'keyword': False }, - 'lfmd': { - 'col_text': 'LFMD', - 'func': wl_measures_effect_size.lfmd, - 'collocation': True, - 'keyword': False - }, - 'log_ratio': { 'col_text': _tr('wl_settings_global', 'Log Ratio'), 'func': wl_measures_effect_size.log_ratio, @@ -3926,13 +3910,6 @@ def init_settings_global(): 'keyword': False }, - 'md': { - 'col_text': 'MD', - 'func': wl_measures_effect_size.md, - 'collocation': True, - 'keyword': False - }, - 'me': { 'col_text': 'ME', 'func': wl_measures_effect_size.me, @@ -3968,11 +3945,11 @@ def init_settings_global(): 'keyword': True }, - 'poisson_collocation_measure': { - 'col_text': _tr('wl_settings_global', 'Poisson Collocation Measure'), - 'func': wl_measures_effect_size.poisson_collocation_measure, + 'im3': { + 'col_text': 'IM³', + 'func': wl_measures_effect_size.im3, 'collocation': True, - 'keyword': False + 'keyword': True }, 'im2': { @@ -3982,6 +3959,13 @@ def init_settings_global(): 'keyword': True }, + 'poisson_collocation_measure': { + 'col_text': _tr('wl_settings_global', 'Poisson Collocation Measure'), + 'func': wl_measures_effect_size.poisson_collocation_measure, + 'collocation': True, + 'keyword': False + }, + 'squared_phi_coeff': { 'col_text': 'φ2', 'func': wl_measures_effect_size.squared_phi_coeff,