Skip to content

Commit

Permalink
Measures: Add effect size - ΔP
Browse files Browse the repository at this point in the history
  • Loading branch information
BLKSerene committed Nov 3, 2024
1 parent c445373 commit c24322c
Show file tree
Hide file tree
Showing 6 changed files with 132 additions and 58 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

## [3.6.0](https://github.com/BLKSerene/Wordless/releases/tag/3.6.0) - ??/??/2024
### 🎉 New Features
- Measures: Add effect size - conditional probability / squared association ratio
- Measures: Add effect size - conditional probability / ΔP / squared association ratio
- Utils: Add Stanza's Sindhi dependency parser

### 📌 Bugfixes
Expand Down
14 changes: 10 additions & 4 deletions doc/doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -1471,15 +1471,15 @@ Test of Statistical Significance|Measure of Bayes Factor|Formula|Collocation Ext
<span id="ref-z-test-berry-rogghes"></span>Z-test (Berry-Rogghe)<br>([Berry-Rogghe, 1973](#ref-berry-rogghe-1973))||![Formula](/doc/measures/statistical_significance/z_test_berry_rogghe.svg)<br>where **S** is the average span size on both sides of the node word.|✔|✖️

<!--
%DIFF:
\text{%DIFF} = \frac{\left(\frac{O_{11}}{O_{x1}} - \frac{O_{12}}{O_{x2}}\right) \times 100}{\frac{O_{12}}{O_{x2}}}
Conditional probability:
\text{P} = \frac{O_{11}}{O_{x1}} \times 100
Cubic association ratio:
\text{IM}^3 = \log_{2} \frac{{O_{11}}^3}{E_{11}}
ΔP:
\Delta\text{P} = \frac{O_{11}}{O_{x1}} - \frac{O_{12}}{O_{x2}}
Dice-Sørensen coefficient:
\text{DSC} = \frac{2 \times O_{11}}{O_{1x} + O_{x1}}
Expand Down Expand Up @@ -1519,6 +1519,9 @@ Mutual information:
Odds ratio:
\text{Odds ratio} = \frac{O_{11} \times O_{22}}{O_{12} \times O_{21}}
%DIFF:
\text{%DIFF} = \frac{\left(\frac{O_{11}}{O_{x1}} - \frac{O_{12}}{O_{x2}}\right) \times 100}{\frac{O_{12}}{O_{x2}}}
Pointwise mutual information:
\text{PMI} = \log_{2} \frac{O_{11}}{E_{11}}
Expand All @@ -1534,9 +1537,9 @@ Squared phi coefficient:

Measure of Effect Size|Formula|Collocation Extraction|Keyword Extraction
----------------------|-------|:--------------------:|:----------------:
<span id="ref-pct-diff"></span>%DIFF<br>([Gabrielatos & Marchi, 2011](#ref-gabrielatos-marchi-2011))|![Formula](/doc/measures/effect_size/pct_diff.svg)|✖️|✔
<span id="ref-conditional-probability"></span>Conditional probability<br>([Durrant, 2008, p. 84](#ref-durrant-2008))|![Formula](/doc/measures/effect_size/conditional_probability.svg)|✔|✖️
<span id="ref-im3"></span>Cubic association ratio<br>([Daille, 1994, p. 139](#ref-daille-1994); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im3.svg)|✔|✔
<span id="ref-delta-p"></span><br>ΔP ([Gries, 2013](#ref-gries-2013))|![Formula](/doc/measures/effect_size/delta_p.svg)|✔|✖️
<span id="ref-dice-sorensen-coeff"></span>Dice-Sørensen coefficient<br>([Smadja et al., 1996, p. 8](#ref-smadja-et-al-1996))|![Formula](/doc/measures/effect_size/dice_sorensen_coeff.svg)|✔|✖️
<span id="ref-diff-coeff"></span>Difference coefficient<br>([Hofland & Johansson, 1982, p. 14](#ref-hofland-johansson-1982); [Gabrielatos, 2018, p. 236](#ref-gabrielatos-2018))|![Formula](/doc/measures/effect_size/diff_coeff.svg)|✖️|✔
<span id="ref-jaccard-index"></span>Jaccard index<br>([Dunning, 1998, p. 48](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/jaccard_index.svg)|✔|✖️
Expand All @@ -1550,6 +1553,7 @@ Measure of Effect Size|Formula|Collocation Extraction|Keyword Extraction
<span id="ref-me"></span>Mutual Expectation<br>([Dias et al., 1999](#ref-dias-et-al-1999))|![Formula](/doc/measures/effect_size/me.svg)|✔|✖️
<span id="ref-mi"></span>Mutual information<br>([Dunning, 1998, pp. 49–52](#ref-dunning-1998))|![Formula](/doc/measures/effect_size/mi.svg)|✔|✖️
<span id="ref-odds-ratio"></span>Odds ratio<br>([Pecina, 2005, p. 15](#ref-pecina-2005), [Pojanapunya & Todd, 2016](#ref-pojanapunya-todd-2016))|![Formula](/doc/measures/effect_size/odds_ratio.svg)|✔|✔
<span id="ref-pct-diff"></span>%DIFF<br>([Gabrielatos & Marchi, 2011](#ref-gabrielatos-marchi-2011))|![Formula](/doc/measures/effect_size/pct_diff.svg)|✖️|✔
<span id="ref-pmi"></span>Pointwise mutual information<br>([Church & Hanks, 1990](#ref-church-hanks-1990); [Kilgarriff, 2001, pp. 104–105](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/pmi.svg)|✔|✔
<span id="ref-poisson-collocation-measure"></span>Poisson collocation measure<br>([Quasthoff & Wolff, 2002](#ref-quasthoff-wolff-2002))|![Formula](/doc/measures/effect_size/poisson_collocation_measure.svg)|✔|✖️
<span id="ref-im2"></span>Squared association ratio<br>([Daille, 1995, p. 21](#ref-daille-1995); [Kilgarriff, 2001, p, 99](#ref-kilgarriff-2001))|![Formula](/doc/measures/effect_size/im2.svg)|✔|✔
Expand Down Expand Up @@ -1655,6 +1659,8 @@ Measure of Effect Size|Formula|Collocation Extraction|Keyword Extraction
1. [**^**](#ref-pct-diff) Gabrielatos, C., & Marchi, A. (2011, November 5). *Keyness: Matching metrics to definitions* [Conference session]. Corpus Linguistics in the South 1, University of Portsmouth, United Kingdom. https://eprints.lancs.ac.uk/id/eprint/51449/4/Gabrielatos_Marchi_Keyness.pdf
<span id="ref-gries-2008"></span>
1. [**^**](#ref-griess-dp) Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. *International Journal of Corpus Linguistics*, *13*(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri
<span id="ref-gries-2013"></span>
1. [**^**](#ref-delta-p) Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. *International Journal of Corpus Linguistics*, *18*(1), 137–165. https://doi.org/10.1075/ijcl.18.1.09gri
<span id="ref-guiraud-1954"></span>
1. [**^**](#ref-rttr) Guiraud, P. (1954). *Les caractères statistiques du vocabulaire: Essai de méthodologie*. Presses Universitaires de France.
<span id="ref-gunning-1968"></span>
Expand Down
34 changes: 34 additions & 0 deletions doc/measures/effect_size/delta_p.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 42 additions & 24 deletions tests/tests_measures/test_measures_effect_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,30 +35,6 @@ def assert_zeros(func, result = 0):
numpy.array([result] * 10)
)

# Reference: Gabrielatos, C., & Marchi, A. (2011, November 5). Keyness: Matching metrics to definitions [Conference session]. Corpus Linguistics in the South 1, University of Portsmouth, United Kingdom. https://eprints.lancs.ac.uk/id/eprint/51449/4/Gabrielatos_Marchi_Keyness.pdf | p. 18
def test_pct_diff():
numpy.testing.assert_array_equal(
numpy.round(wl_measures_effect_size.pct_diff(
main,
numpy.array([206523] * 2),
numpy.array([178174] * 2),
numpy.array([959641 - 206523] * 2),
numpy.array([1562358 - 178174] * 2)
), 1),
numpy.array([88.7] * 2)
)

numpy.testing.assert_array_equal(
wl_measures_effect_size.pct_diff(
main,
numpy.array([0, 1, 0]),
numpy.array([1, 0, 0]),
numpy.array([0, 0, 0]),
numpy.array([1, 1, 0])
),
numpy.array([float('-inf'), float('inf'), 0])
)

# Reference: Durrant, P. (2008). High frequency collocations and second language learning [Doctoral dissertation, University of Nottingham]. Nottingham eTheses. https://eprints.nottingham.ac.uk/10622/1/final_thesis.pdf | pp. 80, 84
def test_conditional_probability():
numpy.testing.assert_array_equal(
Expand All @@ -72,9 +48,26 @@ def test_conditional_probability():
numpy.array([0.178, 0.349])
)

assert_zeros(wl_measures_effect_size.conditional_probability)

def test_im3():
assert_zeros(wl_measures_effect_size.im3)

# Reference: Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165. https://doi.org/10.1075/ijcl.18.1.09gri | p. 144
def test_delta_p():
numpy.testing.assert_array_equal(
numpy.round(wl_measures_effect_size.delta_p(
main,
numpy.array([5610, 5610]),
numpy.array([2257, 168938]),
numpy.array([168938, 2257]),
numpy.array([10233063, 10233063])
), 3),
numpy.array([0.032, 0.697])
)

assert_zeros(wl_measures_effect_size.delta_p)

# Reference: Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), pp. 1–38. | p. 13
def test_dice_sorensen_coeff():
numpy.testing.assert_array_equal(
Expand Down Expand Up @@ -216,6 +209,30 @@ def test_odds_ratio():
numpy.array([float('-inf'), float('inf'), 0])
)

# Reference: Gabrielatos, C., & Marchi, A. (2011, November 5). Keyness: Matching metrics to definitions [Conference session]. Corpus Linguistics in the South 1, University of Portsmouth, United Kingdom. https://eprints.lancs.ac.uk/id/eprint/51449/4/Gabrielatos_Marchi_Keyness.pdf | p. 18
def test_pct_diff():
numpy.testing.assert_array_equal(
numpy.round(wl_measures_effect_size.pct_diff(
main,
numpy.array([206523] * 2),
numpy.array([178174] * 2),
numpy.array([959641 - 206523] * 2),
numpy.array([1562358 - 178174] * 2)
), 1),
numpy.array([88.7] * 2)
)

numpy.testing.assert_array_equal(
wl_measures_effect_size.pct_diff(
main,
numpy.array([0, 1, 0]),
numpy.array([1, 0, 0]),
numpy.array([0, 0, 0]),
numpy.array([1, 1, 0])
),
numpy.array([float('-inf'), float('inf'), 0])
)

# Reference: Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29. | p. 24
def test_pmi():
numpy.testing.assert_array_equal(
Expand Down Expand Up @@ -256,6 +273,7 @@ def test_squared_phi_coeff():
test_pct_diff()
test_conditional_probability()
test_im3()
test_delta_p()
test_dice_sorensen_coeff()
test_diff_coeff()
test_jaccard_index()
Expand Down
43 changes: 25 additions & 18 deletions wordless/wl_measures/wl_measures_effect_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,24 +22,6 @@

from wordless.wl_measures import wl_measures_statistical_significance, wl_measure_utils

# %DIFF
# Reference: Gabrielatos, C., & Marchi, A. (2011, November 5). Keyness: Matching metrics to definitions [Conference session]. Corpus Linguistics in the South 1, University of Portsmouth, United Kingdom. https://eprints.lancs.ac.uk/id/eprint/51449/4/Gabrielatos_Marchi_Keyness.pdf
def pct_diff(main, o11s, o12s, o21s, o22s):
_, _, ox1s, ox2s = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s)

return numpy.where(
(o11s == 0) & (o12s > 0),
-numpy.inf,
numpy.where(
(o11s > 0) & (o12s == 0),
numpy.inf,
wl_measure_utils.numpy_divide(
(wl_measure_utils.numpy_divide(o11s, ox1s) - wl_measure_utils.numpy_divide(o12s, ox2s)) * 100,
wl_measure_utils.numpy_divide(o12s, ox2s)
)
)
)

# Conditional probability
# Reference: Durrant, P. (2008). High frequency collocations and second language learning [Doctoral dissertation, University of Nottingham]. Nottingham eTheses. https://eprints.nottingham.ac.uk/10622/1/final_thesis.pdf | p. 84
def conditional_probability(main, o11s, o12s, o21s, o22s):
Expand All @@ -54,6 +36,13 @@ def im3(main, o11s, o12s, o21s, o22s):

return wl_measure_utils.numpy_log2(wl_measure_utils.numpy_divide(o11s ** 3, e11s))

# ΔP
# Reference: Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165. https://doi.org/10.1075/ijcl.18.1.09gri
def delta_p(main, o11s, o12s, o21s, o22s):
_, _, ox1s, ox2s = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s)

return wl_measure_utils.numpy_divide(o11s, ox1s) - wl_measure_utils.numpy_divide(o12s, ox2s)

# Dice-Sørensen coefficient
# Reference: Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38. | p. 8
def dice_sorensen_coeff(main, o11s, o12s, o21s, o22s):
Expand Down Expand Up @@ -188,6 +177,24 @@ def odds_ratio(main, o11s, o12s, o21s, o22s):
)
)

# %DIFF
# Reference: Gabrielatos, C., & Marchi, A. (2011, November 5). Keyness: Matching metrics to definitions [Conference session]. Corpus Linguistics in the South 1, University of Portsmouth, United Kingdom. https://eprints.lancs.ac.uk/id/eprint/51449/4/Gabrielatos_Marchi_Keyness.pdf
def pct_diff(main, o11s, o12s, o21s, o22s):
_, _, ox1s, ox2s = wl_measures_statistical_significance.get_freqs_marginal(o11s, o12s, o21s, o22s)

return numpy.where(
(o11s == 0) & (o12s > 0),
-numpy.inf,
numpy.where(
(o11s > 0) & (o12s == 0),
numpy.inf,
wl_measure_utils.numpy_divide(
(wl_measure_utils.numpy_divide(o11s, ox1s) - wl_measure_utils.numpy_divide(o12s, ox2s)) * 100,
wl_measure_utils.numpy_divide(o12s, ox2s)
)
)
)

# Pointwise mutual information
# Reference: Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
def pmi(main, o11s, o12s, o21s, o22s):
Expand Down
Loading

0 comments on commit c24322c

Please sign in to comment.