Config issue #15

hubalaga · 2022-01-30T11:41:14Z

Hi

In example:
you cannot run this:
yake.cfg["candidate_selection"] = {"ngram": 3}

because TypeError: unhashable type: 'dict'

Facenomore23 · 2022-02-07T13:17:49Z

The only way I found to change the number of word in the ngram was :

Change the code in the package directory by changing n to an other number. Here I want to have max 2 words as keywords

@registry.candidate_selection.register("ngram")
def ngram_selection(doc: Doc, n=2) -> Iterable[Candidate]:

Create an other function as suggested, copying the code from the method and changing the number n (which is basically the same as the above) :

 @registry.candidate_selection.register("custom")
def custom_selection(doc: Doc, n=2) -> Iterable[Candidate]:
    """Get keywords candidates from ngrams.
    Args:
        doc (Doc): doc.
        n (int): ngram range.
    Returns:
        Iterable[Candidate]
    """
    
    def _is_candidate(span: Span, min_length=3, min_word_length=2, alpha=True) -> bool:
        """Check if N-gram span is qualified as a candidate.
        Args:
            span (Span): n-gram.
            min_length (int): minimum length for n-gram.
            min_word_length (int): minimum length for word in an ngram.
            alpha (bool): Filter n-grams with non-alphanumeric words.
        Returns:
            bool
        """
        n_span = len(span)
        # discard if composed of 1-2 characters
        if len(span.text) < min_length:
            return False
        for token_idx, token in enumerate(span):
            # discard if contains punct
            if token.is_punct:
                return False
            # discard if contains tokens of 1-2 characters
            if len(token.text) < min_word_length:
                return False
            # discard if contains non alphanumeric
            if alpha and not token.is_alpha:
                return False
            # discard if first/last word is a stop
            if token_idx in (0, n_span - 1) and token.is_stop:
                return False
        # discard if ends with NOUN -> VERB
        if n_span >= 2 and span[-2].pos_ == "NOUN" and span[-1].pos_ == "VERB":
            return False
        return True

    surface_forms = [sf for sf in _ngrams(doc, n=n) if _is_candidate(sf[1])]
    return _merge_surface_forms(surface_forms)

def _ngrams(doc: Doc, n=3) -> Iterator[Tuple[int, Span]]:
    """Select all the n-grams and populate the candidate container.
    Args:
        doc (Doc): doc.
        n (int): the n-gram length, defaults to 3.
    Returns:
        Iterator(sentence_id<int>, offset<int>, ngram<Span>)
    """
    for sentence_id, sentence in enumerate(doc.sents):
        n_tokens = len(sentence)
        window = min(n, n_tokens)
        for j in range(n_tokens):
            for k in range(j + 1, min(j + 1 + window, n_tokens + 1)):
                yield sentence_id, sentence[j:k]


def _merge_surface_forms(surface_forms: Iterator[Tuple[int, Span]]) -> Iterable[Candidate]:
    """De-dup candidate surface forms.
    Args:
        surface_forms (Iterable): tuples of <sent_i, span>.
    Returns:
        List of candidates.
    """
    candidates = dict()
    for sent_i, span in surface_forms:
        idx = span.lemma_.lower()
        try:
            c = candidates[idx]
        except (KeyError, IndexError):
            lexical_form = [token.lemma_.lower() for token in span]
            c = candidates[idx] = Candidate(lexical_form, [], [])
        c.surface_forms.append(span)
        c.sentence_ids.append(sent_i)
    return list(candidates.values())

Then you have to call yake.cfg["candidate_selection"] = "custom" as you just did.

Did not found other way around to change the number of words in keywords

talmago · 2022-02-08T15:01:14Z

Following @Facenomore23 comment i'd try -

from spacy_ke.util import registry, ngram_selection

@registry.candidate_selection.register("custom")
def ngram_selection(doc: Doc, n=3) -> Iterable[Candidate]:
    return ngram_selection(doc, n=2) # or any other logic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Config issue #15

Config issue #15

hubalaga commented Jan 30, 2022

Facenomore23 commented Feb 7, 2022 •

edited

Loading

talmago commented Feb 8, 2022 •

edited

Loading

Config issue #15

Config issue #15

Comments

hubalaga commented Jan 30, 2022

Facenomore23 commented Feb 7, 2022 • edited Loading

talmago commented Feb 8, 2022 • edited Loading

Facenomore23 commented Feb 7, 2022 •

edited

Loading

talmago commented Feb 8, 2022 •

edited

Loading