Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config issue #15

Open
hubalaga opened this issue Jan 30, 2022 · 2 comments
Open

Config issue #15

hubalaga opened this issue Jan 30, 2022 · 2 comments

Comments

@hubalaga
Copy link

Hi

In example:
you cannot run this:
yake.cfg["candidate_selection"] = {"ngram": 3}

because TypeError: unhashable type: 'dict'

@Facenomore23
Copy link

Facenomore23 commented Feb 7, 2022

The only way I found to change the number of word in the ngram was :

  • Change the code in the package directory by changing n to an other number. Here I want to have max 2 words as keywords
@registry.candidate_selection.register("ngram")
def ngram_selection(doc: Doc, n=2) -> Iterable[Candidate]:
  • Create an other function as suggested, copying the code from the method and changing the number n (which is basically the same as the above) :
 @registry.candidate_selection.register("custom")
def custom_selection(doc: Doc, n=2) -> Iterable[Candidate]:
    """Get keywords candidates from ngrams.
    Args:
        doc (Doc): doc.
        n (int): ngram range.
    Returns:
        Iterable[Candidate]
    """
    
    def _is_candidate(span: Span, min_length=3, min_word_length=2, alpha=True) -> bool:
        """Check if N-gram span is qualified as a candidate.
        Args:
            span (Span): n-gram.
            min_length (int): minimum length for n-gram.
            min_word_length (int): minimum length for word in an ngram.
            alpha (bool): Filter n-grams with non-alphanumeric words.
        Returns:
            bool
        """
        n_span = len(span)
        # discard if composed of 1-2 characters
        if len(span.text) < min_length:
            return False
        for token_idx, token in enumerate(span):
            # discard if contains punct
            if token.is_punct:
                return False
            # discard if contains tokens of 1-2 characters
            if len(token.text) < min_word_length:
                return False
            # discard if contains non alphanumeric
            if alpha and not token.is_alpha:
                return False
            # discard if first/last word is a stop
            if token_idx in (0, n_span - 1) and token.is_stop:
                return False
        # discard if ends with NOUN -> VERB
        if n_span >= 2 and span[-2].pos_ == "NOUN" and span[-1].pos_ == "VERB":
            return False
        return True

    surface_forms = [sf for sf in _ngrams(doc, n=n) if _is_candidate(sf[1])]
    return _merge_surface_forms(surface_forms)

def _ngrams(doc: Doc, n=3) -> Iterator[Tuple[int, Span]]:
    """Select all the n-grams and populate the candidate container.
    Args:
        doc (Doc): doc.
        n (int): the n-gram length, defaults to 3.
    Returns:
        Iterator(sentence_id<int>, offset<int>, ngram<Span>)
    """
    for sentence_id, sentence in enumerate(doc.sents):
        n_tokens = len(sentence)
        window = min(n, n_tokens)
        for j in range(n_tokens):
            for k in range(j + 1, min(j + 1 + window, n_tokens + 1)):
                yield sentence_id, sentence[j:k]


def _merge_surface_forms(surface_forms: Iterator[Tuple[int, Span]]) -> Iterable[Candidate]:
    """De-dup candidate surface forms.
    Args:
        surface_forms (Iterable): tuples of <sent_i, span>.
    Returns:
        List of candidates.
    """
    candidates = dict()
    for sent_i, span in surface_forms:
        idx = span.lemma_.lower()
        try:
            c = candidates[idx]
        except (KeyError, IndexError):
            lexical_form = [token.lemma_.lower() for token in span]
            c = candidates[idx] = Candidate(lexical_form, [], [])
        c.surface_forms.append(span)
        c.sentence_ids.append(sent_i)
    return list(candidates.values())

Then you have to call yake.cfg["candidate_selection"] = "custom" as you just did.

Did not found other way around to change the number of words in keywords

@talmago
Copy link
Owner

talmago commented Feb 8, 2022

Following @Facenomore23 comment i'd try -

from spacy_ke.util import registry, ngram_selection

@registry.candidate_selection.register("custom")
def ngram_selection(doc: Doc, n=3) -> Iterable[Candidate]:
    return ngram_selection(doc, n=2) # or any other logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants