-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679
Comments
In order to avoid the deprecation, the code below can be used. However, since an InputStream is being used, nothing is keeping track of whether this set of vectors has already been loaded for other purposes. To coordinate that, the package org.clulab.odin.impl
import org.clulab.embeddings.{ExplicitWordEmbeddingMap, WordEmbeddingMap}
import org.clulab.scala.WrappedArray._
import java.io.InputStream
trait OdinResource
// for distributional similarity comparisons
class EmbeddingsResource(is: InputStream) extends OdinResource {
val wordEmbeddingMap = ExplicitWordEmbeddingMap(is, binary = false)
def similarity(w1: String, w2: String): Double = {
val scoreOpt = for {
vec1 <- wordEmbeddingMap.get(w1)
vec2 <- wordEmbeddingMap.get(w2)
} yield WordEmbeddingMap.dotProduct(vec1, vec2).toDouble
scoreOpt.getOrElse(-1d)
}
} |
Thanks for the snippet, @kwalcock . Have you all talked about using an ANN index for a large set of embeddings? Since processors is still using static word embeddings, I am thinking n-gram embeddings could help to improve the relevance of multi-token matches. |
As in approximate nearest neighbor? Some were used for ConceptAlignment in alignment/indexer/knn/hnswlib . Specifically, this library was used: hnswlib. Only individual strings were added to the index, so I suppose that's unigram. Are you wanting to pair the words and concatenate their vectors? I haven't heard of that mentioned in relation to processors. |
No, I meant averaging summing and averaging element-wise. It seems my memory is mistaken, though, we don't currently support this kind of thing:
Yes, an approximate nearest neighbors index. Right now Odin token constraints support expressions like Larger context: I am thinking about extending Odin to support a new kind of Embedding-based NER (just a sketch below): - name: "embedding-ner"
label: ActionStar
type: embedding
# will compare available embeddings for n-grams of the specified sizes
phrases: [1, 2, 3]
pattern: |
ave("Sylvester Stallone", "Arnold Schwarzenegger") > .9 |
Odin's
EmbeddingsResource
extends the deprecatedSanitizedWordEmbeddingMap
.Tangential, but have we given any thought to instead using an ANN index (ex.
annoy4s
) for Odin?The text was updated successfully, but these errors were encountered: