-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Providingmethod sourcedocument #2676
Providingmethod sourcedocument #2676
Conversation
@vincent-4 I think it's okay to deprecate the string parsing methods? I.e., remove completely. Once this is in a reasonable state I will verify regressions - there's no need/reason to keep the jank around... |
hey @vincent-4 this PR is still in draft... is it ready for me to look at? |
Yeah, I would appreciate it. Btw, unrelated, but I thought I'd send a message while you're looking at this too: but I left a message about parquet-floor's issues: essentially the re-implementation of parquet's native stuff by parquet-floor breaks, and I can't get it working properly. (Although I'm guessing you already saw on Slack) |
private Map<String, String> fields; | ||
|
||
public Document(JsonNode json) { | ||
super(); | ||
this.raw = json.toPrettyString(); | ||
this.id = json.get("docid").asText(); | ||
this.contents = json.get("vector").toString(); | ||
JsonNode vectorNode = json.get("vector"); | ||
if (vectorNode != null && vectorNode.isArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation issue?
@@ -23,7 +23,8 @@ | |||
import java.nio.file.Path; | |||
|
|||
/** | |||
* A JSON document collection where the user can specify directly the vector to be indexed. | |||
* A JSON document collection where the user can specify directly the vector to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert? keep on single line please.
@@ -39,7 +40,8 @@ public FileSegment<JsonVectorCollection.Document> createFileSegment(Path p) thro | |||
} | |||
|
|||
@Override | |||
public FileSegment<JsonVectorCollection.Document> createFileSegment(BufferedReader bufferedReader) throws IOException { | |||
public FileSegment<JsonVectorCollection.Document> createFileSegment(BufferedReader bufferedReader) | |||
throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert? don't change things that aren't relevant?
|
||
public Document(JsonNode json) { | ||
super(json); | ||
|
||
// We're going to take the map associated with "vector" and generate pseudo-document. | ||
// We're going to take the map associated with "vector" and generate | ||
// pseudo-document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -17,6 +17,8 @@ | |||
package io.anserini.collection; | |||
|
|||
import java.util.Map; | |||
import org.junit.Test; | |||
import static org.junit.Assert.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use * import please.
@@ -29,11 +29,16 @@ | |||
import com.fasterxml.jackson.databind.ObjectMapper; | |||
|
|||
public class JsonStringVectorTopicReader extends TopicReader<String> { | |||
private final Map<String, float[]> vectorCache = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why we need a cache? I don't think topics are repeated?
hi @vincent-4 gave you some comments. |
e58113a
to
2e5ff74
Compare
405d1a5
to
2546813
Compare
* @return vector as float array, or null if not available | ||
* @throws IOException if error encountered during access to index | ||
*/ | ||
public static float[] getDenseVector(IndexReader reader, String docid) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this method anymore? I don't think we ever parse from string in the current impl?
Removed above - passes tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts?
} | ||
}); | ||
this.contents = vectorNode.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking here... what should be the behavior here? Returning null
is another option if the document only has vector data? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return null
here then we would never need getDenseVector
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah then we can omit it..
We can just have one rep. for the vector
Implementing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I think that makes sense, thanks!
lg for now. I'm going to run regressions and make sure everything still works before I merge. |
Changes made:
Added
vector()
method toSourceDocument
interface as an optional default method that returnsfloat[]
for direct vector accessImplemented
vector()
in collection classes:JsonDenseVectorCollection
: Returns parsed vector for array formatJsonVectorCollection
: Returns vector for dense format, null for sparseModified document generators to use direct vector access:
vector()
firstAdded vector validation in tests:
validateVectorIfPresent
in JsonVectorCollectionTestAddresses: #2661