Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providingmethod sourcedocument #2676

Merged
merged 5 commits into from
Jan 21, 2025

Conversation

vincent-4
Copy link
Contributor

Changes made:

  1. Added vector() method to SourceDocument interface as an optional default method that returns float[] for direct vector access

  2. Implemented vector() in collection classes:

    • JsonDenseVectorCollection: Returns parsed vector for array format
    • JsonVectorCollection: Returns vector for dense format, null for sparse
  3. Modified document generators to use direct vector access:

    • Try vector() first
    • Fall back to string parsing for backward compatibility
    • Affects: JsonDenseVectorDocumentGenerator, JsonInvertedDenseVectorDocumentGenerator
  4. Added vector validation in tests:

    • Added helper method validateVectorIfPresent in JsonVectorCollectionTest
    • Verifies correct vector parsing and null handling
    • Tests both array and non-array formats

Addresses: #2661

@lintool
Copy link
Member

lintool commented Jan 12, 2025

@vincent-4 I think it's okay to deprecate the string parsing methods? I.e., remove completely. Once this is in a reasonable state I will verify regressions - there's no need/reason to keep the jank around...

@lintool
Copy link
Member

lintool commented Jan 16, 2025

hey @vincent-4 this PR is still in draft... is it ready for me to look at?

@vincent-4
Copy link
Contributor Author

vincent-4 commented Jan 16, 2025

Yeah, I would appreciate it.

Btw, unrelated, but I thought I'd send a message while you're looking at this too: but I left a message about parquet-floor's issues: essentially the re-implementation of parquet's native stuff by parquet-floor breaks, and I can't get it working properly. (Although I'm guessing you already saw on Slack)

private Map<String, String> fields;

public Document(JsonNode json) {
super();
this.raw = json.toPrettyString();
this.id = json.get("docid").asText();
this.contents = json.get("vector").toString();
JsonNode vectorNode = json.get("vector");
if (vectorNode != null && vectorNode.isArray()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation issue?

@@ -23,7 +23,8 @@
import java.nio.file.Path;

/**
* A JSON document collection where the user can specify directly the vector to be indexed.
* A JSON document collection where the user can specify directly the vector to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert? keep on single line please.

@@ -39,7 +40,8 @@ public FileSegment<JsonVectorCollection.Document> createFileSegment(Path p) thro
}

@Override
public FileSegment<JsonVectorCollection.Document> createFileSegment(BufferedReader bufferedReader) throws IOException {
public FileSegment<JsonVectorCollection.Document> createFileSegment(BufferedReader bufferedReader)
throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert? don't change things that aren't relevant?


public Document(JsonNode json) {
super(json);

// We're going to take the map associated with "vector" and generate pseudo-document.
// We're going to take the map associated with "vector" and generate
// pseudo-document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -17,6 +17,8 @@
package io.anserini.collection;

import java.util.Map;
import org.junit.Test;
import static org.junit.Assert.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use * import please.

@@ -29,11 +29,16 @@
import com.fasterxml.jackson.databind.ObjectMapper;

public class JsonStringVectorTopicReader extends TopicReader<String> {
private final Map<String, float[]> vectorCache = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why we need a cache? I don't think topics are repeated?

@lintool lintool marked this pull request as ready for review January 17, 2025 01:18
@lintool
Copy link
Member

lintool commented Jan 17, 2025

hi @vincent-4 gave you some comments.

@vincent-4 vincent-4 force-pushed the providingmethod-sourcedocument branch from e58113a to 2e5ff74 Compare January 17, 2025 01:48
@vincent-4 vincent-4 force-pushed the providingmethod-sourcedocument branch from 405d1a5 to 2546813 Compare January 17, 2025 03:07
* @return vector as float array, or null if not available
* @throws IOException if error encountered during access to index
*/
public static float[] getDenseVector(IndexReader reader, String docid) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this method anymore? I don't think we ever parse from string in the current impl?

@vincent-4
Copy link
Contributor Author

Removed above - passes tests
Do you know of other instances of parsing that can be made redundant? I looked at a few but didn't immediately recognize anything standing out.

@lintool lintool self-requested a review January 18, 2025 17:36
Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts?

}
});
this.contents = vectorNode.toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking here... what should be the behavior here? Returning null is another option if the document only has vector data? Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return null here then we would never need getDenseVector right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah then we can omit it..
We can just have one rep. for the vector
Implementing!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think that makes sense, thanks!

@lintool
Copy link
Member

lintool commented Jan 18, 2025

lg for now. I'm going to run regressions and make sure everything still works before I merge.

@lintool lintool self-requested a review January 21, 2025 00:30
@lintool lintool merged commit df900e8 into castorini:master Jan 21, 2025
1 check passed
@vincent-4 vincent-4 deleted the providingmethod-sourcedocument branch January 21, 2025 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants