-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release/v0.3.3 #47
Release/v0.3.3 #47
Conversation
* feat[chat]: vectorize extraction result for improved chat content * feat[Chat]: use references from extraction results as well
* feat(rollbar): add error boundaries on app and main view * refactor(rollbar): clean up error handling component * feat[Rollbar]: improve router logic
WalkthroughThe changes in this pull request introduce enhancements across several files, primarily focusing on the chat functionality and error handling mechanisms. The Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
+ Coverage 60.15% 60.22% +0.07%
==========================================
Files 37 37
Lines 1719 1750 +31
==========================================
+ Hits 1034 1054 +20
- Misses 685 696 +11 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
🧹 Outside diff range and nitpick comments (5)
backend/app/config.py (1)
29-31
: Consider adding docstrings to document the purpose of these parameters.
Adding docstrings would help other developers understand:
- The purpose of these parameters
- The expected value ranges
- The impact on the chat extraction process
Example:
class Settings(BaseSettings):
# ... existing code ...
# Extraction References for chat
chat_extraction_doc_threshold: float = 0.5
"""Similarity threshold for document extraction (0 to 1). Higher values mean stricter matching."""
chat_extraction_max_docs: int = 50
"""Maximum number of documents to extract for each chat query."""
frontend/src/components/ErrorFallback.tsx (1)
43-49
: Add documentation for exported components.
The exported components would benefit from JSDoc documentation explaining their specific use cases and differences.
Add documentation like this:
+/**
+ * Global error fallback component with light text for dark backgrounds
+ * @example
+ * <ErrorBoundary FallbackComponent={GlobalErrorFallback}>
+ * <App />
+ * </ErrorBoundary>
+ */
export const GlobalErrorFallback = (props: ErrorFallbackProps) => (
<ErrorFallbackBase {...props} textColor="text-white" />
);
+/**
+ * View-specific error fallback component with dark text for light backgrounds
+ * @example
+ * <ErrorBoundary FallbackComponent={ViewErrorFallback}>
+ * <View />
+ * </ErrorBoundary>
+ */
export const ViewErrorFallback = (props: ErrorFallbackProps) => (
<ErrorFallbackBase {...props} textColor="text-black" />
);
backend/tests/processing/test_process_queue.py (1)
185-220
: Consider adding more test coverage.
While the test effectively validates the basic functionality, consider adding:
- Verification of ChromaDB initialization parameters
- Error cases (e.g., invalid project_id, None references)
Example test case for ChromaDB initialization:
# Verify ChromaDB initialization
mock_chroma_db.assert_called_once_with(
f"panda-etl-{project_id}",
similarity_threshold=3
)
backend/app/api/v1/chat.py (1)
84-84
: Consistency in concatenating references
When concatenating extraction_doc["reference"]
with docs[index]
, ensure that the order reflects how you want the references to appear. Currently, the new reference is prepended. If the desired order is chronological, you may want to append instead.
Consider adjusting the concatenation:
-docs[index] = f'{extraction_doc["reference"]}\n\n{docs[index]}'
+docs[index] = f'{docs[index]}\n\n{extraction_doc["reference"]}'
backend/app/processing/process_queue.py (1)
88-96
: Improve error message clarity
The error message references 'chat', which might be misleading in this context. Consider updating the message to accurately reflect the operation.
Apply this diff to improve the error message:
except Exception:
- logger.error(f"Failed to vectorize extraction results for chat {traceback.format_exc()}")
+ logger.error(f"Failed to vectorize extraction results: {traceback.format_exc()}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
frontend/yarn.lock
is excluded by!**/yarn.lock
,!**/*.lock
📒 Files selected for processing (9)
- backend/app/api/v1/chat.py (1 hunks)
- backend/app/config.py (1 hunks)
- backend/app/processing/process_queue.py (3 hunks)
- backend/tests/processing/test_process_queue.py (2 hunks)
- frontend/.env.example (1 hunks)
- frontend/package.json (2 hunks)
- frontend/src/app/(app)/layout.tsx (2 hunks)
- frontend/src/app/layout.tsx (2 hunks)
- frontend/src/components/ErrorFallback.tsx (1 hunks)
🧰 Additional context used
🪛 Gitleaks
frontend/.env.example
3-3: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🔇 Additional comments (11)
frontend/src/app/(app)/layout.tsx (2)
7-8
: LGTM! Error handling imports are properly structured.
The imports for error handling components are correctly placed and follow React best practices.
22-26
:
Verify Rollbar configuration setup.
While the ErrorBoundary implementation looks good, the Rollbar configuration seems to be missing. This is necessary for proper error tracking functionality.
Please add the Rollbar configuration before the layout component:
import { ErrorBoundary } from "@rollbar/react";
import { ViewErrorFallback } from "@/components/ErrorFallback";
+ import { Provider } from "@rollbar/react";
+
+ const rollbarConfig = {
+ accessToken: process.env.NEXT_PUBLIC_ROLLBAR_CLIENT_TOKEN,
+ environment: process.env.NODE_ENV
+ };
export default function RootLayout({
Also, wrap the entire layout with the Rollbar Provider:
export default function RootLayout({
children,
}: Readonly<{
children: React.ReactNode;
}>) {
return (
+ <Provider config={rollbarConfig}>
<SidebarProvider>
<ScrollProvider>
{/* ... existing layout code ... */}
</ScrollProvider>
</SidebarProvider>
+ </Provider>
);
}
Let's verify the environment variable is properly set:
frontend/src/app/layout.tsx (1)
7-8
: Verify component naming consistency
The import statement uses GlobalErrorFallback
but the AI summary mentions ViewErrorFallback
. Please ensure the component name is consistent across the codebase.
frontend/src/components/ErrorFallback.tsx (2)
1-13
: LGTM! Clean imports and well-structured interfaces.
The imports are well-organized and the interfaces are properly typed with clear separation of concerns.
1-49
: Verify error boundary integration in layout files.
Let's confirm that the ErrorFallback components are properly integrated with ErrorBoundary components.
✅ Verification successful
Error boundary integration is properly implemented
The error boundaries are correctly set up in the layout files:
- Root layout (
frontend/src/app/layout.tsx
) usesGlobalErrorFallback
with white text - App layout (
frontend/src/app/(app)/layout.tsx
) usesViewErrorFallback
with black text - Both layouts properly import
ErrorBoundary
from@rollbar/react
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify ErrorBoundary usage with ErrorFallback components
# Search for ErrorBoundary usage with our components
echo "Checking ErrorBoundary integration..."
rg -l "ErrorBoundary.*(?:Global|View)ErrorFallback" --type tsx
# Check if ErrorBoundary is imported where our components are used
echo "Checking ErrorBoundary imports..."
rg "import.*ErrorBoundary" --type tsx
Length of output: 322
Script:
#!/bin/bash
# Description: Verify ErrorBoundary usage with ErrorFallback components
# Search for ErrorBoundary usage with our components (without file type restriction)
echo "Checking ErrorBoundary integration..."
rg -l "ErrorBoundary.*(?:Global|View)ErrorFallback"
# Check for ErrorBoundary imports
echo "Checking ErrorBoundary imports..."
rg "import.*ErrorBoundary"
# Find all layout files that might use ErrorBoundary
echo "Checking layout files..."
fd "layout\.(ts|tsx|js|jsx)$"
# Search for usage of our ErrorFallback components
echo "Checking ErrorFallback usage..."
rg "(Global|View)ErrorFallback"
Length of output: 1361
frontend/package.json (1)
43-43
: LGTM!
The rollbar package is correctly added as a dependency, which is required as a peer dependency for @rollbar/react.
backend/tests/processing/test_process_queue.py (1)
3-3
: LGTM: Import changes are appropriate.
The addition of MagicMock
and vectorize_extraction_process_step
imports aligns with the new test requirements.
Also applies to: 9-9
backend/app/api/v1/chat.py (3)
68-85
: Ensure compatibility with the ChromaDB API
Verify that the usage of ChromaDB
and its methods aligns with the latest API specifications. Any discrepancies may lead to runtime errors.
Consult the ChromaDB documentation or codebase to confirm that parameters like similarity_threshold
and methods like get_relevant_docs
are used correctly.
78-85
:
Correct the indexing when iterating over extraction_docs["metadatas"]
Accessing extraction_docs["metadatas"][0]
assumes that the first element contains all metadata, which may not be accurate. Instead, you should iterate over all metadata entries to ensure all relevant documents are processed.
Apply this diff to fix the iteration:
-# Append text from single documents together
-for extraction_doc in extraction_docs["metadatas"][0]:
+for extraction_doc in extraction_docs["metadatas"]:
Double-check that extraction_docs["metadatas"]
is a list of metadata dictionaries corresponding to your documents.
72-75
: Verify the return structure of get_relevant_docs
Ensure that the get_relevant_docs
method returns data in the expected format. Misalignment in the expected structure can lead to KeyError
or incorrect data processing in subsequent steps.
Run the following script to confirm the structure of extraction_docs
:
✅ Verification successful
Based on the search results, I can now provide a definitive response about the structure of get_relevant_docs
:
Return structure of get_relevant_docs
is consistent with usage
From the test file and implementation, we can see that get_relevant_docs
returns a dictionary with expected keys including "metadatas" and "documents". This is verified by:
- The base class in
vectorstore.py
defines it returnsList[dict]
- Test file shows expected structure with "metadatas" containing page information
- Multiple usages in
process_queue.py
consistently access these fields - The implementation in ChromaDB class maintains this contract
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the keys returned by get_relevant_docs
# Test: Print the keys of the returned dictionary
python -c "
from app.vectorstore.chroma import ChromaDB
extract_vectorstore = ChromaDB('panda-etl-extraction-${project_id}')
extraction_docs = extract_vectorstore.get_relevant_docs('${chat_request.query}', k=${settings.chat_extraction_max_docs})
print(extraction_docs.keys())
"
Length of output: 331
Script:
#!/bin/bash
# Let's examine the implementation of get_relevant_docs method
ast-grep --pattern 'def get_relevant_docs'
# Also search for its usage patterns
rg "get_relevant_docs" -A 3
Length of output: 4573
backend/app/processing/process_queue.py (1)
49-51
: LGTM
Extracting project_id
and filename
from the process
and process_step
objects is appropriate for the subsequent processing steps.
# Extraction References for chat | ||
chat_extraction_doc_threshold: float = 0.5 | ||
chat_extraction_max_docs: int = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding value validation for the new parameters.
While the default values are reasonable, it would be beneficial to add validation to ensure:
chat_extraction_doc_threshold
stays between 0 and 1chat_extraction_max_docs
remains positive
You can use pydantic's validators to enforce these constraints:
from pydantic import validator
class Settings(BaseSettings):
# ... existing code ...
@validator("chat_extraction_doc_threshold")
def validate_threshold(cls, v):
if not 0 <= v <= 1:
raise ValueError("chat_extraction_doc_threshold must be between 0 and 1")
return v
@validator("chat_extraction_max_docs")
def validate_max_docs(cls, v):
if v <= 0:
raise ValueError("chat_extraction_max_docs must be positive")
return v
<ErrorBoundary fallbackUI={GlobalErrorFallback}> | ||
<ReactQueryClientProvider> | ||
<Toaster position="top-right" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider specific error handling for React Query errors
While the ErrorBoundary will catch React Query errors, consider adding a specific error handler for query errors using React Query's onError
global configuration.
// In ReactQueryClientProvider.tsx
const queryClient = new QueryClient({
defaultOptions: {
queries: {
onError: (error) => {
// Log to Rollbar with specific context
rollbar.error('Query Error', error, {
context: 'React Query'
});
}
}
}
});
const rollbarConfig = { | ||
accessToken: process.env.NEXT_PUBLIC_ROLLBAR_ACCESS_TOKEN, | ||
environment: process.env.NODE_ENV ?? "production", | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add type safety and review environment fallback
Two suggestions for improvement:
- Consider adding TypeScript interface for the Rollbar configuration.
- The default fallback to "production" environment could be risky. Consider using a more restrictive fallback or throwing an error if NODE_ENV is not set.
+interface RollbarConfig {
+ accessToken: string | undefined;
+ environment: string;
+}
+
-const rollbarConfig = {
+const rollbarConfig: RollbarConfig = {
accessToken: process.env.NEXT_PUBLIC_ROLLBAR_ACCESS_TOKEN,
- environment: process.env.NODE_ENV ?? "production",
+ environment: process.env.NODE_ENV ?? "development",
};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const rollbarConfig = { | |
accessToken: process.env.NEXT_PUBLIC_ROLLBAR_ACCESS_TOKEN, | |
environment: process.env.NODE_ENV ?? "production", | |
}; | |
interface RollbarConfig { | |
accessToken: string | undefined; | |
environment: string; | |
} | |
const rollbarConfig: RollbarConfig = { | |
accessToken: process.env.NEXT_PUBLIC_ROLLBAR_ACCESS_TOKEN, | |
environment: process.env.NODE_ENV ?? "development", | |
}; |
<Provider config={rollbarConfig}> | ||
<ErrorBoundary fallbackUI={GlobalErrorFallback}> | ||
<ReactQueryClientProvider> | ||
<Toaster position="top-right" /> | ||
{children} | ||
</ReactQueryClientProvider> | ||
</ErrorBoundary> | ||
</Provider> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider additional Rollbar configuration options
The basic Rollbar setup is good, but consider adding these recommended configuration options for better error tracking:
captureUncaught
: to capture unhandled exceptionscaptureUnhandledRejections
: for promise rejectionspayload
: to include version and environment data
const rollbarConfig: RollbarConfig = {
accessToken: process.env.NEXT_PUBLIC_ROLLBAR_ACCESS_TOKEN,
environment: process.env.NODE_ENV ?? "development",
+ captureUncaught: true,
+ captureUnhandledRejections: true,
+ payload: {
+ client: {
+ javascript: {
+ code_version: process.env.NEXT_PUBLIC_APP_VERSION,
+ source_map_enabled: true,
+ }
+ }
+ }
};
Committable suggestion was skipped due to low confidence.
const pathname = usePathname(); | ||
const [initialPathname, setInitialPathname] = useState<string | null>(null); | ||
|
||
useEffect(() => { | ||
if (initialPathname === null) { | ||
setInitialPathname(pathname); | ||
} else if (pathname !== initialPathname && error) { | ||
resetError(); | ||
} | ||
}, [pathname, initialPathname, error, resetError]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify pathname tracking logic to prevent potential race conditions.
The current implementation using initialPathname
state could be simplified to avoid potential race conditions during rapid navigation changes.
Consider this alternative implementation:
- const [initialPathname, setInitialPathname] = useState<string | null>(null);
+ const [previousPathname, setPreviousPathname] = useState(pathname);
useEffect(() => {
- if (initialPathname === null) {
- setInitialPathname(pathname);
- } else if (pathname !== initialPathname && error) {
+ if (pathname !== previousPathname && error) {
resetError();
+ setPreviousPathname(pathname);
}
- }, [pathname, initialPathname, error, resetError]);
+ }, [pathname, previousPathname, error, resetError]);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const pathname = usePathname(); | |
const [initialPathname, setInitialPathname] = useState<string | null>(null); | |
useEffect(() => { | |
if (initialPathname === null) { | |
setInitialPathname(pathname); | |
} else if (pathname !== initialPathname && error) { | |
resetError(); | |
} | |
}, [pathname, initialPathname, error, resetError]); | |
const pathname = usePathname(); | |
const [previousPathname, setPreviousPathname] = useState(pathname); | |
useEffect(() => { | |
if (pathname !== previousPathname && error) { | |
resetError(); | |
setPreviousPathname(pathname); | |
} | |
}, [pathname, previousPathname, error, resetError]); |
extract_vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}", | ||
similarity_threshold=settings.chat_extraction_doc_threshold) | ||
|
||
# Extract reference documents from the extraction results from db | ||
extraction_docs = extract_vectorstore.get_relevant_docs( | ||
chat_request.query, | ||
k=settings.chat_extraction_max_docs | ||
) | ||
|
||
# Append text from single documents together | ||
for extraction_doc in extraction_docs["metadatas"][0]: | ||
index = next((i for i, item in enumerate(ordered_file_names) if item == extraction_doc["filename"]), None) | ||
if index is None: | ||
ordered_file_names.append(extraction_doc["filename"]) | ||
docs.append(extraction_doc["reference"]) | ||
else: | ||
docs[index] = f'{extraction_doc["reference"]}\n\n{docs[index]}' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for potential exceptions
The new code block does not account for possible exceptions such as KeyError
or AttributeError
that may arise from accessing dictionary keys or list indices. To ensure robustness, include error handling around these operations.
Wrap the code in a try-except block or validate the data before accessing it. For example:
for extraction_doc in extraction_docs.get("metadatas", []):
if "filename" in extraction_doc and "reference" in extraction_doc:
# Proceed with processing
else:
# Handle missing keys appropriately
extract_vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}", | ||
similarity_threshold=settings.chat_extraction_doc_threshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify persist_directory
when initializing ChromaDB
When creating the extract_vectorstore
, you should specify the persist_directory
to ensure that the vector store points to the correct data location. Without it, ChromaDB
might not load the expected data, leading to inconsistencies.
Apply this diff to include the persist_directory
:
extract_vectorstore = ChromaDB(
f"panda-etl-extraction-{project_id}",
+ persist_directory=settings.persist_directory,
similarity_threshold=settings.chat_extraction_doc_threshold
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
extract_vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}", | |
similarity_threshold=settings.chat_extraction_doc_threshold) | |
extract_vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}", | |
persist_directory=settings.persist_directory, | |
similarity_threshold=settings.chat_extraction_doc_threshold) |
index = next((i for i, item in enumerate(ordered_file_names) if item == extraction_doc["filename"]), None) | ||
if index is None: | ||
ordered_file_names.append(extraction_doc["filename"]) | ||
docs.append(extraction_doc["reference"]) | ||
else: | ||
docs[index] = f'{extraction_doc["reference"]}\n\n{docs[index]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize filename lookup by using a dictionary
The current method of finding the index using a list comprehension has a time complexity of O(n). For better performance, consider using a dictionary to map filenames to their indices.
Refactor the code as follows:
+# Create a mapping from filename to index for quick lookup
+filename_to_index = {filename: idx for idx, filename in enumerate(ordered_file_names)}
for extraction_doc in extraction_docs["metadatas"]:
- index = next((i for i, item in enumerate(ordered_file_names) if item == extraction_doc["filename"]), None)
+ index = filename_to_index.get(extraction_doc["filename"])
if index is None:
ordered_file_names.append(extraction_doc["filename"])
docs.append(extraction_doc["reference"])
+ # Update the mapping with the new filename and index
+ filename_to_index[extraction_doc["filename"]] = len(ordered_file_names) - 1
else:
docs[index] = f'{extraction_doc["reference"]}\n\n{docs[index]}'
Committable suggestion was skipped due to low confidence.
# vectorize extraction result | ||
try: | ||
vectorize_extraction_process_step(project_id=project_id, | ||
process_step_id=process_step_id, | ||
filename=filename, | ||
references=data["context"]) | ||
except Exception : | ||
logger.error(f"Failed to vectorize extraction results for chat {traceback.format_exc()}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid catching broad exceptions
Catching all exceptions with except Exception
can make debugging difficult and may mask unexpected errors. It's better to catch specific exceptions or re-raise the original exception after logging.
Consider modifying the exception handling as follows:
try:
vectorize_extraction_process_step(project_id=project_id,
process_step_id=process_step_id,
filename=filename,
references=data["context"])
-except Exception:
+except SomeSpecificException as e:
logger.error(f"Failed to vectorize extraction results: {traceback.format_exc()}")
+ raise e
Replace SomeSpecificException
with the specific exception(s) that vectorize_extraction_process_step
might raise. If you need to handle multiple exceptions, you can list them in a tuple.
Committable suggestion was skipped due to low confidence.
def vectorize_extraction_process_step(project_id: int, process_step_id: int, filename: str, references: dict) -> None: | ||
# Vectorize extraction result and dump in database | ||
field_references = {} | ||
|
||
# Loop to concatenate sources for each reference | ||
for extraction_references in references: | ||
for extraction_reference in extraction_references: | ||
sources = extraction_reference.get("sources", []) | ||
if sources: | ||
sources_catenated = "\n".join(sources) | ||
field_references.setdefault(extraction_reference["name"], "") | ||
field_references[extraction_reference["name"]] += ( | ||
"\n" + sources_catenated if field_references[extraction_reference["name"]] else sources_catenated | ||
) | ||
|
||
# Only proceed if there are references to add | ||
if not field_references: | ||
return | ||
|
||
# Initialize Vectorstore | ||
vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}") | ||
|
||
docs = [f"{filename} {key}" for key in field_references] | ||
metadatas = [ | ||
{ | ||
"project_id": project_id, | ||
"process_step_id": process_step_id, | ||
"filename": filename, | ||
"reference": reference | ||
} | ||
for reference in field_references.values() | ||
] | ||
|
||
# Add documents to vectorstore | ||
vectorstore.add_docs(docs=docs, metadatas=metadatas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect assignment of documents and metadata in vectorization
Currently, the docs
variable contains filename
and key
, which are not the actual content to be vectorized. The metadatas
assign the concatenated sources to the reference
field, which may not be intended. This could lead to incorrect data being stored in the vector store.
Apply this diff to correct the assignments:
# Build docs and metadatas
-def vectorize_extraction_process_step(project_id: int, process_step_id: int, filename: str, references: dict) -> None:
- # ... [existing code] ...
- # Initialize Vectorstore
- vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}")
-
- docs = [f"{filename} {key}" for key in field_references]
- metadatas = [
- {
- "project_id": project_id,
- "process_step_id": process_step_id,
- "filename": filename,
- "reference": reference
- }
- for reference in field_references.values()
- ]
+ # Initialize Vectorstore
+ vectorstore = ChromaDB(f"panda-etl-extraction-{project_id}")
+
+ docs = list(field_references.values())
+ metadatas = [
+ {
+ "project_id": project_id,
+ "process_step_id": process_step_id,
+ "filename": filename,
+ "reference_name": key
+ }
+ for key in field_references.keys()
+ ]
Explanation:
- Docs Correction:
docs
should contain the concatenated sources (field_references.values()
), which are the actual content to be vectorized. - Metadata Correction: In
metadatas
, usekey
(the reference name) for thereference_name
field to accurately describe the metadata associated with each document.
Ensure that docs
and metadatas
are correctly aligned so that each document corresponds to the correct metadata entry.
Committable suggestion was skipped due to low confidence.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Tests