feat: add pandas df accessor #287

shreyashankar · 2025-01-23T01:26:04Z

Adds a pandas DataFrame accessor (.semantic) that provides LLM-powered operations through DocETL's engine. This integration is inspired by the LOTUS paper (Patel et al. 2024) and enables:

Semantic mapping with LLMs (df.semantic.map())
Semantic filtering (df.semantic.filter())
Fuzzy merging of DataFrames (df.semantic.merge())
Semantic aggregation with optional fuzzy matching (df.semantic.agg())
Automatic cost tracking and operation history

Example usage:

import pandas as pd
from docetl import SemanticAccessor

df = pd.DataFrame({"text": ["Apple released iPhone 15", "Google launches Pixel 8"]})
df.semantic.set_config(default_model="gpt-4o-mini")
result = df.semantic.map(
  prompt="Extract company and product from: {{input.text}}",
  output_schema={"company": "str", "product": "str"}
)

# result is a df with 2 new cols: company and product

Note: While individual operations are optimized internally, pipelines created through the pandas interface cannot be optimized as a whole. For pipeline-level optimizations, use the YAML or Python API interfaces.

shreyashankar added 4 commits January 23, 2025 02:23

feat: add pandas df accessor

9cb068f

feat: add pandas df accessor

b64a431

feat: add pandas df accessor

93af306

rebase with main

7ae7077

shreyashankar merged commit 2a259a0 into main Jan 25, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pandas df accessor #287

feat: add pandas df accessor #287

shreyashankar commented Jan 23, 2025 •

edited

Loading

feat: add pandas df accessor #287

feat: add pandas df accessor #287

Conversation

shreyashankar commented Jan 23, 2025 • edited Loading

shreyashankar commented Jan 23, 2025 •

edited

Loading