Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Align configuration for inference and evaluation #61

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

undo76
Copy link
Contributor

@undo76 undo76 commented Dec 9, 2024

Ensure that evaluation can be configured using the same configuration as inference.

#60

@undo76 undo76 force-pushed the feat/ms-shared-configuration branch from dbc16b0 to 0d21062 Compare December 9, 2024 14:13
@undo76 undo76 force-pushed the feat/ms-shared-configuration branch from 0d21062 to 32f496e Compare December 9, 2024 14:16
Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review on the config modifications only. I chose to focus on this first as it will affect the remainder of the PR as well. My goal is to try and minimise the additions we add to the config dataclass. I'd like to keep it as simple as we can make it. Every parameter should be self-explanatory.

src/raglite/_config.py Outdated Show resolved Hide resolved
@@ -53,6 +67,12 @@ class RAGLiteConfig:
),
compare=False, # Exclude the reranker from comparison to avoid lru_cache misses.
)
search_method: "SearchMethod" = field(default_factory=_default_search_method, compare=False)
system_prompt: str | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which system prompt is this? Can we leave it out of the config? I don't believe we use one for RAG currently.

Copy link
Contributor Author

@undo76 undo76 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the one for the specific use case. It contains information about the assistant role, language, style, etc. I think it is important to keep it here and make it available to the evaluation as it contains valuable information. Also my aim is that just by modifying the Config on could switch from one use case to another without modifying anything else.

In other words, everything that could be modified to improve the performance should be in Config (num_chunks, search method and prompts)

I am using it like this:

 messages = [
        {"role": "system", "content": config.system_prompt},
        *history,
        create_rag_instruction(
            user_prompt=user_prompt,
            context=chunk_spans,
            config=config,
        ),
    ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling we'll end up with potentially many system prompts though in the future, and then it will be difficult to distinguish between them. Can we solve this with the same partial trick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to use the same trick (any suggestions?). I don't think that we should use the assistant instructions (role, language, tone, format, examples, etc) in the rag instructions as they are immutable. One option would be to leave it outside of the configuration as an application specific feature, but then the evaluation would need to take care of the system prompt for the answers and evaluation phase.

Maybe we could configure create_rag_instruction and create_system_prompt as (partial) functions. (I would call them system_prompt and user_prompt,

To be clear: I consider that the evaluation pipeline should take into account the same system prompt that is going to be used during inference. Imagine that the system_prompt says that all the answers should be in Dutch, this information should be taken into account by the answer generation for evaluation.

Another thing I want is to configure everything in a single configuration class. This way switching different versions or use cases becomes trivial.

Finally, (a long shot maybe), by having access to the system_prompt, that describes the particular use cases could be useful for other RAG phases. We could leverage the system prompt in order to augment the chunks with contextual information, hypothetical questions, keywords, etc.

TL;DR: I am willing to change how it is configured (partial or another method), but I think that it should be included in the configuration.

src/raglite/_config.py Outdated Show resolved Hide resolved
@undo76
Copy link
Contributor Author

undo76 commented Dec 12, 2024

Big refactoring to prevent cyclic dependencies. Not fully convinced about the interface yet. In particular I don't like config.retrieval, but it is taking shape. Other thing I don't like is that it is not possible to execute the different phases separately.

@undo76 undo76 force-pushed the feat/ms-shared-configuration branch from f1ef291 to 8e6436e Compare December 13, 2024 09:32
Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another review round.

answered_evals: pd.DataFrame | int = 100, config: RAGLiteConfig | None = None
answered_evals_df: pd.DataFrame,
*,
metrics: Sequence[Any] | None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a Sequence[Metric] | None or maybe even a list[Metric] | None, where from ragas.metrics.base import Metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but as ragas is an optional dependency I used Any. A better solution would be to assign Metric conditionally to Any or ragas.metrics.base.Metric depending on the presence of ragas.

Comment on lines 16 to +18
strict: bool = False, # noqa: FBT001,FBT002
config: RAGLiteConfig | None = None,
*,
config: "RAGLiteConfig",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the * above strict, as a positional True or False without the keyword argument name would not be very clear.

src/raglite/_query_adapter.py Show resolved Hide resolved
search: SearchMethod = hybrid_search,
config: RAGLiteConfig | None = None,
search: "ChunkSearchMethod",
rerank: Optional["ChunkRerankingMethod"] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.10+:

Suggested change
rerank: Optional["ChunkRerankingMethod"] = None,
rerank: "ChunkRerankingMethod" | None = None,

Comment on lines +31 to +36
if rerank:
chunks = rerank(query, chunk_ids=chunk_ids, config=config)
else:
chunks = retrieve_chunks(chunk_ids, config=config)
context = retrieve_chunk_spans(chunks, chunk_neighbors=chunk_neighbors, config=config)
return context[:max_chunk_spans]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the algorithm as implemented:

  1. Let's say you retrieve 40 chunks with search.
  2. Then you rerank those 40 chunks with rerank.
  3. Then you retrieve all chunk spans from those 40 reranked chunks.
  4. Only then do you limit the number of chunk spans to the desired amount (or not any at all).

There are two downsides to this algorithm:

  1. Retrieving all 40 chunks is relatively expensive if you intend to throw away a number of chunk spans afterwards.
  2. If you don't throw away any chunk spans (which is the default), then reranking doesn't do very much as the LLM gets to see the same chunks whether you rerank or not!

To address both issues, I think we should introduce a max_chunks: int argument and update the algorithm as follows:

  1. Let's say you retrieve 40 chunks with search.
  2. Then you rerank those 40 chunks with rerank, and keep only the top max_chunks.
  3. Then you retrieve all chunk spans from the top max_chunks.
  4. Optional: limit the number of chunk spans to max_chunk_spans (with a default of None as is the case now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand. max_chunks is always set to a value and search always returns [:max_chunks], therefore step 2 and step 3 are also limited to [:max_chunks] in the original algorithm.

Regarding the downsides:

  1. Chunk spans consist of a list of contiguous chunks retrieved by search (+ neighbours). We need to compose all the chunk spans as we need to assign a pooled score using the sum of reciprocal rankings of the chunks, and then return the top max_chunk_spans. Maybe you are thinking that for some aggregated span scores we can be sure that they will never be in the top, so we could skip the neighbours retrieval phase, indeed, but the algo is not trivial and depends on the chunk score pooling function. (e.g. if we decided that span_score = max(chunk_scores) it would be trivial).
  2. You are right here, that the default makes the reranking less useful. Maybe it would be better not to allow None in max_chunk_spans

For instance, I have been using this configuration:

retrieval = partial(
    retrieve_rag_context,
    max_chunk_spans=3,
    search_method=partial(
        hybrid_search,
        max_chunks=20,
    ),
    rerank=rerank_chunks,
    chunk_neighbors=(-1, 1),
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have two options:

  1. Option A:
    a. We search for chunks (say, 40)
    b. We rerank all chunks (requires retrieving all chunks)
    c. We truncate the chunks to max_chunks chunks (say, 10)
    d. We transform chunks into chunk spans and return all of them (no truncation of chunk spans)
  2. Option B:
    a. We search for chunks (say, 40)
    b. We rerank all chunks (requires retrieving all chunks)
    c. We transform chunks into chunk spans (ordered according to their aggregate chunk relevance)
    d. We truncate the chunk spans to max_chunk_spans chunk spans (say, 3)

Comparison:

Option A Option B
1. Database load 🟡 retrieves ~40 chunks (for reranking) 🟠 retrieves ~40 chunks (for reranking) + neighbours (for chunk spans)
2. Predictable number of context tokens 🟢 max_chunk chunks, each with 1440 chars 🔴 number of chunks is dynamic, only number of chunk spans is known
3. Chunk span quality 🟡 Some less relevant chunks may be omitted from the chunk spans 🟢 The returned chunk spans will contain all relevant chunks + because there are more chunks there may be fewer chunk spans because there are fewer 'gaps'

Summary: Option B seems like it may result in higher quality chunk spans than Option A, but at the expense of poor control over how many context tokens it generates.

Do you agree? Are there other important aspects I'm missing in the comparison?

I have a feeling that there may be a way to optimise the chunk span construction and perhaps get the best of both worlds. I'll see if I can give that some thought.



def create_rag_instruction(
user_prompt: str,
context: list[ChunkSpan],
*,
rag_instruction_template: str = RAG_INSTRUCTION_TEMPLATE,
rag_instruction_template: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to remove the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the RAG_INSTRUCTION_TEMPLATE is quite specific and the instructions should be in a system prompt. If we want to keep it, I think that something more general makes more sense. I am using the one below and adding the instructions to the system prompt (doing so, it helps with caching too)

rag_instruction_template = "{user_prompt}\n\n{context}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should provide a reasonable default that 'just works' so that users don't necessarily have to provide their own template.

The current template is fairly generic in my opinion – is there anything you would change to it to make it more generally applicable?

Comment on lines +69 to +78
rag_instruction_template: str | None = None,
) -> list[dict[str, str]]:
"""Compose a list of messages to generate a response."""
messages = [
*([{"role": "system", "content": system_prompt}] if system_prompt else []),
*(history or []),
create_rag_instruction(
user_prompt=user_prompt,
context=context if context else [],
rag_instruction_template=rag_instruction_template or DEFAULT_RAG_INSTRUCTION_TEMPLATE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do end up keeping this function, why not:

Suggested change
rag_instruction_template: str | None = None,
) -> list[dict[str, str]]:
"""Compose a list of messages to generate a response."""
messages = [
*([{"role": "system", "content": system_prompt}] if system_prompt else []),
*(history or []),
create_rag_instruction(
user_prompt=user_prompt,
context=context if context else [],
rag_instruction_template=rag_instruction_template or DEFAULT_RAG_INSTRUCTION_TEMPLATE,
rag_instruction_template: str = DEFAULT_RAG_INSTRUCTION_TEMPLATE,
) -> list[dict[str, str]]:
"""Compose a list of messages to generate a response."""
messages = [
*([{"role": "system", "content": system_prompt}] if system_prompt else []),
*(history or []),
create_rag_instruction(
user_prompt=user_prompt,
context=context if context else [],
rag_instruction_template=rag_instruction_template,

) -> list[dict[str, str]]:
"""Compose a list of messages to generate a response."""
messages = [
*([{"role": "system", "content": system_prompt}] if system_prompt else []),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a system prompt is provided, this will prevent prompt caching.

And if we were to drop this system prompt feature, then compose_rag_messages is equivalent to messages.append(create_rag_instruction(user_prompt, context)).

Therefore, I think we can actually do without compose_rag_messages, or is there a good reason to have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this prevents (prefix) prompt catching. If system_prompt is static, at least with OpenAI prompt caching, the common tokens prefix will be cached. Maybe there is a confusion with the syntax I used. It is just a compact way of adding conditionally the system prompt (using a list for this is weird, I reckon, but it is a way of expressing "nothing" to be added)

I use this method in a couple of places, including eval. I think it is a good abstraction to keep as it removes the cognitive load of remembering which order and format the messages should be constructed. Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.

Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this prevents (prefix) prompt catching.

I think it can break prompt caching when using a 'rolling message history': when the message history is too long to fit in the context, you can drop the first k messages until it does fit in the context. But if you drop the first k messages and then reinsert the system prompt, you're creating a new prefix that hasn't been seen yet and would be a cache miss. Subsequent messages would likely have to drop the first messages again and the process repeats, resulting in consistent cache misses once you hit the maximum context size.

Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.

In my opinion, the 'management' of the message history should not be RAGLite's responsibility. Features like rolling history, system prompt, tool usage filtering (or not), should be managed at the application level. For example, in a Chainlit application I think it should really be Chainlit managing the message history. Of course, RAGLite has contributions to make to that message history, but if you agree that it does not 'own' the message history, its modifications to the message history should be minimal.

Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?

I do agree that that is necessary, but I don't agree that should be RAGLite's responsibility (cf. motivation above). I do think RAGLite should offer a way to inject context into a message history, but with minimal interference. For that reason, I would prefer to not even offer a function that modifies the message history and instead offer only a function to append a message to the message history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this prevents (prefix) prompt catching.

I think it can break prompt caching when using a 'rolling message history': when the message history is too long to fit in the context, you can drop the first k messages until it does fit in the context. But if you drop the first k messages and then reinsert the system prompt, you're creating a new prefix that hasn't been seen yet and would be a cache miss. Subsequent messages would likely have to drop the first messages again and the process repeats, resulting in consistent cache misses once you hit the maximum context size.

Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.

In my opinion, the 'management' of the message history should not be RAGLite's responsibility. Features like rolling history, system prompt, and tool usage filtering (or not) should be managed at the application level. For example, in a Chainlit application I think it should really be Chainlit managing the message history. Of course, RAGLite has contributions to make to that message history, but if you agree that it does not 'own' the message history, its modifications to the message history should be minimal.

Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?

I do agree that that is necessary, but I don't agree that should be RAGLite's responsibility (cf. motivation above). I do think RAGLite should offer a way to inject context into a message history, but with minimal interference. For that reason, I would prefer to not even offer a function that modifies the message history and instead offer only a function to append a message to the message history.

src/raglite/_chainlit.py Show resolved Hide resolved
@@ -270,28 +273,41 @@ class Eval(SQLModel, table=True):
document: Document = Relationship(back_populates="evals")

@staticmethod
def from_chunks(
question: str, contexts: list[Chunk], ground_truth: str, **kwargs: Any
def from_contexts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would call it from_context (singular) for concistency with context: list[ChunkSpan] arguments elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants