-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Align configuration for inference and evaluation #61
base: main
Are you sure you want to change the base?
Conversation
dbc16b0
to
0d21062
Compare
0d21062
to
32f496e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review on the config modifications only. I chose to focus on this first as it will affect the remainder of the PR as well. My goal is to try and minimise the additions we add to the config dataclass. I'd like to keep it as simple as we can make it. Every parameter should be self-explanatory.
src/raglite/_config.py
Outdated
@@ -53,6 +67,12 @@ class RAGLiteConfig: | |||
), | |||
compare=False, # Exclude the reranker from comparison to avoid lru_cache misses. | |||
) | |||
search_method: "SearchMethod" = field(default_factory=_default_search_method, compare=False) | |||
system_prompt: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which system prompt is this? Can we leave it out of the config? I don't believe we use one for RAG currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the one for the specific use case. It contains information about the assistant role, language, style, etc. I think it is important to keep it here and make it available to the evaluation as it contains valuable information. Also my aim is that just by modifying the Config on could switch from one use case to another without modifying anything else.
In other words, everything that could be modified to improve the performance should be in Config (num_chunks, search method and prompts)
I am using it like this:
messages = [
{"role": "system", "content": config.system_prompt},
*history,
create_rag_instruction(
user_prompt=user_prompt,
context=chunk_spans,
config=config,
),
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling we'll end up with potentially many system prompts though in the future, and then it will be difficult to distinguish between them. Can we solve this with the same partial
trick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to use the same trick (any suggestions?). I don't think that we should use the assistant instructions (role, language, tone, format, examples, etc) in the rag instructions as they are immutable. One option would be to leave it outside of the configuration as an application specific feature, but then the evaluation would need to take care of the system prompt for the answers and evaluation phase.
Maybe we could configure create_rag_instruction and create_system_prompt as (partial) functions. (I would call them system_prompt and user_prompt,
To be clear: I consider that the evaluation pipeline should take into account the same system prompt that is going to be used during inference. Imagine that the system_prompt says that all the answers should be in Dutch, this information should be taken into account by the answer generation for evaluation.
Another thing I want is to configure everything in a single configuration class. This way switching different versions or use cases becomes trivial.
Finally, (a long shot maybe), by having access to the system_prompt, that describes the particular use cases could be useful for other RAG phases. We could leverage the system prompt in order to augment the chunks with contextual information, hypothetical questions, keywords, etc.
TL;DR: I am willing to change how it is configured (partial or another method), but I think that it should be included in the configuration.
Big refactoring to prevent cyclic dependencies. Not fully convinced about the interface yet. In particular I don't like |
f1ef291
to
8e6436e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another review round.
answered_evals: pd.DataFrame | int = 100, config: RAGLiteConfig | None = None | ||
answered_evals_df: pd.DataFrame, | ||
*, | ||
metrics: Sequence[Any] | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a Sequence[Metric] | None
or maybe even a list[Metric] | None
, where from ragas.metrics.base import Metric
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but as ragas is an optional dependency I used Any. A better solution would be to assign Metric conditionally to Any
or ragas.metrics.base.Metric
depending on the presence of ragas.
strict: bool = False, # noqa: FBT001,FBT002 | ||
config: RAGLiteConfig | None = None, | ||
*, | ||
config: "RAGLiteConfig", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the *
above strict
, as a positional True
or False
without the keyword argument name would not be very clear.
search: SearchMethod = hybrid_search, | ||
config: RAGLiteConfig | None = None, | ||
search: "ChunkSearchMethod", | ||
rerank: Optional["ChunkRerankingMethod"] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.10+:
rerank: Optional["ChunkRerankingMethod"] = None, | |
rerank: "ChunkRerankingMethod" | None = None, |
if rerank: | ||
chunks = rerank(query, chunk_ids=chunk_ids, config=config) | ||
else: | ||
chunks = retrieve_chunks(chunk_ids, config=config) | ||
context = retrieve_chunk_spans(chunks, chunk_neighbors=chunk_neighbors, config=config) | ||
return context[:max_chunk_spans] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the algorithm as implemented:
- Let's say you retrieve 40 chunks with
search
. - Then you rerank those 40 chunks with
rerank
. - Then you retrieve all chunk spans from those 40 reranked chunks.
- Only then do you limit the number of chunk spans to the desired amount (or not any at all).
There are two downsides to this algorithm:
- Retrieving all 40 chunks is relatively expensive if you intend to throw away a number of chunk spans afterwards.
- If you don't throw away any chunk spans (which is the default), then reranking doesn't do very much as the LLM gets to see the same chunks whether you rerank or not!
To address both issues, I think we should introduce a max_chunks: int
argument and update the algorithm as follows:
- Let's say you retrieve 40 chunks with
search
. - Then you rerank those 40 chunks with
rerank
, and keep only the topmax_chunks
. - Then you retrieve all chunk spans from the top
max_chunks
. - Optional: limit the number of chunk spans to
max_chunk_spans
(with a default ofNone
as is the case now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understand. max_chunks
is always set to a value and search always returns [:max_chunks]
, therefore step 2 and step 3 are also limited to [:max_chunks]
in the original algorithm.
Regarding the downsides:
- Chunk spans consist of a list of contiguous chunks retrieved by search (+ neighbours). We need to compose all the chunk spans as we need to assign a pooled score using the sum of reciprocal rankings of the chunks, and then return the top max_chunk_spans. Maybe you are thinking that for some aggregated span scores we can be sure that they will never be in the top, so we could skip the neighbours retrieval phase, indeed, but the algo is not trivial and depends on the chunk score pooling function. (e.g. if we decided that
span_score = max(chunk_scores)
it would be trivial). - You are right here, that the default makes the reranking less useful. Maybe it would be better not to allow None in
max_chunk_spans
For instance, I have been using this configuration:
retrieval = partial(
retrieve_rag_context,
max_chunk_spans=3,
search_method=partial(
hybrid_search,
max_chunks=20,
),
rerank=rerank_chunks,
chunk_neighbors=(-1, 1),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two options:
- Option A:
a. We search for chunks (say, 40)
b. We rerank all chunks (requires retrieving all chunks)
c. We truncate the chunks tomax_chunks
chunks (say, 10)
d. We transform chunks into chunk spans and return all of them (no truncation of chunk spans) - Option B:
a. We search for chunks (say, 40)
b. We rerank all chunks (requires retrieving all chunks)
c. We transform chunks into chunk spans (ordered according to their aggregate chunk relevance)
d. We truncate the chunk spans tomax_chunk_spans
chunk spans (say, 3)
Comparison:
Option A | Option B | |
---|---|---|
1. Database load | 🟡 retrieves ~40 chunks (for reranking) | 🟠 retrieves ~40 chunks (for reranking) + neighbours (for chunk spans) |
2. Predictable number of context tokens | 🟢 max_chunk chunks, each with 1440 chars |
🔴 number of chunks is dynamic, only number of chunk spans is known |
3. Chunk span quality | 🟡 Some less relevant chunks may be omitted from the chunk spans | 🟢 The returned chunk spans will contain all relevant chunks + because there are more chunks there may be fewer chunk spans because there are fewer 'gaps' |
Summary: Option B seems like it may result in higher quality chunk spans than Option A, but at the expense of poor control over how many context tokens it generates.
Do you agree? Are there other important aspects I'm missing in the comparison?
I have a feeling that there may be a way to optimise the chunk span construction and perhaps get the best of both worlds. I'll see if I can give that some thought.
|
||
|
||
def create_rag_instruction( | ||
user_prompt: str, | ||
context: list[ChunkSpan], | ||
*, | ||
rag_instruction_template: str = RAG_INSTRUCTION_TEMPLATE, | ||
rag_instruction_template: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to remove the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the RAG_INSTRUCTION_TEMPLATE is quite specific and the instructions should be in a system prompt. If we want to keep it, I think that something more general makes more sense. I am using the one below and adding the instructions to the system prompt (doing so, it helps with caching too)
rag_instruction_template = "{user_prompt}\n\n{context}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we should provide a reasonable default that 'just works' so that users don't necessarily have to provide their own template.
The current template is fairly generic in my opinion – is there anything you would change to it to make it more generally applicable?
rag_instruction_template: str | None = None, | ||
) -> list[dict[str, str]]: | ||
"""Compose a list of messages to generate a response.""" | ||
messages = [ | ||
*([{"role": "system", "content": system_prompt}] if system_prompt else []), | ||
*(history or []), | ||
create_rag_instruction( | ||
user_prompt=user_prompt, | ||
context=context if context else [], | ||
rag_instruction_template=rag_instruction_template or DEFAULT_RAG_INSTRUCTION_TEMPLATE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do end up keeping this function, why not:
rag_instruction_template: str | None = None, | |
) -> list[dict[str, str]]: | |
"""Compose a list of messages to generate a response.""" | |
messages = [ | |
*([{"role": "system", "content": system_prompt}] if system_prompt else []), | |
*(history or []), | |
create_rag_instruction( | |
user_prompt=user_prompt, | |
context=context if context else [], | |
rag_instruction_template=rag_instruction_template or DEFAULT_RAG_INSTRUCTION_TEMPLATE, | |
rag_instruction_template: str = DEFAULT_RAG_INSTRUCTION_TEMPLATE, | |
) -> list[dict[str, str]]: | |
"""Compose a list of messages to generate a response.""" | |
messages = [ | |
*([{"role": "system", "content": system_prompt}] if system_prompt else []), | |
*(history or []), | |
create_rag_instruction( | |
user_prompt=user_prompt, | |
context=context if context else [], | |
rag_instruction_template=rag_instruction_template, |
) -> list[dict[str, str]]: | ||
"""Compose a list of messages to generate a response.""" | ||
messages = [ | ||
*([{"role": "system", "content": system_prompt}] if system_prompt else []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a system prompt is provided, this will prevent prompt caching.
And if we were to drop this system prompt feature, then compose_rag_messages
is equivalent to messages.append(create_rag_instruction(user_prompt, context))
.
Therefore, I think we can actually do without compose_rag_messages
, or is there a good reason to have this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how this prevents (prefix) prompt catching. If system_prompt
is static, at least with OpenAI prompt caching, the common tokens prefix will be cached. Maybe there is a confusion with the syntax I used. It is just a compact way of adding conditionally the system prompt (using a list for this is weird, I reckon, but it is a way of expressing "nothing" to be added)
I use this method in a couple of places, including eval. I think it is a good abstraction to keep as it removes the cognitive load of remembering which order and format the messages should be constructed. Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.
Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how this prevents (prefix) prompt catching.
I think it can break prompt caching when using a 'rolling message history': when the message history is too long to fit in the context, you can drop the first k messages until it does fit in the context. But if you drop the first k messages and then reinsert the system prompt, you're creating a new prefix that hasn't been seen yet and would be a cache miss. Subsequent messages would likely have to drop the first messages again and the process repeats, resulting in consistent cache misses once you hit the maximum context size.
Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.
In my opinion, the 'management' of the message history should not be RAGLite's responsibility. Features like rolling history, system prompt, tool usage filtering (or not), should be managed at the application level. For example, in a Chainlit application I think it should really be Chainlit managing the message history. Of course, RAGLite has contributions to make to that message history, but if you agree that it does not 'own' the message history, its modifications to the message history should be minimal.
Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?
I do agree that that is necessary, but I don't agree that should be RAGLite's responsibility (cf. motivation above). I do think RAGLite should offer a way to inject context into a message history, but with minimal interference. For that reason, I would prefer to not even offer a function that modifies the message history and instead offer only a function to append a message to the message history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how this prevents (prefix) prompt catching.
I think it can break prompt caching when using a 'rolling message history': when the message history is too long to fit in the context, you can drop the first k messages until it does fit in the context. But if you drop the first k messages and then reinsert the system prompt, you're creating a new prefix that hasn't been seen yet and would be a cache miss. Subsequent messages would likely have to drop the first messages again and the process repeats, resulting in consistent cache misses once you hit the maximum context size.
Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.
In my opinion, the 'management' of the message history should not be RAGLite's responsibility. Features like rolling history, system prompt, and tool usage filtering (or not) should be managed at the application level. For example, in a Chainlit application I think it should really be Chainlit managing the message history. Of course, RAGLite has contributions to make to that message history, but if you agree that it does not 'own' the message history, its modifications to the message history should be minimal.
Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?
I do agree that that is necessary, but I don't agree that should be RAGLite's responsibility (cf. motivation above). I do think RAGLite should offer a way to inject context into a message history, but with minimal interference. For that reason, I would prefer to not even offer a function that modifies the message history and instead offer only a function to append a message to the message history.
@@ -270,28 +273,41 @@ class Eval(SQLModel, table=True): | |||
document: Document = Relationship(back_populates="evals") | |||
|
|||
@staticmethod | |||
def from_chunks( | |||
question: str, contexts: list[Chunk], ground_truth: str, **kwargs: Any | |||
def from_contexts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would call it from_context
(singular) for concistency with context: list[ChunkSpan]
arguments elsewhere.
Ensure that evaluation can be configured using the same configuration as inference.
#60