Skip to content

Commit

Permalink
FEATURE: PDF support for rag pipeline (#1118)
Browse files Browse the repository at this point in the history
This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes:

**1. LLM Model Association for RAG and Personas:**

-   **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM.  Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`.
-   **Migration:**  Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter.
-   **Model Changes:**  The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier.  `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes.
-   **UI Updates:**  The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled).  The RAG options component displays an LLM selector.
-   **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes.

**2. PDF and Image Support for RAG:**

-   **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`.
-   **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled.  Error handling is included for cases where PDF/image indexing is attempted with the setting disabled.
-   **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced.
-   **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs.
-   **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments.
-   **UI Updates:**  The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types.

**3. Refactoring and Improvements:**

-   **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend.
-   **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility.
-   **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based.
-   **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing.
- **Eval Script:** An evaluation script is included.

**4. Testing:**

-    The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`
  • Loading branch information
SamSaffron authored Feb 14, 2025
1 parent e2afbc2 commit 5e80f93
Show file tree
Hide file tree
Showing 54 changed files with 1,329 additions and 141 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ node_modules
/gems
/auto_generated
.env
evals/log
evals/cases
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,7 @@ export default class DiscourseAiToolsEditRoute extends DiscourseRoute {

controller.set("allTools", toolsModel);
controller.set("presets", toolsModel.resultSetMeta.presets);
controller.set("llms", toolsModel.resultSetMeta.llms);
controller.set("settings", toolsModel.resultSetMeta.settings);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ export default class DiscourseAiToolsNewRoute extends DiscourseRoute {

controller.set("allTools", toolsModel);
controller.set("presets", toolsModel.resultSetMeta.presets);
controller.set("llms", toolsModel.resultSetMeta.llms);
controller.set("settings", toolsModel.resultSetMeta.settings);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
@tools={{this.allTools}}
@model={{this.model}}
@presets={{this.presets}}
@llms={{this.llms}}
@settings={{this.settings}}
/>
</section>
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
@tools={{this.allTools}}
@model={{this.model}}
@presets={{this.presets}}
@llms={{this.llms}}
@settings={{this.settings}}
/>
</section>
22 changes: 16 additions & 6 deletions app/controllers/discourse_ai/admin/ai_personas_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,19 @@ def index
}
end
llms =
DiscourseAi::Configuration::LlmEnumerator
.values(allowed_seeded_llms: SiteSetting.ai_bot_allowed_seeded_models)
.map { |hash| { id: hash[:value], name: hash[:name] } }
render json: { ai_personas: ai_personas, meta: { tools: tools, llms: llms } }
DiscourseAi::Configuration::LlmEnumerator.values_for_serialization(
allowed_seeded_llm_ids: SiteSetting.ai_bot_allowed_seeded_models_map,
)
render json: {
ai_personas: ai_personas,
meta: {
tools: tools,
llms: llms,
settings: {
rag_pdf_images_enabled: SiteSetting.ai_rag_pdf_images_enabled,
},
},
}
end

def new
Expand Down Expand Up @@ -187,15 +196,16 @@ def ai_persona_params
:priority,
:top_p,
:temperature,
:default_llm,
:default_llm_id,
:user_id,
:max_context_posts,
:vision_enabled,
:vision_max_pixels,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_conversation_chunks,
:question_consolidator_llm,
:rag_llm_model_id,
:question_consolidator_llm_id,
:allow_chat_channel_mentions,
:allow_chat_direct_messages,
:allow_topic_mentions,
Expand Down
1 change: 1 addition & 0 deletions app/controllers/discourse_ai/admin/ai_tools_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ def ai_tool_params
:summary,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_llm_model_id,
rag_uploads: [:id],
parameters: [:name, :type, :description, :required, enum: []],
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def upload_file
def validate_extension!(filename)
extension = File.extname(filename)[1..-1] || ""
authorized_extensions = %w[txt md]
authorized_extensions.concat(%w[pdf png jpg jpeg]) if SiteSetting.ai_rag_pdf_images_enabled
if !authorized_extensions.include?(extension)
raise Discourse::InvalidParameters.new(
I18n.t(
Expand Down
35 changes: 33 additions & 2 deletions app/jobs/regular/digest_rag_upload.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def execute(args)

# Check if this is the first time we process this upload.
if fragment_ids.empty?
document = get_uploaded_file(upload)
document = get_uploaded_file(upload: upload, target: target)
return if document.nil?

RagDocumentFragment.publish_status(upload, { total: 0, indexed: 0, left: 0 })
Expand Down Expand Up @@ -163,7 +163,38 @@ def first_chunk(text, chunk_tokens:, tokenizer:, splitters: ["\n\n", "\n", ".",
[buffer, split_char]
end

def get_uploaded_file(upload)
def get_uploaded_file(upload:, target:)
if %w[pdf png jpg jpeg].include?(upload.extension) && !SiteSetting.ai_rag_pdf_images_enabled
raise Discourse::InvalidAccess.new(
"The setting ai_rag_pdf_images_enabled is false, can not index images and pdfs.",
)
end
if upload.extension == "pdf"
pages =
DiscourseAi::Utils::PdfToImages.new(
upload: upload,
user: Discourse.system_user,
).uploaded_pages

return(
DiscourseAi::Utils::ImageToText.as_fake_file(
uploads: pages,
llm_model: target.rag_llm_model,
user: Discourse.system_user,
)
)
end

if %w[png jpg jpeg].include?(upload.extension)
return(
DiscourseAi::Utils::ImageToText.as_fake_file(
uploads: [upload],
llm_model: target.rag_llm_model,
user: Discourse.system_user,
)
)
end

store = Discourse.store
@file ||=
if store.external?
Expand Down
88 changes: 46 additions & 42 deletions app/models/ai_persona.rb
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# frozen_string_literal: true

class AiPersona < ActiveRecord::Base
# TODO remove this line 01-1-2025
self.ignored_columns = %i[commands allow_chat mentionable]
# TODO remove this line 01-10-2025
self.ignored_columns = %i[default_llm question_consolidator_llm]

# places a hard limit, so per site we cache a maximum of 500 classes
MAX_PERSONAS_PER_SITE = 500
Expand All @@ -12,7 +12,7 @@ class AiPersona < ActiveRecord::Base
validates :system_prompt, presence: true, length: { maximum: 10_000_000 }
validate :system_persona_unchangeable, on: :update, if: :system
validate :chat_preconditions
validate :allowed_seeded_model, if: :default_llm
validate :allowed_seeded_model, if: :default_llm_id
validates :max_context_posts, numericality: { greater_than: 0 }, allow_nil: true
# leaves some room for growth but sets a maximum to avoid memory issues
# we may want to revisit this in the future
Expand All @@ -30,6 +30,10 @@ class AiPersona < ActiveRecord::Base
belongs_to :created_by, class_name: "User"
belongs_to :user

belongs_to :default_llm, class_name: "LlmModel"
belongs_to :question_consolidator_llm, class_name: "LlmModel"
belongs_to :rag_llm_model, class_name: "LlmModel"

has_many :upload_references, as: :target, dependent: :destroy
has_many :uploads, through: :upload_references

Expand Down Expand Up @@ -62,7 +66,7 @@ def self.persona_users(user: nil)
user_id: persona.user_id,
username: persona.user.username_lower,
allowed_group_ids: persona.allowed_group_ids,
default_llm: persona.default_llm,
default_llm_id: persona.default_llm_id,
force_default_llm: persona.force_default_llm,
allow_chat_channel_mentions: persona.allow_chat_channel_mentions,
allow_chat_direct_messages: persona.allow_chat_direct_messages,
Expand Down Expand Up @@ -157,12 +161,12 @@ def class_instance
user_id
system
mentionable
default_llm
default_llm_id
max_context_posts
vision_enabled
vision_max_pixels
rag_conversation_chunks
question_consolidator_llm
question_consolidator_llm_id
allow_chat_channel_mentions
allow_chat_direct_messages
allow_topic_mentions
Expand Down Expand Up @@ -302,7 +306,7 @@ def chat_preconditions
if (
allow_chat_channel_mentions || allow_chat_direct_messages || allow_topic_mentions ||
force_default_llm
) && !default_llm
) && !default_llm_id
errors.add(:default_llm, I18n.t("discourse_ai.ai_bot.personas.default_llm_required"))
end
end
Expand Down Expand Up @@ -332,13 +336,12 @@ def ensure_not_system
end

def allowed_seeded_model
return if default_llm.blank?
return if default_llm_id.blank?

llm = LlmModel.find_by(id: default_llm.split(":").last.to_i)
return if llm.nil?
return if !llm.seeded?
return if default_llm.nil?
return if !default_llm.seeded?

return if SiteSetting.ai_bot_allowed_seeded_models.include?(llm.id.to_s)
return if SiteSetting.ai_bot_allowed_seeded_models_map.include?(default_llm.id.to_s)

errors.add(:default_llm, I18n.t("discourse_ai.llm.configuration.invalid_seeded_model"))
end
Expand All @@ -348,36 +351,37 @@ def allowed_seeded_model
#
# Table name: ai_personas
#
# id :bigint not null, primary key
# name :string(100) not null
# description :string(2000) not null
# system_prompt :string(10000000) not null
# allowed_group_ids :integer default([]), not null, is an Array
# created_by_id :integer
# enabled :boolean default(TRUE), not null
# created_at :datetime not null
# updated_at :datetime not null
# system :boolean default(FALSE), not null
# priority :boolean default(FALSE), not null
# temperature :float
# top_p :float
# user_id :integer
# default_llm :text
# max_context_posts :integer
# vision_enabled :boolean default(FALSE), not null
# vision_max_pixels :integer default(1048576), not null
# rag_chunk_tokens :integer default(374), not null
# rag_chunk_overlap_tokens :integer default(10), not null
# rag_conversation_chunks :integer default(10), not null
# question_consolidator_llm :text
# tool_details :boolean default(TRUE), not null
# tools :json not null
# forced_tool_count :integer default(-1), not null
# allow_chat_channel_mentions :boolean default(FALSE), not null
# allow_chat_direct_messages :boolean default(FALSE), not null
# allow_topic_mentions :boolean default(FALSE), not null
# allow_personal_messages :boolean default(TRUE), not null
# force_default_llm :boolean default(FALSE), not null
# id :bigint not null, primary key
# name :string(100) not null
# description :string(2000) not null
# system_prompt :string(10000000) not null
# allowed_group_ids :integer default([]), not null, is an Array
# created_by_id :integer
# enabled :boolean default(TRUE), not null
# created_at :datetime not null
# updated_at :datetime not null
# system :boolean default(FALSE), not null
# priority :boolean default(FALSE), not null
# temperature :float
# top_p :float
# user_id :integer
# max_context_posts :integer
# vision_enabled :boolean default(FALSE), not null
# vision_max_pixels :integer default(1048576), not null
# rag_chunk_tokens :integer default(374), not null
# rag_chunk_overlap_tokens :integer default(10), not null
# rag_conversation_chunks :integer default(10), not null
# tool_details :boolean default(TRUE), not null
# tools :json not null
# forced_tool_count :integer default(-1), not null
# allow_chat_channel_mentions :boolean default(FALSE), not null
# allow_chat_direct_messages :boolean default(FALSE), not null
# allow_topic_mentions :boolean default(FALSE), not null
# allow_personal_messages :boolean default(TRUE), not null
# force_default_llm :boolean default(FALSE), not null
# rag_llm_model_id :bigint
# default_llm_id :bigint
# question_consolidator_llm_id :bigint
#
# Indexes
#
Expand Down
3 changes: 2 additions & 1 deletion app/models/ai_tool.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ class AiTool < ActiveRecord::Base
validates :script, presence: true, length: { maximum: 100_000 }
validates :created_by_id, presence: true
belongs_to :created_by, class_name: "User"
belongs_to :rag_llm_model, class_name: "LlmModel"
has_many :rag_document_fragments, dependent: :destroy, as: :target
has_many :upload_references, as: :target, dependent: :destroy
has_many :uploads, through: :upload_references
Expand Down Expand Up @@ -371,4 +372,4 @@ def self.presets
# rag_chunk_tokens :integer default(374), not null
# rag_chunk_overlap_tokens :integer default(10), not null
# tool_name :string(100) default(""), not null
#
# rag_llm_model_id :bigint
2 changes: 1 addition & 1 deletion app/models/llm_model.rb
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def self.provider_params
end

def to_llm
DiscourseAi::Completions::Llm.proxy(identifier)
DiscourseAi::Completions::Llm.proxy(self)
end

def identifier
Expand Down
8 changes: 7 additions & 1 deletion app/serializers/ai_custom_tool_list_serializer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,13 @@ class AiCustomToolListSerializer < ApplicationSerializer
has_many :ai_tools, serializer: AiCustomToolSerializer, embed: :objects

def meta
{ presets: AiTool.presets }
{
presets: AiTool.presets,
llms: DiscourseAi::Configuration::LlmEnumerator.values_for_serialization,
settings: {
rag_pdf_images_enabled: SiteSetting.ai_rag_pdf_images_enabled,
},
}
end

def ai_tools
Expand Down
1 change: 1 addition & 0 deletions app/serializers/ai_custom_tool_serializer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ class AiCustomToolSerializer < ApplicationSerializer
:script,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_llm_model_id,
:created_by_id,
:created_at,
:updated_at
Expand Down
5 changes: 3 additions & 2 deletions app/serializers/localized_ai_persona_serializer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,16 @@ class LocalizedAiPersonaSerializer < ApplicationSerializer
:allowed_group_ids,
:temperature,
:top_p,
:default_llm,
:default_llm_id,
:user_id,
:max_context_posts,
:vision_enabled,
:vision_max_pixels,
:rag_chunk_tokens,
:rag_chunk_overlap_tokens,
:rag_conversation_chunks,
:question_consolidator_llm,
:rag_llm_model_id,
:question_consolidator_llm_id,
:tool_details,
:forced_tool_count,
:allow_chat_channel_mentions,
Expand Down
Loading

0 comments on commit 5e80f93

Please sign in to comment.