Integrate llama.cpp in Aprapipes and a module ImageToTextXForm which can describe an image #345

kushaljain-apra · 2024-04-17T13:57:07Z

IMPORTANT: All PRs must be linked to an issue (except for extremely trivial and straightforward changes).

Fixes #[Issue]

Description

added vcpkg port for llama.cpp
added support for llm and encoder models through Llm Model and Encoder Model Abstract Classes
added Llava and Clip Encoder Models for ImageToTextXForm
added Model Strategy Class to implement different model strategies
added a module ImageToTextXForm which uses Llava model to describe an image

Alternative(s) considered

Have you considered any alternatives? And if so, why have you chosen the approach in this PR?

Type

Type Choose one: (Feature)

Screenshots (if applicable)

Checklist

I have read the Contribution Guidelines
I have written Unit Tests
I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach

yashrajsapra · 2024-04-17T15:17:19Z

base/include/ClipEncoder.h

+#include "EncoderModelAbstract.h"
+
+class ClipEncoderProps : public EncoderModelAbstractProps {
+public:


Move all the definition to cpp

yashrajsapra · 2024-04-17T15:18:21Z

base/include/EncoderModelAbstract.h

+    VIVIT // Video Vision Transformer
+  };
+
+  enum DataType { TEXT = 0, IMAGE, AUDIO, TEXT_EMBEDDING, IMAGE_EMBEDDING, AUDIO_EMBEDDING };


Maintain all the enum in one single place

yashrajsapra · 2024-04-18T10:25:10Z

base/CMakeLists.txt

@@ -608,6 +625,9 @@ target_link_libraries(aprapipesut
  liblzma::liblzma
  bigint::bigint
  sfml-audio
+	${COMMON_LIB}
+	llama


Fix formatting

yashrajsapra · 2024-04-18T10:26:27Z

base/include/EncoderModelAbstract.h

+
+  EncoderModelAbstractProps() {
+    modelArchitecture = ModelArchitectureType::BERT;
+    inputTypes = {DataType::TEXT};


Move all the declaration in cpp file

yashrajsapra · 2024-04-18T10:29:07Z

base/include/EncoderModelAbstract.h

+    VIVIT // Video Vision Transformer
+  };
+
+  enum DataType { TEXT = 0, IMAGE, AUDIO, TEXT_EMBEDDING, IMAGE_EMBEDDING, AUDIO_EMBEDDING };


use FrameType instead of DataType

yashrajsapra · 2024-04-18T11:19:12Z

base/src/Llava.cpp

+      return true;
+    }
+  }
+  return false;


Throw AIP Exception

yashrajsapra · 2024-04-18T11:21:17Z

base/src/Llava.cpp

+    updateProps(_props);
+  }
+
+  void updateProps(LlavaProps &_props) {


Rename it to setModelProps
Remove from Constructor of Detail class

yashrajsapra · 2024-04-18T11:25:11Z

base/src/Llava.cpp

+bool Llava::modelInit() {
+  llama_backend_init(false /*NUMA Architecure set to false*/);
+
+  mDetail->mLlavaModelParams = llama_model_default_params();


Put in setModelProps

yashrajsapra · 2024-04-18T11:29:37Z

base/src/Llava.cpp

+  auto frame = frames.begin()->second;
+  auto frameType = frame->getMetadata()->getFrameType();
+  int nPast = 0;
+  std::string systemPrompt = "A chat between a curious human and an artificial intelligence assistant.  The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:";


@joiskash
System prompt will differ for different version of same model,it may come from Model Strategy class

system prompt should be a property

yashrajsapra · 2024-04-18T12:05:23Z

base/src/Llava.cpp

+  std::cout << "\n";
+
+  /*Prediction token by token*/
+  for(int i = 0; i < nPredict; i++) {


@mraduldubey @joiskash
Need to have different strategy for:-

Chunk based Implementation

One big Chunk of data

For now, we will only have the 2nd implementation

yashrajsapra · 2024-04-24T11:13:49Z

base/include/ModelStrategy.h

+# pragma once
+
+#include "Module.h"
+#include "ClipEncoder.h"


Remove headers which we are not using

Add specific headers

yashrajsapra · 2024-04-24T11:22:27Z

base/include/SceneDescriptorXForm.h

+#include "Module.h"
+
+class SceneDescriptorXFormProps : public ModuleProps {
+public:


Instead of Crating new Enum, directly use Model strategies enum

yashrajsapra · 2024-04-24T11:41:58Z

base/src/ModelStrategy.cpp

+  }
+}
+
+/*LLAVA SCENE-DESCRIPTOR STRATEGY*/


@joiskash @mraduldubey
-> We can keep this constructor as default, but user should also have an option to use different model weights, offloading layers into gpu, prompts

-> Don't use hard coded values

yashrajsapra · 2024-04-24T11:42:22Z

base/src/ModelStrategy.cpp

+SceneDescriptorModelStrategy::SceneDescriptorModelStrategy() : ModelStrategy() {
+  auto clipProps = ClipEncoderProps("./data/llm/llava/llava-v1.6-7b/mmproj-model-f16.gguf");
+  auto llavaProps = LlavaProps("./data/llm/llava/llava-v1.6-7b/llava-v1.6-mistral-7b.Q8_0.gguf", "Describe the image", 2048, 512, 0.8, 10, 256);
+


Add checks for path

yashrajsapra · 2024-04-24T11:51:46Z

base/src/ModelStrategy.cpp

+}
+
+/*LLAVE TEXT-TO-TEXT STRATEGY*/
+LlavaTextToTextModelStrategy::LlavaTextToTextModelStrategy() : ModelStrategy() {


Same comment as SceneDescriptorModelStrategy

yashrajsapra · 2024-04-24T12:09:47Z

base/src/SceneDescriptorXForm.cpp

+  }
+  return Module::term();
+}
+


Look into ClipEncoder comments

yashrajsapra · 2024-04-24T12:16:15Z

base/src/SceneDescriptorXForm.cpp

+bool SceneDescriptorXForm::process(frame_container &frames) {
+  /*Encoder Model*/
+  mDetail->modelStrategy->encoderModel->push(frames);
+  mDetail->modelStrategy->encoderModel->step();


Opt 1
-> Push Should Internally call step
Opt 2
-> If Que is not empty , then we should automatically take up the job

yashrajsapra · 2024-04-24T12:20:26Z

base/src/SceneDescriptorXForm.cpp

+  mDetail->modelStrategy->encoderModel->step();
+  auto clipFrame =
+      makeFrame(mDetail->modelStrategy->encoderModel->getFrameSize());
+  auto clipMetaData = boost::shared_ptr<FrameMetadata>(


Push should Take 2 arguments (input Buffer & reference to output buffer)

yashrajsapra · 2024-04-24T12:23:16Z

base/src/SceneDescriptorXForm.cpp

+  mDetail->modelStrategy->llmModel->step();
+  auto outFrame = makeFrame(mDetail->modelStrategy->llmModel->getFrameSize());
+  mDetail->modelStrategy->llmModel->getFrames(outFrame);
+


Input -> Frame Container
Output -> Frame Container

yashrajsapra · 2024-04-24T12:26:38Z

base/test/sceneDescriptorXForm_tests.cpp

+#include "ModelStrategy.h"
+#include "Module.h"
+#include "ExternalSinkModule.h"
+


Add More Tests for "getProps" & "setProps"

joiskash

I have left a few comments please address them, thanks!

joiskash · 2024-07-01T07:07:50Z

base/CMakeLists.txt

+find_library(COMMON_LIB NAMES common_llama.lib libcommon_llama.a REQUIRED)
+find_library(LLAVA_LIB NAMES llavalib.lib libllavalib.a REQUIRED)


Why do we need a separate find lib if we are using find package for Llama?

joiskash · 2024-10-16T05:09:54Z

base/src/ClipEncoder.cpp

+   auto outputPinId = inputFrameContainer.begin()->first;
+  auto inputFrame = inputFrameContainer.begin()->second;
+  mDetail->storedData = llava_image_embed_make_with_bytes(
+      mDetail->mClipContext, 8,


Add 8 number of threads to props

joiskash · 2024-10-16T05:13:55Z

base/src/ClipEncoder.cpp

+  float *char_buffer = (float *)frame->data();
+  char_buffer += sizeof(llava_image_embed);
+  memcpy(char_buffer, mDetail->storedData->embed,
+         clip_embd_nbytes(mDetail->mClipContext));


Clear storedData if not needed

joiskash · 2024-10-16T05:22:38Z

base/src/ImageToTextXForm.cpp

+  frame_container clipFrames;
+  mDetail->modelStrategy->encoderModel->push(frames, clipFrames, [&](size_t size) -> frame_sp {return makeFrame(size, mDetail->mOutputPinId); });
+
+  /*LLM Model*/
+  frame_container llavaFrames;
+  mDetail->modelStrategy->llmModel->push(clipFrames, llavaFrames, [&](size_t size) -> frame_sp {return makeFrame(size, mDetail->mOutputPinId); });


Do we need a queue in the models? Since everything is blocking and looks

joiskash · 2024-10-16T05:22:48Z

base/src/EncoderModelAbstract.cpp

+
+bool EncoderModelAbstract::term()
+{
+  mQue->clear();


joiskash · 2024-10-16T05:26:11Z

base/src/ImageToTextXForm.cpp

+
+void ImageToTextXForm::setProps(ImageToTextXFormProps &props)
+{
+  if (props.modelStrategyType != mDetail->mProps.modelStrategyType)


Do not allow gpuLayers and modelStratgeyType to change in between

joiskash · 2024-10-16T05:28:16Z

base/src/ModelStrategy.cpp

Please remove all hardcoded values

kushaljain-apra added 11 commits April 11, 2024 07:44

add llama vcpkg port

f64bdad

update whisper portfile to support cuda

0f9f19d

add llm model abstract class

c864c13

add encoder model abstract class

5524ad9

add Llava Model class

4365203

add Clip Encoder Class

7b00c68

add Model Strategy Class

bee2754

add SceneDescriptorXform Module

efd4082

add unit tests

c60e9ec

update cmakelists.txt

d099726

update framemetadata.h

76d37ac

kushaljain-apra requested a review from yashrajsapra April 17, 2024 13:58

yashrajsapra requested changes Apr 18, 2024

View reviewed changes

yashrajsapra reviewed Apr 24, 2024

View reviewed changes

kushaljain-apra added 2 commits May 2, 2024 15:33

Updated code to resolve PR conversations

926a07f

change SceneDescriptorXForm Module name to ImageToTextXForm Module

0a42a70

kushaljain-apra changed the title ~~Integrate llama.cpp in Aprapipes and a module SceneDescriptorXForm which can describe an image~~ Integrate llama.cpp in Aprapipes and a module ImageToTextXForm which can describe an image May 6, 2024

joiskash requested changes Oct 16, 2024

View reviewed changes

		find_library(COMMON_LIB NAMES common_llama.lib libcommon_llama.a REQUIRED)
		find_library(LLAVA_LIB NAMES llavalib.lib libllavalib.a REQUIRED)

Integrate llama.cpp in Aprapipes and a module ImageToTextXForm which can describe an image #345

Are you sure you want to change the base?

Integrate llama.cpp in Aprapipes and a module ImageToTextXForm which can describe an image #345

Conversation

kushaljain-apra commented Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kushaljain-apra Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joiskash left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kushaljain-apra commented Apr 17, 2024 •

edited

Loading

kushaljain-apra Apr 30, 2024 •

edited

Loading