Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate llama.cpp in Aprapipes and a module ImageToTextXForm which can describe an image #345

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

kushaljain-apra
Copy link
Collaborator

@kushaljain-apra kushaljain-apra commented Apr 17, 2024

IMPORTANT: All PRs must be linked to an issue (except for extremely trivial and straightforward changes).

Fixes #[Issue]

Description

  • added vcpkg port for llama.cpp
  • added support for llm and encoder models through Llm Model and Encoder Model Abstract Classes
  • added Llava and Clip Encoder Models for ImageToTextXForm
  • added Model Strategy Class to implement different model strategies
  • added a module ImageToTextXForm which uses Llava model to describe an image

Alternative(s) considered

Have you considered any alternatives? And if so, why have you chosen the approach in this PR?

Type

Type Choose one: (Feature)

Screenshots (if applicable)

Checklist

  • I have read the Contribution Guidelines
  • I have written Unit Tests
  • I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach

#include "EncoderModelAbstract.h"

class ClipEncoderProps : public EncoderModelAbstractProps {
public:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all the definition to cpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

VIVIT // Video Vision Transformer
};

enum DataType { TEXT = 0, IMAGE, AUDIO, TEXT_EMBEDDING, IMAGE_EMBEDDING, AUDIO_EMBEDDING };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintain all the enum in one single place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -608,6 +625,9 @@ target_link_libraries(aprapipesut
liblzma::liblzma
bigint::bigint
sfml-audio
${COMMON_LIB}
llama
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix formatting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


EncoderModelAbstractProps() {
modelArchitecture = ModelArchitectureType::BERT;
inputTypes = {DataType::TEXT};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all the declaration in cpp file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

VIVIT // Video Vision Transformer
};

enum DataType { TEXT = 0, IMAGE, AUDIO, TEXT_EMBEDDING, IMAGE_EMBEDDING, AUDIO_EMBEDDING };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use FrameType instead of DataType

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return true;
}
}
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw AIP Exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

updateProps(_props);
}

void updateProps(LlavaProps &_props) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it to setModelProps
Remove from Constructor of Detail class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

bool Llava::modelInit() {
llama_backend_init(false /*NUMA Architecure set to false*/);

mDetail->mLlavaModelParams = llama_model_default_params();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in setModelProps

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto frame = frames.begin()->second;
auto frameType = frame->getMetadata()->getFrameType();
int nPast = 0;
std::string systemPrompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joiskash
System prompt will differ for different version of same model,it may come from Model Strategy class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system prompt should be a property

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

std::cout << "\n";

/*Prediction token by token*/
for(int i = 0; i < nPredict; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mraduldubey @joiskash
Need to have different strategy for:-

  1. Chunk based Implementation
  2. One big Chunk of data

Copy link
Collaborator Author

@kushaljain-apra kushaljain-apra Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we will only have the 2nd implementation

# pragma once

#include "Module.h"
#include "ClipEncoder.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove headers which we are not using

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add specific headers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#include "Module.h"

class SceneDescriptorXFormProps : public ModuleProps {
public:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Crating new Enum, directly use Model strategies enum

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

/*LLAVA SCENE-DESCRIPTOR STRATEGY*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joiskash @mraduldubey
-> We can keep this constructor as default, but user should also have an option to use different model weights, offloading layers into gpu, prompts

-> Don't use hard coded values

SceneDescriptorModelStrategy::SceneDescriptorModelStrategy() : ModelStrategy() {
auto clipProps = ClipEncoderProps("./data/llm/llava/llava-v1.6-7b/mmproj-model-f16.gguf");
auto llavaProps = LlavaProps("./data/llm/llava/llava-v1.6-7b/llava-v1.6-mistral-7b.Q8_0.gguf", "Describe the image", 2048, 512, 0.8, 10, 256);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add checks for path

}

/*LLAVE TEXT-TO-TEXT STRATEGY*/
LlavaTextToTextModelStrategy::LlavaTextToTextModelStrategy() : ModelStrategy() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as SceneDescriptorModelStrategy

}
return Module::term();
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into ClipEncoder comments

bool SceneDescriptorXForm::process(frame_container &frames) {
/*Encoder Model*/
mDetail->modelStrategy->encoderModel->push(frames);
mDetail->modelStrategy->encoderModel->step();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt 1
-> Push Should Internally call step
Opt 2
-> If Que is not empty , then we should automatically take up the job

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mDetail->modelStrategy->encoderModel->step();
auto clipFrame =
makeFrame(mDetail->modelStrategy->encoderModel->getFrameSize());
auto clipMetaData = boost::shared_ptr<FrameMetadata>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push should Take 2 arguments (input Buffer & reference to output buffer)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mDetail->modelStrategy->llmModel->step();
auto outFrame = makeFrame(mDetail->modelStrategy->llmModel->getFrameSize());
mDetail->modelStrategy->llmModel->getFrames(outFrame);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input -> Frame Container
Output -> Frame Container

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#include "ModelStrategy.h"
#include "Module.h"
#include "ExternalSinkModule.h"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add More Tests for "getProps" & "setProps"

@kushaljain-apra kushaljain-apra changed the title Integrate llama.cpp in Aprapipes and a module SceneDescriptorXForm which can describe an image Integrate llama.cpp in Aprapipes and a module ImageToTextXForm which can describe an image May 6, 2024
Copy link
Collaborator

@joiskash joiskash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a few comments please address them, thanks!

Comment on lines +606 to +607
find_library(COMMON_LIB NAMES common_llama.lib libcommon_llama.a REQUIRED)
find_library(LLAVA_LIB NAMES llavalib.lib libllavalib.a REQUIRED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate find lib if we are using find package for Llama?

auto outputPinId = inputFrameContainer.begin()->first;
auto inputFrame = inputFrameContainer.begin()->second;
mDetail->storedData = llava_image_embed_make_with_bytes(
mDetail->mClipContext, 8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add 8 number of threads to props

float *char_buffer = (float *)frame->data();
char_buffer += sizeof(llava_image_embed);
memcpy(char_buffer, mDetail->storedData->embed,
clip_embd_nbytes(mDetail->mClipContext));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear storedData if not needed

Comment on lines +152 to +157
frame_container clipFrames;
mDetail->modelStrategy->encoderModel->push(frames, clipFrames, [&](size_t size) -> frame_sp {return makeFrame(size, mDetail->mOutputPinId); });

/*LLM Model*/
frame_container llavaFrames;
mDetail->modelStrategy->llmModel->push(clipFrames, llavaFrames, [&](size_t size) -> frame_sp {return makeFrame(size, mDetail->mOutputPinId); });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a queue in the models? Since everything is blocking and looks


bool EncoderModelAbstract::term()
{
mQue->clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mQueue


void ImageToTextXForm::setProps(ImageToTextXFormProps &props)
{
if (props.modelStrategyType != mDetail->mProps.modelStrategyType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not allow gpuLayers and modelStratgeyType to change in between

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all hardcoded values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants