Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Video VL Support #1290

Open
kleineluka opened this issue Dec 29, 2024 · 4 comments
Open

[Feature Request] Video VL Support #1290

kleineluka opened this issue Dec 29, 2024 · 4 comments

Comments

@kleineluka
Copy link

Describe the Issue
New multimodal models are supporting not only image captioning (which Kobold implements) but video captioning as well. For examples see Qwen2-VL or Apollo (which is built on Qwen).

Additional Information:
For UI implementation, a simple "Add video" button beside the "Add img" button would suffice - although I believe getting it working with the API is more important. If there is already a way to achieve this with Kobold and I'm mistaken, please let me know!

Thank you for all the hard work! ^_^

@jabberjabberjabber
Copy link

The API can in fact already analyze videos. Here is a demo.

@kleineluka
Copy link
Author

Is this the same way that models like Qwen caption videos? From a brief overview of the repository you linked, it looks like that is just captioning frame-by-frame. Admittedly, I'm not too sure how the native video support works, but I would've expected it to be a different process than sending frame-by-frame and captioning as pictures?

@jabberjabberjabber
Copy link

Yes, in fact I copied the ffmpeg idea from MiniCPM-V-2.6:

For MiniCPM-V 2.6, we took the approach of extracting frames from the video file and inputting each frame data sequentially to the model. At the code level, I introduced the open source library ffmpeg to implement video frame extraction.And added the "video" parameter to the args of llama.cpp to read video files.

@jabberjabberjabber
Copy link

Is this the same way that models like Qwen caption videos? From a brief overview of the repository you linked, it looks like that is just captioning frame-by-frame. Admittedly, I'm not too sure how the native video support works, but I would've expected it to be a different process than sending frame-by-frame and captioning as pictures?

You change the 'batch-size' to send it multiple images at once. Unfortunately in it's current version KoboldCpp will not allow more tha 4 images to be submitted at the same time, so that's our limitation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants