-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PoC]: Support encode only models by Workflow Defined Engine #8452
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This is a significant change from our current architecture. We'll consider incorporating this when we refactor our core framework. |
As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results With wde, there is no need for one module to be compatible with all functions. I hope you like this new architecture @DarkLight1337 |
It looks nice for sure, but there are many abstractions that are difficult to adopt immediately. If you want us to use this architecture, I suggest that you split this PR up into smaller chunks and gradually refactor over the code base rather than doing everything all at once. |
I want to do experiments on the encode only model and integrate it with the existing code only at entrypoints, so that the impact of modifications is minimal. |
Do you have any better suggestions? @DarkLight1337 |
This PR is used as a demonstration. I can modify and resubmit a PR. |
Don’t be too anxious |
Not really, just do as you have said. |
You should list out what are the features in your PR and how they correspond to #8779. Otherwise, people would have to read through your whole PR to understand what is going on. |
I feel that although this PR has some similarities with #8779, their focus is different and there is no way to compare their features one-to-one. |
I'd say your PR focuses on tackling the second goal in #8779 (the new architecture will be extensible and modular). You should explain in detail how this is being achieved (in particular, what type of abstractions you are using?). That way, we can consider those aspects when planning how to refactor the existing code. |
Many things are simple to write code but very complicated to explain. Can you let them look at the code? |
The thing is, people don't want to look at 10k lines of code to understand what is going on. If you want them to use this code, it is your responsibility to explain it. |
OK, I'll try |
Yeah, that should be good enough to start with.
Probably need to address this. e.g. which types of models are not supported under this workflow? If it's a large category then we'll have to find a solution to it. |
Looks better. You should also think about how to adopt this incrementally. |
Sorry was afk, you can post it whenever you like. |
I think the first part supporting bert is pretty good, but it's still 6,000 lines of code PTAL #8964 |
6000 lines of code support bert, I feel like a clown |
#9166 has many similarities with Workflow Defined Engine. Can you invite @WoosukKwon to participate in the discussion of this PR? |
I suggest you comment directly on his PR. You can also join our Slack workspace (see README) and ping him. |
653794e
to
e80d5f2
Compare
e80d5f2
to
653794e
Compare
PTAL #8453
Briefly introduce
What new models need to be supported
These models are all from issues and are also very famous:
These models is roughly divided into three categories:
What new features these new models have
What the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.
You can think of prefill only as encode only fancy writing.
New features:
How engine Architecture needs to support these features flexibly and efficiently.
If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results
The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand.
I call this architecture Workflow Defined Engine, or WDE for short.
I divided the Engine into the following modules.
With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.
Some models cannot use the common Workflow. When you don’t know where to put the dirty code, you can always create a new workflow and link the model architecture to the new workflow to avoid leaving dirty code everywhere for the sake of compatibility.
Let's start splitting this pr and try to merge it into main