-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(workflow): Implement a simplified CoAct workflow #3770
Conversation
just fyi, the integration tests seem to fail because of some
in some results. |
Thanks, still waiting for reviews on it, if it is good to go I will look into the tests. |
2890cc5
to
fc44e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks a bunch for this @ryanhoangt !
I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.
Thanks Prof., I'll do and update how it goes soon. |
It might be in the paper(s), but I don't quite like that the prompts now talk of |
Re the json in the visualizer, seems like it is because we don't format the finish action yet.
Good catch, this seems to be another bug. Might be because this action is not handled properly: return AgentFinishAction(thought=thought, outputs={'content': thought})
Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.
"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to OpenHands/evaluation/swe_bench/run_infer.py Lines 248 to 290 in 41ddba8
|
I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all. |
Yeah that makes sense, I can try doing that in the next run |
Okay finally the score is converging to what we want, thanks @enyst for all the improvement suggestions! On the subset of 93 verified instances, CoAct resolved 33/93 while CodeAct resolved 39/93. Some plots: Comparing instances resolved in each category, seems like CoAct doesn't perform very well on easy level instances: I'm gonna upload the trajectories to Huggingface shortly. |
@@ -257,6 +265,10 @@ def _get_messages(self, state: State) -> list[Message]: | |||
else: | |||
raise ValueError(f'Unknown event type: {type(event)}') | |||
|
|||
if message and message.role == 'user' and not self.initial_task_str[0]: | |||
# first user message | |||
self.initial_task_str[0] = message.content[0].text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, do we still need this?
Cheers! This is great news. ❤️ The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem. |
The trajectory is uploaded to the visualizer here. I'm going to run evaluation on all 300 instances with the remote runtime to see how it goes, also clean up code a bit and fix tests. |
Hello @ryanhoangt. Just checking in to see if this is something that you will continue working on? There's lots of changes that have gone in recently and don't want you to run into too many hard to resolve conflicts as it seems like it's an involved PR. |
Hey @mamoodi, thanks for checking in. I’m a bit tied up with other tasks at the moment, so I won’t be able to get back to this right away. Maybe we can close the PR for now and I will try to circle back when I have more bandwidth. |
As per Ryan's comment, I'm going to close this PR for now. Whenever Ryan is ready, it will be reopened. Thank you. |
Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG
CodeActAgent
overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.CoActPlannerAgent
finished butCodeActAgent
failed (I expected both to be able to complete it):Give a summary of what the PR does, explaining any non-trivial design decisions
Some next steps to improve this may be:
swe-bench-lite
instances.BrowsingAgent
orCoActExecutorAgent
) to persist its history through the multi-turn conversation.Link of any specific issues this addresses
#3077