Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(workflow): Implement a simplified CoAct workflow #3770

Closed
wants to merge 53 commits into from

Conversation

ryanhoangt
Copy link
Contributor

@ryanhoangt ryanhoangt commented Sep 7, 2024

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

  • This PR implements a simplified multi-agent workflow inspired by the CoAct paper.
  • Currently, in swe-bench eval, there are complex instances that OpenHands fails, especially ones that single CodeActAgent overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.
  • An overkill-ish successful trajectory with replanning can be found here.
  • A task which CoActPlannerAgent finished but CodeActAgent failed (I expected both to be able to complete it):

Give a summary of what the PR does, explaining any non-trivial design decisions

  • Modify CodeAct to make it accept delegated task.
  • Implement 2 new agents, planner and executor with the same abilities as CodeAct, different system prompts, additional action parsers.
  • Nit: adjust the delegate message shown on UI.

Some next steps to improve this may be:

  • Try eval on some swe-bench-lite instances.
  • Adjust the system/user prompts and few-shot examples to further specialize the two agents. Also define the structure for the plan (e.g., its components, etc). 2 agents can now cooperate to finish a swe-bench issue.
  • Use meta prompt to reinforce the actions of agents, to make sure it follows the workflow.
  • Experiment with ability for the global planner to refuse the replan request from executor.
  • Implement ability for the delegated agent (e.g., BrowsingAgent or CoActExecutorAgent) to persist its history through the multi-turn conversation.

Link of any specific issues this addresses

#3077

@tobitege
Copy link
Collaborator

tobitege commented Sep 7, 2024

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 7, 2024

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

Thanks, still waiting for reviews on it, if it is good to go I will look into the tests.

@neubig neubig self-requested a review September 8, 2024 13:32
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

@ryanhoangt
Copy link
Contributor Author

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

Thanks Prof., I'll do and update how it goes soon.

@tobitege
Copy link
Collaborator

It might be in the paper(s), but I don't quite like that the prompts now talk of agent, while anywhere else it is assistant. 🤔

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 18, 2024

I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Re the json in the visualizer, seems like it is because we don't format the finish action yet.

prompt_039.log - It has an observation using JSON.

Good catch, this seems to be another bug. Might be because this action is not handled properly:

return AgentFinishAction(thought=thought, outputs={'content': thought})

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt

Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.

I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to CodeActSWEAgent below, it can be used to steer the agent a bit to be better at a specific task, but I'm not sure the current "2 global plans" is the optimal way to go. In CodeActAgent there're many cases where the agent just fixed the issues without creating any tests.

if agent_class == 'CodeActSWEAgent':
instruction = (
'We are currently solving the following issue within our repository. Here is the issue text:\n'
'--- BEGIN ISSUE ---\n'
f'{instance.problem_statement}\n'
'--- END ISSUE ---\n\n'
)
if USE_HINT_TEXT and instance.hints_text:
instruction += (
f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n'
)
instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
IMPORTANT TIPS:
1. Always start by trying to replicate the bug that the issues discusses.
If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
Then start trying to fix it.
When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
so that you can be sure that the script indeed ran fine all the way through.
2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current open file.
6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
[Current directory: /workspace/{workspace_dir_name}]
"""
else:
# Testing general agents
instruction = (

@enyst
Copy link
Collaborator

enyst commented Sep 18, 2024

I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all.

@ryanhoangt
Copy link
Contributor Author

Yeah that makes sense, I can try doing that in the next run

@ryanhoangt
Copy link
Contributor Author

Okay finally the score is converging to what we want, thanks @enyst for all the improvement suggestions! On the subset of 93 verified instances, CoAct resolved 33/93 while CodeAct resolved 39/93.

Some plots:

Comparing instances resolved in each category, seems like CoAct doesn't perform very well on easy level instances:

I'm gonna upload the trajectories to Huggingface shortly.

@@ -257,6 +265,10 @@ def _get_messages(self, state: State) -> list[Message]:
else:
raise ValueError(f'Unknown event type: {type(event)}')

if message and message.role == 'user' and not self.initial_task_str[0]:
# first user message
self.initial_task_str[0] = message.content[0].text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, do we still need this?

@enyst
Copy link
Collaborator

enyst commented Sep 26, 2024

Cheers! This is great news. ❤️

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

@ryanhoangt
Copy link
Contributor Author

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

The trajectory is uploaded to the visualizer here. I'm going to run evaluation on all 300 instances with the remote runtime to see how it goes, also clean up code a bit and fix tests.

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 1, 2024

Hello @ryanhoangt. Just checking in to see if this is something that you will continue working on? There's lots of changes that have gone in recently and don't want you to run into too many hard to resolve conflicts as it seems like it's an involved PR.

@ryanhoangt
Copy link
Contributor Author

Hey @mamoodi, thanks for checking in. I’m a bit tied up with other tasks at the moment, so I won’t be able to get back to this right away. Maybe we can close the PR for now and I will try to circle back when I have more bandwidth.

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 14, 2024

As per Ryan's comment, I'm going to close this PR for now. Whenever Ryan is ready, it will be reopened. Thank you.

@mamoodi mamoodi closed this Nov 14, 2024
@enyst enyst mentioned this pull request Dec 27, 2024
1 task
@enyst enyst mentioned this pull request Jan 10, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent framework Strategies for prompting, agent, etc enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants