[Experiment] Add symbol navigation commands into the editor #5092

ryanhoangt · 2024-11-17T15:52:06Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR is to:

Integrate the two navigation commands into the editor: jump_to_definition and find_references (Add navigation commands into the editor openhands-aci#5)
Run eval

Link of any specific issues this addresses

openhands/llm/fn_call_converter.py

openhands/agenthub/codeact_agent/function_calling.py

openhands/llm/fn_call_converter.py

xingyaoww

This PR now looks good to me! Can you share some detailed evaluation results there before we merge this?

xingyaoww · 2024-11-21T18:11:50Z

pyproject.toml

@@ -63,7 +63,7 @@ opentelemetry-exporter-otlp-proto-grpc = "1.25.0"
 modal = "^0.64.145"
 runloop-api-client = "0.7.0"
 pygithub = "^2.5.0"
-openhands-aci = "^0.1.1"
+openhands-aci = {git = "https://github.com/All-Hands-AI/openhands-aci.git", rev = "ht/jump-commands"}


Don't forget to update this when we merge jump-commands in ACI repo

Once this is updated & CI is fixed, i'd be happy to approve this PR

ryanhoangt · 2024-11-22T10:49:28Z

Eval results for the PR on a subset of swe-bench-lite:

Model	PR resolved	Baseline
`claude-3-5-sonnet-20241022`	35/59 - submitted instances: 59 - empty patch instances: 0 - resolved instances: 35 - unresolved instances: 24 - error instances: 0	35/59
`deepseek-chat`	8/59 - submitted instances: 59 - empty patch instances: 14 - resolved instances: 8 - unresolved instances: 51 - error instances: 0	7/59
`gpt-4o-2024-05-13` (without in-context example)	17/43 - submitted instances: 43 - empty patch instances: 5 - resolved instances: 17 - unresolved instances: 26 - error instances: 0	19/43

Some issues when I looked into the trajectories for gpt-4o:

A few instances got stuck because it couldn't include a longer context for old_str to ensure uniqueness. Some overcame that, but then got stuck when trying to fix wrong indentations of str_replace with new_str spanning multiple lines.
Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.
Other instances' trajectories looks normal to me, and I can't spot some obvious issues with it.
One thing I notice is in 9/19 resolved instances of the baseline DOES encounter runtime error during execution, which IMO can somewhat contribute to the uncertainty and variation in the result.

enyst · 2024-11-22T12:05:15Z

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case, but the usual differences in what the LLM "decides" to do are much higher than this.

ryanhoangt · 2024-11-22T13:59:15Z

but the usual differences in what the LLM "decides" to do are much higher than this.

Yeah I agree. Although it's not desirable, sometimes just a small change in the action/output of a single step can lead to a completely different ending for the same instance.

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case

I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?

enyst · 2024-11-22T14:18:18Z

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case

I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?

Oh definitely. I would make a script to tell me what succeeded in one and failed in the other, both ways (it can happen), then look at the actual events for anything suspicious.

I'm not suggesting anything different, I'm saying that 2/43 is within the range of possible "normal" different outcomes. It's worth looking into, but it's not obvious it means "degradation". Same as for deepseek-chat 8/43 vs 7/43 is good to see, but it might be just random and not a "real" improvement.

enyst · 2024-11-22T14:55:45Z

Just to note on this:

Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.

This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.

xingyaoww · 2024-11-22T15:02:13Z

Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.

This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.

I believe this should be fixed in #5113?

enyst · 2024-11-22T15:17:50Z

I believe this should be fixed in #5113?

Oh thanks! I was thinking I saw something, but I couldn't remember where in the code. Maybe the run was with a branch set to an older main than four days ago.

ryanhoangt · 2024-11-22T15:46:29Z

Yeah seems like my PR didn't include that change unfortunately. Also thanks @enyst for the comment, that makes sense. We maybe able to tell more confidently with more instances run, but it's not very economical 🥹

ryanhoangt · 2024-12-05T05:47:41Z

Took the chance to run a full eval on swe-bench-lite for claude -- fortunately we got a comparable performance with baseline v2.2 (130/300) and v2.1 in the leaderboard (125/300). At least it's not degrading performance a lot so maybe we can continue improving upon this with All-Hands-AI/openhands-aci#19:

05:00:07 - openhands:INFO: eval_infer.py:418 - # resolved: 127 / 300. (42.33%)
05:00:07 - openhands:INFO: eval_infer.py:418 - # failed_apply_patch: 0 / 300. (0.00%)
05:00:07 - openhands:INFO: eval_infer.py:418 - # error_eval: 0 / 300. (0.00%)
05:00:07 - openhands:INFO: eval_infer.py:418 - # empty_generation: 3 / 300. (1.00%)

There're only over 20 instances where the agent called either the 2 new navigation commands (~7%), so it makes sense that the result is not a lot different. I'm gonna look more closely at the result and post a more detailed comparison here.

ryanhoangt · 2024-12-06T09:20:26Z

Took a look at the result, I can't find any significant/interesting things for now, possibly due to the small difference in the result. Some plots:

Comparing v2.1, v2.2 and the PR: the difference between v2.1 and v2.2 seems pretty weird, I think we don't have much changes between 2 version.

`v2.1` vs. `v2.2`	`PR` vs. `v2.1`	`PR` vs. `v2.2`

Regarding instances where the 2 commands are called: there're 21 instances where either commands are called:

# instances with `jump_to_definition` vs. `find_references` called	# resolved instances in the 21-instance subset with 2 commands called for `PR` vs. `v2.2`

I thought about comparing resolved per difficulty as well, but there're 69/300 Lite instances without gold annotations and many resolved instances fall into this subset so the plots are not very interpretable.

mamoodi · 2024-12-23T14:18:01Z

Going to assume this is still in progress?

ryanhoangt · 2024-12-23T14:20:21Z

Yes, I'm working on a refactor and will circle back to this PR soon!

ryanhoangt · 2025-01-14T14:50:47Z

Running eval and the result is not improving much. Given we have some other work with higher priority (e.g. model routing), I'll close this PR for now and circle back to this later.

ryanhoangt added 4 commits November 17, 2024 23:12

add tool param

606928c

add navigation example

ddad83e

add line number

cee3c26

fix bug

31c6733

xingyaoww reviewed Nov 19, 2024

View reviewed changes

openhands/llm/fn_call_converter.py Show resolved Hide resolved

openhands/agenthub/codeact_agent/function_calling.py Outdated Show resolved Hide resolved

xingyaoww reviewed Nov 19, 2024

View reviewed changes

openhands/llm/fn_call_converter.py Outdated Show resolved Hide resolved

ryanhoangt added 5 commits November 20, 2024 08:18

fix bug

3a1229c

update editor tool desc

32e0c1d

update in-context example

b92fa89

use minimal change for tool desc

e5decf2

fix runtime tests & clean

390b9cb

ryanhoangt requested a review from xingyaoww November 21, 2024 13:24

xingyaoww reviewed Nov 21, 2024

View reviewed changes

ryanhoangt added 8 commits November 22, 2024 16:48

fix ci

61f4c2c

fix ci

fb356f1

fix ci

e0b4185

update aci

b85a032

revert Dockerfile

701e3b1

Merge branch 'main' into add-navigation-commands

748fb35

Merge branch 'main' into add-navigation-commands

63b3236

update aci

8b3bb24

ryanhoangt added 3 commits December 4, 2024 07:28

update aci

b528311

handle <oh_aci_output> tag

2e56ce0

handle multiple file editor calls

4c92c89

Merge branch 'main' into add-navigation-commands

e4d2520

ryanhoangt marked this pull request as ready for review December 5, 2024 06:10

ryanhoangt closed this Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] Add symbol navigation commands into the editor #5092

[Experiment] Add symbol navigation commands into the editor #5092

ryanhoangt commented Nov 17, 2024 •

edited

Loading

xingyaoww left a comment

xingyaoww Nov 21, 2024

xingyaoww Nov 22, 2024

ryanhoangt commented Nov 22, 2024 •

edited

Loading

enyst commented Nov 22, 2024

ryanhoangt commented Nov 22, 2024 •

edited

Loading

enyst commented Nov 22, 2024

enyst commented Nov 22, 2024

xingyaoww commented Nov 22, 2024 •

edited

Loading

enyst commented Nov 22, 2024

ryanhoangt commented Nov 22, 2024 •

edited

Loading

ryanhoangt commented Dec 5, 2024 •

edited

Loading

ryanhoangt commented Dec 6, 2024

mamoodi commented Dec 23, 2024

ryanhoangt commented Dec 23, 2024

ryanhoangt commented Jan 14, 2025 •

edited

Loading

[Experiment] Add symbol navigation commands into the editor #5092

[Experiment] Add symbol navigation commands into the editor #5092

Conversation

ryanhoangt commented Nov 17, 2024 • edited Loading

xingyaoww left a comment

Choose a reason for hiding this comment

xingyaoww Nov 21, 2024

Choose a reason for hiding this comment

xingyaoww Nov 22, 2024

Choose a reason for hiding this comment

ryanhoangt commented Nov 22, 2024 • edited Loading

enyst commented Nov 22, 2024

ryanhoangt commented Nov 22, 2024 • edited Loading

enyst commented Nov 22, 2024

enyst commented Nov 22, 2024

xingyaoww commented Nov 22, 2024 • edited Loading

enyst commented Nov 22, 2024

ryanhoangt commented Nov 22, 2024 • edited Loading

ryanhoangt commented Dec 5, 2024 • edited Loading

ryanhoangt commented Dec 6, 2024

mamoodi commented Dec 23, 2024

ryanhoangt commented Dec 23, 2024

ryanhoangt commented Jan 14, 2025 • edited Loading

ryanhoangt commented Nov 17, 2024 •

edited

Loading

ryanhoangt commented Nov 22, 2024 •

edited

Loading

ryanhoangt commented Nov 22, 2024 •

edited

Loading

xingyaoww commented Nov 22, 2024 •

edited

Loading

ryanhoangt commented Nov 22, 2024 •

edited

Loading

ryanhoangt commented Dec 5, 2024 •

edited

Loading

ryanhoangt commented Jan 14, 2025 •

edited

Loading