Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Add symbol navigation commands into the editor #5092

Closed

Conversation

ryanhoangt
Copy link
Contributor

@ryanhoangt ryanhoangt commented Nov 17, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR is to:


Link of any specific issues this addresses

@ryanhoangt ryanhoangt requested a review from xingyaoww November 21, 2024 13:24
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR now looks good to me! Can you share some detailed evaluation results there before we merge this?

@@ -63,7 +63,7 @@ opentelemetry-exporter-otlp-proto-grpc = "1.25.0"
modal = "^0.64.145"
runloop-api-client = "0.7.0"
pygithub = "^2.5.0"
openhands-aci = "^0.1.1"
openhands-aci = {git = "https://github.com/All-Hands-AI/openhands-aci.git", rev = "ht/jump-commands"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update this when we merge jump-commands in ACI repo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is updated & CI is fixed, i'd be happy to approve this PR

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Nov 22, 2024

Eval results for the PR on a subset of swe-bench-lite:

Model PR resolved Baseline
claude-3-5-sonnet-20241022 35/59
- submitted instances: 59
- empty patch instances: 0
- resolved instances: 35
- unresolved instances: 24
- error instances: 0
35/59
deepseek-chat 8/59
- submitted instances: 59
- empty patch instances: 14
- resolved instances: 8
- unresolved instances: 51
- error instances: 0
7/59
gpt-4o-2024-05-13 (without in-context example) 17/43
- submitted instances: 43
- empty patch instances: 5
- resolved instances: 17
- unresolved instances: 26
- error instances: 0
19/43

Some issues when I looked into the trajectories for gpt-4o:

  • A few instances got stuck because it couldn't include a longer context for old_str to ensure uniqueness. Some overcame that, but then got stuck when trying to fix wrong indentations of str_replace with new_str spanning multiple lines.
  • Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.
  • Other instances' trajectories looks normal to me, and I can't spot some obvious issues with it.
    One thing I notice is in 9/19 resolved instances of the baseline DOES encounter runtime error during execution, which IMO can somewhat contribute to the uncertainty and variation in the result.

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case, but the usual differences in what the LLM "decides" to do are much higher than this.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Nov 22, 2024

but the usual differences in what the LLM "decides" to do are much higher than this.

Yeah I agree. Although it's not desirable, sometimes just a small change in the action/output of a single step can lead to a completely different ending for the same instance.

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case

I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case

I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?

Oh definitely. I would make a script to tell me what succeeded in one and failed in the other, both ways (it can happen), then look at the actual events for anything suspicious.

I'm not suggesting anything different, I'm saying that 2/43 is within the range of possible "normal" different outcomes. It's worth looking into, but it's not obvious it means "degradation". Same as for deepseek-chat 8/43 vs 7/43 is good to see, but it might be just random and not a "real" improvement.

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

Just to note on this:

Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.

This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.

@xingyaoww
Copy link
Collaborator

xingyaoww commented Nov 22, 2024

Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.

This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.

I believe this should be fixed in #5113?

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

I believe this should be fixed in #5113?

Oh thanks! I was thinking I saw something, but I couldn't remember where in the code. Maybe the run was with a branch set to an older main than four days ago.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Nov 22, 2024

Yeah seems like my PR didn't include that change unfortunately. Also thanks @enyst for the comment, that makes sense. We maybe able to tell more confidently with more instances run, but it's not very economical 🥹

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Dec 5, 2024

Took the chance to run a full eval on swe-bench-lite for claude -- fortunately we got a comparable performance with baseline v2.2 (130/300) and v2.1 in the leaderboard (125/300). At least it's not degrading performance a lot so maybe we can continue improving upon this with All-Hands-AI/openhands-aci#19:

05:00:07 - openhands:INFO: eval_infer.py:418 - # resolved: 127 / 300. (42.33%)
05:00:07 - openhands:INFO: eval_infer.py:418 - # failed_apply_patch: 0 / 300. (0.00%)
05:00:07 - openhands:INFO: eval_infer.py:418 - # error_eval: 0 / 300. (0.00%)
05:00:07 - openhands:INFO: eval_infer.py:418 - # empty_generation: 3 / 300. (1.00%)

There're only over 20 instances where the agent called either the 2 new navigation commands (~7%), so it makes sense that the result is not a lot different. I'm gonna look more closely at the result and post a more detailed comparison here.

@ryanhoangt ryanhoangt marked this pull request as ready for review December 5, 2024 06:10
@ryanhoangt
Copy link
Contributor Author

Took a look at the result, I can't find any significant/interesting things for now, possibly due to the small difference in the result. Some plots:

  • Comparing v2.1, v2.2 and the PR: the difference between v2.1 and v2.2 seems pretty weird, I think we don't have much changes between 2 version.
v2.1 vs. v2.2 PR vs. v2.1 PR vs. v2.2
1 2 3
  • Regarding instances where the 2 commands are called: there're 21 instances where either commands are called:
# instances with jump_to_definition vs. find_references called # resolved instances in the 21-instance subset with 2 commands called for PR vs. v2.2
1 2
  • I thought about comparing resolved per difficulty as well, but there're 69/300 Lite instances without gold annotations and many resolved instances fall into this subset so the plots are not very interpretable.

diff-level-lite

@mamoodi
Copy link
Collaborator

mamoodi commented Dec 23, 2024

Going to assume this is still in progress?

@ryanhoangt
Copy link
Contributor Author

Yes, I'm working on a refactor and will circle back to this PR soon!

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Jan 14, 2025

Running eval and the result is not improving much. Given we have some other work with higher priority (e.g. model routing), I'll close this PR for now and circle back to this later.

@ryanhoangt ryanhoangt closed this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants