-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experiment] Add symbol navigation commands into the editor #5092
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR now looks good to me! Can you share some detailed evaluation results there before we merge this?
@@ -63,7 +63,7 @@ opentelemetry-exporter-otlp-proto-grpc = "1.25.0" | |||
modal = "^0.64.145" | |||
runloop-api-client = "0.7.0" | |||
pygithub = "^2.5.0" | |||
openhands-aci = "^0.1.1" | |||
openhands-aci = {git = "https://github.com/All-Hands-AI/openhands-aci.git", rev = "ht/jump-commands"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to update this when we merge jump-commands in ACI repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is updated & CI is fixed, i'd be happy to approve this PR
Eval results for the PR on a subset of swe-bench-lite:
Some issues when I looked into the trajectories for
|
Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case, but the usual differences in what the LLM "decides" to do are much higher than this. |
Yeah I agree. Although it's not desirable, sometimes just a small change in the action/output of a single step can lead to a completely different ending for the same instance.
I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here? |
Oh definitely. I would make a script to tell me what succeeded in one and failed in the other, both ways (it can happen), then look at the actual events for anything suspicious. I'm not suggesting anything different, I'm saying that 2/43 is within the range of possible "normal" different outcomes. It's worth looking into, but it's not obvious it means "degradation". Same as for deepseek-chat 8/43 vs 7/43 is good to see, but it might be just random and not a "real" improvement. |
Just to note on this:
This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use. |
I believe this should be fixed in #5113? |
Oh thanks! I was thinking I saw something, but I couldn't remember where in the code. Maybe the run was with a branch set to an older |
Yeah seems like my PR didn't include that change unfortunately. Also thanks @enyst for the comment, that makes sense. We maybe able to tell more confidently with more instances run, but it's not very economical 🥹 |
Took the chance to run a full eval on
There're only over 20 instances where the agent called either the 2 new navigation commands (~7%), so it makes sense that the result is not a lot different. I'm gonna look more closely at the result and post a more detailed comparison here. |
Going to assume this is still in progress? |
Yes, I'm working on a refactor and will circle back to this PR soon! |
Running eval and the result is not improving much. Given we have some other work with higher priority (e.g. model routing), I'll close this PR for now and circle back to this later. |
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR is to:
jump_to_definition
andfind_references
(Add navigation commands into the editor openhands-aci#5)Link of any specific issues this addresses