Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Failing to run OpenRouter AND Ollama #5310

Open
1 task done
BradKML opened this issue Nov 28, 2024 · 26 comments
Open
1 task done

[Bug]: Failing to run OpenRouter AND Ollama #5310

BradKML opened this issue Nov 28, 2024 · 26 comments
Labels
bug Something isn't working

Comments

@BradKML
Copy link

BradKML commented Nov 28, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

Note: both are using Qwen2.5 coder and all of them failed half way

OpenHands Installation

Docker command in README

OpenHands Version

Latest Docker image with 0.14

Operating System

WSL on Windows

Logs, Errors, Screenshots, and Additional Context

No response

@BradKML BradKML added the bug Something isn't working label Nov 28, 2024
@ryanhoangt
Copy link
Contributor

ryanhoangt commented Nov 28, 2024

The issue in trajectory (1) is from a bug in the aci -- I made a fix for it here. Not too sure what happened with the other 2 trajectories

SmartManoj added a commit to SmartManoj/openhands-aci that referenced this issue Nov 28, 2024
…ded for `create` command

* Update `OHEditor.__call__` to raise `EditorToolParameterMissingError` with appropriate message
* Add test case to verify the error is raised when `file_text` is missing for `create` command

fixes All-Hands-AI/OpenHands#5310
@BradKML
Copy link
Author

BradKML commented Nov 28, 2024

@ryanhoangt how can I solve the other two then? What other info should I give regarding the matter to help debug this?

@ryanhoangt
Copy link
Contributor

ryanhoangt commented Nov 28, 2024

Can you check the logs in terminal to see what errors happened that caused the state to change? You can also set export DEBUG=1 to have more details visible.

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 28, 2024

For my own understanding are we expecting Qwen2.5 coder to be able to work well enough with OpenHands?

@ryanhoangt
Copy link
Contributor

ryanhoangt commented Nov 28, 2024

Probably not, from the discussion here seems like it's not very good at tool using. I also saw that from the trajectories @BradKML posted recently. Would be nice if we have the swe-bench score for it though.

@BradKML I think in the discussion they also mentioned there were some issues with OpenRouter, could that probably be the case for the errors in the 2 later trajectories? Can you try another provider besides ollama just to see if it works?

@BradKML
Copy link
Author

BradKML commented Nov 29, 2024

@ryanhoangt so currently Ollama tool use is meh, and got a few pull requests. Make sense.
But the later 2 trajectories ARE exclusively from OpenRouter, so I am not sure how that would be an issue.
@mamoodi since tool use should be important (not just programming)

Extra note: okay now with a proper instruct model, it pretends that it did something when it did not? https://www.all-hands.dev/share?share_id=c151515f5d38a08393bb2cbf1a9cd1f207fe0dd1c832f238f0ec8ab1e0e15fc6 https://www.all-hands.dev/share?share_id=ec00360072ed24855106caba8749df79fabf93da289edfc76f56a2ec939d7ce3

Addendum: Using OpenRouter with https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct seems to lead to another error that jams over to another session? https://www.all-hands.dev/share?share_id=e5f726e521af742492f8cd9153a50e61448c7b044d520650bdbd2865a23d6ccc https://www.all-hands.dev/share?share_id=43cea05fd76eb9a3ab358de70b0413157d10489fd8e098cf5fe80f7abb8651b2

@BradKML
Copy link
Author

BradKML commented Nov 29, 2024

Switching to Llama3 70B through OpenRouter using the non-advanced settings panels yield the same set of issues https://www.all-hands.dev/share?share_id=cc3af61bf54669abce07879e0bc810bce8bdb815735d0766b2776beb872afc0c

Trying another Ollama model to see how it goes phi3:3.8b-mini-128k-instruct-q4_K_M (not gonna lie even when the conversation goes, it does not use the tools and instead fly off the wheels) https://www.all-hands.dev/share?share_id=ffa8469051ba8178ae33c38a0ec7cbd811b45b408cfbbdc78d57425d0a0bb799 https://www.all-hands.dev/share?share_id=e042acc5afb5c99e069d369c5f5c9317a2a8a69470c0522b9ec0f50eafedba06

Ollama with phi3:14b-medium-128k-instruct-q4_K_M to see if it is an issue with model size but nope apparently Phi3 is really bad at anything above code editing, and gives back even more broken results https://www.all-hands.dev/share?share_id=abf824759491cb5367461cb8beaffe6269933a2b9db0bf2e82ace01aea25b17e

Ollama with OpenCoder to see if another SOTA SLM can do the job, welp it managed to screw with file creation and got stuck on a loop (probably would be useful to add guard rails on long-form outputs) https://www.all-hands.dev/share?share_id=33f618f4bc251656660f93b809a27c190dd14deeacb8b60725a768e770dbe8ee

@BradKML
Copy link
Author

BradKML commented Nov 29, 2024

OpenRouter with Gemma 2 27B (since it is not on the core dashboard) to see if switching models lead to some extra issues when it comes to running the service... #4920 with litellm.BadRequestError: OpenrouterException https://www.all-hands.dev/share?share_id=5b034f25b6b9ed286de9d78fd47634db7b6220aa378ccb480c720ca62498c7ac

BTW this has always been about CodeAct Agent as the default, with WSL/Docker installation. But to @mamoodi it is better to achieve two goals simultaneously to make things better

  • Create sufficient guardrails (e.g. prompt engineering, LLM output checks) such that most LLMs can misinterpret less and use the tools properly (assuming that most care capable)
  • Have statistics on failure to use tools in general so that the LLMs that are problematic can be noted and pointed out, and meta-patterns on LLM design can be discovered

Reference for the future #496

@BradKML
Copy link
Author

BradKML commented Dec 5, 2024

@BradKML
Copy link
Author

BradKML commented Dec 5, 2024

Testing 0.15 with Ollama instead, and it is not writing file

Examining Phi-3 and something similar happens (expected to write bad code but not being to write file is worse)

docker run -it --pull=always -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.15-nikolaik -e LOG_ALL_EVENTS=true -v /var/run/docker.sock:/var/run/docker.sock -e DEBUG=1 -p 3000:3000 --add-host host.docker.internal:host-gateway --name openhands-app docker.all-hands.dev/all-hands-ai/openhands:0.15

BTW another issue with OpenRouter that might be related to the issue #3435

P.S. another Agentic loop for OpenRouter through the default settings panel https://www.all-hands.dev/share?share_id=27f4d1c00ac45c6026e54116bc294c7dcb86227941d64fa4d1e284d95c8dd3c0

@ryanhoangt if I use DEBUG=1 does it show up in these online links? Or do I really need to redo these in the terminal?

@BradKML
Copy link
Author

BradKML commented Dec 5, 2024

For reference @ryanhoangt it seems like OpenRouter setup with CodeActAgent as default, will force itself to use Git files that are non-existent. openrouter_fails.txt
@SmartManoj have fun with this log btw

Will test Ollama and dump its logs tomorrow and see why there are discrepancies between two providers of the same model

@BradKML
Copy link
Author

BradKML commented Dec 5, 2024

Ran OpenRouter a second time but there is (a) clearly no .gitignore (b) uses the default Docker command without specified folders and maybe there is a permissions issues that needs documentation https://pastebin.com/SvBJ8X4c
Addendum: did another set of test, there is definitely a state changing bug Openrouter_superbug.txt another_round.txt

Addendum 2: it seems that once the error is hit, it is stuck, then it automatically resumes in the command line? how_can_this.txt why_is_it_restarting.txt

@enyst
Copy link
Collaborator

enyst commented Dec 7, 2024

@BradKML Thank you for the details reports and logs! Please let me note a couple of things quickly, on the latest:

Re:

Ran OpenRouter a second time but there is (a) clearly no .gitignore (b) uses the default Docker command without specified folders and maybe there is a permissions issues that needs documentation https://pastebin.com/SvBJ8X4c

If you are referring to logging like this:

**ErrorObservation**
2024-12-05 23:25:52     |File not found: /workspace/.gitignore. Your current working directory is /workspace.

You can actually ignore this. There is indeed no gitignore; the spam-log is a bug frankly, it doesn't affect anything but it's annoying. That's not the agent nor the LLM; it doesn't affect the actions; it's the execution environment wasting time.

In the log, I see what looks like the LLM installed some packages whose versions don't match, and when it got an exception it started editing the site-packages files.

2024-12-06 07:07:15     |python3 /workspace/backtester_sma.py
...
2024-12-06 07:07:15     |22:51:32 - openhands:DEBUG: action_execution_server.py:169 - Action output:
2024-12-06 07:07:15     |**CmdOutputObservation (source=None, exit code=1)**
2024-12-06 07:07:15     |python3 /workspace/backtester_sma.py
2024-12-06 07:07:15     |Traceback (most recent call last):
2024-12-06 07:07:15     |  File "/workspace/backtester_sma.py", line 3, in <module>
2024-12-06 07:07:15     |    import pandas_ta as ta
2024-12-06 07:07:15     |  File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/pandas_ta/__init__.py", line 116, in <module>
2024-12-06 07:07:15     |    from pandas_ta.core import *
2024-12-06 07:07:15     |  File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/pandas_ta/core.py", line 18, in <module>
2024-12-06 07:07:15     |    from pandas_ta.momentum import *
2024-12-06 07:07:15     |  File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/pandas_ta/momentum/__init__.py", line 34, in <module>
2024-12-06 07:07:15     |    from .squeeze_pro import squeeze_pro
2024-12-06 07:07:15     |  File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/pandas_ta/momentum/squeeze_pro.py", line 2, in <module>
2024-12-06 07:07:15     |    from numpy import NaN as npNaN
2024-12-06 07:07:15     |ImportError: cannot import name 'NaN' from 'numpy' (/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/numpy/__init__.py). Did you mean: 'nan'?
...
2024-12-06 07:07:15     |**CmdRunAction (source=EventSource.AGENT)**
2024-12-06 07:07:15     |THOUGHT: It looks like there's an issue with the import statement in `pandas_ta`. The correct import for `NaN` from `numpy` should be `nan` (lowercase). Let's fix this by editing the `squeeze_pro.py` file in `pandas_ta`.

Re:

Addendum: did another set of test, there is definitely a state changing bug Openrouter_superbug.txt another_round.txt

I'm not sure what you mean by state changing bug? At first sight, what I see in the log seems to be that the LLM didn't include the appropriate indentation, when asking for edits, so edits weren't performed.

@enyst
Copy link
Collaborator

enyst commented Dec 7, 2024

Re: function calling

Just to clarify, we introduced relatively recently (a month or two now) function calling, in the sense of using the actual "tool use" or "function calls" APIs offered by some providers. There are only a few models (and their providers) that worked very well with it, and it is enabled only for those.

A number of models tested clearly worked better without this: we fallback instead to define a sort of function calling via prompting, like we always did prior to this feature. (just tell the model what and how to answer if it wants this action done).

More importantly: please see

for details and data on what worked in tests.

A lot of models are not capable to follow instructions well enough, or when they sort of do they get confused at some point because of their own multi-step history and starting trying to ask for actions that don't exist etc.

@enyst
Copy link
Collaborator

enyst commented Dec 7, 2024

Re:

BTW another issue with OpenRouter that might be related to the issue #3435

That issue was fixed a while ago. I updated the issue to make it clear it was actually fixed. Sorry for the confusion.

Re:

why are agents stuck on loop when the errors are unique each time? #5355

Sorry, can you elaborate?

@enyst
Copy link
Collaborator

enyst commented Dec 7, 2024

Re: docker. I'm not sure, but if you suspect there's an issue with docker (workspace or perms etc), please open another issue for that, so that we can look into it on its own. It's too different to deal with it here.

Re: stuck in loops.
So far, I'm looking at the logs you posted, and I see

  • some loops are caught by us - we give an error about being stuck in a loop
  • some logs that look to the human eye like loops, are not caught. This kind of thing, I'll look deeper into it. I'm guessing it might be because the code that was looking for patterns in edits, wasn't updated after some very significant editing changes (introducing edit via "tool use" APIs mentioned above), not sure, it might be missing things it should catch.

@enyst
Copy link
Collaborator

enyst commented Dec 7, 2024

Re:

@ryanhoangt if I use DEBUG=1 does it show up in these online links? Or do I really need to redo these in the terminal?

It doesn't show in the trajectories shared via Share feedback. DEBUG=1 shows the debug logging in the console and log file, but the Share feedback sends only the actual events I think; not actual logs.

@BradKML
Copy link
Author

BradKML commented Dec 7, 2024

@enyst thanks for all the replies

  • For now Qwen 2.5 Coder Instruct is what I will be using as the core FOSS alternative since DeepSeek2.5 and Llama3.1 are harder to use locally (and very pessimistic about QwQ and DeepSeek r1)
  • Dumping Ollama errrors in the near future might be necessary cus the tool use issue is weird when OpenRouter works just fine
  • Will use Console Log File from Docker as well for now to skip the shallow-dump with the website, would hope things can be resolved in ways that are easier
  • Regarding the loop, can't we just raise the temperature a little or adjust Top-P/Top-K/Min-P sampling such that it is willing to switch tactics?
  • Other bugs like LiteLLM with OpenRouter exceptions (like legit it will stop in the middle, maybe due to some weird throttling) are also a little bit mad

@enyst
Copy link
Collaborator

enyst commented Dec 14, 2024

Regarding the loop, can't we just raise the temperature a little or adjust Top-P/Top-K/Min-P sampling such that it is willing to switch tactics?

That's one option, I think. If you want to try, a PR is most welcome. I assume we will evaluate the approach.

@BradKML
Copy link
Author

BradKML commented Dec 16, 2024

@enyst there are a few possible solutions around this

Borrowed writing from "guides" (weirdly enough their Top-P and temperature are correlated so it is suspicious) https://promptengineering.org/prompt-engineering-with-temperature-and-top-p https://dropchat.co/blog/controlling-gpt-4s-creative-thermostat-understanding-temperature-and-top_p-sampling https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683

  • Boilerplates and docstrings: Top P and Temperature set to 0.1-0.2
  • New code and alternative algorithms: Top P and Temperature set to 0.3-0.4
  • Experimentation and dialogue: Top P and Temperature set to 0.4-0.6

Some underlying observations and recommendations:

  • Default: Bumping up Temperature to be above 0 is probably good for breaking loops (since Top-P does not need to be exactly 1), which is doable here https://github.com/All-Hands-AI/OpenHands/blob/main/config.template.toml
  • PR Worthy Gently cranking up temperature is likely beneficial just in case classical examples cannot help the situation, so if the system is stuck on a loop N times (even with chat logs), raise the temperature by like 0.05 or 0.1, and let the system break out of its loop creatively, and return the temperature to 0 when the problem is resolved.
  • Complex Creating a new sampling method requires changes in LiteLLM, so better to start from tweaking params instead.

P.S. about self-repetition in conversation, DRY is seen as an alternative to classical penalty systems. XTC (exclude top choice) also looks interesting but are slightly more forced.

Also need to thank @SmartManoj for this SmartManoj#134 (comment)

@SmartManoj
Copy link
Contributor

The trajectory is 21 MB. So, the browser couldn't handle that. The following script will save that into a file.

import requests
import json

json_data = {
    'feedback_id': '4dbb93310608f43026c9843cf184ce93240b31565f6eb913fbcda369d43ec639',
}

response = requests.post(
    'https://show-od-trajectory-3u9bw9tx.uc.gateway.dev/show-od-trajectory',
    json=json_data,
)

with open('response.json', 'w') as f:
    json.dump(response.json(), f)

@BradKML
Copy link
Author

BradKML commented Dec 23, 2024

@SmartManoj does it look "off" in any way? Like OpenRouter connection errors are still present

@SmartManoj
Copy link
Contributor

SmartManoj commented Dec 23, 2024

image
The error was not included in this trajectory.


Could you share the error message?

(or) Could you add these lines and share again if there is no sensitive data in the error msg?

@BradKML
Copy link
Author

BradKML commented Jan 11, 2025

Sorry, just putting up with this other error @SmartManoj #6056

@BradKML
Copy link
Author

BradKML commented Jan 29, 2025

I wanted to report this update:

  1. DeepSeek proper is a lifesaver when it comes to loops BUT the servers under OpenRouter are being bombarded this week
  2. DeepSeek R1 distills using Llama still leads to the same problem of hard-stuck loops RIP
  3. Probably need to check back on Qwen 2.5 Coder again sooner or later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants