Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors only with mesh #189

Open
VorlonCD opened this issue Nov 27, 2024 · 7 comments
Open

Errors only with mesh #189

VorlonCD opened this issue Nov 27, 2024 · 7 comments
Assignees
Labels
Can't Replicate Unable to replicate this issue

Comments

@VorlonCD
Copy link

Hello!

Merry Thanksgiving??

I have 2 windows machines both with YOLOv5 .NET and YOLOv5 6.2 enabled. If I point to either individually without mesh enabled they work all day long without trouble. So I think that rules out cards, memory, drivers.

As soon as I enable mesh (no matter which machine is the master) I start to get 1 of 2 errors depending on which module processes the request:

YOLOv5 .NET
{"error":"No File supplied for object detection.","inferenceMs":0,"processMs":0,"analysisRoundTripMs":30000,"success":false,"moduleName":"Object Detection (YOLOv5 .NET)","moduleId":"ObjectDetectionYOLOv5Net","command":"detect","requestId":"0d584a1f-c553-407b-a3f8-ba29139a7238","processedBy":"PCNAME","timestampUTC":"Wed, 27 Nov 2024 15:23:33 GMT"}

YOLOv5 6.2
{"success":false,"error":"Error occurred on the server","moduleId":"ObjectDetectionYOLOv5-6.2","moduleName":"Object Detection (YOLOv5 6.2)","code":500,"command":"detect","requestId":"7b64bd0f-39eb-45f0-ac84-9c67ecf49e53","inferenceDevice":"GPU","analysisRoundTripMs":30000,"processedBy":"PCNAME","timestampUTC":"Wed, 27 Nov 2024 14:11:11 GMT"}'

For 6.2 there is also an error in the server log that doesnt exist for .net - when I google around I believe it essentially means the same thing as above: "File Not found":

Response rec'd from Object Detection (YOLOv5 6.2) command 'detect' (...822dc4)

Object Detection (YOLOv5 6.2):  [AttributeError] : Traceback (most recent call last):
  File "C:\Program Files\CodeProject\AI\modules\ObjectDetectionYOLOv5-6.2\detect.py", line 140, in do_detection
    det                  = detector(img, size=640)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\models\common.py", line 689, in forward
    if im.shape[0] < 5:  # image in CHW
AttributeError: 'NoneType' object has no attribute 'shape'

So it seems like there may be an issue with one server uploading the file to the other server (both on the same 10gig network)

I'm throwing a lot of 4k 2mb files at it mixed in with some lower resolution images so maybe thats a factor for why it doesnt happen every single call.

I havent really flushed this out yet but I think I need to restart the whole codeproject service after the error to get the mesh to work correctly again.

What else can we do to troubleshoot this?

My system:

  • CodeProject.AI Server version: 2.9.3
  • OS: Windows 10 [NO FIREWALL, out of the box AV, 10gig wired connection between devices]
  • System RAM 128 GB
  • GPU, machine 1: NVIDIA T1000 (8 GB)
  • GPU, machine 2: NVIDIA GeForce RTX 3080 (12 GiB)
@ChrisMaunder ChrisMaunder added the bug Something isn't working label Nov 27, 2024
@ChrisMaunder ChrisMaunder self-assigned this Nov 27, 2024
@ChrisMaunder
Copy link
Contributor

Are both servers running the same version of CodeProject.AI?

@VorlonCD
Copy link
Author

Yes, clean install for both machines. The only setting I changed was setting .net to Large and 6.2 to Huge. I've never tried without doing that so I guess it could be a factor.

@ChrisMaunder
Copy link
Contributor

It should not affect anything.

@ChrisMaunder
Copy link
Contributor

Any further info on this? Does it only happen with certain images? Does model size change anything? I can't replicate this on my testing.

@ChrisMaunder ChrisMaunder added Can't Replicate Unable to replicate this issue and removed bug Something isn't working labels Dec 5, 2024
@VorlonCD
Copy link
Author

Hi Chris,

Yesterday I did a bunch of testing and finally had to disable mesh in cpai. I went back to using multiple servers in AITOOL (That has a sort of 'poor mans' mesh/queue that I wrote, and its been working smoothly). never had the issue before the few most recent versions. And I usually stay close to the current.

It always works when sending requests directly to individual machines.

I tried 3 other linux+docker machines (all different hardware) with the latest CPAI version (I had to wipe the mounted config folders to get ALL of them to update without errors?). So its not just an issue with a single type of hardware/machine.

And I had to enable -p 32168:32168/udp in the docker config to get mesh to work at all (even though I had pointed to the mesh servers via KnownMeshHostNames setting in serversettings.json on the main server.

It seems that almost always after the first call to any mesh server it gives one of the errors as above and then the mesh server goes into a state where, like in another recent mesh issue posted, when you test in AI EXPLORER it it says "No prediction returned" in 0ms. Sometimes if I keep submitting it actually works but most often I have to restart the service/container. But then a new mesh request will break it right away. To me this feels like an Async issue in the code.

Another weird thing I saw is that when mesh was enabled it listed a server at an IP address that doesnt exist! It couldnt connect to it, but the fact a non existing IP on my network was listed freaked me out a bit. I proceeded to check DHCP and Network history in my PIHOLE which acts as DNS and DHCP, but no reference found. My network is 10.0.1.xxx and it listed a 10.0.1.87 which I could not find any trace of in my router logs or pihole logs. And nothing on NMAP.

I DO have a 10.0.1.97 which is in fact a cpai enabled mesh machine. So could something in the code be mixing up .97 and .87? I didnt see anything in the mesh code related to ip addresses that seemed to be related though.

A few other possible factors -

  • My machine has VMWare installed (Its license is free now!) which installs some extra virtual network adapters that could be messing with things. I've seen a few other ghost/inactive mesh servers that did NOT have an IP listed that I didnt think should be there. So perhaps the local link / loopback checks in meshmonitor.cs needs tweaking?
  • Also, I always set each module to the highest model size.
  • I also always prioritize ipv4 on my machines so when it ping it never returns ipv6 address (DisabledComponents=0x20 should get you the solution).
  • In the JSON config file I've tried giving it both a list of IP's and a list of host names, but have ALWAYS had a list there so it could be a factor.
  • I've never tested with BI, only AITOOL.
  • In the code it looks like to get the local IP address somehow by open a UDP port to 8.8.8.8 google dns server. My router has a feature to force all DNS traffic to go through my PIHOLE, but I think it just redirects based on port 53 not the IP. I could be wrong and it could be screwing with the detection.

Dug around in the code and compared to a copy I had from April I think.... It would take forever for me to figure out how to actually debug since I never really work with web server code...

  • I see ProxyController.cs > ForwardAsync() changed the way it creates/re-creates the HttpRequestMessage. Is that the primary function used to forward requests to mesh servers? Maybe it doesnt always correctly handle the images?
  • Same file, but also other files I see functions changed to Async. Maybe worth reviewing for a bug related to that?
  • Maybe a bug in DispatchRemoteRequest?? This line doesnt "await":

responseObject = response!.Content.ReadFromJsonAsync<JsonObject>().Result;

Maybe should be this... (or it might skip to next line of code before it finishes reading?)

responseObject = await response!.Content.ReadFromJsonAsync<JsonObject>();

My VS AsyncFixer plugin also pointed out a few more async issues that I dont really think are related to the issue at hand...

ModuleRunnerbase.cs > processQueue: Use await CancelAsync: _cancellationTokenSource.Cancel();

Same in StopAsync, use CancelAsync.

PackageDownloader.cs > DownloadFileAsync: Use await File.WriteAllBytesAsync.

And 3 or more other places you could use await async versions of JsonSerializer.serialize / deserialize .

It would be great if ProxyControler.cs and MeshMonitor.cs were littered with TRACE logging to see more detail about when/what was sent/broadcast. And log all http response failure codes in the exceptions. I saw a few missing. I'd love to know more about when/where my ghost mesh servers come from and make sure I see every possible error that could be missing in mesh communication.

Thanks!

@ChrisMaunder
Copy link
Contributor

Thanks for the detailed spelunking. I'll review the async calls.

I think adding trace is the best bandaid for now. If a server A passes a message to server B, and server B throws a 500, server B should at least log what went wrong locally if it's unable to send back a more detailed error to server A. That would help diagnose faster.

As to the ghost IP, my guess is VMWare, If you have a mesh server sitting inside a VM (even Docker) then that throws things too. Networking is really messy. We tried to make it robust while we had time but we never got the chance to cover some of the edge cases.

@pjsgsy
Copy link

pjsgsy commented Jan 10, 2025

I think I may have the same issue. My 'non-working' server is the MESH one, separate from the main server and originator of the requests. MESH is the only difference in config, really.

#262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Can't Replicate Unable to replicate this issue
Projects
None yet
Development

No branches or pull requests

3 participants