-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors only with mesh #189
Comments
Are both servers running the same version of CodeProject.AI? |
Yes, clean install for both machines. The only setting I changed was setting .net to Large and 6.2 to Huge. I've never tried without doing that so I guess it could be a factor. |
It should not affect anything. |
Any further info on this? Does it only happen with certain images? Does model size change anything? I can't replicate this on my testing. |
Hi Chris, Yesterday I did a bunch of testing and finally had to disable mesh in cpai. I went back to using multiple servers in AITOOL (That has a sort of 'poor mans' mesh/queue that I wrote, and its been working smoothly). never had the issue before the few most recent versions. And I usually stay close to the current. It always works when sending requests directly to individual machines. I tried 3 other linux+docker machines (all different hardware) with the latest CPAI version (I had to wipe the mounted config folders to get ALL of them to update without errors?). So its not just an issue with a single type of hardware/machine. And I had to enable It seems that almost always after the first call to any mesh server it gives one of the errors as above and then the mesh server goes into a state where, like in another recent mesh issue posted, when you test in AI EXPLORER it it says " Another weird thing I saw is that when mesh was enabled it listed a server at an IP address that doesnt exist! It couldnt connect to it, but the fact a non existing IP on my network was listed freaked me out a bit. I proceeded to check DHCP and Network history in my PIHOLE which acts as DNS and DHCP, but no reference found. My network is 10.0.1.xxx and it listed a 10.0.1.87 which I could not find any trace of in my router logs or pihole logs. And nothing on NMAP. I DO have a 10.0.1.97 which is in fact a cpai enabled mesh machine. So could something in the code be mixing up .97 and .87? I didnt see anything in the mesh code related to ip addresses that seemed to be related though. A few other possible factors -
Dug around in the code and compared to a copy I had from April I think.... It would take forever for me to figure out how to actually debug since I never really work with web server code...
Maybe should be this... (or it might skip to next line of code before it finishes reading?)
My VS AsyncFixer plugin also pointed out a few more async issues that I dont really think are related to the issue at hand... ModuleRunnerbase.cs > processQueue: Use await CancelAsync: Same in StopAsync, use CancelAsync. PackageDownloader.cs > DownloadFileAsync: Use And 3 or more other places you could use await async versions of JsonSerializer.serialize / deserialize . It would be great if ProxyControler.cs and MeshMonitor.cs were littered with TRACE logging to see more detail about when/what was sent/broadcast. And log all http response failure codes in the exceptions. I saw a few missing. I'd love to know more about when/where my ghost mesh servers come from and make sure I see every possible error that could be missing in mesh communication. Thanks! |
Thanks for the detailed spelunking. I'll review the async calls. I think adding trace is the best bandaid for now. If a server A passes a message to server B, and server B throws a 500, server B should at least log what went wrong locally if it's unable to send back a more detailed error to server A. That would help diagnose faster. As to the ghost IP, my guess is VMWare, If you have a mesh server sitting inside a VM (even Docker) then that throws things too. Networking is really messy. We tried to make it robust while we had time but we never got the chance to cover some of the edge cases. |
I think I may have the same issue. My 'non-working' server is the MESH one, separate from the main server and originator of the requests. MESH is the only difference in config, really. |
Hello!
Merry Thanksgiving??
I have 2 windows machines both with YOLOv5 .NET and YOLOv5 6.2 enabled. If I point to either individually without mesh enabled they work all day long without trouble. So I think that rules out cards, memory, drivers.
As soon as I enable mesh (no matter which machine is the master) I start to get 1 of 2 errors depending on which module processes the request:
YOLOv5 .NET
{"error":"No File supplied for object detection.","inferenceMs":0,"processMs":0,"analysisRoundTripMs":30000,"success":false,"moduleName":"Object Detection (YOLOv5 .NET)","moduleId":"ObjectDetectionYOLOv5Net","command":"detect","requestId":"0d584a1f-c553-407b-a3f8-ba29139a7238","processedBy":"PCNAME","timestampUTC":"Wed, 27 Nov 2024 15:23:33 GMT"}
YOLOv5 6.2
{"success":false,"error":"Error occurred on the server","moduleId":"ObjectDetectionYOLOv5-6.2","moduleName":"Object Detection (YOLOv5 6.2)","code":500,"command":"detect","requestId":"7b64bd0f-39eb-45f0-ac84-9c67ecf49e53","inferenceDevice":"GPU","analysisRoundTripMs":30000,"processedBy":"PCNAME","timestampUTC":"Wed, 27 Nov 2024 14:11:11 GMT"}'
For 6.2 there is also an error in the server log that doesnt exist for .net - when I google around I believe it essentially means the same thing as above: "File Not found":
So it seems like there may be an issue with one server uploading the file to the other server (both on the same 10gig network)
I'm throwing a lot of 4k 2mb files at it mixed in with some lower resolution images so maybe thats a factor for why it doesnt happen every single call.
I havent really flushed this out yet but I think I need to restart the whole codeproject service after the error to get the mesh to work correctly again.
What else can we do to troubleshoot this?
My system:
The text was updated successfully, but these errors were encountered: