Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One node can see the other, but not reciprocal #603

Open
michaelkerr opened this issue Jan 12, 2025 · 8 comments
Open

One node can see the other, but not reciprocal #603

michaelkerr opened this issue Jan 12, 2025 · 8 comments

Comments

@michaelkerr
Copy link

michaelkerr commented Jan 12, 2025

I have a Macbook Pro (M1 Max, 32GB) and a Mac Mini (M4, 16GB). Both running Exo with no errors in the console.

The Macbook sees the mini in the cluster (2 nodes, both listed) but the mini can't see the macbook (1 node, only itself listed).

To make things more intersting - the macbook tiny chat doesn't work, but the mini does, Macbook shows the start of the prompt response generation in the terminal, then hangs:

<lbegin......

Cutting Knowledge...
Today Date...

@AlexCheema
Copy link
Contributor

How are they networked? Is this over WiFi?

@michaelkerr
Copy link
Author

Yessir - wifi.

@iseanwang
Copy link

I have the same issue as you, however I used thunderbolt5 to connect them. Here is how I addressed this issue:

  1. Select System Settings - Network - Set Service Order
  2. Make sure the method you connected your device as the first priority.
    Hope that helps!

@michaelkerr
Copy link
Author

Appreaciate the suggestion.

Unfortunately, no change. The thunderbolt bridge is connected, and different IPS. Same behavior - the macbook can see the mini, but the mini, can't see the macbook.

@dollarusername
Copy link

dollarusername commented Jan 28, 2025

I have the exact same issue. The first run after a fresh install on both hosts appeared to work Okay, as both hosts downloaded the model, but then I had to leave and take one host with me, when I came back no number of restarting both node's exo will make it work-- one host sees both, the other only sees one.
Both of my nodes are connected via ethernet to the same switch.

@AlexCheema
Copy link
Contributor

AlexCheema commented Jan 29, 2025

I have the exact same issue. The first run after a fresh install on both hosts appeared to work Okay, as both hosts downloaded the model, but then I had to leave and take one host with me, when I came back no number of restarting both node's exo will make it work-- one host sees both, the other only sees one. Both of my nodes are connected via ethernet to the same switch.

Can you try running on the latest main. I made some fixes to blocking code which would cause nodes to appear unhealthy.

@dollarusername
Copy link

I was using the latest, as of yesterday. Today I just rebooted the host that only showed the other host and it all worked-- able to download and run a 70B model on 2 32GB macbooks without issue!
So it seems maybe something in the network stack or potentially cached may be the problem.

@dollarusername
Copy link

dollarusername commented Jan 30, 2025

A follow up here, as the connection issues persist.
What's interesting is node2 doesn't show node1, however if I download a model on node1 it will show as downloading on node2 and go and download that new model on node1 as well, even though it shows it is not connected to the other host.
Doing a download of a model (via tinygrad) on node2 DOESN'T show up on node1.

When doing a query on node1 I see the query activity on node2 as well, even though it's still only showing 1 node connected.

I did a pull today on both nodes, so they are running latest code.

I had also noticed that on node2 (the unhealthy node) it sometimes crashes with a GPU timeout error:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[1] 10267 abort exo

Node2 is a M1 Pro, node1 is a M2 Pro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@michaelkerr @dollarusername @iseanwang @AlexCheema and others