Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local sparsification #1027

Open
fgregg opened this issue May 24, 2022 · 9 comments
Open

Local sparsification #1027

fgregg opened this issue May 24, 2022 · 9 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented May 24, 2022

We are contemplating exposing an argument that should sparsify the entire pair-score edgelist (#1026).

It would be very nice if we had some principled way to do this automatically for users. @fjsj and I discussed some "global" approaches previously (#834 (review)), but they didn't look to good.

Flavio also provided some links to "local" sparsification.

it would be great to see if one of these would be appropriate.

@fgregg
Copy link
Contributor Author

fgregg commented May 29, 2022

Structure-Preserving Sparsification Methods for Social Networks is about methods that work on unweighted graphs.

They do reference Extracting the multiscale backbone of complex weighted networks approvingly as an approach for weighted networks.

@fgregg
Copy link
Contributor Author

fgregg commented May 31, 2022

in Unsupervised Sparsification of Similarity Graphs, Gollub and Stein make the following claim (or something close to it):

Let $c(o)$ indicate the class membership of object $o$. Assume that $o_1$ and $o_2$ belong to the same class, then.

$$ \Pr(c(o_1) = c(o_2)) > \max\{\Pr(c(o_1) = c(o_{x \not\in \{1, 2\}})),\Pr(c(o_2) = c(o_{y \not\in \{1, 2\}}))\} $$

Which should imply that,

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o_1, o_{x \not\in \{1, 2\}})),\mathbb{E}(s(o_2, o_{y \not\in \{1, 2\}}))\} $$

where $s(o_i, o_j)$ is the similarity score between $o_i$ and $o_j$.

The authors then make an unexpected claim that this is equivalent to

$$ s(o_1, o_2) > \max\{s(o_1, \bar{o}),s(o_2, \bar{o})\} $$

i.e. that the similarity between two objects that belong to the same class should be larger than the similarity between either of those objects and the graph centroid.

This seems clearly untrue. Imagine a graph with only two edges. If the last proposed rule holds, the two nodes would never be put in the same class.

The more defensible rule, that

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o_1, o_{x \neq 1})),\mathbb{E}(s(o_2, o_{y \neq 2}))\} $$

has some surprising consequences. It means that the the threshold for two nodes being put in the same class should be higher for nodes that are near the center of the graph and lower for nodes that are at the periphery.

@fgregg
Copy link
Contributor Author

fgregg commented May 31, 2022

Improving Short Text Clustering by Similarity Matrix Sparsification basically picks up at

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o_1, o_{x \not\in \{1, 2\}})),\mathbb{E}(s(o_2, o_{y \not\in \{1, 2\}}))\} $$

but adds standard deviation.

Let $u_i$ be the mean similarity between $o_i$ and other nodes and let $s_i$ be the standard deviation of those similarities.

Then, Rakib and his co-authors propose the rule that

$$ s(o_1, o_2) > \min\{u_1 + \alpha s_1, u_2 + \alpha s_2\} $$

This is basically a z-score based approach, and seems quite interesting. But, beyond setting $\alpha$ to 0, it's not clear what a principled way of setting that value would be.

Also, the shift from maximum to minimum seems hard to justify.

@fgregg
Copy link
Contributor Author

fgregg commented May 31, 2022

ultimately, we are going to be thresholding clusters, so let's think about how we can use that fact for sparsification.

let our $t$ be our distance threshold. let's start with "average" clustering. let edge distances take a value between 0 and 1.

Pairwise

if two nodes, $i, j$ are a dense component, and the $w_{i,j} > t$ we can discard this edge early.

Triangle

three nodes, $i, j, k$ are a dense component, with weights $w_{i,j}, w_{i,k}, w_{j,k}$, what is average value of nodes incident on $i$ such that all three records are in a cluster. Assume $w_{j,k}=0$ and are already in cluster $C_{j, k}$. $C_i$ is the singleton cluster containing only node $i$.

$$ D(C_{j,k}, C_i) = u_i < t $$

(useful ref on linkage algos)

So, then what is the minimum value that a particular edge can have? Assume all other new edges have a value of 0, then

$$ \frac{x}{n} < t $$

$$ x < nt $$

This does not look very promising as a form of edge sparsification, but i wonder if it could be valuable as a form of node sparsification.

If every edge incident on a node $i$ has a weight greater than $t$, then $i$ can't be a part of any cluster.

@fgregg
Copy link
Contributor Author

fgregg commented Jun 1, 2022

Extracting the multiscale backbone of complex weighted networks is very local. Dyads that were very low scoring would not be removed. Not quite what we want.

@fgregg
Copy link
Contributor Author

fgregg commented Jun 1, 2022

If every edge incident on a node $i$ has a weight greater than $t$, then $i$ can't be a part of any cluster.

just did a quick and dirty check on csv example, and filtering out nodes like this this removed a bit over 10% of edges

@fgregg
Copy link
Contributor Author

fgregg commented Jun 1, 2022

Another thing we could sparsify

If an edge is the only path between two nodes, and the edge weight is greater then $t$ (i.e. $w_{c,d} &gt; t$), then we can disregard that edge.

graph TD;
    A-->C;
    B-->C;
    C-->D;
    D-->E;
    D-->F;
Loading

Don't know if there's an efficient way to identify such single path.

@fgregg
Copy link
Contributor Author

fgregg commented Jun 1, 2022

Actually, if the edges between two parts of a graph are such that their weights are greater than the threshold, we should be able to cut those edges.

graph TD;
    A-->C;
    B-->C;
    B-->F;
    C-->D;
    D-->E;
    D-->F;
Loading

if $w_{c,d}$ and $w_{b,f}$ are both greater than the threshold, we should be able to cut both of them.

@fgregg
Copy link
Contributor Author

fgregg commented Jun 1, 2022

keep going down this road, and we are basically re-implementing clustering...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant