Local sparsification #1027

fgregg · 2022-05-24T13:17:38Z

We are contemplating exposing an argument that should sparsify the entire pair-score edgelist (#1026).

It would be very nice if we had some principled way to do this automatically for users. @fjsj and I discussed some "global" approaches previously (#834 (review)), but they didn't look to good.

Flavio also provided some links to "local" sparsification.

https://www.researchgate.net/publication/225919431_Unsupervised_Sparsification_of_Similarity_Graphs
https://www.researchgate.net/publication/328034827_Improving_Short_Text_Clustering_by_Similarity_Matrix_Sparsification
this article summarizes various methods: https://arxiv.org/abs/1601.00286

it would be great to see if one of these would be appropriate.

fgregg · 2022-05-29T19:42:52Z

Structure-Preserving Sparsification Methods for Social Networks is about methods that work on unweighted graphs.

They do reference Extracting the multiscale backbone of complex weighted networks approvingly as an approach for weighted networks.

fgregg · 2022-05-31T02:52:59Z

in Unsupervised Sparsification of Similarity Graphs, Gollub and Stein make the following claim (or something close to it):

Let $c(o)$ indicate the class membership of object $o$. Assume that $o_1$ and $o_2$ belong to the same class, then.

$$ \Pr(c(o_1) = c(o_2)) > \max\{\Pr(c(o_1) = c(o_{x \not\in \{1, 2\}})),\Pr(c(o_2) = c(o_{y \not\in \{1, 2\}}))\} $$

Which should imply that,

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o_1, o_{x \not\in \{1, 2\}})),\mathbb{E}(s(o_2, o_{y \not\in \{1, 2\}}))\} $$

where $s(o_i, o_j)$ is the similarity score between $o_i$ and $o_j$.

The authors then make an unexpected claim that this is equivalent to

$$ s(o_1, o_2) > \max\{s(o_1, \bar{o}),s(o_2, \bar{o})\} $$

i.e. that the similarity between two objects that belong to the same class should be larger than the similarity between either of those objects and the graph centroid.

This seems clearly untrue. Imagine a graph with only two edges. If the last proposed rule holds, the two nodes would never be put in the same class.

The more defensible rule, that

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o_1, o_{x \neq 1})),\mathbb{E}(s(o_2, o_{y \neq 2}))\} $$

has some surprising consequences. It means that the the threshold for two nodes being put in the same class should be higher for nodes that are near the center of the graph and lower for nodes that are at the periphery.

fgregg · 2022-05-31T04:06:25Z

Improving Short Text Clustering by Similarity Matrix Sparsification basically picks up at

$$ s(o_1, o_2) > \max\{\mathbb{E}(s(o_1, o_{x \not\in \{1, 2\}})),\mathbb{E}(s(o_2, o_{y \not\in \{1, 2\}}))\} $$

but adds standard deviation.

Let $u_i$ be the mean similarity between $o_i$ and other nodes and let $s_i$ be the standard deviation of those similarities.

Then, Rakib and his co-authors propose the rule that

$$ s(o_1, o_2) > \min\{u_1 + \alpha s_1, u_2 + \alpha s_2\} $$

This is basically a z-score based approach, and seems quite interesting. But, beyond setting $\alpha$ to 0, it's not clear what a principled way of setting that value would be.

Also, the shift from maximum to minimum seems hard to justify.

fgregg · 2022-05-31T22:22:44Z

ultimately, we are going to be thresholding clusters, so let's think about how we can use that fact for sparsification.

let our $t$ be our distance threshold. let's start with "average" clustering. let edge distances take a value between 0 and 1.

Pairwise

if two nodes, $i, j$ are a dense component, and the $w_{i,j} > t$ we can discard this edge early.

Triangle

three nodes, $i, j, k$ are a dense component, with weights $w_{i,j}, w_{i,k}, w_{j,k}$, what is average value of nodes incident on $i$ such that all three records are in a cluster. Assume $w_{j,k}=0$ and are already in cluster $C_{j, k}$. $C_i$ is the singleton cluster containing only node $i$.

$$ D(C_{j,k}, C_i) = u_i < t $$

(useful ref on linkage algos)

So, then what is the minimum value that a particular edge can have? Assume all other new edges have a value of 0, then

$$ \frac{x}{n} < t $$

$$ x < nt $$

This does not look very promising as a form of edge sparsification, but i wonder if it could be valuable as a form of node sparsification.

If every edge incident on a node $i$ has a weight greater than $t$, then $i$ can't be a part of any cluster.

fgregg · 2022-06-01T00:51:42Z

Extracting the multiscale backbone of complex weighted networks is very local. Dyads that were very low scoring would not be removed. Not quite what we want.

fgregg · 2022-06-01T02:39:07Z

If every edge incident on a node $i$ has a weight greater than $t$, then $i$ can't be a part of any cluster.

just did a quick and dirty check on csv example, and filtering out nodes like this this removed a bit over 10% of edges

fgregg · 2022-06-01T12:55:53Z

Another thing we could sparsify

If an edge is the only path between two nodes, and the edge weight is greater then $t$ (i.e. $w_{c,d} > t$), then we can disregard that edge.

graph TD;
    A-->C;
    B-->C;
    C-->D;
    D-->E;
    D-->F;

Don't know if there's an efficient way to identify such single path.

fgregg · 2022-06-01T13:31:10Z

Actually, if the edges between two parts of a graph are such that their weights are greater than the threshold, we should be able to cut those edges.

graph TD;
    A-->C;
    B-->C;
    B-->F;
    C-->D;
    D-->E;
    D-->F;

if $w_{c,d}$ and $w_{b,f}$ are both greater than the threshold, we should be able to cut both of them.

fgregg · 2022-06-01T14:01:34Z

keep going down this road, and we are basically re-implementing clustering...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local sparsification #1027

Local sparsification #1027

fgregg commented May 24, 2022

fgregg commented May 29, 2022

fgregg commented May 31, 2022 •

edited

Loading

fgregg commented May 31, 2022 •

edited

Loading

fgregg commented May 31, 2022 •

edited

Loading

fgregg commented Jun 1, 2022

fgregg commented Jun 1, 2022 •

edited

Loading

fgregg commented Jun 1, 2022

fgregg commented Jun 1, 2022

fgregg commented Jun 1, 2022

Local sparsification #1027

Local sparsification #1027

Comments

fgregg commented May 24, 2022

fgregg commented May 29, 2022

fgregg commented May 31, 2022 • edited Loading

fgregg commented May 31, 2022 • edited Loading

fgregg commented May 31, 2022 • edited Loading

Pairwise

Triangle

fgregg commented Jun 1, 2022

fgregg commented Jun 1, 2022 • edited Loading

fgregg commented Jun 1, 2022

fgregg commented Jun 1, 2022

fgregg commented Jun 1, 2022

fgregg commented May 31, 2022 •

edited

Loading

fgregg commented May 31, 2022 •

edited

Loading

fgregg commented May 31, 2022 •

edited

Loading

fgregg commented Jun 1, 2022 •

edited

Loading