get_edges_prob for every possible nonedge?

Deklan_Webster · March 19, 2020, 5:33pm

I'm attempting to use get_edges_prob to find the most likely missing edges
out of every possible non-edge. I know every possible edge is O(n^2).

Currently I'm sampling the like this:

non_edges_probs = [[] for _ in range(len(non_edges))]

def collect_edge_probs(s):
s = s.levels[0]

        for i, non_edge in enumerate(non_edges):
            p = s.get_edges_prob([non_edge], [],
                                 entropy_args=dict(partition_dl=False))
            non_edges_probs[i].append(p)

    gt.mcmc_equilibrate(nested_state,
                        force_niter=100,
                        mcmc_args=dict(niter=10),
                        callback=collect_edge_probs,
                        verbose=True)

Is there a way to speed this up at all? If not, is there a heuristic I can
use to reduce the number of possibilities?

Currently I'm using vertex similarity measures to cut the possibilities,
but I'm wondering if there is a heuristic involving just the NestedState.

Any help appreciated!

attachment.html (1.3 KB)

tiago · March 23, 2020, 2:12pm

There is no way to avoid looking at all possibilities, but you could
pass the actual list at once, instead of iterating through it and
passing lists of size 1. The reason get_edges_prob() exists and accepts
lists is precisely to speed things up in this case.

Best,
Tiago

Deklan_Webster · April 1, 2020, 12:40am

I was under the impression that passing a list corresponds to getting the
probability that *all* the edges are missing. Indeed, when I try it out I
get back a scalar not a np array. I want to collect the probability that
each individual edge is missing.

Also, with respect to the heuristics I mention, I just saw this paper
"Evaluating Overfit and Underfit in Models of Network Community Structure"
use "s_ij = θ_i *θ_ j* l_gi,gj"

If sampling is not computationally feasible, this is what I had in mind.

1) Is there a way built into graph-tool to compute this similarity function
efficiently? (i.e., without Python slowing me down)
2) Is there a hierarchical analog, like just summing this similarity at
each level?

Thanks as always

attachment.html (3.16 KB)

tiago · April 1, 2020, 8:17am

I was under the impression that passing a list corresponds to getting
the probability that *all* the edges are missing. Indeed, when I try it
out I get back a scalar not a np array. I want to collect the
probability that each individual edge is missing.

Yes, this is true.

Also, with respect to the heuristics I mention, I just saw this paper
"Evaluating Overfit and Underfit in Models of Network Community
Structure" use "s_ij = θ_i *θ_ j* l_gi,gj"

This is not substantially faster that what is actually computed in
graph-tool, it is just less accurate.

If sampling is not computationally feasible, this is what I had in mind.

1) Is there a way built into graph-tool to compute this similarity
function efficiently? (i.e., without Python slowing me down)

You should switch to using

MeasuredBlockState.get_edge_prob()

which is implemented entirely in C++.

The whole approach described in

should be preferred over the older binary classification scheme, as it
completely subsumes it.

2) Is there a hierarchical analog, like just summing this similarity at
each level?

Yes, any of the above approaches work in the exact say way with the
hierarchical model.

Best,
Tiago