Link prediction as binary classification using graph-tool

I am playing around with link prediction methods and trying out the method explained in the Inferring modular network structure guide.

I’m working with a bipartite graph where one node set has approximately 7,000 nodes and the other node set has approximate 9,000 nodes. The network has about 50,000 edges which means there are little less than 54 million pairs of nodes not joined by an edge. (note that I am only considering pairs of nodes belonging to different node sets sine this is a bipartite graph)

I have created a “training” network that I will use to fit the SBM where I have removed 10% of the edges. The “test set” contains the 10% of edges I removed along with the ~54 million non-edges. My goal is to get a list of probabilities of every pair of nodes in the test set, below is the code for how I am trying to approach this (sorry for typos, I tried to edit out irrelevant code. I welcome criticism of my code aswell).

import graph_tool.all as gt
import numpy as np
import scipy.special as spsp

def prepare_data(bipartite_network, test_edges)

    test_edges_set = list(map(set, test_edges))
    train_graphview = gt.GraphView(bipartite_network,
                                   efilt=lambda e: set(e) not in test_edges_set)
    train_graph = gt.Graph(train_graphview, prune=True)
    return train_graph


def collect_edge_probs(s):
    
    for idx, pair in enumerate(node_pairs):
        p = s.get_edges_prob([pair], entropy_args=dict(partition_dl=False))
        probs[idx].append(p)


bipartite_network = # ... graph-tool graph object with a vertex property node_type which labels nodes by their corresponding node set
node_pairs = # ... code that returns a set of node pairs corresponding to test_edges and non-edges in the network

test_size = # ... number of actual edges in the test set

test_edges = node_pairs[:test_size] # node pairs contains all test edges and non-edges, test edges are first in the list.
train_graph = prepare_data(bipartite_network, test_edges)

state = gt.minimize_blockmodel_dl(train_graph, state_args=dict(deg_corr=True,
                                                pclabel=bipartite_network.vp.node_types))

probs = tuple([] for _ in range(len(node_pairs)))
mcmc_itr = 10
sweeps_per_itr = 10

gt.mcmc_equilibrate(state, force_niter=mcmc_itr, mcmc_args=dict(niter=sweeps_per_itr),
                    callback=collect_edge_probs)

probas = []
for i in range(len(node_pairs)):
    probas.append(spsp.logsumexp(probs[i]) - np.log(len(probs[i])))

denom = spsp.logsumexp(probas)
for i in range(len(probas)):
    probas[i] = probas[i] - denom

I am finding that this block of code:

def collect_edge_probs(s):
    
    for idx, pair in enumerate(node_pairs):
        p = s.get_edges_prob([pair], entropy_args=dict(partition_dl=False))
        probs[idx].append(p)

takes a very long time to execute since it needs to iterate through 54 million node_pairs. So my question is if there is a way to speed up this calculation? I know that many of the graph-tool loops are offloaded to c++ for performance, but in this specific instance I don’t see how I can take advantage of graph-tool to speed up this loop and I am not sure that I can write a c++ extension given that this code depends on the get_edges_prob() method. Does any one have any advice on how to speed up this calculation?

The answer is simple, just implement your function as:

def collect_edge_probs(s, node_pairs):
   return s.get_edges_prob(node_pairs, entropy_args=dict(partition_dl=False))

Thank you for your response, but I am not sure that I follow.

Wouldn’t

def collect_edge_probs(s, node_pairs):
   return s.get_edges_prob(node_pairs, entropy_args=dict(partition_dl=False))

Return the joint log-probability of all pairs in node_pairs missing at once, that is, I get a single log-probability for my list of 54 million pairs?

What I am looking to do is akin to what you have written in your cookbook where I can get the relative probability of each pair missing individually. This way I can, for example, calculate a true and false positive rate and plot an ROC curve.

Indeed you’re right.

To obtain what you want, you should use a MeasuredBlockState:

mstate = MeasuredBlockState(g, n=g.new_ep("int", val=1), 
                            x=g.new_ep("int", val=1),
                            bstate=s, nested=False)
mstate.get_edges_prob(node_pairs)

It should be substantially faster.

1 Like

I hadn’t considered using MeasuredBlockState. I would like to make sure I am using it properly to achieve my goal by attempting to reproduce your example in your cookbook on Edge prediction as binary classification

I have made a slight change by adding the existing edge (55,35) to the list of missing edges, but this is the reference code I am using to compare results:

g = gt.collection.data["football"]

missing_edges = [(55, 35), (101, 102), (17, 56)]

L = 10 # This is direct copy paste from the cookbook, but I don't think this is actually used in the example.

state = gt.minimize_nested_blockmodel_dl(g)

probs = ([], [], [])
def collect_edge_probs(s):
    p1 = s.get_edges_prob([missing_edges[0]], entropy_args=dict(partition_dl=False))
    p2 = s.get_edges_prob([missing_edges[1]], entropy_args=dict(partition_dl=False))
    p3 = s.get_edges_prob([missing_edges[2]], entropy_args=dict(partition_dl=False))
    probs[0].append(p1)
    probs[1].append(p2)
    probs[2].append(p3)

# Now we collect the probabilities for exactly 100,000 sweeps
gt.mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10),
                    callback=collect_edge_probs)


def get_avg(p):
   p = np.array(p)
   pmax = p.max()
   p -= pmax
   return pmax + np.log(np.exp(p).mean())

p1 = get_avg(probs[0])
p2 = get_avg(probs[1])
p3 = get_avg(probs[2])

p_sum = get_avg([p1, p2, p3]) + np.log(3)

l1 = p1 - p_sum
l2 = p2 - p_sum
l3 = p3 - p_sum

print("likelihood-ratio for %s: %g" % (missing_edges[0], np.exp(l1)))
print("likelihood-ratio for %s: %g" % (missing_edges[1], np.exp(l2)))
print("likelihood-ratio for %s: %g" % (missing_edges[2], np.exp(l3)))

Its output is:

likelihood-ratio for (55, 35): 0.342067
likelihood-ratio for (101, 102): 0.0132379
likelihood-ratio for (17, 56): 0.644695

To be honest, I’m not entirely sure I even understand this output since I find it a bit surprising that the non-existing edge (17,56) is so much more likely to be missing than the existing edge (55,35).

Below is my attempt to reproduce these results using the MeasuredBlockState:

g = gt.collection.data["football"]

node_pairs = [(55, 35), (101, 102), (17, 56)]

state = gt.minimize_blockmodel_dl(g, state_args=dict(deg_corr=True))
mstate = gt.MeasuredBlockState(g, n=g.new_ep("int", val=1), 
                            x=g.new_ep("int", val=1),
                            bstate=state, state_args=dict(deg_corr=True), nested=False)

eprobs = []

def collect_edge_probs(s):
    global eprobs
    eprobs.append(s.get_edges_prob(node_pairs, entropy_args=dict(partition_dl=False)))


gt.mcmc_equilibrate(mstate, force_niter=10000, mcmc_args=dict(niter=10),
                    callback=collect_edge_probs)


probas = []
for i in range(len(node_pairs)):
    probas.append(spsp.logsumexp(eprobs[i]) - np.log(len(eprobs[i])))

denom = spsp.logsumexp(probas)
for i in range(len(probas)):
    probas[i] = probas[i] - denom

for i in range(len(node_pairs)):
    print("likelihood-ratio for %s: %g" % (node_pairs[i], np.exp(probas[i])))

Its output is:

likelihood-ratio for (55, 35): 0.333083
likelihood-ratio for (101, 102): 0.333456
likelihood-ratio for (17, 56): 0.333462

Again, (17,56) seems to be more likely to be missing than (55,35) but in this case, it really seems like all three edges equally likely to be missing. Multiple runs of the above seems to have (17,56) and (55,35) switching for first place but the differences are almost always in the 3rd or 4th digit after the decimal point. I’m honestly not sure that I am truly doing this correctly.

This is not really what I meant; I suggested you should just replace the computation of the scores in your original code with a MeasuredBlockState, and keep everything else, since it should give the same conditional posterior probabilities, up to normalization, but is faster.

Doing the full reconstruction using that method is not what I had in mind, but of course is totally fine — and much better in many ways! But it seems you are just computing the probabilities wrong. Running your code, this is what I get:

import numpy as np
from scipy.special import logsumexp
for i in range(len(node_pairs)):
    p = logsumexp([ep[i] for ep in eprobs]) - np.log(len(eprobs))
    print("posterior prob for %s: %g" % (node_pairs[i], np.exp(p)))

which gives me:

posterior prob for (55, 35): 0.999111
posterior prob for (101, 102): 0.000241431
posterior prob for (17, 56): 0.0253524

Please note that the conditional log-probabilities given by MeasuredBlockState.get_edges_prob() are already normalized!

Of course you could also compute the ratio between them if you’d like…

Here’s a full jupyter notebook if want to check what I did: Google Colab

1 Like

I understand now. Thank you for your help!