I am playing around with link prediction methods and trying out the method explained in the Inferring modular network structure guide.
I’m working with a bipartite graph where one node set has approximately 7,000 nodes and the other node set has approximate 9,000 nodes. The network has about 50,000 edges which means there are little less than 54 million pairs of nodes not joined by an edge. (note that I am only considering pairs of nodes belonging to different node sets sine this is a bipartite graph)
I have created a “training” network that I will use to fit the SBM where I have removed 10% of the edges. The “test set” contains the 10% of edges I removed along with the ~54 million non-edges. My goal is to get a list of probabilities of every pair of nodes in the test set, below is the code for how I am trying to approach this (sorry for typos, I tried to edit out irrelevant code. I welcome criticism of my code aswell).
import graph_tool.all as gt
import numpy as np
import scipy.special as spsp
def prepare_data(bipartite_network, test_edges)
test_edges_set = list(map(set, test_edges))
train_graphview = gt.GraphView(bipartite_network,
efilt=lambda e: set(e) not in test_edges_set)
train_graph = gt.Graph(train_graphview, prune=True)
return train_graph
def collect_edge_probs(s):
for idx, pair in enumerate(node_pairs):
p = s.get_edges_prob([pair], entropy_args=dict(partition_dl=False))
probs[idx].append(p)
bipartite_network = # ... graph-tool graph object with a vertex property node_type which labels nodes by their corresponding node set
node_pairs = # ... code that returns a set of node pairs corresponding to test_edges and non-edges in the network
test_size = # ... number of actual edges in the test set
test_edges = node_pairs[:test_size] # node pairs contains all test edges and non-edges, test edges are first in the list.
train_graph = prepare_data(bipartite_network, test_edges)
state = gt.minimize_blockmodel_dl(train_graph, state_args=dict(deg_corr=True,
pclabel=bipartite_network.vp.node_types))
probs = tuple([] for _ in range(len(node_pairs)))
mcmc_itr = 10
sweeps_per_itr = 10
gt.mcmc_equilibrate(state, force_niter=mcmc_itr, mcmc_args=dict(niter=sweeps_per_itr),
callback=collect_edge_probs)
probas = []
for i in range(len(node_pairs)):
probas.append(spsp.logsumexp(probs[i]) - np.log(len(probs[i])))
denom = spsp.logsumexp(probas)
for i in range(len(probas)):
probas[i] = probas[i] - denom
I am finding that this block of code:
def collect_edge_probs(s):
for idx, pair in enumerate(node_pairs):
p = s.get_edges_prob([pair], entropy_args=dict(partition_dl=False))
probs[idx].append(p)
takes a very long time to execute since it needs to iterate through 54 million node_pairs. So my question is if there is a way to speed up this calculation? I know that many of the graph-tool loops are offloaded to c++ for performance, but in this specific instance I don’t see how I can take advantage of graph-tool to speed up this loop and I am not sure that I can write a c++ extension given that this code depends on the get_edges_prob()
method. Does any one have any advice on how to speed up this calculation?