Subsampling a network for inference

I’ve developed a python library that uses graph-tool to analyze some biological data and it works perfect, not because of my library, but because of the SBM model under the hood (thanks!).
The methods available to generate the data are continuously improved and, most important, their cost decreases with time. The direct consequence is that the size of the datasets increases as well. When I first approached the problem, I typically worked with graphs of, at most, 20k nodes. Nowadays we are already in the realm of 10^5 or even 10^6 nodes. I may be ok with the fact it may take days (or weeks) to analyze such networks, but it would be better if I could speed up (some collaborators also asked hints on that).
A while ago, I was reasoning I could infer the structure on (multiple) random subsamples of the original network, project the structure on the original one and possibly taking the consensus over multiple solutions.
To perform the projection I would go like this: given G (the original) and S (the subsample), I can find the solution B for S. Then I would create a BlockState P for G in which all nodes belong to the same partition, except for the ones in S, which will belong to the partition B. I would then use the P.virtual_vertex_move for every “unassigned” cell to each group in B, looking for the “best” move.

Of course I won’t be able to find small partitions in the original network and the final solution will be necessarily “coarse”, but in my case that may be ok for the majority of settings (which raises the question: why on earth you would generate 10x more data? but that’s another story). Having said that, do you think there’s something inherently wrong in what I just wrote? I never had time to test such approach



This may sound like an appealing idea at first, but my prediction is that it will not work very well.

Random sub-sampling of sparse networks will result in much sparser networks, or even completely empty ones. A rough guideline is that if you remove a fraction f of the nodes, then the average degree will also decrease at least by the same fraction, often more, and hence the detectability of the communities will suffer.

Having said that, I never experimented with this in practice, so it may be worthwhile to investigate.

There are other approaches that may be worth considering as well, it terms of speeding up the inference.

Have you compared the speed of the following four approaches:

  1. Using minimize_nested_blockmodel_dl()
  2. Starting with a NestedBlockState with one group and calling state.multiflip_mcmc_sweep(beta=np.infty, niter=100) until convergence.
  3. Like 2, but initializing the state with B=N.
  4. Like 2, but initializing the state with B=sqrt(N).

It may be worth trying these out if you have not.

We’ve made some tests in the past comparing 1 and 2 and in the end we went for 2. We actually perform it multiple times (using joblib) and compute the consensus at the end. I never explored 3 and 4, I will.
As for the subsampling, it is true that the network is sparse, the only thing I can add is that I work with kNN networks generated from data and I typically choose a n_neighbors that scales with N (sqrt(N)/2), so bigger networks are not necessarily sparser.
Having said that I will perform some tests and let you know.