[SBM on Dense Graphs]

Jean_Christophe_Caze · June 14, 2019, 7:59am

Hello, I intend to use graph_tool for a big network, +100k nodes and very
dense.

The dataset i'm working with at the moment is ~ 40/50 GB csv containing
vertices and edges as transactions.

Is it realistic to try SBM on such graph both computationnally and would
this be theoretically useful?

If it isnt computationnally, how big can my subgraph be in order to be
feasible?

Note: I will rent a Google Cloud Platform VM to do so.

Thank you

attachment.html (555 Bytes)

haiko.lietz · June 14, 2019, 8:04am

Hi Jean,

the answer also depends how complicated the desired SBM is. A layered model takes longer than an unlayered one.

Modeling a graph with 100k nodes should take very long. But I'd also be interested in a more informed answer...

Haiko

attachment.html (1.75 KB)

Felix_Victor_Munch · June 14, 2019, 9:56am

Hi Jean,

I did a hierarchical SBM on a Twitter follow network (pretty sparse actually in comparison to other kinds of networks) with ca. 250k accounts. It took about "11 days for one run, with 60 vCPUs and a peak usage of ca. 400 GB RAM" (my PhD thesis, https://eprints.qut.edu.au/125543/, p. 251). How useful it is theoretically depends on your goals and discipline. In my case, feel free to read the referenced chapter and decide for yourself.

I also did runs on smaller networks with ca 100k -150k nodes. Running time was still about in the same ball park, or even longer … I guess because the network structure was more complex.

The amount of RAM is pretty mandatory. Swap memory won't help you much, as it slows the algorithm down to a degree that makes the running time infeasible. Number of CPUs could be lower I guess, because much of the algo seemed to run serially and those parts made up most of the calculation time.

I had a university/QRIScloud (https://www.qriscloud.org.au/) provided VM with 60 cores and 900 GB RAM. On a Google VM this would have been pretty costly. I adjusted epsilon for less accuracy and greater speed:

state = gt.inference.minimize_nested_blockmodel_dl(core, verbose=True, mcmc_equilibrate_args={'epsilon': 1e-2}, )

(PhD-code/1000+ nested SBM.ipynb at master · FlxVctr/PhD-code · GitHub)

If I wouldn't have had the computing ressources for free I wouldn't have done it.

I'd recommend to test infomap if you're looking for a more efficient alternative (MapEquation) that also works with entropy minimization (even though it's more flow oriented). Just did a 181k network community detection (non-hierarchical) in a matter of seconds on a last-year's Macbook Pro yesterday. I don't know how long it takes for a hierarchical structure, but it can do this, so it's worth a try.

Also efficient, but with all the drawbacks that Tiago Peixoto elaborates on in his papers (which I also refer to in the chapter linked above), and not hierarchical, is the parallelised modularity maximisation (Parallelised Louvain Method) PLM in NetworKit (https://networkit.github.io/). Despite it's theoretical and statistical drawbacks it delivers good heuristical evidence for communities in networks imho. But that depends a lot on what you want to do

Cheers,

Felix

Dr. Felix Victor Münch Postdoc Researcher
Leibniz Institut for Media Research | Hans-Bredow-Institut (HBI), Hamburg
https://leibniz-hbi.de/
https://felixvictor.net/)

attachment.html (10.1 KB)

Jean_Christophe_Caze · June 14, 2019, 9:58am

Thank you so much for those quick answers !

attachment.html (10.3 KB)

Felix_Victor_Munch · June 14, 2019, 10:06am

No worries. I'd like to add, that, as Haiko suggests, the non-hierarchical SBM is much quicker and needs magnitudes less RAM. So it might also run on a recently bought laptop at very acceptable running times, even at your network scale. But not sure about how network density might affect this.

Cheers,

Felix

attachment.html (11.3 KB)

tiago · June 14, 2019, 11:46am

Hi,

The important quantity is the number of edges E, not the number of nodes.

The functions minimize_nested_blockmodel_dl() and minimize_blockmodel_dl()
have a complexity of O(E log² N), but the nested version takes longer.

As was already mentioned, the nested model has a large memory footprint,
since several state copies need to be kept.

There is an alternative approach that takes far less memory, which is to
instantiate a NestedBlockState object and just call multiflip_mcmc_sweep()
until equilibration. This is very competitive with the functions above, but
tend to finish much quicker and use less memory.

This is covered in the HOWTO:

https://graph-tool.skewed.de/static/doc/demos/inference/inference.html#id15

(Remember to replace mcmc_sweep() with multiflip_mcmc_sweep() in the code
above. I had forgotten to make this change the last release...)

Best,
Tiago

Jean_Christophe_Caze · June 14, 2019, 11:50am

Wow, i'll try and give my results and computing time if you want
Thank you for providing answers and for graph_tool !

attachment.html (12.3 KB)