G'day!
I'm struggling with a network with millions of nodes and billions of edges
(follow network of Australian Twitter users), trying to apply
gt.inference.minimize_nested_block_model_dl().
I'm reducing the size by analysing k-cores and increasing k step-wise to
test the limits. With 19k nodes and 12m edges I'm already exceeding the 64GB
RAM my machine has at the moment, so given its O(n ln^2(n)) I don't expect
the whole network to fit, but I'd like to push the limit at least as far as
possible.
The only parameter set so far is `mcmc_equilibrate_args={'epsilon': 1e-3}`.
I have four questions:
1. I've tried mcmc_args={'sequential': False, 'parallel': True}, but
this did not seem to have a great impact regarding speed on my smaller
k-cores, could it have on bigger ones?
2. What does the warning here
(https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference.BlockState.mcmc_sweep)
regarding parallel sampling mean in worst case and when is it likely that
the warning actually applies?
3. I am considering either to try to fund a machine with more
(expensive) RAM or to just swap memory on freely available disk space. What
would you expact would be the impact on speed? Checking the behaviour of the
algorithm so far I don't see it being I/O bound but rather by available
memory and the sequential parts of the algo that are running on one core
only. Is swapping a considerable option or should I just straight away rent
a bigger machine?
4. Is there any rule of thumb for epsilon? And what is its impact if it's
too large?
Thanks for any help (maybe also hints how to reduce the network size while
minimising the impact on the results of the nested SBM with something more
sophisticated than k-cores).
Cheers,
Felix