G'day!

I'm struggling with a network with millions of nodes and billions of edges

(follow network of Australian Twitter users), trying to apply

gt.inference.minimize_nested_block_model_dl().

I'm reducing the size by analysing k-cores and increasing k step-wise to

test the limits. With 19k nodes and 12m edges I'm already exceeding the 64GB

RAM my machine has at the moment, so given its O(n ln^2(n)) I don't expect

the whole network to fit, but I'd like to push the limit at least as far as

possible.

The only parameter set so far is `mcmc_equilibrate_args={'epsilon': 1e-3}`.

I have four questions:

1. I've tried mcmc_args={'sequential': False, 'parallel': True}, but

this did not seem to have a great impact regarding speed on my smaller

k-cores, could it have on bigger ones?

2. What does the warning here

(https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference.BlockState.mcmc_sweep)

regarding parallel sampling mean in worst case and when is it likely that

the warning actually applies?

3. I am considering either to try to fund a machine with more

(expensive) RAM or to just swap memory on freely available disk space. What

would you expact would be the impact on speed? Checking the behaviour of the

algorithm so far I don't see it being I/O bound but rather by available

memory and the sequential parts of the algo that are running on one core

only. Is swapping a considerable option or should I just straight away rent

a bigger machine?

4. Is there any rule of thumb for epsilon? And what is its impact if it's

too large?

Thanks for any help (maybe also hints how to reduce the network size while

minimising the impact on the results of the nested SBM with something more

sophisticated than k-cores).

Cheers,

Felix