Minimize_blockmodel algorithm takes an hour to finish using my network

with 21000 nodes (like the hierarchical version), and it spends two

days and a half with overlap. However, I have run an hierarchical

analysis with overlap, and it is still running since 14 days ago. So

my first question is: is this time normal, or maybe there is any

problem? Do you know how long could it ussually takes?

The overlapping model is indeed much more time consuming than the

non-overlapping one. This is because in the overlapping case we need to

partition the _half-edges_ instead of the nodes, as we do in the

non-overlapping case. The number of half-edges is twice the number of

edges, which is typically one or two orders of magnitude larger than the

number of nodes. Because of this, and the fact that group mixtures are

more expensive to represent algorithmically, the overlapping model tends

to be slower.

So, the time difference you are observing is expected. Although, it is

difficult to say how long it will in fact take, since it depends on the

number of edges, not nodes.

In my experience, I find that there are comparatively few networks that

are better modelled by the overlapping model (see

http://dx.doi.org/10.1103/PhysRevX.5.011033). Usually, the

non-overlapping model provides a shorter description length, indicating

that it has a better explanatory power. Because of this, I generally

tend to work with the non-overlapping model, unless there is a very

specific reason not to.

Secondly, I have repeated some of these analysis with exactly same

options but I get different solutions (similar but different), so I

wonder if the algorithm is heuristic (I thought it was exact).

No, it is not exact. This type of inference problem is NP-Hard, hence

exact algorithms are not feasible. In the default parametrization, the

algorithm implemented is an agglomerative heuristic. The algorithm is

probabilistic, so it will indeed return a slightly different fit every

time. Thus the typical practice is to perform several fits and choose

the one with the smallest description length.

If more precision is desired, the algorithm can also be configured via

the appropriate parameters to either run a unity-temperature MCMC, Gibbs

sampling or simulated annealing. These, of corse, are more expensive

numerically.

My last question question regards bipartite analysis. I have two types

of nodes in my network and I wonder if there are any analytical

difference when running these algorithms with the bipartite option

(clabel=True, and different labels in each group of nodes) or not,

because it seems that the program “knows” my network is bipartite in

any case. If there are differences between bipartite and “unipartite”

analysis (clabel=False), is it possible to compare description length

between them to model selection?

The clabel parameter should not be Boolean; it needs to be a property

map that marks the partition constraints for each node. Namely, vertices

with different label values will not be clustered in the same

group. This does not affect the computation of the description length;

it simply guides the algorithm so that violating moves are not

attempted.

Indeed, as you noticed, the algorithm will always learn that you have a

bipartite network if that is the case, hence the 'clabel' parameter does

not need to be used for that purpose. The clabel parameter is useful

only in situations where there are other types of forbidden partitions

that the algorithm would run into otherwise.

Note that if the agglomerative heuristic is turned off, and MCMC is

used instead, the algorithm may in fact put nodes belonging to different

partitions of a bipartite network in the same group. The probability of

this happening will depend on how "strongly" bipartite the network

is. In this case, the clabel parameter is useful to forbid those

non-bipartite divisions, if they are undesired.

Note that there is also a 'pclabel' parameter. This is almost identical

to 'clabel', with the only difference that it _is_ in fact used to

compute the description length. Namely, it assumes that this division is

known a priori, and hence the description length for the partition will

include only subdivisions thereof. This is indeed useful, for example,

when you know from context that your network is bipartite, and this is

not a fact that you wish to learn. This generalizes also to other types

of known divisions. I recommend using this, if this is the case.

Best,

Tiago