Hi! I am excited to recently learn about graph-tool and especially the discussion concerning the comparison between “descriptive” and “inferential” methods for community detection methods.
Here is a challenge I am facing when I try to apply the “stochastic block model” method to my project:
- I have a network which is quite dense (undirected with a density of 0.86), and the topological pattern of the network is mainly shown through difference in edge weight.
- In such a network, the edge weight is the number of interactions between nodes in a given timeperiod. In other words, an edge that doesn’t exist is a similar concept to an edge with a weight of 0. (I learnt in some other cases this might not be the case)
- I applied the
minimize_blockmodel_dl()
but the result is not very good. - I have tried different assumptions about weight definitions (i.e. normal and exponential for real-value weights), and also “refinement” as mentioned in the documentation, but the result is not very good.
My question is: What is the best practice to apply minimize_blockmodel_dl()
to a network like mine? I have some tentative solutions in mind:
- Model simplification: We can see how a part of the stochastic block model is no longer very meaningful in my case (i.e. P(A| b, \theta)), and maybe we can avoid it to make mode focus more on the weight difference? For example, maybe the following code does the trick?
state = gt.minimize_blockmodel_dl(
g,
state_args=dict(
recs=[G_gt.ep.weight],
rec_types=["real-exponential"],
),
multilevel_mcmc_args=dict(entropy_args=dict(adjacency=False)
),
)
- Weight distribution tuning: We can try to get a better interpretation of the weights with the model, by explicitly provide the distribution parameters? For example, in my case, the best fit of weights is achieved with a gamma distribution (alpha~=0.002, beta~=400). Maybe we should provide that to the function as
rec_params
. - Network filtering: Maybe it’s not even a good idea to apply SBM to a dense network, as we are missing quite some statistical power from modeling whether a link exists or not?
Let me know if you guys have any insights concerning the above, or if there is even more signficant flaw in how I am conducting this analysis.
Supplement Information
How do I know the “result is not very good”?
I build a synthetic benchmark B that is representative of original G.
- B has the same number of nodes, same number of edges and same sequence of weights, as G.
- I assume a certain number of communities (empirical estimate) in B and randomly generate the community partition:
community_assignments = np.random.choice(range(1, C + 1), size=N, replace=True)
- I assume B is fully connected. I order the weight sequence from largest to the smallest. I retrieve weight values and assign weights to internal connections first. After all internal connections have been associated with a weight, I assign the rest of the weights to external connections.
- I apply community detection algorithm to see if the generated partition is discovered by the algorithm.
- “Descriptive method” can find the generated parition (e.g. normalized mutual information = 1), but my attempt of
minimize_blockmodel_dl()
leads to more communities than expected and lower normalized mutual information (~=0.7)