What could be best-practice for dense, weighted graph?

Hi! I am excited to recently learn about graph-tool and especially the discussion concerning the comparison between “descriptive” and “inferential” methods for community detection methods.

Here is a challenge I am facing when I try to apply the “stochastic block model” method to my project:

  • I have a network which is quite dense (undirected with a density of 0.86), and the topological pattern of the network is mainly shown through difference in edge weight.
  • In such a network, the edge weight is the number of interactions between nodes in a given timeperiod. In other words, an edge that doesn’t exist is a similar concept to an edge with a weight of 0. (I learnt in some other cases this might not be the case)
  • I applied the minimize_blockmodel_dl() but the result is not very good.
  • I have tried different assumptions about weight definitions (i.e. normal and exponential for real-value weights), and also “refinement” as mentioned in the documentation, but the result is not very good.

My question is: What is the best practice to apply minimize_blockmodel_dl() to a network like mine? I have some tentative solutions in mind:

  • Model simplification: We can see how a part of the stochastic block model is no longer very meaningful in my case (i.e. P(A| b, \theta)), and maybe we can avoid it to make mode focus more on the weight difference? For example, maybe the following code does the trick?
state = gt.minimize_blockmodel_dl(
  • Weight distribution tuning: We can try to get a better interpretation of the weights with the model, by explicitly provide the distribution parameters? For example, in my case, the best fit of weights is achieved with a gamma distribution (alpha~=0.002, beta~=400). Maybe we should provide that to the function as rec_params.
  • Network filtering: Maybe it’s not even a good idea to apply SBM to a dense network, as we are missing quite some statistical power from modeling whether a link exists or not?

Let me know if you guys have any insights concerning the above, or if there is even more signficant flaw in how I am conducting this analysis.

Supplement Information
How do I know the “result is not very good”?
I build a synthetic benchmark B that is representative of original G.

  • B has the same number of nodes, same number of edges and same sequence of weights, as G.
  • I assume a certain number of communities (empirical estimate) in B and randomly generate the community partition: community_assignments = np.random.choice(range(1, C + 1), size=N, replace=True)
  • I assume B is fully connected. I order the weight sequence from largest to the smallest. I retrieve weight values and assign weights to internal connections first. After all internal connections have been associated with a weight, I assign the rest of the weights to external connections.
  • I apply community detection algorithm to see if the generated partition is discovered by the algorithm.
  • “Descriptive method” can find the generated parition (e.g. normalized mutual information = 1), but my attempt of minimize_blockmodel_dl() leads to more communities than expected and lower normalized mutual information (~=0.7)

Unfortunately, it’s not possible to provide a solution to your problem or an answer to your question with the information provided.

To enable us to understand the situation, you need to provide a minimal working example that shows the problem.

This is very important! If you provide us only the part of the code that you believe causes the problem, or a vague description, then it is not possible to understand the context that may have contributed to it.

You need to provide us a complete example that runs and reproduces the problem you are observing.

In particular, you gave only a very broad description of how your data was generated and what you considered to be good result, but you have not explained how you tried to fit the SBM, i.e. what parameters you used.

You also make some statements that seem contradictory. You say that your weights are integer counts (i.e. “number of interactions”) but that you tried to use models for continuous (i.e. real) values. Why not use the discrete distributions?

Please try to produce a minimal working example.