I am trying to fit a stochastic block model to a network in order to
ultimately do link prediction. I am reasonably happy comparing different
versions of the SBMs to each other by computing and comparing posterior odds
ratios so I can determine which configuration is best at describing my data.
What I however am not sure about yet is how good the model is in actually
describing the data (after all, if they are all very poor in fitting my
particular network than the fact that one is better than the rest doesn't
guarantee that it is actually any good at describing my data and thus my
later predictions based on it will therefore also be off) and whether there
is some metric or procedure I could use to get an idea for this. Does
anybody have any advice on this?
Thank you for your help in advance!
With best wishes,
I'm not aware of any usable "absolute" quality of fit criterion for SBMs.
One can of course use a test statistic based on any particular metric (say
clustering coefficient, spectral gap, etc.) and test whether networks
generated from the fitted model match the data, by computing a p-value, for
example. But this should not be used as a measure of model selection, since
models that overfit will have high p-value.
Prediction quality is also a relative measure, since no probabilistic model
will predict edges with 100% precision, even if it is the actual model that
generated the data.
In the end, we can only compare models to each other (via posterior
odds/description length, prediction quality, etc). We can never really
compare them to the "truth", since we have no access to it. Remember that
data is never the "truth" since it contains noise.