It is so utterly simple, it is *exactly* like doing sums.
Oh, I see now. Thanks!
(I discuss this in the PRX paper, which I get the impression you did not
yet bother to read.)
Actually, I did read some of it before my last message (the terrorist
network example is pretty cool). But, no, I didn't read it front to back. I
apologize if it seems I'm not putting in effort trying to understand. I
just went ahead and read it. I understood more than I thought I would. I
found many minor typos, if you're interested in fixing them, but I also
found some more confusing errors (I *think* I'm reading the latest
version..):
In FIG 1 it says "probability q of missing edge", shouldn't that be
"spurious edge"?
At one point you say "If an entry is not observed, it has a value of n_ij=
0, which is different from it being observed with n_ij > 0 as a nonedge
x_ij= 0." Shouldn't the terminology for n_ij=0 be "unmeasured" not
"unobserved"? That really confused me in the following paragraphs.
"Inverson bracket" should be "Iverson bracket"
Not without making an _arbitrary_ choice of normalization, which results
in an _arbitrary_ number of edges being predicted.
Without saying how you normalized the scores to begin with, these values
are meaningless.
Okay, for clarity here are two situations
1) I take a similarity measure, say the "dice" similarity measure from
graph-tool, and compute it for every vertex pair. Then, I pick an arbitrary
cutoff and consider everything greater than that a "missing edge".
2) I turn this into a supervised classification problem by randomly
removing edges, computing whatever features for every vertex pair (say,
every similarity measure from graph-tool: dice, salton, hub-promoted, etc),
and then training a binary classifier using the removed edges, non-edges,
and the corresponding features. (The tedious details on constructing train,
test, validation, etc properly with edges/non-edges sets aside).
I am doing something like 2 not something like 1. I agree that 1 involves
an arbitrary choice. 2 doesn't. With 2 I can calibrate probabilities which
*do* have some meaning: they're based on my ability to predict the edges I
removed. This is what the Nearly Optimal Link Prediction paper does.
To be specific, the probabilities I showed you are being spit out of an
LGBMClassifier from LightGBM. The LightGBM docs recommend calibrating these
probabilities further using the procedure detailed here
1.16. Probability calibration — scikit-learn 1.3.2 documentation using
sklearn.calibration.CalibratedClassifierCV — scikit-learn 1.3.2 documentation,
i.e. Platt scaling or isotonic regression. The references on the sklearn
page are from 1999-2005, so I'm assuming there are better ways to calibrate
probabilities by now.
From the paper,
Although this distribution can be used to
perform the same binary classification task, by using the
posterior marginal probabilities π_ij as the aforementioned
“scores,” it contains substantially more information. For
example, the number of missing and spurious edges (and,
hence, the inferred probabilities p and q) are contained in
this distribution and thus do not need to be prespecified.
Indeed, our method lacks any kind of ad hoc input, such as
a discrimination threshold [note that the threshold 1=2 in
the MMPestimator of Eq. (44) is a derived optimum, not an
input]. This property means that absolute assessments such
as the similarity of Eq. (45) can be computed instead of
relative ones such as the AUC.
Not sure what you mean by AUROC being 'relative'. As you say, you need a
score threshold to make predictions in the end but computing AUROC itself
doesn't require a threshold. There are other non-threshold-based metrics
one can use.
Okay, I was curious, so I tested it out: I just removed 1000 edges (of 21k;
the structure was not destroyed) at random from the graph and ran this
marginal probability procedure with *defaults*: 1 measurement per edge, 1
observation per edge, 1 measurement per non-edge, 0 observations per
non-edge, alpha, beta, mu, nu all 1. For the equilibration I used
wait=1000, epsilon=1e-5, multiflip=True. For sampling I used 5000
iterations.
In the resulting marginal graph 2470 non-edges from the original graph
appear and 224 edges from the original graph didn't appear. Of the 1000
edges I removed, only 181 edges appear in the marginal graph. Okay, but for
the 181 edges that did appear, were the probabilities at least high? Here's
the histogram of probabilities https://i.imgur.com/kZcR0W1.png As you can
see, almost all the probs are low. How many probabilities exceed 0.5? 16.
So of the original 1000 edges I removed it correctly identified 16/1000 =
0.016 of them.
Given the true value of r, the optimal choice should be beta=[(1-r)/r]*
alpha, and making alpha -> infinity (in practice just a large-enough
value).
Okay, so even though this is not realistic in an exploratory context, I'll
try adjusting alpha and beta to be ideal. 1000 out of 21k is about 5%. So,
using your formula, I'll try alpha=100, beta=1900. Here are results:
10850 non-edges from the original graph in marginal graph, 3 edges not
found from original graph. 450/1000 removed edges appear in the marginal
graph. 25/1000 = 0.025 have probability > 0.5.
So, slightly better, but still very poor. A binary classifier trained with
just the simple similarity measures implemented in graph-tool (as described
above) as features would give better results than this. Indeed, the
conditional probability method is much much better (based on SHAP
importance in my binary classifier). How can that be? Surely I'm missing
something or making a mistake somewhere?
That is not what I meant. Setting n_default=0 means saying that every
non-edge is non-observed. You should set n=0 only to the edges
'removed', and to a comparable number of actual non_edges.
After reading the paper I see what you meant now. If I understand
correctly, this doesn't apply to my situation. Maybe I'm wrong? If someone
says "Here is a real-world graph. Tell me what the missing edges are", this
method doesn't appear applicable?
Although, I can test it out just to see if it performs any better,
I selected 1000 random edges, set their n and x to 0. Then, I selected 2000
random non-edges, added them to the graph, and set their n and x to 0.
Of the 1000 random edges, 981 were found in the marginal graphs. 271/1000 =
0.271 had probability > 0.5. Of the 2000 random non-edges 952 were found in
the marginal graphs, and only 12 had probability > 0.5.
Assuming I did this right, this method seems to work much better! It didn't
appear to be fooled by the unmeasured non-edges.
Choosing alpha=beta=infinity means being 100% sure that the value of r
(probability of missing edges) is 1/2. Does that value make sense to
you? Did you remove exactly 1/2 of the edges?
I was playing with different values as I showed you. I'll be more explicit
about my thought process:
Let's say I run this twice, once with alpha=50, beta=50 and once with
alpha=10, beta=90. The former run will have more possible non-edges,
right?. Let's say that there are two non-edges, a and b which were found in
both runs. Is it not the case that if P(a) < P(b) in the first run, that it
will also be so in the second run? Is this an incorrect intuition? In other
words, will adjusting alpha and beta change the relative ordering of the
probabilities (for the common ones)?
Thanks for your help, as always
attachment.html (14.9 KB)