Is Bayesian inference subjective? – Inverse Complexity Lab

If you reply to this post, it will show as a comment below the original article.


This is a companion discussion topic for the original entry at https://skewed.de/lab/posts/objective-inference

This is an interesting post! Thanks for sharing. I hadn’t seen the description length before. That’s a cool perspective.

At the same time, I’m not sure I understand how this defends Bayesian inference from the charge of being subjective. I was expecting it to be a post about selecting a prior, since this is the step in Bayesian inference that is most often accused of subjectivity. The post starts by mentioning the problem of prior selection, but quickly switches to talking about posterior maximization instead. You make a good argument that maximizing the posterior is well-motivated and not at all arbitrary. But I don’t think you defend against the more serious claim that the prior can be an arbitrary choice. Maybe I misunderstood the post?

I also had a question about the later discussion on comparing models through compression. Couldn’t we define a model to explain the protein interaction network that trivially has description length zero? We would define a model with zero parameters that outputs the pictured network every time. It seems to me that the description length of this model is zero. If so, what’s to stop us from declaring that this trivial model is the “true model” referenced in Figure 1?

You cannot simply declare the description length of some data to be zero by modelling it as a delta function.

The error in this logic is to forget that if you parametrize some data \boldsymbol x as a delta function,

P(\boldsymbol x | \boldsymbol x') = \delta_{\boldsymbol x , \boldsymbol x'},

then the description length would be

\begin{aligned} \Sigma(\boldsymbol x) &= -\log_2 P(\boldsymbol x | \boldsymbol x'=x) - \log_2 P(\boldsymbol x'=x),\\ &= - \log_2 P(\boldsymbol x'=x). \end{aligned}

In other words, the location \boldsymbol x' of the delta function is a parameter of the distribution, which requires - \log_2 P(\boldsymbol x') bits to be described, where P(\boldsymbol x') is some prior. Therefore, with a delta function you have simply shifted all the encoding to the prior, and achieved nothing.

Indeed, you cannot cheat with the description length: it needs to map to an encoding of the data, and arbitrary compression is neither trivial nor in fact possible.

Just think about this for a second: can you store all the contents of your hard drive using a single bit just by waving your hand and declaring them to come from some trivial distribution? Imagine how different our technology would be if we could achieve arbitrary compression like that.

And this is precisely my point about the objectivity of the description length (and Bayesian inference as a corollary): even though you can arbitrarily choose the likelihood and priors above (models are always choices we make), you cannot decide what description length they will have for some arbitrary data.

Ah, I see what you’re saying. If I do, I think I disagree. Here’s what I’m thinking.

There is famously no such thing as the Kolmogorov complexity of a given string. The complexity must be measured relative to a specific universal Turing machine. For any fixed string, one can dream up a universal Turing machine that outputs that string when fed no input, making the Kolmologorov complexity of that string zero.

This has (as far as I know) no real consequences in the practical theory of compression. It’s a cheat, and it won’t work for arbitrary data. As you point out, if one could cheat and infinitely compress any arbitrary string, the consequences would be absurd. But the cheat works just fine for any one fixed string. So yes, one can technically compress any one hard drive to a single bit if one is allowed to select the compression algorithm. This might seem like a quibble but I think it’s important here.

I don’t think the model I’m proposing is the same one you’re analyzing. The model you analyzed is x\mapsto \delta_x, and I agree this model achieves nothing if we’re trying to minimize description length. The model I’m proposing is analogous to the Komologorov complexity cheat strategy; it’s a model that takes no parameters (or a parameter in a singleton set \{\star\} if you prefer) and outputs the same \delta\_{x_0} every time. With respect to this model (unless I’ve made a mistake here) I think the description length of x_0 would be

\Sigma(x_0) = -\log_2 P(x_0|\star) - \log_2 P(\star) = -\log_2(1) - \log_2(1) = 0

Have I computed this incorrectly?

This is of course a silly model. It’s only meant to show what I’m worried about; that I can hide the complexity of some data not in the model parameters, or the output conditioned on the parameters, but in the internals of the model itself. Just as Kolmogorov complexity does not account for the “complexity” of the universal Turing machine, I fear the description length might not account for the “complexity” of a model if that complexity does not produce any parameters. Does this make sense?

Surely you understand that to realize such a specific Turing machine you will need to include the string in its specification. So, at first, we could easily get around this kind of construction by the modest requirement that the reference Turing machine is data independent. (In Bayesian terms, the prior cannot depend on the data.)

But more crucially, it turns out that KC is invariant in an important way: It is the same for all Turing machines up to a constant, which is independent of the data, and depends only on the language used: Kolmogorov complexity - Wikipedia. This constant is immaterial, since the relative compression between two programs will be identical in every language. Therefore, this dependence on Turing machine is irrelevant for model selection, since the outcome will be identical for every language, and therefore cannot account for “subjectivity”.

Note however that my argument did not rely on KC. We don’t really need to evoke KC or Turing machines for my main point.

The reason why what you computed is not quite a description length is that it does not correspond to an encoding.

You are simply not writing down all the parameters of your model: the location of your delta function P(\star) needs to be specified. It’s impossible to describe the distribution otherwise. It’s not enough just to declare that it exists, you have to describe it! Once the location is specified, you have to encode it (put a prior on it).

If you actually try to write down the distributions you are describing for some data \boldsymbol x for the case where it’s a large random vector of size N, you will notice quite quickly that you will have to repeat that huge vector inside the function definition.

Go ahead and try to come up with an encoding (e.g. Huffman, arithmetic) that achieves zero bits for arbitrary data.

In other words, to obtain a description length, per definition, you have to describe the “internals of your model”, as you put it. This is your model specification and parameters. You just have to be honest! And the only way to be honest is to write down the actual encoding.

(There’s a reason why most compression competitions consider the size of the compressed file together with the program used to generate it. It’s precisely to avoid the trivial cheating you are describing, where the “compression” program has a single print() function that outputs the uncompressed file, and the “compressed” file has zero size. The definition of the description length prevents the exact same kind of cheating.)

I agree we can leave discussion of Kolmogorov complexity behind. It’s not that important here.

I think we’re at the core of my confusion. The things you’re saying are very reasonable. For instance, you’ve said that the model is dishonest, and it would be more honest to include the location of the delta as a parameter. I agree completely.

My confusion is that the formal definition of description length that you gave

-\log_2 P(A|b) - \log_2 P(b)

does not seem to include any such restrictions. For instance, while I agree that the model itself may have a very long description if x_0 is complicated, there is no term in the above expression for the “complexity” of the model itself, since I have chosen not to expose any parameters. Is there a mathematical way to decide that the model is cheating? That the location of the delta must be a parameter? Is there a definition that the model doesn’t satisfy?

Maybe the distinction isn’t mathematical and one must simply examine a model and make a judgement call about whether it’s dishonest? If so, that’s fine. Minimizing description length would still be a useful principle, assuming that one knows how to make those judgement calls.

The underlying assumption of the formula you quoted is that each term does not include more information about the data than what is explicitly specified by the parameters.

More formally, you have to show that each distribution can map into an encoding — a table or a simple algorithm — that guarantees that the data can in fact be described using these specified number of bits.

Here’s an example: N coin tosses, each with value x_i\in \{0, 1\}. We assume N is known. One possible model is to sample first the number M \in \{0, N\} of ones from an uniform distribution:

P(M|N) = \frac{1}{N+1}.

Conditioned on this value, we sample the final coin tosses also uniformly from

P(\boldsymbol x | M) = \frac{\delta_{\sum_ix_i,M}}{N \choose M}.

According to this model, the description length is

\Sigma(\boldsymbol x, M) = \log_2 {N\choose M} + \log_2(N + 1).

It’s easy to see that we can always use these many bits to encode any binary string, by doing, in sequence:

  1. Storing an index M from [0,N] which can always be done using \log_2 (N+1) bits.
  2. Storing an index between 1 and {N \choose M} corresponding to the specific string from the constrained set (e.g. ordered lexicographically) which always needs \log_2 {N\choose M} bits.

Compare this now with the alternative encoding where the string is chosen uniformly at random

P(\boldsymbol x | N) = 2^{-N}

which gives a description length of N bits — which is arguably the most obvious encoding.

In both these cases the distribution amounts to a tangible, direct encoding, and it will always be like this if your parametrization is done explicitly.

Happy holidays! I hope you had a good New Year. Also, thank you for engaging with my questions at length. I appreciate your time.

Thanks for your example. It makes sense to me, but I’m not sure my question is resolved.

The formulas you give for description lengths work for any string. If I were choosing between your two models, these formulas you gave are exactly what I would want to see. My issue is that the blog post is applying these same comparisons to specific outputs, such as the specific network in Figure 1.

Let x_0 be the complete works of Shakespeare in ASCII. Suppose I am interested in making the equivalent of your Figure 1 but for x_0 instead of the pictured network. I compute description lengths under both of the models you propose, along with various sophisticated language models.

I also try the following model. There is one parameter: \theta \in \{0,1\}. If \theta = 0, my model is the same as the second model you propose, in which the description length is always N. If \theta = 1, my distribution P(\cdot|\theta = 1) is a delta distribution at x_0. Our prior for \theta will be uniform.

For this model, the description length of any string is N+1 (with the one extra bit given for encoding the value of \theta), except for x_0, which has description length 1. This model, unsurprisingly, beats all of my other more sophisticated models.

I think this model corresponds to an encoding that is just as explicit as the ones you gave. In practice, when encoding a string x, the encoding could be 0 if x=x_0, and otherwise the encoding would be a 1 followed by the N bits necessary to encode x directly.

Somewhere I have made an illegal move. I think the options are as follows:

  1. We add a restriction that the model cannot be data-dependent. This seems difficult to enforce, since “data-dependent” is not defined if the “data” in question is a fixed string x_0. Indeed, I could argue that I simply find it highly likely to come across the complete works of Shakespeare. (The model I proposed is particularly brazen, but we could imagine softer examples where I post-hoc make reasonable-sounding modeling decisions that favor x_0 to give it a shorter description length.)
  2. We say that the model is cheating because I am mathematically required to expose x_0 as a parameter. I don’t see a way to mathematically enforce this either, since x_0 is a fixed string.
  3. We make a judgement call that the model is cheating because it is dishonest that I have not exposed x_0 as a parameter. I think this option is perfectly valid.
  4. We notice that the resulting explicit program is very long, since it contains a copy of x_0. We declare that the code itself must be included in the description length. I think this is a valid choice, though it raises other issues: the length of the decoding program will depend on the underlying universal Turing machine.

What are your thoughts on the options?

I feel I’m repeating myself, and we’re running around in circles.

The correct answer is point 4 — the concrete encoding is not a subjective judgment call, it’s the mathematical definition of the description length. The Turing machine problem is an non-issue, due to the invariance I described already: translating an encoding from one machine to another just adds an unimportant data-independent constant to the description length. If we compare all our models using a specific Turing machine T1, once we move to T2 our comparisons remain exactly the same. If you want to include in the comparison the best Turing machine for some data, this just becomes part of the encoding (and the generative model).

Maybe at this point it would be more productive to open a book on information theory (I recommend Cover and Thomas) and the works on MDL from Grünwald cited in the post above.

Thanks for clarifying. I think this is where we disagree: I don’t think it follows from your argument that we can brush aside the underlying UTM. The argument you gave provides upper bounds on the lengths of programs in one UTM given their lengths in another, but it does not preclude much shorter implementations.

One can easily imagine a (highly unusual) programming language in which my “cheat” decoding program is short, while the more reasonable programs we’ve been discussing are necessarily long. Writing our decoding programs in such a language would surely shuffle the results, putting the “cheat” model on top. This shows that the underlying UTM does matter for the rankings.

However, I expect the UTM does not matter much in practice. It’s probably rare to get such examples using popular general-purpose programming languages. One could select a popular language and start placing models on the kind of scale that appears in Figure 1, and I expect one could rest assured that the choice of language was unlikely to matter.

Also, if you feel the conversation is unproductive, feel free to hop out. Either way, thank you again for your time!

Yes, it does preclude exactly that. For two Turing machines u and v, and the same string s of arbitrary length N, we have that |K_u(s) - K_v(s)| < c_{u,v}, where K_u(s) is the encoding using machine u, and c_{u,v}=O(1) is a CONSTANT that is independent of N. One can be pedantic and make this constant large in absolute terms, but never asymptotically.

That constant c_{u,v} can never be significantly large without including a random string in the specification of either u or v.

You already made this point before. Besides what I said above, that whatever difference between languages will amount only to a constant, please try to understand also the following point. It is impossible to define a Turing machine that outputs a large random string, without including that string in its specification. Turing machines do not exist in a vacuum. You have to specify what are the operations, the states, and the outputs. For the purpose of computing description lengths, it only makes sense to talk about specifications that themselves have size O(1), while the data can have any size.

It is one of the most fundamental theorems of information theory that it is impossible to compress a random string of N bits, using whatever method, even crafting specific Turing machines. (That’s pretty much the definition of a random string.)

It’s not a matter of practice, or popular programming languages, it’s just theory. The statements I’m making are fundamental in algorithmic information theory, and can be made rigorous. I’m trying to give the gist, but I don’t believe I will convince you without giving a full blown lecture.

Let’s please stop this interchange here since I don’t think it makes sense to go much further, and it’s just polluting the comment section.

I apologize for the pollution.