Kolmogorov and Shannon Information (Pt.2)

rohola zandie
6 min readJul 3, 2021

--

An example of 425-bit universal Turing machine in pixels [1]

Information, according to Shannon, is an ensemble notion. The entropy of a random variable X with outcomes in an ensemble S is the quantity H(X) = log d(S). This is a measure of the uncertainty in choice before we have selected a particular value for X, and of the information produced from the set if we assign a specific value to X. By choosing a particular message a from S, we remove the entropy from X by the assignment X = a and produce or transmit information I =log d(S) by our selection of a. Since the information is usually measured in the number of bits I’ needed to be transmitted from sender to receiver, I’ = ceil(log d(S)).

Define the entropy of a random variable X with the given probabilities P (X = xi ) = pi by

eq.1

The development of this theory immediately gave rise to at least two different questions. The first observation is that the concept of information as used in the theory of communication is a probabilistic notion, which is natural for information transmission over communication channels.

The second observation is more important and is exemplified in Shannon’s statement, “Messages have meaning [ . . . however . . . ] the semantic aspects of communication are irrelevant to the engineering problem.” In other words, can we answer a question such as, “what is the information in this book” by viewing it as an element of a set of possible books with a probability distribution on it? Or that the individual sections in this book form a random sequence with stochastic relations that damp out rapidly over a distance of several pages? And how can we measure the quantity of hereditary information in biological organisms, as encoded in DNA? Again there is the possibility of seeing a particular form of animal as one of a set of possible forms with a probability distribution on it. This seems to be contradicted by the fact that the calculation of all possible life forms in existence at any one time on earth would give a ridiculously low figure like 2^100 .

We are interested in a measure of the information content of an individual finite object, and in the information conveyed about an individual finite object by another individual finite object. Here, we want the information content of an object x to be an attribute of x alone, and not to depend on, for instance, the means chosen to describe this information content. Making the natural restriction that the description method should be effective, the information content of an object should be computable invariant among the different description systems. Pursuing this thought leads straightforwardly to Kolmogorov complexity.

In his original paper, Shannon restricted attention to those codes for which no value is the prefix of another value, the so-called prefix-codes. This restriction is highly motivated by the implicit assumption that descriptions will be concatenated and must be uniquely decodable. Uniquely decodable codes and prefix-codes share the same sets of code-word lengths. Moreover, the minimal
average code-word length L of a prefix-code encoding source-word sequences emitted by a random variable with probability distribution P satisfies:

eq.2

where H(P ) is the entropy of the random variable. This is obtained (up to an additive constant term) by assigning code-word lengths li = ceil(log 1/pi) to the ith outcome of the random variable, where pi is the probability of that outcome. In this way, H(P ) may be interpreted as the minimal expected length of a description of an outcome, using a prefix-free code.

To define Kolmogorov complexity we base the programs on universal Turing machine U. The universal Turing machine input a program p with some input on the auxiliary tape (like y) and then outputs x and halts U(yp)=x.

The unconditional Kolmogorov complexity K(x) is by definition the length of a shortest self-delimiting binary program p for U with ε on the auxiliary tape, which outputs x.

In the example:

x = 10101010101010101010101010

we can encode x by a small program (p= {print([10 for _ in range(13)])}). I chose python but the choice of language doesn’t matter because it just adds an additive constant.

Intuitively, the Kolmogorov complexity of a binary string cannot exceed its own length, because the string is obviously a (literal) description of itself.

The conditional prefix Kolmogorov complexity K(x|y) is the length of a shortest binary program p on the input tape of U with y on the auxiliary tape such that U outputs x and halts without reading the next symbol after p.

K has more intuitive properties compared to when we don’t use prefix coding and come up with C as the definition for complexity. For example, the measure K is subadditive, that is:

eq.3

which is align with our intuition about complexity. If x and y has K(x) and K(y) complexity their concatenation should be almost equal to the addition of each one’s complexity.

As we can expect most strings have maximal complexity (uncompressible). This can be shown in the following theorem:

Let x be a string of length n. If K(x) ≥ n, then x is incompressible. Also, the prefix complexity of a shortest program x* does not rise above its length:

eq.4

Since the set of K(x)’s is the length set of a prefix code, the first inequality of the noiseless coding theorem. So based on eq.1 we can write:

eq.5

Moreover, an effective version of the Shannon–Fano code guarantees that:

eq.6

Together this shows that the entropy:

eq.7

of the distribution P is asymptotically equal to the expected complexity:

eq.8

with respect to probability P (·).

The fact that the P -expectations of log 1/P(x) and K(x) are close does not necessarily mean that those quantities are close together for all arguments. However, the values log 1/P (x) and K(x) are close to each other with high probability.

The additive constants in the plain Kolmogorov complexity and prefix complexity of an object depend on the particular choice of universal machine fixed as the reference machine.

In the next section, we delve deeper into the definition of probability and its relation to complexity. Algorithmic probability is the new term.

[1] Li, M. and Vitányi, P., 2008. An introduction to Kolmogorov complexity and its applications (Vol. 3). New York: Springer.

--

--

rohola zandie
rohola zandie

Written by rohola zandie

I am a researcher at MIT, I am curious about mathematics, machine learning, philosophy and languages.

No responses yet