Kolmogorov and Shannon Information (Pt.1)

Pi is the perfect example of a simple object that Shannon’s information theory fails to model

In this article, we used the concepts of “complexity”, “entropy”, and “information” interchangeably, in the next sections we make the distinctions more clear. This article mostly comes from this book

The notion of Kolmogorov complexity has its roots in probability theory, information theory, and philosophical notions of randomness, and came to fruition using the recent development of the theory of algorithms. The idea is intimately related to problems in both probability theory and information theory. These problems can be interpreted as saying that the related disciplines are not tight enough; they leave things unspecified that our intuition tells us should be dealt with.

An adversary claims to have a true fair coin. However, when he flips it 100 times, the coin comes up 100 heads in a row. Upon seeing this, we claim that the coin cannot be fair. The adversary, however, appeals to probability theory, which says that every sequence of outcomes of a hundred coin flips is equally likely having probability 1/2^100 , and one sequence had to come up.

Probability theory gives us no basis to challenge an outcome after it has happened. We could only exclude unfairness in advance by putting a penalty side bet on an outcome of 100 heads. But what about 1010 . . .? What about an initial segment of the binary expansion of π?

Regular sequence Pr(00000000000000000000000000) = 1/2^26,
Regular sequence Pr(01000110110000010100111001) = 1/2^26,
Random sequence Pr(10010011011000111011010000) = 1/2^26 .

The first sequence is regular, but what is the distinction between the second sequence and the third? The third sequence was generated by flipping a quarter. The second sequence is very regular: 0, 1, 00, 01, . . . . The third sequence will pass (pseudo)randomness tests.

In fact, classical probability theory cannot express the notion of the randomness of an individual sequence. It can only express expectations of
properties of outcomes of random processes, that is, the expectations of properties of the total set of sequences under some distribution.

Only relatively recently, this problem has found a satisfactory resolution
by combining notions of computability and statistics to express the complexity of a finite object. We call this the Kolmogorov complexity.

Kolmogorov Complexity: the length of the shortest binary program from which the object can be effectively reconstructed.

It may be called the algorithmic information content of the object. This quantity turns out to be an attribute of the object alone, and absolute (in the technical sense of being computably invariant). It is the Kolmogorov complexity of the object.

Shannon’s classical information theory assigns a quantity of information to an ensemble of possible messages. All messages in the ensemble being equally probable, this quantity is the number of bits needed to count all possibilities. This expresses the fact that each message in the ensemble can be communicated using this number of bits. However, it does not say anything about the number of bits needed to convey any individual message in the ensemble.

The biggest difference between Shannon and Kolmogorov complexity is that Shannon measures it at the ensemble level (expectation) while Kolmogorov measures it at the individual level.

So the shift from the Shannon information to Kolmogorov is the shift from random variables to individual objects. This can be shown in cases where the statistical properties of the object don’t matter. For example, storing or transmitting the first 1,000,000,000,000,000 bits of π = 3.1415 . . . can be done the hard way using information theory, or the easy way using a small program incorporating an approximation algorithm that generates the successive bits π. Listen to Shannon and Kolmogorov:

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem.The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” [Shannon]

And Kolmogorov:

“The probabilistic approach is natural in the theory of information transmission over communication channels carrying ‘bulk’ information consisting of a large number of unrelated or weakly related messages obeying definite probabilistic laws. . . . But what real meaning is there, for example, in asking how much information is contained in ‘War and Peace’ ? Is it reasonable to include this novel in the set of ‘possible novels,’ or even to postulate some probability distribution for this set? Or, on the other hand, must we assume that the individual scenes in this book form a random sequence with ‘stochastic relations’ that damp out quite rapidly over a distance of several pages? [. . .] Our definition of the quantity of information has the advantage that it refers to
individual objects and not to objects treated as members of a set of objects
with a probability distribution given on it.
The probabilistic definition can be convincingly applied to the information contained, for example, in a stream of congratulatory telegrams. But it would not be clear how to apply it, for example, to an estimate of the quantity of information contained in a novel or in the translation of a novel into another language relative to the original. I think that the new definition is capable of introducing in similar applications of the theory at least clarity of principle.” [Kolmogorov]

Shannon ignores the object itself but considers only the characteristics of the random source of which the object is one of the possible outcomes (i.e. random variable), while Kolmogorov considers only the object itself to determine the number of bits in the ultimate compressed version irrespective of the manner in which the object arose. For almost every Shannon theory notion there turns out to be a Kolmogorov complexity theory notion that is equivalent in the sense that the expectation of the latter is closely related to the former.

For an example, we can consider the entropy measure of π and a random sequence. As the experiment shows they both have the same entropy.

The gives:

pi entropy: 3.3218434709282008
random sequence entropy: 3.321777392084359

Those are almost the same because the Shannon entropy just checks the frequencies and doesn’t care about the order or structure other than statistical regularities!

In the next section, we delve deeper into formalism and how these two concepts are related.