Complexity principles in a nutshell!

rohola zandie
12 min readMar 12, 2023

--

In a scene from the movie Amélie, the main character is waiting for a date with her beloved Nino, but Nino does not show up on time. Amélie, who is upset, starts to speculate about what might have happened to Nino. Two hypotheses come to her mind. The first is that Nino may have missed the address that Amélie had given her, and therefore did not show up for the date. The second hypothesis is that, before finding the address, a group of bank robbers had taken Nino hostage, and when the police chased them, they fled but had an accident. When Nino regained consciousness, she had lost her memory, and a con artist had taken him on a boat to Istanbul, mistakenly thinking he was someone else. There, he saw some Afghan attackers who were trying to steal some Russian missiles, and they arrested him. However, when their truck hit a mine in Tajikistan, he survived the incident and became a freedom fighter!

This part may be just a joke in the movie, but it raises a question as to why the second hypothesis is so ridiculous and far-fetched? Why do we prefer simple hypotheses to explain an event? The issue of choosing between rival explanations for an event with the same set of evidence is called model selection. It is more like what detectives do. They have a set of evidence and data and are looking for an explanation for the events that have occurred. The scientific method is very similar to this process. We are looking for theories to explain the data. However, as we have said in previous sections, this search for explanations cannot be an unbiased explanation. Preferring a simple explanation to a complicated one is a type of bias! Why should we think that what happened behind the scenes requires a simple explanation?

Now let’s consider another scenario. If our hypothesis is very complicated, it can explain more data, but it may also be more difficult to understand and test. You can create a story for every small part of the events after seeing a crime scene and connect them with great difficulty. But does anyone believe it?

William of Ockham was an English scholar and philosopher that proposed a principle that pervades most of the science and philosophy after him. It says in every situation we should always seek the simplest explanation and get rid of unnecessary details. That’s why it is called Occam’s Razor or the principle of parsimony.

The more organized a hypothesis is, the easier it is to explain the data. In this way, other data points that may be found later are consistent with our hypothesis. This is exactly the concept of learning. It means drawing a hypothesis from a limited amount of data to explain the rest of the data. If we have discovered a simple pattern, we do not need to create new stories (or modify our hypothesis) for new data. The same simple pattern can explain the rest of the data just as well. This is what all researchers are looking for.

The principle of minimum description length (MDL) is a mathematical and algorithmic way of quantifying Occam’s Razor principle. In this theory, “learning” is defined as finding patterns and regularities that compress the data for us. Here’s an example: let’s say your friend’s phone number is 09123702469, and you want to read this number to another friend on the phone. Well, there doesn’t seem to be a way to shorten it, and you have to read each number individually. But suppose your friend’s phone number is 09123333555. This time, it’s easier because there’s a pattern in the numbers that make things easier: you can easily say 0912 and then four threes and five fives. The pattern itself brings “compression in the description.” In other words, the second number is simpler than the first! But how can complexity be defined?

This was the beginning of an extensive effort by scientists in the field of algorithmic information theory to quantify complexity. Consider the following three sequences

001001001001001001001001…001001

011100110100110100110110…1001100

00001100000101010…0010000001000

The first sequence is a repetition of 001 and the second comes from flipping a fair coin and is completely random but the third one comes from a non-fair coin that produces 0s almost (statistically) 4 times the 1s. So in this way, the first sequence is the most ordered, the second the most disordered and the third one is something between the first two.

Kolmogorov, the Russian mathematician found a very genius way to measure complexity:

The Complexity of a sequence of numbers (or any object) is the shortest program that can generates it

Here we don’t care about the specifics of the programming language (we can fix it in one programming language like C or python the difference measured by different languages is just a constant). Looking at it this way the first sequence is the most ordered and hence has a very short program to generate it (a for loop that prints 001 n times). However, for the second sequence, there is no way to generate it because there is no order and because of this we have to print the whole sequence, in other words, the length of the output and the length of the program are of the same order (complete chaos).

You might think the third sequence also has the same nature but the truth is it has patterns that can be harnessed for a simpler program. But to understand it we have to go beyond Kolmogrov's definition. The problem with this definition is that it is “exact” which means it asks the program to completely generate the data but in reality most of the time we want to “approximately” generate the data. So we have two challenges:

1- Finding the simplest program (or model)

2- Make the program as accurate as possible (or as close as possible to the data)

Kolmogorov Complexity is uncomputable so it can’t even be used to find the simplest programs. We have to first find a way to measure the complexity and accuracy of our hypothesis in some other way.

Data transmission

In the previous section, we defined the best hypothesis among competing hypotheses based on the criterion of “shortest program to generate that data” for a fixed set of data. But we saw that we need a framework to measure the “real world” data complexity using different hypotheses. One important application of this is for data transmission, just like reading a phone number for someone on the other end of the line.

In the early years of electronic communication and sending digital messages, data compression was vital due to the low bandwidth of networks. If you wanted to send a message like:

001001001001001…001001

It is not very smart to send the whole sequence rather we can just send a compressed version of it: like the minimum description of it. Concerns like this made people like Claude Shannon think of efficient ways to send information on communication channels.

Information theory emerged from this very need. Shannon and others realized that there are many patterns in the everyday data we use that can help transmission with data compression. However, these patterns are not always as simple as a straightforward sequence, but rather a pattern that is accompanied by randomness.

Let’s think about the example of sending text data over the internet. Languages on the character level have similar patterns, for example in English the letter ‘e’ happens with the highest frequency while ‘z’ has the lowest frequency.

This “pattern” can help us to see why it is smart to allocate fewer bits to ‘e’ and more to ‘z’. In other words, we don’t have to allocate the same length bits (ceil(log(26))=4) for each letter. Hoffman code is a way to do this uneven bit code allocation.

The next question that comes to mind is that “Is there any simpler or more efficient way to do this?” here we just took advantage of the simple frequency of letters but there could be more patterns like the correlations between letters or looking at 2-grams (like ‘aa’, ‘ab’, …). So there is always a model, a hypothesis that represents the data like a probability distribution P(D) with some accuracy and some complexity but we don’t know it! Also, we have to give up on the platonic mindset of such a model exists: All models are wrong but some are useful (George Box).

We talked about the models being simple but how simple? There is a quote by Albert Einstein that says:

Everything should be made as simple as possible but not simpler. (Albert Einstein)

Let’s take a look at the following graphs:

We have an independent variable x (x-axis) and a dependent variable y (y-axis). We can consider three models to explain the data. First from left is the most simple model (H1): the data comes from (or generated) a linear relationship and every deviation from it is noise. The next model (H3) is the most complex one (like the second hypothesis by Ameilie) it goes into all the details of why each data point is where it is, there is almost no randomness (everything happens for a reason!) and the last one (H2) is a compromise between the first two: The world is not as simple as the linear relationship but we also have to accept there is some randomness in the data.

In each case, we use the model to “explain” the data by giving a shorter description of it. We have to always minimize two things:

1- The length of the data given the hypothesis L(D|H) (accuracy)

2- The length of the hypothesis L(H) (complexity)

Here L is a function that measures the length of data or model in bits. For the line model above L(H1) is very small however the length of data given this oversimplistic model, L(D|H1) is too big! This means the model is not accurate and we have to store a lot of extra noise to model D. For the next one however L(D|H3) is very short which means the model is accurate but this comes at a cost: the big L(H3) or a very complex model

The final case is a compromise in which L(H2) and L(D|H2) have smaller values.

In other words, we are minimizing the sum of these two values:

This means the minimal length of description for data D is the minimum of both L(H) (the hypothesis complexity=length) and the complexity of data given hypothesis H in the space of all hypotheses H’.

From an information theoretic perspective, the code length for a probability distribution is already defined as

L = -log p(x)

Then we can write L(D) = -log(p(D)), L(H)=-log(p(H)) and L(D|H)=-log(p(D|H)), then the minimization formula becomes:

which means the best model (P(D)) for the data comes from finding the posterior probability distribution on all H given D. This is the principle of maximum a posterior (MAP). In other words, we extracted this principle from MDL. However, MDL is much more general.

Maximum Entropy Principle

Now we can look at this in terms of entropy (or disorder), if all the letters in the alphabet occurred with the same frequency then there is no way to predict what is more likely to happen next in a sequence of letters! everything would happen with the same probability. But any unevenness in the probability distribution (or the frequency normalized) is a sign of less entropy and more pattern.

We want the model that has the “highest entropy” among all the descriptions that can “explain” the data. The highest entropy means the minimum number of assumptions about the data. When we don’t know anything about the data then we basically have the highest entropy, all letters have the same probability. This is also called the principle of indifference or non-informative prior by Laplace.

This might seem like a paradox to you. How the best model has the highest entropy while we are looking for the lowest entropy (lowest disorder)? The key part is the fact that we have to always think about the models that CAN explain the data pretty well (with good accuracy). Like the opening example, there can be very complicated explanations (or models) for the data but we are looking for the simplest ones among them. In other words, the simplest model has the highest entropy among all of those models.

The principle of maximum entropy and minimum length description are mathematically very close and Jaynes shows how they can be equal.

Free energy principle

We learned that finding P(D) is challenging and we should always look for some hypothesis H that is simple yet not too simple that causes the model to be inaccurate. But let’s focus more on H. We can think about another probability distribution q(D) that is very close to p(D) but we formulate it as a parametric distribution like q_phi. It can be a gaussian distribution or anything that fits the problem at hand. To make things even more practical we can think any data x \in D has a high dimension so we think about a new hidden set z that has a lower dimension but z is actually controlling x. Introducing z is more than just a dimensionality reduction, it can help us to get rid of computing P(x) itself which is intractable.

x depends on z

Now, we can think about q as the model and p as the true distribution (that comes from the data), and what we are looking for is the right parameters for q that is as close as possible to the p distribution given data x. To find “close” we need a concept of distance between distribution. The best way to do it is KL-divergence. So we are basically looking for the following.

If we follow the definition of KL divergence this can be expanded to:

and

The term logP(x) is independent of q_phi so it can get outside of the E and finally because it has no phi parameter it is basically a constant so we can get rid of it:

and finally:

This is known as variational free energy (VFE) and was introduced by many especially Karl Friston.

The nice thing about this formulation is decomposes the similarity between the target and model probabilities in two terms. The first one shows the likelihood and here we have to maximize it (note it has a negative sign) and the second term is the complexity of the model q with respect to the target distribution p(z). Here, complexity is the divergence between the variational density and prior beliefs about hidden states z. There are two important things in the derivation:

1- We solve the problem by getting rid of p(x) for the target distribution (which is intractable) by introducing z and marginalizing it.

2- We move the log p(z) to the second term

which is basically the entropy of the distribution q to create the KL divergence between q(z|x) and p(z) which is complexity.

The negative of VFE is also known as ELBO in the context of machine learning that we try to maximize.

Putting it all together

Now let’s put everything in the following table:

We should note that except the last one all the principles focus mostly on the complexity of the model. However, for all of them, we should always consider the likelihood or accuracy term too.

These principles, again and again, show up in different fields from information theory to thermodynamics and algorithm and they point in the same direction. Understanding and seeing them are always useful and can help get a better insight into every field of science.

--

--

rohola zandie
rohola zandie

Written by rohola zandie

I am a researcher at MIT, I am curious about mathematics, machine learning, philosophy and languages.

Responses (3)