God doesn’t play dice! God creates dices: Dirichlet distribution (Pt.2)

7 min readDec 23, 2019

As we saw in the previous part, the dices or coins that represent a probability distribution themselves can come from another distribution because they, just like anything else in this world, are imperfect and subject to random variances.

Beta distribution can beautifully model the uncertainty of the process of making dices and the extension of this idea to more than two faces (head and tail) seems straightforward. Dirichlet distribution is the extension that we need. Following the same line of reasoning as we had in the previous part we can write down the Dirichlet distribution as follows:

here we have a vector of alphas with the size of K. The beta function can be defined as:

the x_i’s here follow the same pattern: they all should form a probability distribution for a coin or dice or anything like K-faces dice. So the sum of all the x_i’s is equal to 1 (with all x_i’s positive). In the mathematical terms, they form a K-1 simplex.

Just like before alpha values determine how the process of making K-face dices work. Greater alpha values make less variability and higher production quality of dices (minimum bias) and bigger values for alpha_j shows a bias to j’th face of the dice.

We can draw samples from the Dirichlet distribution with the parameter alpha and denote it like:

So q is a probability mass function (pmf) that can be easily represented as a list of probs like [0.1, 0.3, 0.5, 0.1].

Properties

Now, assume that we can draw from Dirichlet distribution (later we go into details on how to draw samples from it), we want to simulate the process of manufacturing the dices. Here we try to produce 25 dices with different settings of alpha.

First, think about a high-quality production line that produces good and almost unbiased dices. We can write the code for it as follow:

And here is the output:

As you can see, the pmf’s of produced dices are not very biased which means they are almost fair. Now let's try it with some other value of alpha. For example alpha=[8, 1, 1, 1, 1, 1]. And the output:

Now, the produced dices are obviously biased towards 1.

Maybe the most interesting case is when the production line is in it’s worst and create dices with random biases. For example for alpha = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1].

Now the dices are very biased but they are scattered even randomly (uniformly) between different kinds of biases.

If all alpha values are the same (symmetric) then we have symmetric (=fair) “way” of creating dices but this doesn’t mean dices would be fair. It just means we fairly create fair/unfair (bad/good) dices. But if alpha values are higher then there is less variance in creating them. It means we create very good fair dices if alpha is big.

Simplex

There is another way to look at the Dirichlet distribution. As we discussed in the previous section what Dirichlet distribution produces is pmf’s and the pmf’s lie on a simplex. A simplex is a generalization of a triangle and all the points on a simplex have a very simple property that if you add up all their coordinates it’s fixed. For our purpose, the simplex shows a list of floating numbers that add up to 1. Any set of positive numbers that add up to 1 can represent a pmf!

Ternary plots can show how Dirichlet distribution looks like with different sets of parameters. For example for alpha=[1,1,1] we have:

As you can see all the samples are equally distributed everywhere. This is the uniform case of Dirichlet distribution. In terms of dices, it represents all different kinds of biases with dices. This setting has the same probability of making all kinds of bias.

For alpha=[0.1, 0.1, 0.1] we have:

Here most samples from the distribution are in the corners which means the distribution tries to produce biased dices with a bias towards one of the faces each time.

And finally the distribution with alpha=[10, 10, 10]

Here we have very precise (low variance) samples that produce dices with high-quality production.

There are two ways to measure the variance of Dirichlet distribution. One of them is just an experimental variance given a set of samples. And the second one is using the analytical variance formula which is:

using this formula you can see the difference between the variances in the examples above.

How to sample?

Up until now, we used numpy function for drawing samples from Dirichlet distribution, but the natural question is how to sample from Dirichlet distribution with a well-defined algorithm. There are multiple ways to do this but we focus on three methods.

Polya Urn

This is a very simple and easy to understand way for drawing samples. To start, put alpha_i balls with the same color for each i=1..K in an urn. Every i has a different color. At each iteration, we draw a ball randomly from the urn and place it back to the urn with an additional ball with the same color. As we go on the proportion of balls in the urn converge to the pmf that is a sample from Dir(alpha).

The algorithm can be described as the following python code:

There is a subtlety here that in this setting we can think about the situations in which we have fractions of balls and not just whole numbers. To test the correctness (experimentally) of the above algorithm you can replace the line of using Dirichlet distribution from numpy in the previous section with the poly_urn_simulation function.

Note that this does NOT mean that in the limit as the number of balls in the urn goes to infinity the probability of drawing balls of each color is given by the pmf alpha/alpha_0. Instead, asymptotically, the probability of drawing balls of each color is given by a pmf that is a realization of the distribution Dir(alpha). Thus, asymptotically we have a sample from Dir(alpha).

The stick-breaking

This approach involves generating a random vector with a Dir(alpha) distribution by iteratively breaking a stick of length 1 into k pieces in such a way that the lengths of the k pieces follow a Dir(alpha) distribution.

The stick-breaking can be described as the process of breaking a stick randomly and iteratively. For simplicity, we can think about K=3. We start with a stick of length 1.

1- Break the stick in u_1 position such that u_1 ~ Beta(alpha_1, alpha_2+alpha_3). Set q_1 =u_1 and the rest of the stick has (1-u_1) length.

2- Take the remaining of the stick and break it from u_2 ~ Beta(alpha_2, alpha_3). Now, q_2 = u_2 *(1-u_1) and finally the last part has q_3=1-(q-1+q_2).

This process can be extended to K easily. Here we have the algorithm in python:

We can show the simulation on a graph like below for 10 samples:

As you expect the result is very sparse. And for alpha=[10, 10, 10]:

which is much more centered and equal.

The code for creating the above figures:

From Gamma Distribution

This approach is more computationally efficient than the other two but it is less intuitive. In this approach, you need to know the gamma distribution. There are two steps for this algorithm:

1- Generate gamma realization: for i=1..K , draw a number z_i: