math-to-play-with  ...Because when we play, we learn!

Measuring information: Shannon's H

    In 1948 Claude Shannon developed a measure for the information content of a message or dataset: Shannon's information entropy (H):

    $$H(X) = -\sum_{i=1}^{n} P(x_i) \log_b P(x_i)$$ Although in different applications different log bases (b) can be used, here we will use base 2. Hence: $$H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$$ To be clear, H measures entropy; i.e., uncertainty —the 'non-information' if you want— rather than the information. Here are some examples where we assume equal probability for each possibility xi within example X.

    Example (X)
    Possibilities (xi)
    Probability of each (P(xi))
    H(X)
    One day we will die
    1
    1
    0
    Coin toss
    2
    .5
    1
    Rock, paper scissors
    3
    .333...
    1.584
    4-sided die roll
    4
    .25
    2
    5-sided die roll
    5
    .20
    2.322
    6-sided die roll
    6
    .166...
    2.584

    Note that as the possibilities increase, so does the entropy. Assuming that we all will die one day —(P(we will die) = 1.0)— there is zero entropy (no uncertainty; H = 0) for that case. Entropy of a coin toss (2 possibilities) is .5. Entropy of a 4-sided die roll is 2, etc.

    This makes intuitive sense. The uncertainty of a case with two possibilities is less than that with 3, 4 or more possibilities.

    All of the above cases, however, are rather trivial. Since the possibilities for each case all have the identical probability associated with them, the formula for H reduces to the simple negative of the log; i.e., $$H(X) = -log_2 P(x_i) $$ Below we have plotted log2 and H for 0 ≤ x ≤ 1.0.

    The choice of log2 is desirable as it indicates for a dataset how many binary (yes/no) choices we will have to make to find a single type of case. Consider, for instance, a dataset with the following three variables:

    • Employment status: 50% of the cases (people) employed; 50% unemployed
    • Age: 50% old; 50% young
    • Hair color: 25% brown; 25% black; 25% blonde; 25% red
    The log2-based information entropy for this dataset equals 4.0 indicating that four yes/no questions are (on average) needed to determine the employment/age/hair color profile of a case (person) in the dataset: one question for employment, one for age, and two for hair color.

    This also means that in order to store the information of one person we will need at least four bits; i.e., four on/off or yes/no switches.

    Things become more interesting when the probabilities are no longer equally divided, for instance when the dataset contains more employed than unemployed people, unequal percentages of hair colors, etc.

    We have set this up below for you to play with. Play with the probabilities of the categories (make sure that they sum to 1.0 across) and see what happens to H. You might even try and see what happens when the dataset contains only employed or unemployed, only old or young people and see what effect that has on H and this the number of bits needed to store the information of one person.

    Variable
    Probability 1
    Probability 2
    Probability 3
    Probability 4
    Employment P(employed): P(unemployed):

    Age P(young): P(old):

    Hair color P(brown): P(black): P(blonde): P(red):
    H:

About math-to-play-withContactDisclaimersLicense