当前位置：首页 > news >正文

Information theorem-Entropy

news 2025/9/22 14:58:13

Self -information

In information theory and statistics, “surprise” is quantified by self-information. For an event x with probability p(x), the amount of surprise (also called information content) is defined as

$-\log p(x)=\log \frac{1}{p(x)}.$

This definition has several nice properties:

Rarity gives more surprise: If p(x) is small, then I(x) is large — rare events are more surprising.
Certainty gives no surprise: If p(x)=1, then I(x)=0. Something guaranteed to happen is not surprising at all.
Additivity for independence: If two independent events occur, the total surprise is the sum of their individual surprises: I(x,y)=I(x)+I(y)

The logarithm base just changes the unit:base 2 gives “bits,” base e gives “nats.”

For example, if an event has probability 1/8, then

$-\log_2 \tfrac{1}{8} = 3 \ \text{bits}$

meaning the event carries 3 bits of surprise.

Entropy

Formally, entropy is defined as the expected surprise (expected self-information) under a probability distribution. For a random variable X with distribution p(x):

$\mathbb{E}[I(X)] = \sum_{x} p(x)\,(-\log p(x)) = -\sum_{x} p(x)\log p(x).$

Key points:

Each outcome x carries a “surprise” $-\log p(x)$ .
We weight that surprise by how likely it is p(x).
The entropy is the average surprise if you repeatedly observe X.

For example:

A fair coin ( $p (H) = p (T) = 0.5$ ) has

$-[0.5\log_2 0.5 + 0.5\log_2 0.5] = 1 \text{ bit}.$

Meaning: on average, each coin flip carries 1 bit of surprise.
A biased coin ( $p (H) = 0.9, p (T) = 0.1$ ) has

$-[0.9\log_2 0.9 + 0.1\log_2 0.1] \approx 0.47 \text{ bits}.$

Less uncertainty → less average surprise.

So entropy = expected surprised amount you’d feel per observation, given your model.

Cross-entropy

$H(P,Q)=−∑_{x}p(x)logq(x)$

The weighting comes from the true distribution P (because that’s what actually happens in the world).
The surprise calculation $- l o g q (x)$ comes from the model Q (because that’s what you believe the probabilities are).
Reality probability P:
- In theory: it’s the true distribution of the world.
- In practice: we don’t know it exactly, so we estimate it from data (observations, frequencies, empirical distribution).
- Example: if in 100 flips you saw 80 heads and 20 tails, then your empirical P is 0.8 ,0.2.
Model probability Q:
- This is your hypothesis or predictive model.
- It gives probabilities for outcomes (e.g. “I think the coin is fair, so Q=0.5,0.5”).
- It can be parametric (like logistic regression, neural network, etc.) or non-parametric.

Property of source information

i.i.d. (independent and identically distributed) is a special case of “memoryless + stationary,” but the concepts are slightly different.

i.i.d. is a property of the information source — a sequence of random variables $)(X_1, X_2, \dots)$ .

Stationary

Means the distribution is the same over time.
Example: $P(X_t=1) = 0.7$ for all t.
Does not require independence.
You could have correlations (like a Markov chain) and still be stationary if the distribution doesn’t change with time.

Memoryless

Means no dependence on the past (i.e. independence).
Example: $P (Xt ∣ Xt - 1, Xt - 2, \dots) = P (Xt)$
Does not require the distribution to be identical over time. (E.g. each toss independent but the bias slowly changes with time → memoryless but not stationary.)

i.i.d.

Independent: no memory (memoryless).
Identically distributed: stationary in the simplest sense (same marginal distribution for all time steps).
So i.i.d. = memoryless and stationary at once.

Joint entropy

For two random variables $X, Y$ with joint distribution $p (x, y)$ , the joint entropy is

$\sum_{x} \sum_{y} p(x,y) \,\log p(x,y)$

Interpretation: the average surprise when you observe the pair $(X, Y)$ .

If X and Y are independent:
$H (X, Y) = H (X) + H (Y)$

Conditional entropy

The conditional entropy of Y given X is

$\sum_{x} \sum_{y} p(x,y) \,\log p(y|x)$

🔹 Interpretation: the average surprise in Y after you already know X.

If X and Y are independent:
$H (Y ∣ X) = H (Y)$
If Y is fully determined by X:
$H (Y ∣ X) = 0$

Definition of conditional entropy

$-\sum_{x} p(x)\,\log p(x|x).$

But $p (x ∣ x) = 1$ (if you already know X, the probability that it equals itself is certain).

$-\sum_x p(x)\,\log 1 = 0.$

Intuition

Conditional entropy measures the remaining uncertainty in one variable after knowing another.
If you condition on the same variable, there is no uncertainty left at all.
Therefore:

$H (X ∣ X) = 0.$

Chain rule of entropy

For two random variables X and Y

$H (X, Y) = H (X) + H (Y ∣ X) = H (Y) + H (X ∣ Y)$

This is called the chain rule for entropy.

Why it works

Start from the definition of joint entropy:

$-\sum_{x,y} p(x,y)\log p(x,y).$

Factorize p(x,y):

$p (x, y) = p (x) p (y ∣ x)$

So:

$-\sum_{x,y} p(x,y)\log \big(p(x)\,p(y|x)\big)$

Expand the log:

$H(X,Y)=−∑x,yp(x,y)log⁡p(x)−∑x,yp(x,y)log⁡p(y∣x)H(X,Y)=−\sum_{x,y}p(x,y)log⁡p(x)−\sum_{x,y}p(x,y)log⁡p(y∣x)$

The first term simplifies to $−∑xp(x)log⁡p(x)=H(X).-\sum_{x}p(x)\log p(x) = H(X).$
The second term is exactly H(Y∣X).

$\sum_{x,y} p(x,y)\,\log p(y|x)$

to a simpler result. Let’s work it step by step.

Factor the sums

$-\sum_x \sum_y p(x,y)\,\log p(y|x).$

Recognize conditional distribution

Note that $p (x, y) = p (x) p (y ∣ x)$ .
So:

$-\sum_x p(x) \sum_y p(y|x)\,\log p(y|x).$

The inner sum

$log⁡p(y∣x)-\sum_y p(y|x)\,\log p(y|x)$

is just the entropy of Y given a fixed X=x. Call this $H (Y ∣ X = x)$

So the whole expression becomes:

$\sum_x p(x)\,H(Y|X=x).$

Symmetry

$H (X, Y) = H (X) + H (Y ∣ X) .$

By symmetry, you can also write

$H (X, Y) = H (Y) + H (X ∣ Y)$

Chain rule without conditioning

$H (X, Y) = H (X) + H (Y ∣ X)$

This is the basic chain rule of entropy.
It says: the uncertainty in the pair (X,Y) equals the uncertainty in X plus the leftover uncertainty in Y once you know X.
No external variable here.

In general case:

$H(X1,X2,…,Xn)=∑_{i}^nH(Xi∣X1,…,Xi−1)$

Chain rule with conditioning on Z

$H (X, Y ∣ Z) = H (X ∣ Z) + H (Y ∣ X, Z) .$

This is the conditional chain rule.
It says: given knowledge of Z, the uncertainty in the pair (X,Y) equals:
- the uncertainty in X once you know Z, plus
- the leftover uncertainty in Y once you know both X and Z.

out the uncertainty of (X,Y)once Z is already given.

In general case:

$H(X1,X2,…,Xn∣Z)=∑_{i}^{n}H(Xi∣X1,…,Xi−1,Z)$

H§ is a concave function of probability

Take the simplest case: a binary random variable with probabilities $p$ and $1 - p$ .
Entropy is

$\log p - (1-p)\log(1-p)$

This function $H (p)$ is concave in p.
Concavity means: for any two probability values $p_1, p_2$ and any $λ∈[0,1],\lambda \in [0,1],$

$λH(p1)+(1−λ)H(p2).H(\lambda p_1 + (1-\lambda)p_2) \;\;\ge\;\; \lambda H(p_1) + (1-\lambda)H(p_2).$

Graphically, the entropy curve is shaped like an “upside-down bowl.”

Maximum at $p = 0.5$ (most uncertainty).
Minimum at $p = 0$ or $p = 1$ (no uncertainty).

This concavity holds more generally: entropy is concave in the whole probability distribution p(x) .
That’s why mixing distributions increases entropy (on average).

Mutual Information

The mutual information between two random variables X and Y is

$I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X)$

It can also be written as

$I (X; Y) = H (X) + H (Y) - H (X, Y)$
And in terms of distributions:

$\sum_{x,y} p(x,y)\, \log \frac{p(x,y)}{p(x)p(y)}$

Mutual Information as a measure of dependence

If I(X;Y)=0: the two variables are independent. Knowing one tells you nothing about the other.
If I(X;Y) is large: there is a strong dependency. Knowing one variable reduces a lot of uncertainty about the other.

2. How “large” relates to entropy

The maximum possible mutual information is limited by the entropy:
$\le \min\{H(X), H(Y)\}.$
This makes sense: you can’t learn more about Y from X than the total uncertainty H(Y) that Y has.

3. Intuition with examples

Independent coin flips:
$I (X; Y) = 0$ .
Perfect correlation (e.g. Y=X):
$I (X; Y) = H (X) = H (Y)$
Perfect anti-correlation (e.g. $\text{not }X$ ):
$I (X; Y) = H (X)$ as well — because knowing one still completely determines the other.

So yes, larger MI = stronger relationship between the two variables.

Marginal probability

Marginal” means you’re looking at the distribution of one variable alone, ignoring the others.
On a probability table, you literally get it by summing along the margins — that’s why it’s called marginal probability.