Least squares prediction and Indicator Variables
Least squares prediction
CI and PI
-
CI: “Where is the average wage for people with 16 years of education likely to be?”
-
PI: “If I pick one random person with 16 years of education, what wage will they likely earn?”
Confidence Interval (CI)
-
Target: The conditional mean μY∣X0=E[Y∣X0]\mu_{Y|X_0} = \mathbb{E}[Y|X_0]μY∣X0=E[Y∣X0]
-
This is a fixed but unknown number (not random once the data and X0X_0X0 are fixed).
-
The randomness comes from the estimation process (sampling variation in β^\hat{\beta}β^ ).
-
So the CI is telling you: “With 95% confidence, the true mean lies in this range.”
-
CI → fixed but unknown parameter (deterministic).
Prediction Interval (PI)
-
Target: A new observation Y0∣X0Y0∣X0Y0∣X0
-
Even if you knew the regression line exactly, this Y0Y_0Y0would still be random because of the error term ε\varepsilonε.
-
PI accounts for both the estimation uncertainty and the irreducible randomness.
-
So PI is telling you: “With 95% probability, the next individual value will fall in this range.”
-
PI → random variable (future realization of Y0Y_0Y0).
Summary
-
CI → fixed but unknown mean, interval comes from the randomness of the estimator.
-
PI → inherently random individual outcome, interval accounts for both estimation error and noise
Indicator Variables
dummy variables
They are also called dummy, binary, or dichotomous variables, because they take just two values, usually one or zero, to indicate the presence or absence of a characteristic or to indicate whether a condition is true or false. For variables that take on only two values, the mapping to {0, 1} is very common. However other numerical values can also be adopted.
For example, we can define an indicator variable
D • D = 1 if characteristic is present; • D = 0 if characteristic is not present.
Intercept indicator variables
The reference group and the dummy variable trap
Slope indicator variables
Interactions between qualitative factors
Why do we need an extra coefficient (γ\gammaγ) for the interaction between being Black and Female, instead of just letting the two dummy variables (Black, Female) both equal 1?
Two dummy variables only (no interaction term)
Suppose the regression is:
E(WAGE)=β1+β2 EDUC+δ1 BLACK+δ2 FEMALEE(WAGE) = \beta_1 + \beta_2 \, EDUC + \delta_1 \, BLACK + \delta_2 \, FEMALEE(WAGE)=β1+β2EDUC+δ1BLACK+δ2FEMALE
-
If someone is Black Female, then BLACK=1,FEMALE=1
-
Their expected wage would be:
β1+β2EDUC+δ1+δ2\beta_1 + \beta_2 EDUC + \delta_1 + \delta_2β1+β2EDUC+δ1+δ2
So in this model, the effect of being Black and Female is just the sum of the two separate penalties.
With an interaction term
Now suppose the model is:
E(WAGE)=β1+β2 EDUC+δ1 BLACK+δ2 FEMALE+γ(BLACK×FEMALE)E(WAGE) = \beta_1 + \beta_2 \, EDUC + \delta_1 \, BLACK + \delta_2 \, FEMALE + \gamma (BLACK \times FEMALE)E(WAGE)=β1+β2EDUC+δ1BLACK+δ2FEMALE+γ(BLACK×FEMALE)
-
If someone is Black Female, then BLACK=1, FEMALE=1, so the interaction term = 1.
-
Their expected wage is:
β1+β2EDUC+δ1+δ2+γ\beta_1 + \beta_2 EDUC + \delta_1 + \delta_2 + \gammaβ1+β2EDUC+δ1+δ2+γ
Here, γγγ is an extra adjustment. It allows the Black Female group’s outcome to be different from just the sum of “being Black” + “being Female.”
Here, γ\gammaγ is exactly the “extra piece” — the part of the joint Black–Female effect that cannot be explained by just adding the two main effects.
Why no perfect collinearity?
-
The interaction column is not equal to either BLACK or FEMALE.
-
And it’s also not equal to their sum or difference.
-
It is only “1” in one cell (Black Female), “0” elsewhere.
So mathematically:
BLACK×FEMALE≠a⋅BLACK+b⋅FEMALE+cBLACKBLACK×FEMALE≠a⋅BLACK+b⋅FEMALE+cBLACK BLACK×FEMALE=a⋅BLACK+b⋅FEMALE+cBLACK
for any constants a,b,c
Thus, the interaction term provides new independent information and does not cause perfect collinearity.