What is the chi-square statistic?
The
chi-square (chi, the Greek letter
pronounced "kye”) statistic is a
nonparametric statistical technique used to determine if a
distribution of observed frequencies differs from the theoretical
expected frequencies. Chi-square statistics use
nominal (categorical) or ordinal level data, thus instead of
using means and variances, this test uses frequencies.
The value of the chi-square statistic is given
by
X2 = Sigma [ (O-E)2
/ E ] (1)
where X2 is the chi-square statistic, O
is the observed frequency and E is the expected frequency
Generally the
chi-squared statistic summarizes the
discrepancies between the expected number of times each outcome
occurs (assuming that the model is true) and the observed number of
times each outcome occurs, by summing the squares of the
discrepancies, normalized by the expected numbers, over all the
categories (Dorak, 2006).
Data used in a chi-square analysis has to
satisfy the following conditions
- Randomly drawn from the population,
- reported in raw counts of frequency,
- measured variables must be independent,
- observed frequencies cannot be too small, and
- values of independent and dependent variables must be
mutually exclusive.
There are two types of chi-square
test.
- The Chi-square test for goodness of fit which compares the
expected and observed values to determine how well an
experimenter's predictions fit the data.
- The Chi-square test for independence which compares two sets
of categories to determine whether the two groups are distributed
differently among the categories. (McGibbon, 2006)
1. Chi-square test for Goodness of Fit
Goodness of fit means how well a statistical
model fits a set of observations. A measure of goodness of fit
typically summarize the discrepancy between observed values and the
values expected under the model in question. Such measures can be
used in statistical hypothesis testing, e.g., to test for normality
of residuals, to test whether two samples are drawn from identical
distributions.
Suppose a coin is tossed 100 times, the
outcomes would be expected to be 50 heads and 50 tails. If 47 heads
and 53 tails are observed instead, does this deviation occur
because the coin is biased, or is it by chance?
1.1 Establish Hypotheses
The Null hypothesis for the above experiment
is that the observed values are close to the predicted values. The
alternative hypothesis is that they are not close to the predicted
values. These hypotheses hold for all Chi-square goodness of fit
tests. Thus in this case the null and alternative hypotheses
corresponds to:Null hypothesis: The coin is fair
Alternative hypothesis: The coin is
biased
Table 1 Tabulated results of Observed and Expected frequencies
|
|
Heads
|
Tails
|
| Observed |
47 |
53 |
| Expected |
50 |
50 |
1.2 Calculate the chi-square statistic
We calculate chi-square by substituting values
for O and E
For Heads: X2
= (47-50)2/50 = 0.18
For Tails
X2 = (53-50)2/50 = 0.18
The sum of these categories is 0.18 + 0.18 =
0.36
1.3 Assessing significance levels
Significance of the
chi-square test for
goodness of fit value is established by calculating the
degree of
freedom v (the Greek letter nu) and by using the
chi-square distribution table (Bissonnette, 2006). The
v in a
chi-square goodness of fit test is equal to the number of
categories, c, minus one (
v= c-1). This is done in order to check
if the null hypothesis is valid or not, by looking at the critical
chi-square value from the table that corresponds to the calculated
v. If the calculated Chi-square is greater than the value in the
table, then the null hypothesis is rejected and it is concluded
that the predictions made were incorrect.In the above experiment,
v= (2-1) = 1. The critical value for a chi-square for this example
at
a = 0.05 and
v =1 is 3.84 which is greater than
X2=0.36.
Therefore the null hypothesis is not rejected, hence the coin toss
was fair.
2. Chi-square test for Independence
The chi-square test for independence is used
to determine the relationship between two variables of a sample. In
this context independence means that the two factors are not
related. Typically in social science research, we're interested in
finding factors which are related, e.g. education and income,
occupation and prestige, age and voting behaviour.
Example: We want to know whether boys or girls
get into trouble more often in school. Below is the table
documenting the frequency of boys and girls who got into trouble in
school
Table 2: Tabulated results of the Observed and Expected frequency [QMSS, 2006]
| |
Got into trouble (Observed) |
Not in trouble (Observed) |
Total
|
Got into trouble (Expected) |
Not in trouble (Expected) |
| Boys |
46 |
71 |
117 |
(40.97) |
(76.02) |
| Girls |
37 |
83 |
120 |
(42.03) |
(77.97) |
| Total |
83 |
154 |
237 |
|
|
To examine statistically whether boys got in
trouble more often in school, we need to establish hypotheses for
the question.
2.1 Establish Hypotheses
The null hypothesis is that the two variables
are independent or in this particular case is that the likelihood
of getting in trouble is the same for boys and girls. The
alternative hypothesis to be tested is that the likelihood of
getting in trouble is not the same for boys and girls.
Cautionary Note
It is important to keep in mind that the
chi-square test for independence only tests whether two variables
are independent or not, it cannot address questions of which is
greater or less. Using the chi-square test for independence, it
cannot be evaluate directly from the hypothesis who get more in
trouble between boys and girls.
2.2 Calculate the expected value for each cell of the
table
As with the goodness of fit example described
earlier, the key idea of the chi-square test for independence is a
comparison of observed and expected values. In the case of tabular
data, however, we usually do not know what the distribution should
look like (as we did with tossing the coin). Rather expected values
are calculated based on the row and column totals from the
table.
The expected value for each cell of the table
can be calculated using the following equation:
expected value = Row total * Column total /
Total for table (2)
The expected values (in parentheses, italics
and bold) for each cell are also presented in Table 2.
2.3 Calculate Chi-square statistic
With the values in Table 2, the chi-square
statistic can be calculated using Equation 1 as follows:
X2 = (46-40.97)2 / 40.97 + (37-42.03)2 /
42.03 + (71-76.03)2 / 76.03 +(83-77.97)2 / 77.97=1.87
2.4 Assessing significance levels
In the chi-square test for independence the
degree of freedom is equal to the number of columns in the table
minus one multiplied by the number of rows in the table minus
one.
i.e. dof = (r-1)(c-1) = 1.
Thus the value calculated from the formula
above is compared with values in the
chi-square distribution table (Bissonnette, 2006). The value
returned from the table is p< 20%. Therefore the null hypothesis
is not rejected, hence boys are not significantly more likely to
get in trouble in school than girls.
Exercise: In a certain city, there are about
one million eligible voters. A simple random sample of 10000
eligible voters was chosen to study the relationship between sex
and participation in the previous elections
Table 3: Tabulated results of Observed Frequency [Rodríguez, 2006]
| |
Men |
Women |
| Voted |
2792 |
3591 |
| Did not Vote |
1486 |
2131 |
Establish whether being a man or a woman is
independent of having voted in the previous elections. In other
words are "sex and voting independent"?
AnswerReferences:
Coauthor: Jones Kalunga