Contributed by: Netali Agrawal
The Chi-square test is also known as the name of the “goodness of fit test”. But questions that first strike our minds on hearing this name are:
This article is a guide for all of your doubts and is meant to equip you with a comprehensive knowledge of the Chi-Square test. Let’s go through a few basics of Hypothesis Testing. We need to learn more about Hypothesis Tests as the Chi-Square test is one of them.
Fundamentals of Hypothesis Testing
Hypothesis Testing can be used to interpret and draw conclusions about a population using its sample data. It helps in deciding which mutually exclusive statements about the population are best supported by the sample data.
Null Hypothesis (H0) – It is a statement that is commonly accepted or is considered to be the status quo. It is assumed that the observed result is due to the chance of a factor. It is denoted by H0.
Alternate Hypothesis(H1 or Ha) – As previously mentioned, the Null Hypothesis and Alternative Hypotheses are mutually exclusive statements. If the Null Hypothesis is a commonly accepted fact then Alternate Hypothesis is a real fact-based observation from the sample data. It is denoted by H1 or Ha.
There are various types of Hypothesis Testing. These include z-tests, one-sample t-tests, paired t-tests, 2 sample t-tests, ANOVA, and many more. All of these are parametric tests of mean and variance. We are going to try to understand one of these tests in detail: the Chi-Square test.
What is a Chi-Square Test?
The Chi-Square test is used to check how well the observed values for a given distribution fit with it when the variables are independent. Here, the test is to see how well the fit of the observed values is with variable, independent distribution for the same data. This is why it is also known as the “goodness of fit” test.
With the above statement, we could see why this is a test of the hypothesis. Either of the two mutually exclusive statements is to be proven here. The null hypothesis would suggest that the data fits the independent variable distribution perfectly. This implies that the observed data is not biased. An alternate hypothesis would suggest deviance of observed data from independent variable distribution, thus making the data biased or the variables dependent. Later sections will discuss this in detail.
When to use the Chi-Square test?
Chi-Square test is designed for a specific set of data types, and that is a categorical variable. This means the test could not be applied to continuous data types. If it is to be applied on a continuous data type, the data needs to be divided into buckets, and frequency or count for each bucket needs to be provided. Let’s understand the difference between categorical and continuous data types.
- Continuous Data Type – Continuous data types are ones that are infinite numerical value between any two values. For example, salary, time.
- Categorical Data Type – Categorical data types are ones that contain a finite set of distinct categories or groups. For example, gender, marital status.
If continuous data needs to be segregated into buckets/categories, then create categories with utmost precision. If the category is not selected carefully, the test results might not make any sense. The Chi-Square test will tell you if data is following independent variable distribution or not. But, it will not tell you if the categories created or chosen are correct or not.
Let’s consider a scenario, assume an app provides ratings to all the restaurants under 3 categories, good, okay, and not recommended. Now the challenge is to segregate restaurants under correct categories. They can be created under the name of the seating capacity of the restaurant. This is how the table would look-
Small | Medium | Large | |
Good | 30 | 10 | 20 |
Okay | 8 | 10 | 12 |
Not Recommended | 3 | 5 | 2 |
Total | 41 | 25 | 34 |
Small is for a restaurant with a sitting capacity of 20 people, medium is for a sitting capacity of 100 people and large is for a sitting capacity of more than 100 people.
Here we changed continuous data into categorical data. Be very vigilant in doing so else conclusions from the test might not come out well.
If we talk about null and alternative hypotheses in the above-given case, then it could be formulated as below:
Null Hypothesis: Ratings for restaurants are independent of the size of a restaurant or in simple terms 2 variables, ratings, and size are independent.
Alternate Hypothesis: Ratings and size for a restaurant are having dependent on one and other, and it is a biased observation as one variable is influencing another variable.
The test compares the observed data to a model that distributes the data according to the expectation that the variable is independent. If the observed data does not fit the model, the chances that the variables are dependent become stronger. In this scenario, we will reject the null hypothesis.
Also Read: What is Machine Learning? How does it work?
What should be the data format for the Chi-Square test?
We now know that the data type should be categorical. The format of data input for this test should be tabular. All examples here would be given in a 2×2 grid format. As long as data is in a tabular format with proper categorization, the test data remains valid. It could be a grid of any size – 3×2, 4×4, 8×3 – all of these sizes are valid so long as they meet the aforementioned criteria.
Small | Medium | Large | |
Good | 30 | 10 | 20 |
Okay | 8 | 10 | 12 |
Not Recommended | 3 | 5 | 2 |
Total | 41 | 25 | 34 |
Although the data is in a tabular format, it is incomplete for the test. Along with counts mentioned for each category, the total count of each column and row should also be provided, as well as the whole dataset:
Small | Medium | Large | Total | |
Good | 30 | 10 | 20 | 60 |
Okay | 8 | 10 | 12 | 30 |
Not Recommended | 3 | 5 | 2 | 10 |
Total | 41 | 25 | 34 | 100 |
We now have a complete dataset on the distribution of 100 restaurants based on the categories of rating (good/okay/not recommended) and restaurant size category (small/medium/large). A Chi-Square test could be performed on this data to check whether the rating and size of the restaurant are completely independent or influencing one another.
How to perform a Chi-Square test?
Now that we have learned what the test is and what form of data should be used as input, we can move on to the performance of this test. Here’s a list of everything we need to perform this test :
- Observed values
- Estimated values
These are the components required to perform the test and obtain Chi-Square Statistics. The table below can help you with observed values – a concept we have been looking at throughout the blog. This table shows the observed values for the problem at hand. It is denoted by O.
Small | Medium | Large | Total | |
Good | 30 | 10 | 20 | 60 |
Okay | 8 | 10 | 12 | 30 |
Not Recommended | 3 | 5 | 2 | 10 |
Total | 41 | 25 | 34 | 100 |
The question now is how to get estimated values? The estimated value is denoted by E and can be calculated using a simple formula.
The formula for estimated value for each cell is the total for rows multiplied by the total for the columns, divided by the total for the table, or simply-
Estimated values in each cell = (Row total * Column total)/Table total
So, for above table for cell(1,1) expected value is (60*41)/100, or 24.6. This is an estimated value so if it is in decimal also, don’t worry!
For all the cells the estimated value can be calculated similarly. Let’s see how the estimated table looks like:
Small | Medium | Large | Total | |
Good | 24.6 | 15 | 20.4 | 60 |
Okay | 12.3 | 7.5 | 10.2 | 30 |
Not Recommended | 4.1 | 2.5 | 3.4 | 10 |
Total | 41 | 25 | 34 | 100 |
Let’s see this in a more understandable format, in one table both observed and estimated values. It will give a more compact view, as well as a better understanding.
Small | Medium | Large | Total | |
Good | 30 (24.6) |
10 (15) |
20 (20.4) |
60 |
Okay | 8 (12.3) |
10 (7.5) |
12 (10.2) |
30 |
Not Recommended | 3 (4.1) |
5 (2.5) |
2 (3.4) |
10 |
Total | 41 | 25 | 34 | 100 |
The above table looks like our observed and estimated values are inline, so can we straight away say that the null hypothesis is correct. The variables are independent of each other.
Let’s put our data to the test now. This can be done on Excel or any other tool which we use for statistical modeling. The formula shown below is the same as the one applied at the backend of any tool:
Let’s understand the formula first. We have already seen what O and E are. To reiterate, O stands for observed value and E stands for estimated value (one which we calculated above). We are subtracting observed from expected to get residual or error values. We can also see it as measuring the deviation of observed from estimated or vice versa. This residual value is squared to get rid of positive and negative values and have all of them in one format. This will lead to inflation in scaled values, so to normalize bigger values we divide it with the expected value. Why do we need to do a summation? It is just to tell you that you need to do this for every cell and then add it up to get Chi-square statistics. This is the formula to calculate Chi-Square statistics and is denoted by χ (Chi). Since the test name itself is Chi-Squared, we calculate χ2 using the above formula.
Using this formula, we calculate the Chi-Square value for above given example and it is calculated as ((30-24.6)^2/24.6) + ((10-15)^2/15 ) +((20-20.4)^2/20.4) +((8-12.3)^2/12.3) + ((10-7.5)^2/7.5) + ((12-10.2)^2/10.2) + ((3-4.1)^2/4.1) + ((5-2.5)^2/2.5) + ((2-3.4) ^2/3.4) , which comes out to be 8.88.
How do we conclude the result of the test? There is a final step we need to perform. Let’s recall the basic concept of any hypothesis test, p-value. It is a benchmark to conclude on any hypothesis test. How do we generate this P-value? For calculating this p-value we need the following:
- The Chi-Square value
- The degrees of freedom
You might be wondering that the Chi-Square value is known but what is this degree of freedom.
Also Read: The Ultimate Guide to AdaBoost Algorithm
Degrees of Freedom
The Degree of Freedom is denoted by df or d generally. It tells you how many cells in a grid are independent. For the Chi-Square grid, it could answer how many cells you would need to fill in before, given the totals, you can fill the rest in by using the formula. If the total of rows and columns are given, then you have limited freedom to fill in the cells. The rest of the cells are filled by a formula, such that totals of rows and columns are met. You can fill only certain cells by random numbers and the rest are filled with the help of formula application on totals along with these random values. For direct calculation, it is the number of rows minus one times the number of columns minus one, (R-1) *(C-1). If we apply this to our example, the degree of freedom is (3-1)*(3-1) = 2*2 = 4. We can fill in 4 random values and the rest would be calculated with the help of totals.
Now you have all the data points to proceed further along with conceptual knowledge of the Chi-Square test. There are Chi-Square tables like z-score and f-statistics tables, but let’s stick to excel calculation here.
The formula in excel to be used is:
P -value = CHIDIST(x,degree_of_freedom)
Put in the values, and this will give you a p-value for the given data points mentioned above. In the above example, x is 8.88, and df is 4. Substituting the values in the above formula, you will get a p-value of 0.06417.
What does the basic hypothesis rule say?
- Reject null hypothesis if p-value < alpha (0.05)
- Fail to reject null hypothesis if p-value >= alpha (0.05)
In the above example, the p-value is greater than alpha and thus, we fail to reject the null hypothesis and conclude that ratings given are independent of the size of the restaurant.
Important FAQs on Chi-Square Test
What is a chi-square test used for?
A Chi-Square( for hypothesis tests) test is used to determine whether the data you have obtained is as per your expectations. It is basically used to compare the observed values with the expected values to check if the null hypothesis is true.
What is the chi-square test in simple terms?
In simple terms, the Chi-Square test helps you determine whether your null hypothesis is true or not. This is done by comparing the observed values with the expected values.
Where we can use the chi-square test?
We can use the Chi-Square test when the sample size is larger in size. If the sample size is less than 50, then it is not recommended to use the Chi-Square test.
How do you interpret chi-square results?
In a Chi-Square test, if the value of ‘p’ is less than or equal to the significance level then it is considered that observed and the expected values aren’t the same.
What are the two types of chi-square tests?
The two types of Chi-Square tests are the goodness of fit test and the test of independence.
What is the difference between chi-square and t-test?
A chi-square test checks the null hypothesis for the relationship between two variables. Whereas in a t-test, the null hypothesis is tested for two means -whether they are equal or the difference between them is zero.
What are the characteristics of the chi-square test?
The test is based on frequencies and not on standard deviations or mean. This test is not useful for estimation and is not based on any assumptions. It’s a non-parametric test and can be well used in research work.
What is the p-value in Chi-Square?
The p-value is a probability measure that an observed difference could have occurred just by random chance. The lower the value of p, the greater the statistical significance of the observed difference.
How do I report chi square?
This is the format for reporting the results from a Chi Square test:
X^{2} (degress of freedom, N = sample size) = chi-square statistic value, p = p value.
Conclusion
Now that you have gained a comprehensive knowledge of the Chi-Square test, remember that this test tells you the relationship between observed and estimated values. It tells you if the variables are independent or not, but it does not provide insights into how the variables are dependent or what kind of relationship exists between the variables.
To conclude, let’s look at a quote by Daniel Keys Moran-
“You can have data without information, but you cannot have information without data.”