Analyzing Categorical Data with Contingency Tables

The previous post discussed the concept of missing categorical data in datasets and highlighted its importance. The types of missing data, Missing Completely at Random (MCAR) and Missing at Random (MAR), were explained. Different methods of handling missing data were presented for MCAR and MAR scenarios, including mode imputation, creating an "Unknown" category, conditional imputation, hot deck imputation, multiple imputation, logistic regression imputation, propensity score matching, cluster analysis, and using specialized software. The validation of imputation methods and the role of domain knowledge were emphasized.

Additionally, the use of chi-square analyses to assess whether missingness is Missing at Random (MAR) was covered. The process of performing a chi-squared test and interpreting its results was explained, including the comparison of observed and expected frequencies in a contingency table and the consideration of the p-value. The application of the chi-squared test in Python with a mock dataset was provided as an example to illustrate the process.

The post emphasized that while statistical techniques like the chi-squared test provide evidence, making a conclusion about the randomness of missing data should also involve domain knowledge and other analyses.

In this week's post, we will delve into contingency tables.

A contingency table, also known as a cross-tabulation or crosstab, is a type of table in a matrix format that displays the frequency distribution of variables. They are used to show the relationship between two or more categorical variables.

Below is an example showing the number of smartphones sales from the 2017 third quarter through the 2018 third quarter of a small business.

In this example, we have a crosstab of Brand and Quarters. The quarters range from 2017 Q3 through 2018 Q3. For example, in the 2018 first quarter, thirty-seven pieces of Apple were sold.

Cross-tabulation can be used in market research, survey analysis, business intelligence, and social scientific research. It helps to:

  1. Explore relationships or associations between two or more categorical variables.
  2. Understand interactions between several factors or variables.
  3. Identify trends and patterns within data.
  4. Calculate joint and conditional probabilities and other statistical tests.
Applications of  Crosstabs
  • Data Exploration: To identify trends, patterns, relationships, or correlations between variables.
  • Data Cleaning: To identify errors or inconsistencies.
  • Feature Engineering: To create new features for machine learning models. The relationship between two variables might provide useful information to create additional variables.
  • Hypothesis testing: For Chi-Square and other statistical tests.
  • Data visualization: Crosstabs form the basis for many data visualization such as heatmaps and mosaic plots.
  • Reporting and Communication: To provide a summary of data for non-technical stakeholders.
  • Confusion Matrix: For evaluating classification models. This allows us to easily see the counts of true positives, true negatives, false positives, and false negatives. 
  • Comparison of Models: Cross-tabulation can also be used to compare the performance of different models or different configurations of a model. For example, the rows of the table could represent different models or configurations, while the columns represent various performance metrics. This makes it easy to compare the models' performance and choose the best one.

Real-Life Examples:

Here are real-life examples of how contingency tables are useful in various fields, along with aspects of their construction, interpretation, and applications:

Example 1: Marketing Analysis

Scenario: A company is launching a new product and wants to understand how customer preferences for the product vary based on gender and age group.


In this contingency table, the rows represent different age groups, while the columns represent gender. The cell values represent the number of customers falling into each category. By analyzing this table, the company can identify trends in product preferences based on gender and age.

The company can use this information to tailor marketing strategies. For instance, if the product is more popular among females aged 18-30, they might consider launching targeted advertising campaigns aimed at this demographic.

Example 2: Social Science Research

Scenario: Researchers want to study the relationship between political affiliation and support for a specific policy.


In this case, the rows represent political affiliation, and the columns represent support or opposition to the policy. The cell values indicate the number of individuals with each combination of characteristics.

Analyzing this table could reveal whether there is an association between political affiliation and policy support. Such insights could help guide policymakers in understanding the potential challenges and opportunities for policy implementation.

 Example 3: Medical Studies

Scenario: Researchers are investigating the relationship between smoking status and the occurrence of a specific medical condition.



Here, the rows represent the presence or absence of the medical condition, while the columns represent smoking status. The cell values indicate the number of individuals with each combination of attributes.

By analyzing this table, researchers can determine whether smoking status is associated with a higher likelihood of the medical condition. This information could have implications for public health campaigns and interventions to reduce the risk of the condition among smokers.

Using Contingency Tables to graph Categorical Variables

When dealing with categorical variables, several types of graphs can effectively visualize and communicate various aspects of the data. Below is a summary of these graphs:

  • Bar Chart
        1. Represents the frequency or proportion of each category
 
        2. Suitable for comparing categorical data across one variable.

        3. Can be stacked for comparing subcategories within each category.

Let's create some sample categorical data using Python and then create bar charts to demonstrate the three points you mentioned about bar charts. We'll use fictional data related to customer preferences for different product categories. You may copy the code and follow along.

# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import random

# Generating sample data
categories = ['Electronics', 'Clothing',
              'Books', 'Home Appliances']
data = []

for i in range(100):
    category = random.choice(categories)
    data.append({'Category': category})

# Creating a DataFrame
df = pd.DataFrame(data)

# Counting the frequency of each category
category_counts = df['Category'].value_counts()
print(category_counts)

This code creates the frequency  table below:
Let's create a bar chart for the counts.

plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar')
plt.title('Frequency of Customer
          Preferences for Product Categories')
plt.xlabel('Product Category')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

We get the bar chart below showing the frequency of preference for each product.





Let's create a stacked bar chart for the single categorical variable with the categories

# Counting the frequency of each category and gender combination
gender_counts = df.groupby(['Category', 'Gender']).size().unstack()

# Creating the stacked bar chart
plt.figure(figsize=(10, 6))
gender_counts.plot(kind='bar', stacked=True)
plt.title('Customer Preferences for Product Categories by Gender')
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Gender')
plt.tight_layout()
plt.show()


Below is the output:





Insight and Conclusions

Ø  Preference Ranking: Electronics is the most preferred category, followed by Clothing, Books, and Home Appliances.

Ø  Gender Variations: Gender variations are observed in the Clothing and Home Appliances categories, where females have a slightly higher preference.

Ø  Consistency: The preferences across categories appear consistent, with only minor differences in distribution.

Ø  Market Insights: This data can help in making decisions about inventory, marketing strategies, and product promotions. The insights might vary depending on whether the goal is to maximize revenue or target specific gender-based preferences.

Ø  Targeted Marketing: The data indicates that gender preferences could influence marketing efforts. For instance, tailoring marketing campaigns for Home Appliances to appeal more to females.

Ø  Future Investigation: Further analysis could involve examining the demographics of customers to see if age or other factors impact these preferences.

In addition to bar charts, categorical data can be visualized using the following:

  • Stacked Bar Charts
  • Mosaic Plots
  • Heatmaps
  • Clustered Bar Charts
  • Diverging Stacked Bar Charts
  • Pie Charts
  • Doughnut Charts
  • Pareto Charts
  • Dot Plots
  • Waterfall Charts(Can be adapted to categorical variables for stepwise changes)
  • Waffle Charts
  • Radar Charts(Spider Charts)
  • Sankey Diagrams
  • Word Cloud
  • Chord Diagrams and Network Diagrams
We will investigate some of these charts and take detailed lessons on other methods available for contingency table analysis. 





Comments

Popular posts from this blog

Categorical Data Analysis

Statistics for data science - categorical data