Exploratory Data Analysis with Seaborn (with a focus on data visualization) - Section 1

By Jia Liu in python data visualization EDA Seaborn

January 19, 2022

So why are we here?

Recently I’ve been using python more often in my work. I started to learn about Seaborn, which is a python visualization package built based on Matplotlib. It generates pretty figures! So I plan to do some exploratory data analysis with a focus on visualization using Seaborn. The dataset we will be using is the US health insurance dataset from Kaggle.

In this section, we will focus on

Visualizing the distribution of single variable
- Continuous variable
- Categorical variable
Visualizing the bivariate distribution

(Future improvement: Trouble shooting the figure size adjustment for python generated plots in Rmarkdown)

Prepare

We will first import packages that needed and set some basic parameters for figures:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = [12, 4]
#plt.rcParams['font.size'] = 14
#plt.rcParams['font.weight'] = 'bold'
plt.style.use('seaborn-whitegrid')

Read in the insurance data:

df = pd.read_csv("insurance.csv")

print('\nNumber of rows and columns in the data set: ', df.shape)

## 
## Number of rows and columns in the data set:  (1338, 7)

df.head()

##    age     sex     bmi  children smoker     region      charges
## 0   19  female  27.900         0    yes  southwest  16884.92400
## 1   18    male  33.770         1     no  southeast   1725.55230
## 2   28    male  33.000         3     no  southeast   4449.46200
## 3   33    male  22.705         0     no  northwest  21984.47061
## 4   32    male  28.880         0     no  northwest   3866.85520

This data set includes insurance information for $1338$ subjects. This first $6$ columns age, sex, bmi, children, smoker, region are dependent variables. We want to see how these $6$ features would influence the $7$ th column, charges of insurance.

Exploratory data analysis

Check for missing values

Check how many missing values each feature has by the code below:

df.isnull().sum().to_frame(name = 'sum')

##           sum
## age         0
## sex         0
## bmi         0
## children    0
## smoker      0
## region      0
## charges     0

This seems to be a quite clean data set. Let’s move on to the next step.

Visualizing distributions

Looking at a data set, the first thing we may want to check is the distribution of some specific features. Taking the insurance data set as an example, we may want to know how the insurance charges distribute, or what is the age distribution in this data set. Visualizing the distribution allows us to learn about what to expect for the given feature(s) or compare the distributions of different variables.

Univariate distribution

1.1. Continuous variable

Histogram, KDE, ECDF plots are commonly used to display variable distributions, these plots can be performed by different settings of the displot() function in seaborn . Take the insurance charges as an example:

# Display the distribution of charges
sns.displot(x = 'charges', data = df, kind = 'hist', bins = 40, hue = 'smoker')

# KDE plot for charges
sns.displot(data = df, x = 'charges', kind = 'kde')

# Include the histogram, KDE, and rug on one plot
sns.displot(data = df, x = 'charges', kind = 'hist', kde = True, rug = True, bins = 40)

# ecdf of charges
sns.displot(data = df, x = 'charges', kind = 'ecdf', rug = True)

The insurance charges seem to follow a binormal distribution, where a large percentage of people being charged at less than $20000$ , a small group of people being charged at over $30000$ or $40000$ .

1.2. Categorical variable

The counts of different values for a categorical variable can be displayed by countplot function. Let’s choose a categorical variable ‘region’ as an example:

# Display the counts distribution for different regions
sns.countplot(x = 'region', hue = 'smoker', data = df)

From the plot above we can tell that the number of subjects at different regions are roughly the same. At each region, there are much less subjects with the habit of smoking.

Bivariate distribution

Histogram and KDE can also be used to show the data distributions based on two variables.

# Display the distribution data points over charges and bmi using histogram
sns.displot(y = 'charges', x = 'bmi', data = df, kind = 'hist', bins = 40, hue = 'smoker')

# Display the distribution data points over charges and bmi using kde
sns.displot(y = 'charges', x = 'bmi', data = df, kind = 'kde', hue = 'smoker')

From the above plots we can see that the bmi of both smokers and non-smokers distribute around $(16, 50)$ . For non-smokers, the insurance charges does not seem to be linearly correlated with bmi. But smokers with high bmi seem to be charged more in insurance.

Posted on:: January 19, 2022

Length:: 4 minute read, 747 words

Categories:: python data visualization EDA Seaborn

See Also: