One of the first steps in analysing a dataset is the Data Exploration. The box plots is a standardized way of displaying the distribution of data. Boxplots gives information about the minimum, first quartile, median, third quartile, and maximum.
In this short post, I will focus on the vizualization of the data and the outliers using BoxPlot. To learn more on Data Exploration, check this very thorough post .
I will be using data from an assignment of the Machine Learning Nanodegree: customer segmentation. The dataset is made of 400 businesses: restaurants, retailers, etc…, with their purchasing pattern for Milk products, Grocery products, and more…
import seaborn as sns | |
#create "melted" dataframe | |
df = pd.DataFrame(columns=['variable', 'value']) | |
for col in log_data.columns: | |
df = df.append(pd.melt(log_data, value_vars=[col])) | |
print(df.shape) | |
# create the boxplot with data points overlay | |
plt.figure(figsize=(8,6)) | |
sns.boxplot(x="value", y="variable", data=df, color="c") | |
sns.stripplot(x="value", y="variable", data=df, jitter=True, | |
size=4, alpha=.4, color=".3", linewidth=0) | |
sns.despine(trim=True) |
You can also check out this post where Matplotlib was used to generate the boxplots. link