Calculating Pearson Correlation

I’m going to be doing an analysis on the Gapminder dataset. I’ll be looking at the relationship between the internet usage rate and the life expectancy variables, as well as the association between internet usage and employment rates.

Here’s the code used to produce the visualizations and run the Pearson correlation on the chosen variables in the Gapminder dataset.

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()

# convert to numeric
def to_numeric(dataframe, cols):

for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)

for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
plt.xlabel(label_1)
plt.ylabel(label_2)
plt.show()

plot_regression('lifeexpectancy', 'internetuserate', df_clean, 'Life Expectancy', 'Internet Use Rate')
plot_regression('lifeexpectancy', 'employrate', df_clean, 'Life Expectancy', 'Employment Rate')

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

Here’s the plots that were rendered:

Here’s the output of the Pearson correlation:

Association between life expectancy and internet use rate
(0.77081050888289, 5.983388253650836e-33)


Association between employment rate and internet use rate
(-0.1950109538173115, 0.013175901971555317)

Based on the information above, it looks like life expectancy and internet use rate have a very strong correlation that is highly unlikely to be of chance. Meanwhile, employment rate and internet use see a less robust, though significant correlation. There’s a weak negative correlation for the second two variables.

Running a Chi-Square Test of Independence – Gapminder Dataset

For the Gapminder data I’m investigating whether or not there is a relationship between internet use rate and life expectancy. Because both of the variables are continuous in nature, I start off by binning both the variables. Internet use rate is binned into just two groups: less than 50% of the population with internet access and over 50% internet access. Meanwhile, life expectancy is divided into 10 bins.

import pandas as pd
import scipy.stats


# check for missing data
def check_missing(dataframe, cols):

for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()

# convert to numeric
def to_numeric(dataframe, cols):

for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)

for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

def bin(dataframe, cols):

for col in cols:
new_col_name = "{}_bins".format(col)
dataframe[new_col_name] = pd.qcut(dataframe[col], 10, labels=["1=10%", "2=20%", "3=30%", "4=40%", "5=50%",
"6=60%", "7=70%", "8=80", "9=90%", "10=100%"])

def bin_half(dataframe, cols):

for col in cols:
new_col_name = "{}_bins_2".format(col)
dataframe[new_col_name] = pd.qcut(dataframe[col], 2, labels=["1=50%", "2=100%"])

df3 = df2.copy()
bin(df3, cols)
bin(df3, norm_cols)
bin_half(df3, ['internetuserate'])

We now make recode dictionaries and set up crosstab calculations for four different tests. We’ll only be testing four different comparisons here, just as a proof of concept and to demonstrate that we know to carry out the calculations.

recode_2 = {"3=30%": "3=30%", "7=70%": "7=70%"}
recode_3 = {"2=20%": "2=20%", "8=80": "8=80"}
recode_4 = {"6=60%": "6=60%", "9=90%": "9=90%"}
recode_5 = {"4=40%": "4=40%", "7=70%": "7=70%"}

df3['Comp_3v7'] = df3['lifeexpectancy_bins'].map(recode_2)
df3['Comp_2v8'] = df3['lifeexpectancy_bins'].map(recode_3)
df3['Comp_6v9'] = df3['lifeexpectancy_bins'].map(recode_4)
df3['Comp_4v7'] = df3['lifeexpectancy_bins'].map(recode_5)

# get table of observed counts
count_table = pd.crosstab(df3['internetuserate_bins_2'], df3['lifeexpectancy_bins'])
print(count_table)

count_table_3 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_3v7'])
count_table_4 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_2v8'])
count_table_5 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_6v9'])
count_table_6 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_4v7'])

Now we’ll do the chi-square tests.

def chi_square_test(table):

print("Results for:")
print(str(table))

# get column percentages
col_sum = table.sum(axis=0)
col_percents = table/col_sum
print(col_percents)

chi_square = scipy.stats.chi2_contingency(table)
print("Chi-square value, p-value, expected_counts")
print(chi_square)

print()

print("Initial Chi-square:")
chi_square_test(count_table)
print(" ")

chi_square_test(count_table_3)
chi_square_test(count_table_4)
chi_square_test(count_table_5)
chi_square_test(count_table_6)

Here’s the initial test:

Initial Chi-square:
Results for:
lifeexpectancy_bins 1=10% 2=20% 3=30% 4=40% 5=50% 6=60% 7=70% 8=80 \
internetuserate_bins_2
1=50% 18 19 16 14 12 6 4 4
2=100% 0 0 1 4 6 13 15 11

lifeexpectancy_bins 9=90% 10=100%
internetuserate_bins_2
1=50% 1 0
2=100% 16 19
lifeexpectancy_bins 1=10% 2=20% 3=30% 4=40% 5=50% 6=60% \
internetuserate_bins_2
1=50% 1.0 1.0 0.941176 0.777778 0.666667 0.315789
2=100% 0.0 0.0 0.058824 0.222222 0.333333 0.684211

lifeexpectancy_bins 7=70% 8=80 9=90% 10=100%
internetuserate_bins_2
1=50% 0.210526 0.266667 0.058824 0.0
2=100% 0.789474 0.733333 0.941176 1.0
Chi-square value, p-value, expected_counts
(102.04563740451277, 6.064860600653971e-18, 9, array([[9.45251397, 9.97765363, 8.9273743 , 9.45251397, 9.45251397,
9.97765363, 9.97765363, 7.87709497, 8.9273743 , 9.97765363],
[8.54748603, 9.02234637, 8.0726257 , 8.54748603, 8.54748603,
9.02234637, 9.02234637, 7.12290503, 8.0726257 , 9.02234637]]))

The p-value results are significant so we’ll do some some post-hoc tests.

Results for:
Comp_2v8 2=20% 8=80
internetuserate_bins_2
1=50% 19 4
2=100% 0 11
Comp_2v8 2=20% 8=80
internetuserate_bins_2
1=50% 1.0 0.266667
2=100% 0.0 0.733333
Chi-square value, p-value, expected_counts
(17.382650301643437, 3.0560286589975315e-05, 1, array([[12.85294118, 10.14705882],
[ 6.14705882, 4.85294118]]))

Results for:
Comp_6v9 6=60% 9=90%
internetuserate_bins_2
1=50% 6 1
2=100% 13 16
Comp_6v9 6=60% 9=90%
internetuserate_bins_2
1=50% 0.315789 0.058824
2=100% 0.684211 0.941176
Chi-square value, p-value, expected_counts
(2.319693757720874, 0.12774517376836148, 1, array([[ 3.69444444, 3.30555556],
[15.30555556, 13.69444444]]))

Results for:
Comp_4v7 4=40% 7=70%
internetuserate_bins_2
1=50% 14 4
2=100% 4 15
Comp_4v7 4=40% 7=70%
internetuserate_bins_2
1=50% 0.777778 0.210526
2=100% 0.222222 0.789474
Chi-square value, p-value, expected_counts
(9.743247922437677, 0.0017998260000241526, 1, array([[8.75675676, 9.24324324],
[9.24324324, 9.75675676]]))

Notice that the p-values for the 3 vs. 7 and 2 vs. 8 categories are much smaller than the p-values for 6 vs. 9 and 4 vs. 7. However, the P-values for all the tests are fairly small.

Running An Analysis Of Variance on Gapminder Data

In this post I run an analysis of variance (ANOVA) on the Gapminder dataset. Because the Gapminder data contains mainly continous values and not categorical ones, I had to create categorical values using the “qcut” function from Python’s Pandas library. Let’s take a look at the code used to load, preprocess, and bin the data.

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import pandas as pd

# check for missing data
def check_missing(dataframe, cols):

for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()

# convert to numeric
def to_numeric(dataframe, cols):

for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)

for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

def bin(dataframe, cols):

for col in cols:
new_col_name = "{}_bins".format(col)
dataframe[new_col_name] = pd.qcut(dataframe[col], 10, labels=["1=10%", "2=20%", "3=30%", "4=40%", "5=50%",
"6=60%", "7=70%", "8=80", "9=90%", "10=100%"])

df3 = df2.copy()
#bin(df3, cols)
bin(df3, norm_cols)

I’ll run an ANOVA on the life expectancy data and binned internet user rate data. I’ll make an ANOVA frame and another frame to check mean and standard deviation on. I then fit the stats OLS model and print out the results, along with the mean and standard dev to get more context.

anova_frame = df3[['lifeexpectancy', 'breastcancerper100th', 'suicideper100th',
'internetuserate_bins', 'employrate_bins']].dropna()

relate_frame = df3[['lifeexpectancy', 'internetuserate_bins']]

anova_model = smf.ols(formula='lifeexpectancy ~ C(internetuserate_bins)', data=anova_frame).fit()
print(anova_model.summary())
mean = relate_frame.groupby("internetuserate_bins").mean()
sd = relate_frame.groupby("internetuserate_bins").std()

Here’s the results for the OLS ANOVA.

Here’s the mean and standard dev for the internet use rate and life expectancy.

The F-Score is 36.76, while the P-Value is 8.78e-34, which is a very small P-value. This seems to be a significant relationship. There are 10 different categories in the binned Internet Use Rate variables, and since the P-value is small enough to be significant we want to run a post-hoc comparison.

Here’s the code to run the multi-comparison.

multi_comp = multi.MultiComparison(anova_frame["lifeexpectancy"], anova_frame["internetuserate_bins"])
results = multi_comp.tukeyhsd()
print(results)

And here’s the results for the multi-comparison.

We can see that there are significant differences, and that we can reject the null hypothesis for the majority of these class/group differences.

Creating Graphs For Gapminder Data

In this post I’ll be using exploratory data analysis techniques on the Gapminder dataset. I’ll be graphing some of the features in the dataset, with the primary variables of interest being Internet Use Rate and Life Expectancy. I’ve also included some analysis of secondary variables of interest, including Breast Cancer Rates Per 100k and Employment Rates.

To begin with, here’s the code I used to make the visualizations:

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()

# convert to numeric
def to_numeric(dataframe, cols):

for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)

for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

def bin(dataframe, cols):

for col in cols:
new_col_name = "{}_bins".format(col)
dataframe[new_col_name] = pd.qcut(dataframe[col], 10, labels=["1=10%", "2=20%", "3=30%", "4=40%", "5=50%",
"6=60%", "7=70%", "8=80", "9=90%", "10=100%"])

df3 = df2.copy()
#bin(df3, cols)
bin(df3, norm_cols)

df_employ = df3[["employrate_bins"]]
print(df_employ)

sns.distplot(df3['lifeexpectancy'].dropna())
plt.title("Life Expectancy Distribution")
plt.show()
sns.distplot(df3['internetuserate'].dropna())
plt.title("Internet Use Rate Distribution")
plt.show()
sns.countplot(x = "employrate_bins", data=df_employ)
plt.title("Distribution of Employment Rates")
plt.show()

sns.lmplot(x = "internetuserate", y="breastcancerper100th", hue="internetuserate_bins", data=df3, fit_reg=False)
plt.title("Internet Use Rate and Breast Cancer Per 100k")
plt.show()

sns.lmplot(x = "internetuserate", y="lifeexpectancy", hue="internetuserate_bins", data=df3, fit_reg=False)
plt.title("Internet Use Rate and Life Expectancy")
plt.show()

sns.lmplot(x = "internetuserate", y="employrate", hue="internetuserate_bins", data=df3, fit_reg=False)
plt.title("Internet Use Rate and Employment Rate")
plt.show()

Now let’s take a look at the visualizations themselves.

To begin with, here’s a distribution of life expectancy for the approximately 210+ countries tracked in the dataset.

The data is (arguably) bimodal, with two peaks. The first large peak is clustered around 70 years of age while a second smaller cluster is around 50 years of age.

Now let’s have a look at the distribution for internet use rates.

The largest peak by far is between 0 – 10% internet use rate. This looks to be about 25% of the data distribution overall, so most countries have less internet access for less than half their population. I’d argue that the data is a unimodel distribution, although I suppose a second smaller peak is found at the 40% mark.

Let’s try graphing some of the binned data, which includes binned employment rates.

If we do this we can see that the data is fairly uniformally distributed, which we would expect given how the qcut function in Python works. However, there is a small bump around the 30th percentile and a small fall near the 40th percentile.

Now let’s try creating scatter plots for the continuous variables. First, we’ll try plotting the relationship between breast cancer diagnosis per 100k and internet use rate.

There seems to be a strong positive correlation between internet use rate and breast cancer diagnosis rates. Here’s an example of why we should be careful about inferring causation from correlation. It’s highly unlikely there’s some kind of causal mechanism between internet usage and breast cancer rates.

Now let’s try graphing the relationship between internet use rate and life expectancy.

There also seems to be a roughly positive correlation between internet use rate and life expectancy. There’s a massive increase in life expectancy as internet use rises from 0 to 20%, followed by a slower increase.

Now let’s try graphing the relationship between internet use and employment rates.

Here’s something I was a little surprised to find. It looks as if there’s a relationship between internet use and employment rate that’s roughly parabolic in nature. It looks as if there’s a decline in employment rate internet saturation increases up until the approximately 50% mark, and then the rate increases again.

Making Data Management Decisions

After defining the functions we need, we’ll start off by loading in the data.

Afterwards, we’ll check to see if there’s any blank values.

import pandas as pd
import numpy as np

# check for missing data
def check_missing(dataframe, cols):

for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()

# convert to numeric
def to_numeric(dataframe, cols):

for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)

for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

check_missing(df, cols)
check_missing(df, norm_cols)

There’s quite a few blank values:

22

Column breastcancerper100th is missing:
40

Column suicideper100th is missing:
22

Column internetuserate is missing:
21

Column employrate is missing:
35

Calling the to_numeric function should convert them into NaN values. Which we can check by calling the frequency distribution function.

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

NaN 22
63.125 1
74.576 1
62.475 1
74.414 1
79.977 1
58.199 1
75.670 1
81.012 1
72.283 1
55.442 1
81.855 1
48.398 1
68.944 1
75.133 1
76.126 1
69.317 1
65.193 1
75.057 1
77.685 1
68.498 1
62.465 1
79.634 1
73.911 1
80.499 1
61.597 1
79.591 1
71.017 1
82.759 1
68.978 1
..
76.954 1
73.703 1
79.839 1
48.718 1
71.172 1
73.456 1
48.397 1
81.439 1
75.246 1
55.377 1
74.788 1
74.402 1
82.338 1
79.499 1
81.539 1
54.210 1
67.017 1
61.452 1
73.373 1
73.127 1
69.245 1
68.795 1
79.341 1
76.918 1
57.937 1
73.126 1
64.666 1
75.956 1
57.379 1
50.239 1
Name: lifeexpectancy, Length: 190, dtype: int64
Fred dist for: breastcancerper100th
23.5 2
NaN 40
70.5 1
31.5 1
62.5 1
36.0 1
92.0 1
46.0 1
19.5 6
21.5 1
16.5 3
26.0 1
29.5 1
63.0 1
90.0 1
43.5 1
33.0 1
23.0 1
52.5 1
30.6 1
38.5 1
82.5 1
10.5 1
22.5 2
90.8 1
29.0 1
55.5 1
48.0 1
35.0 1
25.2 1
..
58.9 2
23.3 1
33.4 1
28.1 7
29.7 1
83.2 1
84.3 1
50.1 1
43.9 1
4.4 1
19.6 1
29.8 2
17.3 2
18.7 1
31.2 3
20.6 2
38.7 1
21.8 2
58.4 2
35.1 2
24.2 1
13.6 1
30.3 1
18.2 2
79.8 1
46.2 1
62.1 1
50.4 2
46.6 1
31.7 1
Name: breastcancerper100th, Length: 137, dtype: int64
Fred dist for: suicideper100th
NaN 22
9.847460 1
6.519537 1
11.980497 1
13.716340 1
33.341860 1
2.034178 1
14.554677 1
6.449157 1
18.954570 1
5.554276 1
16.959240 1
10.493150 1
11.956941 1
4.961071 1
11.426181 1
4.930045 1
10.645740 1
11.655210 1
2.234896 1
2.515721 1
10.550375 1
1.658908 1
10.365070 1
8.164005 1
3.108603 1
16.234370 1
3.563325 1
10.100990 1
4.217076 1
..
15.538490 1
9.875281 1
10.171735 1
6.087671 1
5.838315 1
25.404600 1
13.637060 1
1.498057 1
8.283071 1
16.913248 1
27.874160 1
15.542603 1
5.767406 1
9.127511 1
4.288574 1
6.882952 1
12.405918 1
10.114997 1
7.699330 1
10.129350 1
13.548420 1
12.289122 1
20.369590 1
8.211067 1
3.716739 1
9.507928 1
2.109414 1
13.089616 1
9.257976 1
6.811439 1
Name: suicideper100th, Length: 192, dtype: int64
Fred dist for: internetuserate
81.000000 0.004695
66.000000 0.004695
45.000000 0.004695
NaN 0.098592
2.100213 0.004695
65.000000 0.004695
80.000000 0.004695
5.001375 0.004695
51.914184 0.004695
42.984580 0.004695
25.000000 0.004695
39.820178 0.004695
9.999954 0.004695
44.989947 0.004695
2.300027 0.004695
5.999836 0.004695
71.849124 0.004695
44.001025 0.004695
7.232224 0.004695
90.703555 0.004695
1.259934 0.004695
69.339971 0.004695
35.850437 0.004695
26.477223 0.004695
2.199998 0.004695
16.780037 0.004695
40.122235 0.004695
15.999650 0.004695
70.028599 0.004695
81.590397 0.004695

2.699966 0.004695
12.006692 0.004695
1.280050 0.004695
74.247572 0.004695
28.731883 0.004695
81.338393 0.004695
43.366498 0.004695
6.497924 0.004695
65.387786 0.004695
12.500255 0.004695
47.867469 0.004695
12.349750 0.004695
43.055067 0.004695
40.650098 0.004695
45.986590 0.004695
54.992809 0.004695
33.382128 0.004695
65.808554 0.004695
44.570074 0.004695
7.499996 0.004695
88.770254 0.004695
1.700031 0.004695
4.170136 0.004695
56.300034 0.004695
51.280478 0.004695
36.000335 0.004695
41.000128 0.004695
33.616683 0.004695
60.119707 0.004695
1.699985 0.004695
Name: internetuserate, Length: 193, dtype: float64
Fred dist for: employrate
50.500000 0.004695
NaN 0.164319
61.500000 0.014085
46.000000 0.009390
64.500000 0.004695
63.500000 0.004695
51.000000 0.009390
68.000000 0.004695
56.000000 0.009390
56.500000 0.004695
59.000000 0.009390
53.500000 0.014085
81.500000 0.004695
66.000000 0.009390
64.599998 0.004695
64.199997 0.004695
83.000000 0.004695
60.500000 0.004695
42.500000 0.004695
54.500000 0.009390
77.000000 0.004695
42.000000 0.004695
65.000000 0.014085
61.700001 0.004695
50.900002 0.009390
61.000000 0.009390
66.199997 0.004695
57.900002 0.004695
76.000000 0.004695
49.500000 0.004695

63.700001 0.004695
52.700001 0.004695
78.900002 0.004695
70.400002 0.009390
71.699997 0.004695
41.599998 0.004695
80.699997 0.004695
54.599998 0.004695
42.799999 0.004695
50.700001 0.004695
57.200001 0.004695
66.800003 0.004695
54.400002 0.004695
63.099998 0.004695
63.900002 0.004695
55.099998 0.004695
67.300003 0.004695
44.799999 0.004695
64.300003 0.004695
47.099998 0.004695
41.099998 0.004695
44.299999 0.004695
62.400002 0.004695
51.299999 0.004695
79.800003 0.004695
58.599998 0.004695
63.799999 0.009390
63.200001 0.004695
65.599998 0.004695
68.300003 0.004695

The blank values have successfully been replaced with NaN. This is good because when we plot graphs later, we can just have Python ignore those NaN values. We’ve coded out the missing data with NaN.

I’ve also decided to experiment with bins. We’ll do one more thing and create five bins for our chosen features.

Here’s the code to do that:

def bin(dataframe, cols):

    for col in cols:
        new_col_name = "{}_bins".format(col)
        dataframe[new_col_name] = pd.qcut(dataframe[col], 5, labels=["1-20%", "21 to 40%", "40 to 60%", "60 to 80%", "80 to 100%"])

df3 = df2.copy()
bin(df3, cols)
bin(df3, norm_cols)

# print out the head and check the dataframe 
print(df3.head())

Let’s print out the head of the dataframe to see if the values exist. It looks a little messy, but we can see that the new columns have successfully been created.

Running The First Program – Frequency Distribution





As mentioned in the last post, I am doing an analysis on the GapMinder dataset. I’ve selected five variables of interest: life expectancy, breast cancer per 100, suicide per 100, internet user rate, and employment rate. The first three variables do not need to be normalized, while the second two variables do. The code below loads in the data, makes two different lists (variables to normalize, and non-normalized) and then for the two lists returns a frequency distribution along with the five most common occurrence values. Jump to the bottom to see analysis of missing values and most common values.

import pandas as pd

# load data
dataframe = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(dataframe.isnull().values.any())

# split into two lists
cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']

norm_cols = ['internetuserate', 'employrate']

# print frequency distribution and 5 most common occurences
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))

        count = dataframe[col].value_counts(sort=False)
        top_5 = dataframe[col].value_counts()[:4].index.tolist()
        print(count)
        print("Top 5 most common:")
        print(top_5)
        print("-----")

    for col in norm_cols:
        print("Fred dist for: {}".format(col))

        count = dataframe[col].value_counts(sort=False, normalize=True)
        top_5 = dataframe[col].value_counts()[:4].index.tolist()
        print(count)
        print("Top 5 most common:")
        print(top_5)
        print(count)


freq_dist(dataframe, cols, norm_cols)

Here’s the results of the frequency distributions:

l

Fred dist for: lifeexpectancy
62.465 1
73.127 1
76.918 1
62.475 1
79.839 1
67.484 1
55.442 1
73.126 1
81.907 1
51.61 1
56.081 1
22
74.522 1
58.199 1
79.158 1
72.64 1
75.62 1
75.181 1
76.142 1
67.529 1
74.221 1
76.64 1
51.444 1
73.371 1
57.937 1
48.397 1
79.977 1
79.915 1
68.978 1
50.411 1
..
57.379 1
64.666 1
49.025 1
72.283 1
77.005 1
68.749 1
74.573 1
80.17 1
55.439 1
81.539 1
75.246 1
79.311 1
54.116 1
68.846 1
77.685 1
79.499 1
72.15 1
75.901 1
65.438 1
80.642 1
73.403 1
73.99 1
68.823 1
73.488 1
58.582 1
81.618 1
72.231 1
81.097 1
51.879 1
73.979 2
Name: lifeexpectancy, Length: 190, dtype: int64
Top 5 most common:

[‘ ‘, ‘72.974’, ‘73.979’, ‘73.911’]

Fred dist for: breastcancerper100th
50.9 1
25.2 1
31.7 1
18.3 1
48 1
84.3 1
19.6 1
17.9 1
21.5 1
30 1
33.3 1
74.4 1
23.9 1
24 1
24.1 1
20.4 2
38.7 1
46.2 1
12.3 1
38.8 1
29.7 1
91.9 2
34.3 1
62.5 1
31.2 3
43.9 1
33.4 1
29 1
88.7 1
62.1 1
..
51.8 1
55.5 1
82.5 1
52.5 1
67.2 1
51.1 1
26.4 1
23 1
30.9 1
86.7 1
73.9 1
23.1 1
23.5 2
19.5 6
50.4 2
101.1 1
90 1
18.4 1
52.1 1
44.8 1
49.6 1
63 1
43.5 1
29.8 2
19 1
13.6 1
50.1 1
13.2 2
17.1 1
30.6 1
Name: breastcancerper100th, Length: 137, dtype: int64
Top 5 most common:

[‘ ‘, ‘28.1’, ‘19.5’, ‘24.7’]

Fred dist for: suicideper100th
6.02188205718994 1
6.26578903198242 1
11.9804973602295 1
6.1052818998346 1
14.09153 1
6.44915676116943 1
2.0341784954071 1
7.765584 1
5.83525085449219 1
4.409532 1
14.77625 1
11.11583 1
35.752872467041 1
11.9569406509399 1
22
3.94025897979736 1
8.21094799041748 1
4.37336492538452 1
20.3179302215576 1
9.127511 1
14.4696483612061 1
9.211085 1
11.2139701843262 1
11.6533222198486 1
3.74158787727356 1
7.44382619857788 1
5.36217880249023 1
10.0719423294067 1
10.1149969100952 1
3.10860252380371 1
..
1.65890777111053 1
4.8487696647644 1
9.6331148147583 1
7.69932985305786 1
12.2167692184448 1
8.28307056427002 1
1.37000155448914 1
10.823 1
12.17976 1
7.87687826156616 1
.20144872367382 1
11.4261808395386 1
12.8698148727417 1
9.927033 1
6.68438529968262 1
20.16201 1
13.23981 1
7.06018447875976 1
5.931845 1
7.7450647354126 1
3.56332468986511 1
29.864164352417 1
9.70955562591553 1
15.5426025390625 1
4.7510838508606 1
4.52785158157349 1
5.55427646636963 1
4.41499042510986 1
6.3698878288269 1
13.1179485321045 1
Name: suicideper100th, Length: 192, dtype: int64
Top 5 most common:

[‘ ‘, ‘13.1179485321045’, ‘1.57435011863708’, ‘14.5546770095825’]

Fred dist for: internetuserate
44.5853546903272 0.004695
47.2804360294118 0.004695
40.0200948796529 0.004695
62.8119000060846 0.004695
76.5875384615385 0.004695
90.0161900191939 0.004695
8.37020688381867 0.004695
31.0043782824698 0.004695
77.6385351546869 0.004695
0.098592
11.0000554403336 0.004695
42.7478120557293 0.004695
7.0002138207311 0.004695
12.5002554302298 0.004695
44.9899469578783 0.004695
69.3399707174231 0.004695
12.3348932627873 0.004695
47.8674686327078 0.004695
81.3383926859286 0.004695
2.25997588483994 0.004695
82.166659877332 0.004695
14.0002467348544 0.004695
15.9996499919575 0.004695
15.8999820280962 0.004695
95.6381132075472 0.004695
42.9845801749271 0.004695
29.999939516129 0.004695
2.10021270579814 0.004695
1.25993360916614 0.004695
38.2602335526316 0.004695

2.45036224422442 0.004695
90.0795266272189 0.004695
1.4000606995385 0.004695
20.0017101420083 0.004695
3.70000325975843 0.004695
25.8997967022931 0.004695
14.8307358837209 0.004695
61.987412863816 0.004695
51.2804783981952 0.004695
2.99980317919075 0.004695
65.3877859391396 0.004695
39.8201778851441 0.004695
40.7728505747126 0.004695
65.163250915 0.004695
16.7800370218845 0.004695
77.9967811501598 0.004695
9.00773590909091 0.004695
80 0.004695
73.7339344713656 0.004695
12.3497504635596 0.004695
.99995892606692 0.004695
77.4986193497188 0.004695
6.00343714285714 0.004695
49.0006318425088 0.004695
31.050012866438 0.004695
82.526897905279 0.004695
9.99995388324075 0.004695
83.0025842490842 0.004695
2.19999781832606 0.004695
36.5625529623661 0.004695
Name: internetuserate, Length: 193, dtype: float64
Top 5 most common:

[‘ ‘, ‘36.5625529623661’, ‘36.4991147241897’, ‘15.8999703410908’]

Fred dist for: employrate
78.1999969482422 0.009390
57.5 0.009390
50.7000007629394 0.004695
48.7000007629394 0.009390
44.2999992370606 0.004695
47.0999984741211 0.004695
58.2000007629394 0.009390
55.9000015258789 0.014085
58.5 0.004695
52.0999984741211 0.004695
59.9000015258789 0.014085
0.164319
46.4000015258789 0.004695
58.4000015258789 0.009390
66.8000030517578 0.004695
78.9000015258789 0.004695
68.0999984741211 0.004695
65.0999984741211 0.004695
56 0.009390
57.2000007629394 0.004695
81.3000030517578 0.004695
50.5 0.004695
56.9000015258789 0.004695
41.5999984741211 0.004695
83 0.004695
42 0.004695
49.5 0.004695
39 0.004695
63.7999992370606 0.009390
63.7000007629394 0.004695

54.4000015258789 0.004695
56.2999992370606 0.009390
63.5 0.004695
42.7999992370606 0.004695
57.2999992370606 0.004695
46.7999992370606 0.004695
43.0999984741211 0.004695
80.6999969482422 0.004695
62.7000007629394 0.004695
65.6999969482422 0.004695
56.4000015258789 0.004695
72 0.004695
53.5 0.014085
71.6999969482422 0.004695
71.8000030517578 0.004695
41.0999984741211 0.004695
74.6999969482422 0.004695
51 0.009390
57.9000015258789 0.004695
45.7000007629394 0.004695
63.5999984741211 0.004695
71 0.004695
59.7999992370606 0.004695
63.0999984741211 0.004695
42.5 0.004695
83.1999969482422 0.009390
53.4000015258789 0.009390
76 0.004695
65.9000015258789 0.004695
46 0.009390
Name: employrate, Length: 140, dtype: float64
Top 5 most common:
[‘ ‘, ‘58.9000015258789’, ‘55.9000015258789’, ‘59.9000015258789’]

Here’s the 5 most common values for each category:

Top 5 most common:

[‘ ‘, ‘73.979’, ‘72.974’, ‘77.653’]

Fred dist for: breastcancerper100th
Top 5 most common:

[‘ ‘, ‘28.1’, ‘19.5’, ‘24.7’]

Fred dist for: suicideper100th
Top 5 most common:

[‘ ‘, ‘12.8722219467163’, ‘3.1468141078949’, ‘7.7450647354126’]

Fred dist for: internetuserate
Top 5 most common:

[‘ ‘, ‘11.0907646263158’, ‘82.526897905279’, ‘42.7478120557293’]

Fred dist for: employrate
Top 5 most common:
[‘ ‘, ’65’, ‘55.9000015258789’, ‘61.5’]

Notice that blank is the most common value for all of them, this means we have quite a few cells that are just filled in with blank values. Finally, this checks to see if there are any blank cell values (blank does not mean missing):

for col in cols:
print((dataframe[col].values == ' ').sum())

for col in norm_cols:
print((dataframe[col].values == ' ').sum())

We do have a number of blank values for those columns:

22
40
22
21
35

Data Management and Visualization: Selecting a Research Topic

I recently finished Hans Rosling’s book Factfulness: Ten Reasons We’re Wrong About the World, and I would love to do a research project that pays tribute to Rosling and Gapminder. For that reason I’ll be using the Gapminder study.

The primary topic I’m interested in is the relationship between internet usage and female employment. I’m curious about whether or not access to the internet might play a role in increasing life expectancy. For that reason, the two variables I’ll focus on are “Internetuserate” and “lifeexpectancy”. However, at this stage I’m not going to exclude any variables. I’m not sure yet which variables might be related in important ways to my chosen variables, so I’m going to keep everything at the moment. After some exploratory data analysis and more research, I’ll have a better idea of which variables to focus on.

I’m also curious about the relationship between internet usage and employment rates. There’s a lot of discussion right now that the internet can help job seekers gain more skills and become more employable, and I’m curious to see how strong the relationship between these two things are. Because of this, I’ll also include the “employrate” variable in my analysis.

(Upon further reflection, “breastcancerper100” and “suicideper100” might be interesting variables to track as well, as they related to physical and mental health.)

I used Google Scholar and my search terms were as follows:

“internet access and life expectancy”

“internet access and employment rates”

My initial search for relevant information about education and life-expectancy yielded the following results:

A handful of articles finding a relationship between internet usage and life expectancy or general health:

https://www.researchgate.net/publication/268631883_Examining_the_Relationship_between_the_Internet_and_Life_Expectancy , Conference: 24th IBIMA Conference, At Milan, Italy

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5642753/ ,

BMJ Open. 2017; 7(7): e015839.Published online 2017 Jul 21. doi: 10.1136/bmjopen-2017-015839

https://www.eurekalert.org/pub_releases/2016-10/uoc–lla102716.php

http://buscompress.com/uploads/3/4/9/8/34980536/riber_8-2_06_m18-053_70-80.pdf, Integrative Business and Economics Research, Vol. 8, Issue 2

https://link.springer.com/article/10.1007/s11205-015-1107-2 , Social Indicators Research volume 129, pages391–402(2016)

There doesn’t seem to have a tremendous amount of research on the subject, which is disappointing but I’m excited to see what results I get anyway.

In terms of internet access and employment rates, I was able to find a few studies that found a link between the two variables:

https://www.aeaweb.org/articles?id=10.1257/aer.20161385 ,

AMERICAN ECONOMIC REVIEW
VOL. 109, NO. 3, MARCH 2019
(pp. 1032-79)

https://www.sciencedirect.com/science/article/pii/S0167624516300282 ,

Information Economics and Policy
Volume 40, September 2017, Pages 21-25

As well as one that found a negligible effect:

https://www.aeaweb.org/articles?id=10.1257/000282804322970779 ,

AMERICAN ECONOMIC REVIEW
VOL. 94, NO. 1, MARCH 2004
(pp. 218-232)

To sum up:

My research questions are as follows:

“Is there a relationship between internet usage and life expectancy?” and beyond that “Is there a relationship between internet usage and employment rates”?

Based upon the sources I was able to find, I hypothesize there will be a positive relationship between internet usage and both life expectancy and employment rates. I would also expect breast cancer and suicide rates to trend down as internet access increases as they should be negatively correlated with life expectancy.

Classifying Mushrooms With Python and Scikit-Learn

Mushrooms can be either toxic or edible. It takes trained mushroom hunters and mycologists to discern the toxic mushrooms from the edible mushrooms. Can machine learning algorithms also classify the edibility of mushrooms? We’ll visualize some of the data here, to get an idea of how the data is related, and then implement several different classifiers on the dataset.

This might go without saying, but don’t take advice about which mushrooms to eat from a random blog post. I do not condone eating any mushrooms based on the patterns revealed in this notebook or in the accompanying Python script or documentation.

To start with, we’ll import all the necessry modules that we need.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report, log_loss
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")

We’ll load in the data for the mushrooms. Then we need to check if there is any null data in the dataset. There are only two classes: edible and poisonous. We can confirm that this is the case by printing the unique values of the class feature. We’re going to need to encode non-numerical data, that is, transform it into a form that our classifiers can use.

Scikit-Learn has a Label Encoder that will represent our non-numerical values as numerical ones. So we’ll create an instance of the encoder and then apply the transformation to every column in the dataset. We can print out the dataframe to be sure that there the transformations have been correct.

m_data = pd.read_csv('mushrooms.csv')

# check for any null values
print(m_data.isnull().sum())

# see all unique values in a category
print(m_data['class'].unique())

# machine learning systems work with integers, we need to encode these
# string characters into ints

encoder = LabelEncoder()
# now apply the transformation to all the columns:
for col in m_data.columns:
    m_data[col] = encoder.fit_transform(m_data[col])
class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64
['p' 'e']
   class  cap-shape  cap-surface  cap-color  bruises  odor  gill-attachment  \
0      1          5            2          4        1     6                1   
1      0          5            2          9        1     0                1   
2      0          0            2          8        1     3                1   
3      1          5            3          8        1     6                1   
4      0          5            2          3        0     5                1   

   gill-spacing  gill-size  gill-color  ...  stalk-surface-below-ring  \
0             0          1           4  ...                         2   
1             0          0           4  ...                         2   
2             0          0           5  ...                         2   
3             0          1           5  ...                         2   
4             1          0           4  ...                         2   

   stalk-color-above-ring  stalk-color-below-ring  veil-type  veil-color  \
0                       7                       7          0           2   
1                       7                       7          0           2   
2                       7                       7          0           2   
3                       7                       7          0           2   
4                       7                       7          0           2   

   ring-number  ring-type  spore-print-color  population  habitat  
0            1          4                  2           3        5  
1            1          4                  3           2        1  
2            1          4                  3           2        3  
3            1          4                  2           3        5  
4            1          0                  3           0        1  

[5 rows x 23 columns]

Now we can do some plotting of the data. Using Matplotlib, let’s do a heatmap plot, which compares the correlation of every feature with every other feature.

# let's see how many poisonous and edible there are of each, 1 is poisonous, 0 is edible
# check to get a rough idea of correlations
correlations = m_data.corr()
plt.subplots(figsize=(20, 15))
#plt.figure(figsize=(16, 14))
data_corr = sns.heatmap(correlations, annot=True, linewidths=0.5, cmap="RdBu_r")
plt.show(data_corr)
mushroom1.png

It looks like the features most closely associated with class are: cap-surface, gill-attachment, gill-size, veil-color, spore-print-color, population, and habitat. Why don’t we see if we can plot the correlations of those variables individually.

features = ["cap-surface", "gill-attachment", "gill-size", "veil-color", "spore-print-color", "population", "habitat"]

for feature in features:
    line_plot = sns.lineplot(x="class", y=feature, label=feature, data=m_data)
    
plt.legend(loc=1)    
plt.show()
mushroom2

If we are so inclined, we can chart how individual attributes are likely to correlate with a mushroom’s likelihood of beind poisonous. We just get the individual colors stored in the “gill-color” feature and we plot them against class. We need to specify the names of the gill-colors and choose colors to represent them in our plot, however.

gill_names =["buff", "red", "gray", "chocolate", "black", "brown", "orange", "pink", "green", "purple", "white", "yellow"]
gill_colors = ["khaki", "Red", "darkGrey", "chocolate", "Black", "saddleBrown", "orange", "lightpink", "darkGreen", "purple", "lightGrey", "Yellow"]

factor = sns.factorplot(x="gill-color",y="class",data=m_data, kind="bar", size = 8,
palette = gill_colors)
factor.set_xticklabels(rotation=45)
factor.set(xticks=range(0,14), xticklabels=gill_names)
factor = factor.set_ylabels("Prob. Poison")
plt.grid(axis='y')
plt.show()

Now we need to split our data into features and labels. This is easy because the “class” feature is the first in the dataframe, so all we need to do is cast the features as everything but the first column and do the opposite for the labels. We’ll also want to scale the data. Scaling our data is important as the data covers a wide range.

The sheer range of the data can throw off the accuracy of our classifier, so by standardizing the data we make sure our classifier performs optimally. We’ll create an instance of the classifier and then transform the data with it. We don’t need to scale the labels/targets as they are only 0 or 1.

X_features = m_data.iloc[:,1:23]  
y_label = m_data.iloc[:, 0]  

# we should probably scale the features, so that SVM or gaussian NB can deliver better predictions
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)

Principal Component Analysis is a method that can simplify the representation of features, distilling the most important features down into a combination of just a few features. PCA can help improve the outcome of a classifier. However, PCA works best when datasets are large, and since this dataset is relatively small, we may not want to do it. We can plot the conclusions of the PCA to see what kind of dimensions the data would be compressed to.

We fit the PCA on our features and plot the explained variance ratio. The explained variance ratio is a way of measuring how many features describe a given portion of the data. We can plot our this function to get an idea of how many features are needed to describe 90% or 100% of the data. Let’s plot the ratio.

# principal component analysis, may or may not want to do, small dataset
pca = PCA()
pca.fit_transform(X_features)
plt.figure(figsize=(10,10))
# plot the explained variance ratio
plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.grid()
plt.show()
# it looks like about 17 of the features account for about 95% of the variance in the dataset
mushroom4

Here’s another way that the explained variance can be plotted. Plotting it like this leads to the same conclusion: around 18 features describe 99% of the dataset.

# here's another way to visualize this
pca_variance = pca.explained_variance_

plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal components')
plt.show()
mushroom5

Let’s go ahead and create a new PCA object and use 17 or 18 features to comprise our new featureset.

pca2 = PCA(n_components=17)
x_new = pca2.fit_transform(X_features)
X_train, X_test, y_train, y_test = train_test_split(x_new, y_label, test_size=0.20, random_state=2)

Now we can select our chosen classifiers and run GridSearchCV to find their best possible parameters. GridSearchCV takes a specified list of parameters and tests the classifier with the potential combinatiosn of your schosen parameters to see which combination is best. Doing this should dramatically improve the performance of our classifiers compared to just implementing them vanilla. We have to fit GridSearchCV on the classifiers and our chosen parameters, and then save the optimized settings as a new instance of the classifier.

logreg_clf = LogisticRegression()
parameters_logreg = {"penalty": ["l2"], "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
                     "max_iter": [25, 50, 100, 200, 400]}
grid_logreg = GridSearchCV(logreg_clf, parameters_logreg)
grid_logreg.fit(X_train, y_train)
logreg_opt = grid_logreg.best_estimator_

GNB_clf = GaussianNB()
GNB_clf.fit(X_train, y_train)

svc_clf = SVC()
svc_param = {"kernel": ["rbf", "linear"]}
grid_svc = GridSearchCV(svc_clf, svc_param)
grid_svc.fit(X_train, y_train)
svc_opt = grid_svc.best_estimator_

rf_clf = RandomForestClassifier()
parameters_rf = {"n_estimators": [4, 6, 8, 10, 12, 14, 16], "criterion": ["gini", "entropy"], "max_features": ["auto", "sqrt", "log2"],
                 "max_depth": [2, 3, 5, 10], "min_samples_split": [2, 3, 5, 10]}
grid_rf = GridSearchCV(rf_clf, parameters_rf)
grid_rf.fit(X_train, y_train)
rf_opt = grid_rf.best_estimator_

knn_clf = KNeighborsClassifier()
parameters_knn = {"n_neighbors": [3, 5, 10, 15, 20], "weights": ["uniform", "distance"],
                  "leaf_size": [10, 20, 30, 45, 60]}
grid_knn = GridSearchCV(knn_clf, parameters_knn)
grid_knn.fit(X_train, y_train)
knn_opt = grid_knn.best_estimator_

dt_clf = DecisionTreeClassifier()
parameters_dt = {"criterion": ["gini", "entropy"], "splitter": ["best", "random"], "max_features": ["auto", "log2", "sqrt"]}
grid_dt = GridSearchCV(dt_clf, parameters_dt)
grid_dt.fit(X_train, y_train)
dt_opt = grid_dt.best_estimator_

xgb_clf = XGBClassifier()
parameters_xg = {"objective" : ["reg:linear"], "n_estimators" : [5, 10, 15, 20]}
grid_xg = GridSearchCV(xgb_clf, parameters_xg)
grid_xg.fit(X_train, y_train)
xgb_opt = grid_xg.best_estimator_

chosen_metrics = [accuracy_score, log_loss, classification_report]

Finally, we can create a function to do the classification and return our chosen metrics.

def get_reports(metrics, classifier):

    preds = classifier.predict(X_test)
    print("'{}' Performance: ".format(classifier.__class__.__name__))
    acc = accuracy_score(y_test, preds)
    l_loss = log_loss(y_test, preds)
    c_report = classification_report(y_test, preds)
    print("Accuracy: " + str(acc))
    print("Log Loss: " + str(l_loss))
    print("Classificaiton Report: ")
    print(c_report)
    print("----------")

get_reports(chosen_metrics, logreg_opt)
get_reports(chosen_metrics, GNB_clf)
get_reports(chosen_metrics, svc_opt)
get_reports(chosen_metrics, rf_opt)
get_reports(chosen_metrics, knn_opt)
get_reports(chosen_metrics, dt_opt)
get_reports(chosen_metrics, xgb_opt)

'LogisticRegression' Performance: 
Accuracy: 0.9427692307692308
Log Loss: 1.9766989475886843
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.94      0.96      0.95       860
           1       0.95      0.93      0.94       765

    accuracy                           0.94      1625
   macro avg       0.94      0.94      0.94      1625
weighted avg       0.94      0.94      0.94      1625

----------
'GaussianNB' Performance: 
Accuracy: 0.9298461538461539
Log Loss: 2.423050640308682
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.92      0.95      0.93       860
           1       0.94      0.91      0.92       765

    accuracy                           0.93      1625
   macro avg       0.93      0.93      0.93      1625
weighted avg       0.93      0.93      0.93      1625

----------
'SVC' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'RandomForestClassifier' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'KNeighborsClassifier' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'DecisionTreeClassifier' Performance: 
Accuracy: 0.992
Log Loss: 0.2763131635190287
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       860
           1       0.99      0.99      0.99       765

    accuracy                           0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

----------
'XGBClassifier' Performance: 
Accuracy: 0.968
Log Loss: 1.1052531461360688
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       860
           1       0.97      0.96      0.97       765

    accuracy                           0.97      1625
   macro avg       0.97      0.97      0.97      1625
weighted avg       0.97      0.97      0.97      1625

----------

I highly recommend experimenting with the different classifiers to get a better idea of which classifiers work better under different circumstances. You can check out a list of classifiers supported by Scikit-learn here or here.

Understanding and Implementing Merge Sort In Python

Previously, we’ve covered  binary search and selection sort. This time, we’ll be covering another sorting algorithm – merge sort. Merge sort is a little different from selection sort. While selection sort can’ only be used to sort an unsorted list, merge sort can be utilized to merge two different sorted lists, in addition to sorting an unsorted list. 

The core idea of merge sort is an instance of a divide and conquer approach. First, the entire list/array is split apart into left and right halves. Afterwards, the left half of the array is also divided in half. This recurs until there is only one element in the left and right halves, until the array halves cannot be split further. Afterwards, the values are merged together, being compared and place in sorted order.

This process recurs, gradually sorting and merging groups, until there is only one group with all the elements in their proper order. It is critical to remember that every-time a group is created the sorted order must be maintained.

Here’s an example of the technique, broken into dividing and merging phases.

Dividing Phase:

Assuming we have an array with the following values: [3, 8, 9, 2, 6, 7], the process will look something like this.

Division 1: [3, 8, 9][2, 6, 7]

Division 2: [3] [8, 9] [2] [6,7]

Division 3: [3] [8] [9] [2] [6] [7]

Now comes the merging phase.

Merging Phase:

Merge 1: [3, 8] [2,9] [6,7]

Merge 2: [2, 3, 8, 9][6, 7]

Merge 3: [2, 3, 6, 7, 8, 9]

If two lists are being merged, it is important to compare the elements before they are merged into a single list. Otherwise, the merged list will contain all the sorted elements of one list and then all the sorted elements of the other list.

So let’s make sure we understand the steps to implementing merge sort:

  1. We start with an unsorted list and split it in half.

2. We split the left half of the array into halves.

3. We keep splitting until there is just one element per half.

4. We compare each of the elements

5. The elements are merged into one group, in order.

6. The process continues until the entire array has been sorted and merged.

Now we can try writing merge sort in Python.

This implementation will be  a recursive implementation. When writing a recursive algorithm, remember that the algorithm must have a base-case, or a case where the recursion ends.

In order to implement merge sort in Python, the algorithm can be structured as follows:

def merge_sort(array):
# Only if there are two or more elements
if len(array) > 1:
# get the midpoint in the array
mid = len(array) // 2
# left half is everything before the mid
left = array[:mid]
# right half is everything after the midpoint
right = array[mid:]

# recur the splits on the left and right halves
merge_sort(left)
merge_sort(right)

# we need variables to track the iteration through the arrays
x = 0
y = 0

# Iterator for the main list
z = 0

# here we copy the data into temporary arrays that store the left and right halves
# as long as there are still values within the left and right half of the arrays
# compare the values and swap positions

while x < len(left) and y < len(right):
# if left value is smaller than the right value
if left[x] <= right[y]:
# place the left position into the sorted array
array[z] = left[x]
# update left position
x = x + 1
else:
array[z] = right[y]
y = y + 1
# move to the next portion of the array
z = z + 1

# If there are any elements left over in the left half of the array
# We fill in the primary array with the remaining left values and then move the iterators up
while x < len(left):
array[z] = left[x]
x += 1
z += 1

# Do the same thing for the right half of the array
while y < len(right):
array[z] = right[y]
y += 1
z += 1

listy = [35, 72, 29, 55, 41, 27, 79, 11, 23, 56, 50]
merge_sort(listy)
print(listy)

Here’s the sorted list:

[11, 23, 27, 29, 35, 41, 50, 55, 56, 72, 79]

GameInformer Review Analysis

Analyzing reviews of media is a good way to get acquainted with data analysis and data science ideas/techniques. Because of the tendency for reviewers to give numerical scores to produces they review, it can be fairly easy to visualize how certain other features may correlate with these scores. This blog post will serve as an demonstration of how to analyze and explore data, using review data gathered from GameInformer.

I scraped the review data from GameInformer myself, and what follows is the results of the visualization/analysis. If you are curious about how the scraping was done:

First, the review URLS were collected from the main site.

Next, the review data was collected. Following this, some preprocessing was done to remove duplicate entries and non-standard characters from the review data.

This blog will also act as a quick primer on how to visualize data with Pandas and Seaborn.

Let’s start off by making all the necessary imports.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from collections import Counter
import csv

Now we need to load in the data. We also need to get dummy representations of any non-numeric values.

GI_data = pd.read_csv('GIreviewcomplete.csv', encoding = "ISO-8859-1")
GI_data2 = pd.get_dummies(GI_data)

We can get a violin plot for every reviewer. The width of the violins shows how often a particular score was assigned, while the height and length of a violin show the possible range of scores.

#---display violin plot for author and avg score

sns.lmplot(x='author', y='score', data=GI_data, hue='author', fit_reg=False)
sns.violinplot(x='author',y='score', data=GI_data)
plt.xticks(rotation=-90)
plt.show()

It’s possible to filter the database by multiple criteria. For instance, we could return every review over a certain score by a certain reviewer.

EF_reviews = GI_data[(GI_data.author == 'Elise Favis') & (GI_data.score >= 8.0)]
print(EF_reviews)

title score author \ 48 Heaven's Vault 8 Elise Favis 58 Photographs 8 Elise Favis 102 Bury Me, My Love 8 Elise Favis 158 Life Is Strange 2: Episode 1 ? Roads 8 Elise Favis 243 Minit 8 Elise Favis 262 Where The Water Tastes Like Wine 9 Elise Favis 266 Florence 8 Elise Favis 276 Subnautica 8 Elise Favis 286 The Red Strings Club 8 Elise Favis 405 Tacoma 8 Elise Favis 442 Perception 8 Elise Favis 485 Thimbleweed Park 8 Elise Favis 506 Night In The Woods 8 Elise Favis 586 Phoenix Wright: Ace Attorney - Spirit of Justice 8 Elise Favis 678 Day of the Tentacle Remastered 8 Elise Favis r_platform o_platform \ 48 PC PlayStation 4 58 PC iOS, Android 102 Switch PC, iOS, Android 158 PC PlayStation 4, Xbox One 243 PC PlayStation 4, Xbox One, Switch 262 PC PlayStation 4, PC 266 iOS Xbox One, PC 276 PC PlayStation 4, Switch, PC 286 PC Xbox One, Switch, PC 405 PC Xbox One 442 PC PlayStation 4, Xbox One 485 PC Xbox One 506 PlayStation 4 PC 586 3DS PC 678 PlayStation 4 PlayStation Vita, PC publisher developer \ 48 Inkle Inkle 58 EightyEight Games EightyEight Games 102 Playdius The Pixel Hunt, Arte France, FIGS 158 Square Enix Dontnod Entertainment 243 Devolver Digital JW, Kitty, Jukio, and Dom 262 Good Shepard Entertainment Dim Bulb Games, Serenity Forge 266 Annapurna Interactive Annapurna Interactive 276 Unknown Worlds Entertainment Unknown Worlds Entertainment 286 Devolver Digital Deconstructeam 405 Fullbright Fullbright 442 The Deep End Games The Deep End Games 485 Terrible Toybox Terrible Toybox 506 Finji Infinite Fall 586 Capcom Capcom 678 Double Fine Productions Double Fine Productions release_date rating 48 April 16, 2019 Teen 58 April 3, 2019 NaN 102 January 10, 2019 Everyone 10+ 158 September 27, 2018 Mature 243 April 3, 2018 Everyone 262 February 28, 2018 Not rated 266 February 14, 2018 Everyone 276 January 23, 2018 Everyone 10+ 286 January 22, 2018 Rating Pending 405 August 2, 2017 Teen 442 May 30, 2017 Mature 485 March 30, 2017 Teen 506 February 21, 2017 Teen 586 September 8, 2016 Teen 678 March 22, 2016 Teen

Countplots and barplots are useful ways of visualizing data. Countplots just plot the count of your chosen variable, whereas bar plots compare two chosen variables with each other.

plt.figure(figsize=(10, 4))
scores = sns.countplot(x='score', data=GI_data2)
plt.xticks()
plt.show(scores)

plt.figure(figsize=(10, 4))
r_platform = sns.countplot(x="r_platform", data=GI_data)
plt.xticks(rotation=-90)
plt.show(r_platform)

plt.figure(figsize=(10, 4))
rating = sns.countplot(x="rating", data=GI_data)
plt.xticks(rotation=-90)
plt.show(rating)

plt.figure(figsize=(10, 4))
plt.xticks(rotation=-90)
barplot_score = sns.barplot(x="rating", y="score", data=GI_data)
plt.show(barplot_score)

Seaborn also contains a handy distribution function, and in this case we can see the relative distribution of GameInformer review scores. Most of their assigned scores have been an 8.

sns.distplot(GI_data2['score'])
plt.show()

Swarmplots show the individual instances of Y given X, or in this case of a certain score given a certain rating.

plt.figure(figsize=(16, 6))
factorplot_score = sns.swarmplot(x="rating", y="score", data=GI_data, hue='rating')
plt.xticks(rotation= -90)
plt.show(factorplot_score)

Let’s do some analysis of publisher and developer statistics. We’d want to start by getting a list of publishers and developers. Let’s just get the 100 most common.

publishers = Counter(GI_data['publisher'])
developers = Counter(GI_data['developer'])

pub_list = []
dev_list = []

for item, freq in publishers.most_common(100):
    pub_list.append(item)

for item, freq in developers.most_common(100):
    dev_list.append(item)
    
print(pub_list[:10])
print("--------------")
print(dev_list[:10])
[' Nintendo', ' Telltale Games', ' Square Enix', ' Ubisoft', ' Sony Computer Entertainment', ' Devolver Digital', ' Warner Bros. Interactive', ' Microsoft Game Studios', ' Electronic Arts', ' Bandai Namco']
--------------
[' Telltale Games', ' Nintendo', ' Capcom', ' Square Enix', ' EA Tiburon', ' Atlus', ' Visual Concepts', ' EA Canada', ' Dontnod Entertainment', ' Ubisoft Montreal']

If we make a publisher and developer dataframe, we can transform those dataframes by getting the mean and median values of their scores. We can then merge those back into the original dataframe to get a dataframe sorted by publisher which also has the publisher’s mean and median scores. We could do the same thing for developers.

pub_df = pd.DataFrame(index=None)
dev_df = pd.DataFrame(index=None)

# append rows for individual publishers to dataframe

def custom_mean(group):
    group['mean'] = group['score'].mean()
    return group

def custom_median(group):
    group['median'] = group['score'].median()
    return group

for pub in pub_list:
    scores = pd.DataFrame(GI_data[(GI_data.publisher == pub) & (GI_data.score)], index=None)
    pub_df = pub_df.append(scores, ignore_index=True)

pub_mean_df = pub_df.groupby('publisher').apply(custom_mean)
pub_median_df = pub_df.groupby('publisher').apply(custom_median)
pub_median = pub_median_df['median']
pub_merged_df = pub_mean_df.join(pub_median)
print(pub_merged_df.head(3))
pub_merged_df.to_csv('pub_merged.csv')

for dev in dev_list:
    scores = pd.DataFrame(GI_data[(GI_data.developer == dev) & (GI_data.score)], index=None)
    dev_df = dev_df.append(scores, ignore_index=True)

dev_mean_df = dev_df.groupby('developer').apply(custom_mean)
dev_median_df = dev_df.groupby('developer').apply(custom_median)
dev_median = dev_median_df['median']
dev_merged_df = dev_mean_df.join(dev_median)
dev_merged_df.to_csv('dev_merged.csv')
print(dev_merged_df.head(3))
                                               title  score  \
0                          Fire Emblem: Three Houses      9   
1        Marvel Ultimate Alliance 3: The Black Order      7   
2  Cadence of Hyrule ? Crypt of the Necrodancer F...      7   

              author r_platform o_platform  publisher              developer  \
0  Kimberley Wallace     Switch        NaN   Nintendo    Intelligent Systems   
1      Andrew Reiner     Switch        NaN   Nintendo             Team Ninja   
2     Suriel Vazquez     Switch        NaN   Nintendo   Brace Yourself Games   

     release_date           rating     mean  median  
0   July 26, 2019              NaN  7.73913     7.0  
1   July 19, 2019             Teen  7.73913     7.0  
2   June 13, 2019   Rating Pending  7.73913     7.0  
                                              title  score             author  \
0                The Walking Dead: The Final Season      7  Kimberley Wallace   
1    The Walking Dead: The Final Season ? Episode 1      7  Kimberley Wallace   
2  Batman: The Enemy Within Episode 5 ? Same Stitch      9      Javy Gwaltney   

      r_platform                       o_platform        publisher  \
0  PlayStation 4             Xbox One, Switch, PC   Telltale Games   
1             PC          PlayStation 4, Xbox One   Telltale Games   
2  PlayStation 4  PlayStation 4, Xbox One, Switch   Telltale Games   

         developer      release_date           rating  mean  median  
0   Telltale Games    March 27, 2019   Rating Pending  7.16     7.0  
1   Telltale Games   August 14, 2018           Mature  7.16     7.0  
2   Telltale Games    March 27, 2018           Mature  7.16     7.0  

Let’s select just the top publishers and see what their mean and median scores are.

dev_data = pd.read_csv('dev_merged.csv')
dev_data = dev_data.drop(dev_data.columns[0], axis=1)
developer_names = list(dev_data['developer'].unique())
#print(developer_names)

dev_examples = pd.DataFrame(index=None)

for dev in developer_names:
   dev_examples = dev_examples.append(dev_data[dev_data.developer == dev].iloc[0])

#print(dev_examples.to_string())

dev_stats = dev_examples[['developer', 'mean', 'median']]
print(dev_stats.head(20))
                    developer      mean  median
0              Telltale Games  7.160000     7.0
25                   Nintendo  7.714286     7.0
39                     Capcom  7.857143     9.0
46                Square Enix  8.111111     9.0
55                 EA Tiburon  6.666667     7.0
61                      Atlus  7.666667     7.0
67            Visual Concepts  7.666667     7.0
76                  EA Canada  7.000000     7.0
80      Dontnod Entertainment  7.000000     7.0
83           Ubisoft Montreal  7.000000     7.0
87              Firaxis Games  8.428571     9.0
94                   TT Games  7.666667     7.0
97     Blizzard Entertainment  9.000000     9.0
104             From Software  8.714286     9.0
111                Game Freak  7.000000     7.0
113   Double Fine Productions  7.666667     7.0
116       Intelligent Systems  8.200000     9.0
121                      SEGA  8.600000     9.0
126            HAL Laboratory  7.000000     7.0
127            Spike Chunsoft  6.000000     6.0

Another interesting thing we could to is get the worst reviewed games, games that have received less than a 4.

# how you can filter for just one criteria and then pull out only the columns you care about
# print(GI_data[GI_data.score <= 4].title)
# this is actually the preferred method...
print(GI_data.loc[GI_data.score <= 4, 'title'])
69               R.B.I. Baseball 19
124     Overkill's The Walking Dead
134                   The Quiet Man
217               Tennis World Tour
220                           Agony
259               Fear Effect Sedna
301                  Hello Neighbor
343              Raid: World War II
483              R.B.I. Baseball 17
488                 Old Time Hockey
489                 Old Time Hockey
503                      1-2-Switch
589                    One Way Trip
612             Ghostbusters (2016)
637       Homefront: The Revolution
722                   Devil's Third
767                        Armikrog
815                        Godzilla
820     Payday 2: Crimewave Edition
933              Escape Dead Island
935       Sonic Boom: Rise of Lyric
1045          Rambo: The Video Game
1063                         Rekoil
Name: title, dtype: object

We could also get only published by Nintendo by filtering out results where ‘Nintendo’ doesn’t appear in ‘publisher’ with isin.

# can use "isin" in a series...
# spaces are in this, be sure to include them

r = GI_data[GI_data['publisher'].isin([' Nintendo'])][:40]
print(r)

title score \ 0 Fire Emblem: Three Houses 9 5 Marvel Ultimate Alliance 3: The Black Order 7 6 Dr. Mario World 8 19 Super Mario Maker 2 8 23 Cadence of Hyrule ? Crypt of the Necrodancer F... 7 43 BoxBoy! + BoxGirl! 8 64 Yoshi's Crafted World 8 79 Tetris 99 8 101 Mario & Luigi: Bowser's Inside Story + Bowser ... 8 103 New Super Mario Bros. U Deluxe 8 112 Super Smash Bros. Ultimate 9 123 Pokémon: Let's Go, Pikachu 8 144 The World Ends With You: Final Remix 7 151 Super Mario Party 7 161 Xenoblade Chronicles 2: Torna - The Golden Cou... 7 193 Tetris 99 8 196 WarioWare Gold 8 199 Octopath Traveler 8 202 Captain Toad: Treasure Tracker 8 203 Splatoon 2: Octo Expansion 8 209 Mario Tennis Aces 8 212 Pokémon Quest 6 215 Sushi Striker: The Way of Sushido 7 216 Dillon's Dead-Heat Breakers 7 236 Donkey Kong Country: Tropical Freeze 9 246 Detective Pikachu 7 254 Kirby Star Allies 6 300 The Legend Of Zelda: Breath Of The Wild ? The ... 8 308 Xenoblade Chronicles 2 7 335 Super Mario Odyssey 9 338 Fire Emblem Warriors 7 377 Metroid: Samus Returns 9 378 Monster Hunter Stories 8 408 Miitopia 7 410 Hey! Pikmin 6 413 Splatoon 2 8 418 The Legend Of Zelda: Breath Of The Wild ? Mast... 7 426 Ever Oasis 8 431 Arms 8 447 Fire Emblem Echoes: Shadows of Valentia 7 author r_platform o_platform \ 0 Kimberley Wallace Switch NaN 5 Andrew Reiner Switch NaN 6 Ben Reeves iOS Android 19 Kyle Hilliard Switch NaN 23 Suriel Vazquez Switch NaN 43 Ben Reeves Switch NaN 64 Brian Shea Switch NaN 79 Kyle Hilliard Switch Xbox One, PC 101 Kyle Hilliard 3DS Xbox One 103 Brian Shea Switch PC, iOS, Android 112 Jeff Cork Switch PlayStation 4, PC 123 Brian Shea Switch PlayStation 4 144 Kimberley Wallace Switch PC 151 Brian Shea Switch PlayStation 4, Xbox One, Switch 161 Joe Juba Switch PlayStation 4, Switch, PC, Mac 193 Kyle Hilliard Switch PlayStation 4, Xbox One, Switch 196 Kyle Hilliard 3DS PlayStation 4, PlayStation Vita 199 Joe Juba Switch PlayStation 4, Xbox One, Switch, Mac, iOS 202 Ben Reeves Switch 3DS 203 Brian Shea Switch 3DS 209 Kyle Hilliard Switch PlayStation 4, PC 212 Brian Shea Switch iOS, Android 215 Kyle Hilliard Switch 3DS 216 Kyle Hilliard 3DS 3DS 236 Kyle Hilliard Switch Wii U 246 Ben Reeves 3DS PlayStation 4, Xbox One, Switch 254 Kyle Hilliard Switch Xbox One, PC 300 Suriel Vazquez Switch Xbox One, PC 308 Joe Juba Switch PlayStation 4, Xbox One 335 Andrew Reiner Switch Xbox One, Switch, PC 338 Javy Gwaltney Switch PlayStation 4, PC 377 Ben Reeves 3DS Xbox One, PC 378 Daniel Tack 3DS Xbox One, PC 408 Jeff Cork 3DS Xbox One 410 Ben Reeves 3DS PlayStation 4 413 Brian Shea Switch PC 418 Javy Gwaltney Switch Xbox One, PC, Mac, iOS, Android 426 Kyle Hilliard 3DS Xbox One, PlayStation Vita 431 Brian Shea Switch PlayStation 3 447 Javy Gwaltney 3DS Xbox One, Switch, PC publisher developer release_date \ 0 Nintendo Intelligent Systems July 26, 2019 5 Nintendo Team Ninja July 19, 2019 6 Nintendo Nintendo July 10, 2019 19 Nintendo Nintendo June 28, 2019 23 Nintendo Brace Yourself Games June 13, 2019 43 Nintendo HAL Laboratory April 26, 2019 64 Nintendo Good Feel March 29, 2019 79 Nintendo Arika February 13, 2019 101 Nintendo AlphaDream January 11, 2019 103 Nintendo Nintendo January 11, 2019 112 Nintendo Sora, Ltd December 7, 2018 123 Nintendo Game Freak November 16, 2018 144 Nintendo Square Enix, h.a.n.d. October 12, 2018 151 Nintendo Nintendo October 5, 2018 161 Nintendo Monolith Soft September 14, 2018 193 Nintendo Arika February 13, 2019 196 Nintendo Nintendo, Intelligent Systems August 3, 2018 199 Nintendo Square Enix, Acquire July 13, 2018 202 Nintendo Nintendo July 13, 2018 203 Nintendo Nintendo June 13, 2018 209 Nintendo Camelot Software June 22, 2018 212 Nintendo Game Freak May 29, 2018 215 Nintendo Nintendo, indies zero June 8, 2018 216 Nintendo Vanpool May 24, 2018 236 Nintendo Retro Studios May 4, 2018 246 Nintendo Creatures Inc. March 23, 2018 254 Nintendo HAL Laboratory March 16, 2018 300 Nintendo Nintendo December 7, 2017 308 Nintendo Monolith Soft December 1, 2017 335 Nintendo Nintendo October 27, 2017 338 Nintendo Koei Tecmo TBA 377 Nintendo MercurySteam September 15, 2017 378 Nintendo Capcom September 8, 2017 408 Nintendo Nintendo July 27, 2017 410 Nintendo Nintendo July 28, 2017 413 Nintendo Nintendo July 21, 2017 418 Nintendo Nintendo July 30, 2017 426 Nintendo Grezzo June 23, 2017 431 Nintendo Nintendo June 16, 2017 447 Nintendo Intelligent Systems May 19, 2017 rating 0 NaN 5 Teen 6 Everyone 19 NaN 23 Rating Pending 43 Everyone 64 Everyone 79 Everyone 101 Everyone 103 Everyone 112 Mature 123 Everyone 144 Teen 151 Everyone 161 Teen 193 Teen 196 Everyone 10+ 199 Teen 202 Everyone 203 Everyone 10+ 209 Everyone 212 Everyone 215 Everyone 216 Everyone 236 Everyone 246 Everyone 254 Everyone 10+ 300 Everyone 308 Teen 335 Everyone 10+ 338 Rating Pending 377 Everyone 10+ 378 Everyone 10+ 408 Everyone 410 Everyone 10+ 413 Everyone 10+ 418 Everyone 426 Everyone 10+ 431 Everyone 10+ 447 Rating Pending

Finally, let’s do a swarmplot of scores by Nintendo associated developers. We’ll take games published by Nintendo and plot the score given some developers.

plt.figure(figsize=(8, 10))
dev_score = sns.swarmplot(x="developer", y="score", data=r, hue='developer')
plt.grid()
plt.xticks(range(50),rotation=-90)
plt.show(dev_score)

If you’d like to get better acquainted with visualizing data, I suggest checking out the documentation of Pandas and Seaborn and trying to visualize some simple datasets, such as the ones below:

Iris Dataset

Pokemon with Stats

Mushroom Dataset

Did It Rain In Seattle?

Design a site like this with WordPress.com
Get started