criticalnelson

Calculating Pearson Correlation

I’m going to be doing an analysis on the Gapminder dataset. I’ll be looking at the relationship between the internet usage rate and the life expectancy variables, as well as the association between internet usage and employment rates.

Here’s the code used to produce the visualizations and run the Pearson correlation on the chosen variables in the Gapminder dataset.

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

plot_regression('lifeexpectancy', 'internetuserate', df_clean, 'Life Expectancy', 'Internet Use Rate')
plot_regression('lifeexpectancy', 'employrate', df_clean, 'Life Expectancy', 'Employment Rate')

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

Here’s the plots that were rendered:

Here’s the output of the Pearson correlation:

Association between life expectancy and internet use rate
(0.77081050888289, 5.983388253650836e-33)

—
Association between employment rate and internet use rate
(-0.1950109538173115, 0.013175901971555317)

Based on the information above, it looks like life expectancy and internet use rate have a very strong correlation that is highly unlikely to be of chance. Meanwhile, employment rate and internet use see a less robust, though significant correlation. There’s a weak negative correlation for the second two variables.

Running a Chi-Square Test of Independence – Gapminder Dataset

For the Gapminder data I’m investigating whether or not there is a relationship between internet use rate and life expectancy. Because both of the variables are continuous in nature, I start off by binning both the variables. Internet use rate is binned into just two groups: less than 50% of the population with internet access and over 50% internet access. Meanwhile, life expectancy is divided into 10 bins.

import pandas as pd
import scipy.stats


# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

def bin(dataframe, cols):

    for col in cols:
        new_col_name = "{}_bins".format(col)
        dataframe[new_col_name] = pd.qcut(dataframe[col], 10, labels=["1=10%", "2=20%", "3=30%", "4=40%", "5=50%",
                                                                     "6=60%", "7=70%", "8=80", "9=90%", "10=100%"])

def bin_half(dataframe, cols):

    for col in cols:
        new_col_name = "{}_bins_2".format(col)
        dataframe[new_col_name] = pd.qcut(dataframe[col], 2, labels=["1=50%", "2=100%"])

df3 = df2.copy()
bin(df3, cols)
bin(df3, norm_cols)
bin_half(df3, ['internetuserate'])

We now make recode dictionaries and set up crosstab calculations for four different tests. We’ll only be testing four different comparisons here, just as a proof of concept and to demonstrate that we know to carry out the calculations.

recode_2 = {"3=30%": "3=30%", "7=70%": "7=70%"}
recode_3 = {"2=20%": "2=20%", "8=80": "8=80"}
recode_4 = {"6=60%": "6=60%", "9=90%": "9=90%"}
recode_5 = {"4=40%": "4=40%", "7=70%": "7=70%"}

df3['Comp_3v7'] = df3['lifeexpectancy_bins'].map(recode_2)
df3['Comp_2v8'] = df3['lifeexpectancy_bins'].map(recode_3)
df3['Comp_6v9'] = df3['lifeexpectancy_bins'].map(recode_4)
df3['Comp_4v7'] = df3['lifeexpectancy_bins'].map(recode_5)

# get table of observed counts
count_table = pd.crosstab(df3['internetuserate_bins_2'], df3['lifeexpectancy_bins'])
print(count_table)

count_table_3 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_3v7'])
count_table_4 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_2v8'])
count_table_5 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_6v9'])
count_table_6 = pd.crosstab(df3['internetuserate_bins_2'], df3['Comp_4v7'])

Now we’ll do the chi-square tests.

def chi_square_test(table):

    print("Results for:")
    print(str(table))

    # get column percentages
    col_sum = table.sum(axis=0)
    col_percents = table/col_sum
    print(col_percents)

    chi_square = scipy.stats.chi2_contingency(table)
    print("Chi-square value, p-value, expected_counts")
    print(chi_square)

    print()

print("Initial Chi-square:")
chi_square_test(count_table)
print(" ")

chi_square_test(count_table_3)
chi_square_test(count_table_4)
chi_square_test(count_table_5)
chi_square_test(count_table_6)

Here’s the initial test:

Initial Chi-square:
Results for:
lifeexpectancy_bins 1=10% 2=20% 3=30% 4=40% 5=50% 6=60% 7=70% 8=80 \
internetuserate_bins_2
1=50% 18 19 16 14 12 6 4 4
2=100% 0 0 1 4 6 13 15 11

lifeexpectancy_bins 9=90% 10=100%
internetuserate_bins_2
1=50% 1 0
2=100% 16 19
lifeexpectancy_bins 1=10% 2=20% 3=30% 4=40% 5=50% 6=60% \
internetuserate_bins_2
1=50% 1.0 1.0 0.941176 0.777778 0.666667 0.315789
2=100% 0.0 0.0 0.058824 0.222222 0.333333 0.684211

lifeexpectancy_bins 7=70% 8=80 9=90% 10=100%
internetuserate_bins_2
1=50% 0.210526 0.266667 0.058824 0.0
2=100% 0.789474 0.733333 0.941176 1.0
Chi-square value, p-value, expected_counts
(102.04563740451277, 6.064860600653971e-18, 9, array([[9.45251397, 9.97765363, 8.9273743 , 9.45251397, 9.45251397,
9.97765363, 9.97765363, 7.87709497, 8.9273743 , 9.97765363],
[8.54748603, 9.02234637, 8.0726257 , 8.54748603, 8.54748603,
9.02234637, 9.02234637, 7.12290503, 8.0726257 , 9.02234637]]))

The p-value results are significant so we’ll do some some post-hoc tests.

Results for:
Comp_2v8 2=20% 8=80
internetuserate_bins_2
1=50% 19 4
2=100% 0 11
Comp_2v8 2=20% 8=80
internetuserate_bins_2
1=50% 1.0 0.266667
2=100% 0.0 0.733333
Chi-square value, p-value, expected_counts
(17.382650301643437, 3.0560286589975315e-05, 1, array([[12.85294118, 10.14705882],
[ 6.14705882, 4.85294118]]))

Results for:
Comp_6v9 6=60% 9=90%
internetuserate_bins_2
1=50% 6 1
2=100% 13 16
Comp_6v9 6=60% 9=90%
internetuserate_bins_2
1=50% 0.315789 0.058824
2=100% 0.684211 0.941176
Chi-square value, p-value, expected_counts
(2.319693757720874, 0.12774517376836148, 1, array([[ 3.69444444, 3.30555556],
[15.30555556, 13.69444444]]))

Results for:
Comp_4v7 4=40% 7=70%
internetuserate_bins_2
1=50% 14 4
2=100% 4 15
Comp_4v7 4=40% 7=70%
internetuserate_bins_2
1=50% 0.777778 0.210526
2=100% 0.222222 0.789474
Chi-square value, p-value, expected_counts
(9.743247922437677, 0.0017998260000241526, 1, array([[8.75675676, 9.24324324],
[9.24324324, 9.75675676]]))

Notice that the p-values for the 3 vs. 7 and 2 vs. 8 categories are much smaller than the p-values for 6 vs. 9 and 4 vs. 7. However, the P-values for all the tests are fairly small.

Running An Analysis Of Variance on Gapminder Data

In this post I run an analysis of variance (ANOVA) on the Gapminder dataset. Because the Gapminder data contains mainly continous values and not categorical ones, I had to create categorical values using the “qcut” function from Python’s Pandas library. Let’s take a look at the code used to load, preprocess, and bin the data.

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import pandas as pd

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

def bin(dataframe, cols):

    for col in cols:
        new_col_name = "{}_bins".format(col)
        dataframe[new_col_name] = pd.qcut(dataframe[col], 10, labels=["1=10%", "2=20%", "3=30%", "4=40%", "5=50%",
                                                                     "6=60%", "7=70%", "8=80", "9=90%", "10=100%"])

df3 = df2.copy()
#bin(df3, cols)
bin(df3, norm_cols)

I’ll run an ANOVA on the life expectancy data and binned internet user rate data. I’ll make an ANOVA frame and another frame to check mean and standard deviation on. I then fit the stats OLS model and print out the results, along with the mean and standard dev to get more context.

anova_frame = df3[['lifeexpectancy', 'breastcancerper100th', 'suicideper100th',
                   'internetuserate_bins', 'employrate_bins']].dropna()

relate_frame = df3[['lifeexpectancy', 'internetuserate_bins']]

anova_model = smf.ols(formula='lifeexpectancy ~ C(internetuserate_bins)', data=anova_frame).fit()
print(anova_model.summary())
mean = relate_frame.groupby("internetuserate_bins").mean()
sd = relate_frame.groupby("internetuserate_bins").std()

Here’s the results for the OLS ANOVA.

Here’s the mean and standard dev for the internet use rate and life expectancy.

The F-Score is 36.76, while the P-Value is 8.78e-34, which is a very small P-value. This seems to be a significant relationship. There are 10 different categories in the binned Internet Use Rate variables, and since the P-value is small enough to be significant we want to run a post-hoc comparison.

Here’s the code to run the multi-comparison.

multi_comp = multi.MultiComparison(anova_frame["lifeexpectancy"], anova_frame["internetuserate_bins"])
results = multi_comp.tukeyhsd()
print(results)

And here’s the results for the multi-comparison.

We can see that there are significant differences, and that we can reject the null hypothesis for the majority of these class/group differences.

Creating Graphs For Gapminder Data

In this post I’ll be using exploratory data analysis techniques on the Gapminder dataset. I’ll be graphing some of the features in the dataset, with the primary variables of interest being Internet Use Rate and Life Expectancy. I’ve also included some analysis of secondary variables of interest, including Breast Cancer Rates Per 100k and Employment Rates.

To begin with, here’s the code I used to make the visualizations:

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

def bin(dataframe, cols):

    for col in cols:
        new_col_name = "{}_bins".format(col)
        dataframe[new_col_name] = pd.qcut(dataframe[col], 10, labels=["1=10%", "2=20%", "3=30%", "4=40%", "5=50%",
                                                                     "6=60%", "7=70%", "8=80", "9=90%", "10=100%"])

df3 = df2.copy()
#bin(df3, cols)
bin(df3, norm_cols)

df_employ = df3[["employrate_bins"]]
print(df_employ)

sns.distplot(df3['lifeexpectancy'].dropna())
plt.title("Life Expectancy Distribution")
plt.show()
sns.distplot(df3['internetuserate'].dropna())
plt.title("Internet Use Rate Distribution")
plt.show()
sns.countplot(x = "employrate_bins", data=df_employ)
plt.title("Distribution of Employment Rates")
plt.show()

sns.lmplot(x = "internetuserate", y="breastcancerper100th", hue="internetuserate_bins", data=df3, fit_reg=False)
plt.title("Internet Use Rate and Breast Cancer Per 100k")
plt.show()

sns.lmplot(x = "internetuserate", y="lifeexpectancy", hue="internetuserate_bins", data=df3, fit_reg=False)
plt.title("Internet Use Rate and Life Expectancy")
plt.show()

sns.lmplot(x = "internetuserate", y="employrate", hue="internetuserate_bins", data=df3, fit_reg=False)
plt.title("Internet Use Rate and Employment Rate")
plt.show()

Now let’s take a look at the visualizations themselves.

To begin with, here’s a distribution of life expectancy for the approximately 210+ countries tracked in the dataset.

The data is (arguably) bimodal, with two peaks. The first large peak is clustered around 70 years of age while a second smaller cluster is around 50 years of age.

Now let’s have a look at the distribution for internet use rates.

The largest peak by far is between 0 – 10% internet use rate. This looks to be about 25% of the data distribution overall, so most countries have less internet access for less than half their population. I’d argue that the data is a unimodel distribution, although I suppose a second smaller peak is found at the 40% mark.

Let’s try graphing some of the binned data, which includes binned employment rates.

If we do this we can see that the data is fairly uniformally distributed, which we would expect given how the qcut function in Python works. However, there is a small bump around the 30th percentile and a small fall near the 40th percentile.

Now let’s try creating scatter plots for the continuous variables. First, we’ll try plotting the relationship between breast cancer diagnosis per 100k and internet use rate.

There seems to be a strong positive correlation between internet use rate and breast cancer diagnosis rates. Here’s an example of why we should be careful about inferring causation from correlation. It’s highly unlikely there’s some kind of causal mechanism between internet usage and breast cancer rates.

Now let’s try graphing the relationship between internet use rate and life expectancy.

There also seems to be a roughly positive correlation between internet use rate and life expectancy. There’s a massive increase in life expectancy as internet use rises from 0 to 20%, followed by a slower increase.

Now let’s try graphing the relationship between internet use and employment rates.

Here’s something I was a little surprised to find. It looks as if there’s a relationship between internet use and employment rate that’s roughly parabolic in nature. It looks as if there’s a decline in employment rate internet saturation increases up until the approximately 50% mark, and then the rate increases again.

Making Data Management Decisions

After defining the functions we need, we’ll start off by loading in the data.

Afterwards, we’ll check to see if there’s any blank values.

import pandas as pd
import numpy as np

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate']

df2 = df.copy()

check_missing(df, cols)
check_missing(df, norm_cols)

There’s quite a few blank values:

Column breastcancerper100th is missing:
40

Column suicideper100th is missing:
22

Column internetuserate is missing:
21

Column employrate is missing:
35

Calling the to_numeric function should convert them into NaN values. Which we can check by calling the frequency distribution function.

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

NaN 22
63.125 1
74.576 1
62.475 1
74.414 1
79.977 1
58.199 1
75.670 1
81.012 1
72.283 1
55.442 1
81.855 1
48.398 1
68.944 1
75.133 1
76.126 1
69.317 1
65.193 1
75.057 1
77.685 1
68.498 1
62.465 1
79.634 1
73.911 1
80.499 1
61.597 1
79.591 1
71.017 1
82.759 1
68.978 1
..
76.954 1
73.703 1
79.839 1
48.718 1
71.172 1
73.456 1
48.397 1
81.439 1
75.246 1
55.377 1
74.788 1
74.402 1
82.338 1
79.499 1
81.539 1
54.210 1
67.017 1
61.452 1
73.373 1
73.127 1
69.245 1
68.795 1
79.341 1
76.918 1
57.937 1
73.126 1
64.666 1
75.956 1
57.379 1
50.239 1
Name: lifeexpectancy, Length: 190, dtype: int64
Fred dist for: breastcancerper100th
23.5 2
NaN 40
70.5 1
31.5 1
62.5 1
36.0 1
92.0 1
46.0 1
19.5 6
21.5 1
16.5 3
26.0 1
29.5 1
63.0 1
90.0 1
43.5 1
33.0 1
23.0 1
52.5 1
30.6 1
38.5 1
82.5 1
10.5 1
22.5 2
90.8 1
29.0 1
55.5 1
48.0 1
35.0 1
25.2 1
..
58.9 2
23.3 1
33.4 1
28.1 7
29.7 1
83.2 1
84.3 1
50.1 1
43.9 1
4.4 1
19.6 1
29.8 2
17.3 2
18.7 1
31.2 3
20.6 2
38.7 1
21.8 2
58.4 2
35.1 2
24.2 1
13.6 1
30.3 1
18.2 2
79.8 1
46.2 1
62.1 1
50.4 2
46.6 1
31.7 1
Name: breastcancerper100th, Length: 137, dtype: int64
Fred dist for: suicideper100th
NaN 22
9.847460 1
6.519537 1
11.980497 1
13.716340 1
33.341860 1
2.034178 1
14.554677 1
6.449157 1
18.954570 1
5.554276 1
16.959240 1
10.493150 1
11.956941 1
4.961071 1
11.426181 1
4.930045 1
10.645740 1
11.655210 1
2.234896 1
2.515721 1
10.550375 1
1.658908 1
10.365070 1
8.164005 1
3.108603 1
16.234370 1
3.563325 1
10.100990 1
4.217076 1
..
15.538490 1
9.875281 1
10.171735 1
6.087671 1
5.838315 1
25.404600 1
13.637060 1
1.498057 1
8.283071 1
16.913248 1
27.874160 1
15.542603 1
5.767406 1
9.127511 1
4.288574 1
6.882952 1
12.405918 1
10.114997 1
7.699330 1
10.129350 1
13.548420 1
12.289122 1
20.369590 1
8.211067 1
3.716739 1
9.507928 1
2.109414 1
13.089616 1
9.257976 1
6.811439 1
Name: suicideper100th, Length: 192, dtype: int64
Fred dist for: internetuserate
81.000000 0.004695
66.000000 0.004695
45.000000 0.004695
NaN 0.098592
2.100213 0.004695
65.000000 0.004695
80.000000 0.004695
5.001375 0.004695
51.914184 0.004695
42.984580 0.004695
25.000000 0.004695
39.820178 0.004695
9.999954 0.004695
44.989947 0.004695
2.300027 0.004695
5.999836 0.004695
71.849124 0.004695
44.001025 0.004695
7.232224 0.004695
90.703555 0.004695
1.259934 0.004695
69.339971 0.004695
35.850437 0.004695
26.477223 0.004695
2.199998 0.004695
16.780037 0.004695
40.122235 0.004695
15.999650 0.004695
70.028599 0.004695
81.590397 0.004695
…
2.699966 0.004695
12.006692 0.004695
1.280050 0.004695
74.247572 0.004695
28.731883 0.004695
81.338393 0.004695
43.366498 0.004695
6.497924 0.004695
65.387786 0.004695
12.500255 0.004695
47.867469 0.004695
12.349750 0.004695
43.055067 0.004695
40.650098 0.004695
45.986590 0.004695
54.992809 0.004695
33.382128 0.004695
65.808554 0.004695
44.570074 0.004695
7.499996 0.004695
88.770254 0.004695
1.700031 0.004695
4.170136 0.004695
56.300034 0.004695
51.280478 0.004695
36.000335 0.004695
41.000128 0.004695
33.616683 0.004695
60.119707 0.004695
1.699985 0.004695
Name: internetuserate, Length: 193, dtype: float64
Fred dist for: employrate
50.500000 0.004695
NaN 0.164319
61.500000 0.014085
46.000000 0.009390
64.500000 0.004695
63.500000 0.004695
51.000000 0.009390
68.000000 0.004695
56.000000 0.009390
56.500000 0.004695
59.000000 0.009390
53.500000 0.014085
81.500000 0.004695
66.000000 0.009390
64.599998 0.004695
64.199997 0.004695
83.000000 0.004695
60.500000 0.004695
42.500000 0.004695
54.500000 0.009390
77.000000 0.004695
42.000000 0.004695
65.000000 0.014085
61.700001 0.004695
50.900002 0.009390
61.000000 0.009390
66.199997 0.004695
57.900002 0.004695
76.000000 0.004695
49.500000 0.004695
…
63.700001 0.004695
52.700001 0.004695
78.900002 0.004695
70.400002 0.009390
71.699997 0.004695
41.599998 0.004695
80.699997 0.004695
54.599998 0.004695
42.799999 0.004695
50.700001 0.004695
57.200001 0.004695
66.800003 0.004695
54.400002 0.004695
63.099998 0.004695
63.900002 0.004695
55.099998 0.004695
67.300003 0.004695
44.799999 0.004695
64.300003 0.004695
47.099998 0.004695
41.099998 0.004695
44.299999 0.004695
62.400002 0.004695
51.299999 0.004695
79.800003 0.004695
58.599998 0.004695
63.799999 0.009390
63.200001 0.004695
65.599998 0.004695
68.300003 0.004695

The blank values have successfully been replaced with NaN. This is good because when we plot graphs later, we can just have Python ignore those NaN values. We’ve coded out the missing data with NaN.

I’ve also decided to experiment with bins. We’ll do one more thing and create five bins for our chosen features.

Here’s the code to do that:

def bin(dataframe, cols):

    for col in cols:
        new_col_name = "{}_bins".format(col)
        dataframe[new_col_name] = pd.qcut(dataframe[col], 5, labels=["1-20%", "21 to 40%", "40 to 60%", "60 to 80%", "80 to 100%"])

df3 = df2.copy()
bin(df3, cols)
bin(df3, norm_cols)

# print out the head and check the dataframe 
print(df3.head())

Let’s print out the head of the dataframe to see if the values exist. It looks a little messy, but we can see that the new columns have successfully been created.

Running The First Program – Frequency Distribution

As mentioned in the last post, I am doing an analysis on the GapMinder dataset. I’ve selected five variables of interest: life expectancy, breast cancer per 100, suicide per 100, internet user rate, and employment rate. The first three variables do not need to be normalized, while the second two variables do. The code below loads in the data, makes two different lists (variables to normalize, and non-normalized) and then for the two lists returns a frequency distribution along with the five most common occurrence values. Jump to the bottom to see analysis of missing values and most common values.

import pandas as pd

# load data
dataframe = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(dataframe.isnull().values.any())

# split into two lists
cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']

norm_cols = ['internetuserate', 'employrate']

# print frequency distribution and 5 most common occurences
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))

        count = dataframe[col].value_counts(sort=False)
        top_5 = dataframe[col].value_counts()[:4].index.tolist()
        print(count)
        print("Top 5 most common:")
        print(top_5)
        print("-----")

    for col in norm_cols:
        print("Fred dist for: {}".format(col))

        count = dataframe[col].value_counts(sort=False, normalize=True)
        top_5 = dataframe[col].value_counts()[:4].index.tolist()
        print(count)
        print("Top 5 most common:")
        print(top_5)
        print(count)


freq_dist(dataframe, cols, norm_cols)

Here’s the results of the frequency distributions:

Fred dist for: lifeexpectancy
62.465 1
73.127 1
76.918 1
62.475 1
79.839 1
67.484 1
55.442 1
73.126 1
81.907 1
51.61 1
56.081 1
22
74.522 1
58.199 1
79.158 1
72.64 1
75.62 1
75.181 1
76.142 1
67.529 1
74.221 1
76.64 1
51.444 1
73.371 1
57.937 1
48.397 1
79.977 1
79.915 1
68.978 1
50.411 1
..
57.379 1
64.666 1
49.025 1
72.283 1
77.005 1
68.749 1
74.573 1
80.17 1
55.439 1
81.539 1
75.246 1
79.311 1
54.116 1
68.846 1
77.685 1
79.499 1
72.15 1
75.901 1
65.438 1
80.642 1
73.403 1
73.99 1
68.823 1
73.488 1
58.582 1
81.618 1
72.231 1
81.097 1
51.879 1
73.979 2
Name: lifeexpectancy, Length: 190, dtype: int64
Top 5 most common:

[‘ ‘, ‘72.974’, ‘73.979’, ‘73.911’]

Fred dist for: breastcancerper100th
50.9 1
25.2 1
31.7 1
18.3 1
48 1
84.3 1
19.6 1
17.9 1
21.5 1
30 1
33.3 1
74.4 1
23.9 1
24 1
24.1 1
20.4 2
38.7 1
46.2 1
12.3 1
38.8 1
29.7 1
91.9 2
34.3 1
62.5 1
31.2 3
43.9 1
33.4 1
29 1
88.7 1
62.1 1
..
51.8 1
55.5 1
82.5 1
52.5 1
67.2 1
51.1 1
26.4 1
23 1
30.9 1
86.7 1
73.9 1
23.1 1
23.5 2
19.5 6
50.4 2
101.1 1
90 1
18.4 1
52.1 1
44.8 1
49.6 1
63 1
43.5 1
29.8 2
19 1
13.6 1
50.1 1
13.2 2
17.1 1
30.6 1
Name: breastcancerper100th, Length: 137, dtype: int64
Top 5 most common:

[‘ ‘, ‘28.1’, ‘19.5’, ‘24.7’]

Fred dist for: suicideper100th
6.02188205718994 1
6.26578903198242 1
11.9804973602295 1
6.1052818998346 1
14.09153 1
6.44915676116943 1
2.0341784954071 1
7.765584 1
5.83525085449219 1
4.409532 1
14.77625 1
11.11583 1
35.752872467041 1
11.9569406509399 1
22
3.94025897979736 1
8.21094799041748 1
4.37336492538452 1
20.3179302215576 1
9.127511 1
14.4696483612061 1
9.211085 1
11.2139701843262 1
11.6533222198486 1
3.74158787727356 1
7.44382619857788 1
5.36217880249023 1
10.0719423294067 1
10.1149969100952 1
3.10860252380371 1
..
1.65890777111053 1
4.8487696647644 1
9.6331148147583 1
7.69932985305786 1
12.2167692184448 1
8.28307056427002 1
1.37000155448914 1
10.823 1
12.17976 1
7.87687826156616 1
.20144872367382 1
11.4261808395386 1
12.8698148727417 1
9.927033 1
6.68438529968262 1
20.16201 1
13.23981 1
7.06018447875976 1
5.931845 1
7.7450647354126 1
3.56332468986511 1
29.864164352417 1
9.70955562591553 1
15.5426025390625 1
4.7510838508606 1
4.52785158157349 1
5.55427646636963 1
4.41499042510986 1
6.3698878288269 1
13.1179485321045 1
Name: suicideper100th, Length: 192, dtype: int64
Top 5 most common:

[‘ ‘, ‘13.1179485321045’, ‘1.57435011863708’, ‘14.5546770095825’]

Fred dist for: internetuserate
44.5853546903272 0.004695
47.2804360294118 0.004695
40.0200948796529 0.004695
62.8119000060846 0.004695
76.5875384615385 0.004695
90.0161900191939 0.004695
8.37020688381867 0.004695
31.0043782824698 0.004695
77.6385351546869 0.004695
0.098592
11.0000554403336 0.004695
42.7478120557293 0.004695
7.0002138207311 0.004695
12.5002554302298 0.004695
44.9899469578783 0.004695
69.3399707174231 0.004695
12.3348932627873 0.004695
47.8674686327078 0.004695
81.3383926859286 0.004695
2.25997588483994 0.004695
82.166659877332 0.004695
14.0002467348544 0.004695
15.9996499919575 0.004695
15.8999820280962 0.004695
95.6381132075472 0.004695
42.9845801749271 0.004695
29.999939516129 0.004695
2.10021270579814 0.004695
1.25993360916614 0.004695
38.2602335526316 0.004695
…
2.45036224422442 0.004695
90.0795266272189 0.004695
1.4000606995385 0.004695
20.0017101420083 0.004695
3.70000325975843 0.004695
25.8997967022931 0.004695
14.8307358837209 0.004695
61.987412863816 0.004695
51.2804783981952 0.004695
2.99980317919075 0.004695
65.3877859391396 0.004695
39.8201778851441 0.004695
40.7728505747126 0.004695
65.163250915 0.004695
16.7800370218845 0.004695
77.9967811501598 0.004695
9.00773590909091 0.004695
80 0.004695
73.7339344713656 0.004695
12.3497504635596 0.004695
.99995892606692 0.004695
77.4986193497188 0.004695
6.00343714285714 0.004695
49.0006318425088 0.004695
31.050012866438 0.004695
82.526897905279 0.004695
9.99995388324075 0.004695
83.0025842490842 0.004695
2.19999781832606 0.004695
36.5625529623661 0.004695
Name: internetuserate, Length: 193, dtype: float64
Top 5 most common:

[‘ ‘, ‘36.5625529623661’, ‘36.4991147241897’, ‘15.8999703410908’]

Fred dist for: employrate
78.1999969482422 0.009390
57.5 0.009390
50.7000007629394 0.004695
48.7000007629394 0.009390
44.2999992370606 0.004695
47.0999984741211 0.004695
58.2000007629394 0.009390
55.9000015258789 0.014085
58.5 0.004695
52.0999984741211 0.004695
59.9000015258789 0.014085
0.164319
46.4000015258789 0.004695
58.4000015258789 0.009390
66.8000030517578 0.004695
78.9000015258789 0.004695
68.0999984741211 0.004695
65.0999984741211 0.004695
56 0.009390
57.2000007629394 0.004695
81.3000030517578 0.004695
50.5 0.004695
56.9000015258789 0.004695
41.5999984741211 0.004695
83 0.004695
42 0.004695
49.5 0.004695
39 0.004695
63.7999992370606 0.009390
63.7000007629394 0.004695
…
54.4000015258789 0.004695
56.2999992370606 0.009390
63.5 0.004695
42.7999992370606 0.004695
57.2999992370606 0.004695
46.7999992370606 0.004695
43.0999984741211 0.004695
80.6999969482422 0.004695
62.7000007629394 0.004695
65.6999969482422 0.004695
56.4000015258789 0.004695
72 0.004695
53.5 0.014085
71.6999969482422 0.004695
71.8000030517578 0.004695
41.0999984741211 0.004695
74.6999969482422 0.004695
51 0.009390
57.9000015258789 0.004695
45.7000007629394 0.004695
63.5999984741211 0.004695
71 0.004695
59.7999992370606 0.004695
63.0999984741211 0.004695
42.5 0.004695
83.1999969482422 0.009390
53.4000015258789 0.009390
76 0.004695
65.9000015258789 0.004695
46 0.009390
Name: employrate, Length: 140, dtype: float64
Top 5 most common:
[‘ ‘, ‘58.9000015258789’, ‘55.9000015258789’, ‘59.9000015258789’]

Here’s the 5 most common values for each category:

Top 5 most common:

[‘ ‘, ‘73.979’, ‘72.974’, ‘77.653’]

Fred dist for: breastcancerper100th
Top 5 most common:

[‘ ‘, ‘28.1’, ‘19.5’, ‘24.7’]

Fred dist for: suicideper100th
Top 5 most common:

[‘ ‘, ‘12.8722219467163’, ‘3.1468141078949’, ‘7.7450647354126’]

Fred dist for: internetuserate
Top 5 most common:

[‘ ‘, ‘11.0907646263158’, ‘82.526897905279’, ‘42.7478120557293’]

Fred dist for: employrate
Top 5 most common:
[‘ ‘, ’65’, ‘55.9000015258789’, ‘61.5’]

Notice that blank is the most common value for all of them, this means we have quite a few cells that are just filled in with blank values. Finally, this checks to see if there are any blank cell values (blank does not mean missing):

for col in cols:
   print((dataframe[col].values == ' ').sum())

for col in norm_cols:
   print((dataframe[col].values == ' ').sum())

We do have a number of blank values for those columns:

22
40
22
21
35

Data Management and Visualization: Selecting a Research Topic

I recently finished Hans Rosling’s book Factfulness: Ten Reasons We’re Wrong About the World, and I would love to do a research project that pays tribute to Rosling and Gapminder. For that reason I’ll be using the Gapminder study.

The primary topic I’m interested in is the relationship between internet usage and female employment. I’m curious about whether or not access to the internet might play a role in increasing life expectancy. For that reason, the two variables I’ll focus on are “Internetuserate” and “lifeexpectancy”. However, at this stage I’m not going to exclude any variables. I’m not sure yet which variables might be related in important ways to my chosen variables, so I’m going to keep everything at the moment. After some exploratory data analysis and more research, I’ll have a better idea of which variables to focus on.

I’m also curious about the relationship between internet usage and employment rates. There’s a lot of discussion right now that the internet can help job seekers gain more skills and become more employable, and I’m curious to see how strong the relationship between these two things are. Because of this, I’ll also include the “employrate” variable in my analysis.

(Upon further reflection, “breastcancerper100” and “suicideper100” might be interesting variables to track as well, as they related to physical and mental health.)

I used Google Scholar and my search terms were as follows:

“internet access and life expectancy”

“internet access and employment rates”

My initial search for relevant information about education and life-expectancy yielded the following results:

A handful of articles finding a relationship between internet usage and life expectancy or general health:

https://www.researchgate.net/publication/268631883_Examining_the_Relationship_between_the_Internet_and_Life_Expectancy , Conference: 24th IBIMA Conference, At Milan, Italy

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5642753/ ,

BMJ Open. 2017; 7(7): e015839.Published online 2017 Jul 21. doi: 10.1136/bmjopen-2017-015839

https://www.eurekalert.org/pub_releases/2016-10/uoc–lla102716.php

http://buscompress.com/uploads/3/4/9/8/34980536/riber_8-2_06_m18-053_70-80.pdf, Integrative Business and Economics Research, Vol. 8, Issue 2

https://link.springer.com/article/10.1007/s11205-015-1107-2 , Social Indicators Research volume 129, pages391–402(2016)

There doesn’t seem to have a tremendous amount of research on the subject, which is disappointing but I’m excited to see what results I get anyway.

In terms of internet access and employment rates, I was able to find a few studies that found a link between the two variables:

https://www.aeaweb.org/articles?id=10.1257/aer.20161385 ,

AMERICAN ECONOMIC REVIEW
VOL. 109, NO. 3, MARCH 2019
(pp. 1032-79)

https://www.sciencedirect.com/science/article/pii/S0167624516300282 ,

Information Economics and Policy
Volume 40, September 2017, Pages 21-25

As well as one that found a negligible effect:

https://www.aeaweb.org/articles?id=10.1257/000282804322970779 ,

AMERICAN ECONOMIC REVIEW
VOL. 94, NO. 1, MARCH 2004
(pp. 218-232)

To sum up:

My research questions are as follows:

“Is there a relationship between internet usage and life expectancy?” and beyond that “Is there a relationship between internet usage and employment rates”?

Based upon the sources I was able to find, I hypothesize there will be a positive relationship between internet usage and both life expectancy and employment rates. I would also expect breast cancer and suicide rates to trend down as internet access increases as they should be negatively correlated with life expectancy.

Classifying Mushrooms With Python and Scikit-Learn

Mushrooms can be either toxic or edible. It takes trained mushroom hunters and mycologists to discern the toxic mushrooms from the edible mushrooms. Can machine learning algorithms also classify the edibility of mushrooms? We’ll visualize some of the data here, to get an idea of how the data is related, and then implement several different classifiers on the dataset.

This might go without saying, but don’t take advice about which mushrooms to eat from a random blog post. I do not condone eating any mushrooms based on the patterns revealed in this notebook or in the accompanying Python script or documentation.

To start with, we’ll import all the necessry modules that we need.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report, log_loss
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")

We’ll load in the data for the mushrooms. Then we need to check if there is any null data in the dataset. There are only two classes: edible and poisonous. We can confirm that this is the case by printing the unique values of the class feature. We’re going to need to encode non-numerical data, that is, transform it into a form that our classifiers can use.

Scikit-Learn has a Label Encoder that will represent our non-numerical values as numerical ones. So we’ll create an instance of the encoder and then apply the transformation to every column in the dataset. We can print out the dataframe to be sure that there the transformations have been correct.

m_data = pd.read_csv('mushrooms.csv')

# check for any null values
print(m_data.isnull().sum())

# see all unique values in a category
print(m_data['class'].unique())

# machine learning systems work with integers, we need to encode these
# string characters into ints

encoder = LabelEncoder()
# now apply the transformation to all the columns:
for col in m_data.columns:
    m_data[col] = encoder.fit_transform(m_data[col])

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64
['p' 'e']
   class  cap-shape  cap-surface  cap-color  bruises  odor  gill-attachment  \
0      1          5            2          4        1     6                1   
1      0          5            2          9        1     0                1   
2      0          0            2          8        1     3                1   
3      1          5            3          8        1     6                1   
4      0          5            2          3        0     5                1   

   gill-spacing  gill-size  gill-color  ...  stalk-surface-below-ring  \
0             0          1           4  ...                         2   
1             0          0           4  ...                         2   
2             0          0           5  ...                         2   
3             0          1           5  ...                         2   
4             1          0           4  ...                         2   

   stalk-color-above-ring  stalk-color-below-ring  veil-type  veil-color  \
0                       7                       7          0           2   
1                       7                       7          0           2   
2                       7                       7          0           2   
3                       7                       7          0           2   
4                       7                       7          0           2   

   ring-number  ring-type  spore-print-color  population  habitat  
0            1          4                  2           3        5  
1            1          4                  3           2        1  
2            1          4                  3           2        3  
3            1          4                  2           3        5  
4            1          0                  3           0        1  

[5 rows x 23 columns]

Now we can do some plotting of the data. Using Matplotlib, let’s do a heatmap plot, which compares the correlation of every feature with every other feature.

# let's see how many poisonous and edible there are of each, 1 is poisonous, 0 is edible
# check to get a rough idea of correlations
correlations = m_data.corr()
plt.subplots(figsize=(20, 15))
#plt.figure(figsize=(16, 14))
data_corr = sns.heatmap(correlations, annot=True, linewidths=0.5, cmap="RdBu_r")
plt.show(data_corr)

It looks like the features most closely associated with class are: cap-surface, gill-attachment, gill-size, veil-color, spore-print-color, population, and habitat. Why don’t we see if we can plot the correlations of those variables individually.

features = ["cap-surface", "gill-attachment", "gill-size", "veil-color", "spore-print-color", "population", "habitat"]

for feature in features:
    line_plot = sns.lineplot(x="class", y=feature, label=feature, data=m_data)
    
plt.legend(loc=1)    
plt.show()

If we are so inclined, we can chart how individual attributes are likely to correlate with a mushroom’s likelihood of beind poisonous. We just get the individual colors stored in the “gill-color” feature and we plot them against class. We need to specify the names of the gill-colors and choose colors to represent them in our plot, however.

gill_names =["buff", "red", "gray", "chocolate", "black", "brown", "orange", "pink", "green", "purple", "white", "yellow"]
gill_colors = ["khaki", "Red", "darkGrey", "chocolate", "Black", "saddleBrown", "orange", "lightpink", "darkGreen", "purple", "lightGrey", "Yellow"]

factor = sns.factorplot(x="gill-color",y="class",data=m_data, kind="bar", size = 8,
palette = gill_colors)
factor.set_xticklabels(rotation=45)
factor.set(xticks=range(0,14), xticklabels=gill_names)
factor = factor.set_ylabels("Prob. Poison")
plt.grid(axis='y')
plt.show()

Now we need to split our data into features and labels. This is easy because the “class” feature is the first in the dataframe, so all we need to do is cast the features as everything but the first column and do the opposite for the labels. We’ll also want to scale the data. Scaling our data is important as the data covers a wide range.

The sheer range of the data can throw off the accuracy of our classifier, so by standardizing the data we make sure our classifier performs optimally. We’ll create an instance of the classifier and then transform the data with it. We don’t need to scale the labels/targets as they are only 0 or 1.

X_features = m_data.iloc[:,1:23]  
y_label = m_data.iloc[:, 0]  

# we should probably scale the features, so that SVM or gaussian NB can deliver better predictions
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)

Principal Component Analysis is a method that can simplify the representation of features, distilling the most important features down into a combination of just a few features. PCA can help improve the outcome of a classifier. However, PCA works best when datasets are large, and since this dataset is relatively small, we may not want to do it. We can plot the conclusions of the PCA to see what kind of dimensions the data would be compressed to.

We fit the PCA on our features and plot the explained variance ratio. The explained variance ratio is a way of measuring how many features describe a given portion of the data. We can plot our this function to get an idea of how many features are needed to describe 90% or 100% of the data. Let’s plot the ratio.

# principal component analysis, may or may not want to do, small dataset
pca = PCA()
pca.fit_transform(X_features)
plt.figure(figsize=(10,10))
# plot the explained variance ratio
plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.grid()
plt.show()
# it looks like about 17 of the features account for about 95% of the variance in the dataset

Here’s another way that the explained variance can be plotted. Plotting it like this leads to the same conclusion: around 18 features describe 99% of the dataset.

# here's another way to visualize this
pca_variance = pca.explained_variance_

plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal components')
plt.show()

Let’s go ahead and create a new PCA object and use 17 or 18 features to comprise our new featureset.

pca2 = PCA(n_components=17)
x_new = pca2.fit_transform(X_features)
X_train, X_test, y_train, y_test = train_test_split(x_new, y_label, test_size=0.20, random_state=2)

Now we can select our chosen classifiers and run GridSearchCV to find their best possible parameters. GridSearchCV takes a specified list of parameters and tests the classifier with the potential combinatiosn of your schosen parameters to see which combination is best. Doing this should dramatically improve the performance of our classifiers compared to just implementing them vanilla. We have to fit GridSearchCV on the classifiers and our chosen parameters, and then save the optimized settings as a new instance of the classifier.

logreg_clf = LogisticRegression()
parameters_logreg = {"penalty": ["l2"], "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
                     "max_iter": [25, 50, 100, 200, 400]}
grid_logreg = GridSearchCV(logreg_clf, parameters_logreg)
grid_logreg.fit(X_train, y_train)
logreg_opt = grid_logreg.best_estimator_

GNB_clf = GaussianNB()
GNB_clf.fit(X_train, y_train)

svc_clf = SVC()
svc_param = {"kernel": ["rbf", "linear"]}
grid_svc = GridSearchCV(svc_clf, svc_param)
grid_svc.fit(X_train, y_train)
svc_opt = grid_svc.best_estimator_

rf_clf = RandomForestClassifier()
parameters_rf = {"n_estimators": [4, 6, 8, 10, 12, 14, 16], "criterion": ["gini", "entropy"], "max_features": ["auto", "sqrt", "log2"],
                 "max_depth": [2, 3, 5, 10], "min_samples_split": [2, 3, 5, 10]}
grid_rf = GridSearchCV(rf_clf, parameters_rf)
grid_rf.fit(X_train, y_train)
rf_opt = grid_rf.best_estimator_

knn_clf = KNeighborsClassifier()
parameters_knn = {"n_neighbors": [3, 5, 10, 15, 20], "weights": ["uniform", "distance"],
                  "leaf_size": [10, 20, 30, 45, 60]}
grid_knn = GridSearchCV(knn_clf, parameters_knn)
grid_knn.fit(X_train, y_train)
knn_opt = grid_knn.best_estimator_

dt_clf = DecisionTreeClassifier()
parameters_dt = {"criterion": ["gini", "entropy"], "splitter": ["best", "random"], "max_features": ["auto", "log2", "sqrt"]}
grid_dt = GridSearchCV(dt_clf, parameters_dt)
grid_dt.fit(X_train, y_train)
dt_opt = grid_dt.best_estimator_

xgb_clf = XGBClassifier()
parameters_xg = {"objective" : ["reg:linear"], "n_estimators" : [5, 10, 15, 20]}
grid_xg = GridSearchCV(xgb_clf, parameters_xg)
grid_xg.fit(X_train, y_train)
xgb_opt = grid_xg.best_estimator_

chosen_metrics = [accuracy_score, log_loss, classification_report]

Finally, we can create a function to do the classification and return our chosen metrics.

def get_reports(metrics, classifier):

    preds = classifier.predict(X_test)
    print("'{}' Performance: ".format(classifier.__class__.__name__))
    acc = accuracy_score(y_test, preds)
    l_loss = log_loss(y_test, preds)
    c_report = classification_report(y_test, preds)
    print("Accuracy: " + str(acc))
    print("Log Loss: " + str(l_loss))
    print("Classificaiton Report: ")
    print(c_report)
    print("----------")

get_reports(chosen_metrics, logreg_opt)
get_reports(chosen_metrics, GNB_clf)
get_reports(chosen_metrics, svc_opt)
get_reports(chosen_metrics, rf_opt)
get_reports(chosen_metrics, knn_opt)
get_reports(chosen_metrics, dt_opt)
get_reports(chosen_metrics, xgb_opt)

'LogisticRegression' Performance: 
Accuracy: 0.9427692307692308
Log Loss: 1.9766989475886843
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.94      0.96      0.95       860
           1       0.95      0.93      0.94       765

    accuracy                           0.94      1625
   macro avg       0.94      0.94      0.94      1625
weighted avg       0.94      0.94      0.94      1625

----------
'GaussianNB' Performance: 
Accuracy: 0.9298461538461539
Log Loss: 2.423050640308682
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.92      0.95      0.93       860
           1       0.94      0.91      0.92       765

    accuracy                           0.93      1625
   macro avg       0.93      0.93      0.93      1625
weighted avg       0.93      0.93      0.93      1625

----------
'SVC' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'RandomForestClassifier' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'KNeighborsClassifier' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'DecisionTreeClassifier' Performance: 
Accuracy: 0.992
Log Loss: 0.2763131635190287
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       860
           1       0.99      0.99      0.99       765

    accuracy                           0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

----------
'XGBClassifier' Performance: 
Accuracy: 0.968
Log Loss: 1.1052531461360688
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       860
           1       0.97      0.96      0.97       765

    accuracy                           0.97      1625
   macro avg       0.97      0.97      0.97      1625
weighted avg       0.97      0.97      0.97      1625

----------

I highly recommend experimenting with the different classifiers to get a better idea of which classifiers work better under different circumstances. You can check out a list of classifiers supported by Scikit-learn here or here.

Understanding and Implementing Merge Sort In Python

Previously, we’ve covered binary search and selection sort. This time, we’ll be covering another sorting algorithm – merge sort. Merge sort is a little different from selection sort. While selection sort can’ only be used to sort an unsorted list, merge sort can be utilized to merge two different sorted lists, in addition to sorting an unsorted list.

The core idea of merge sort is an instance of a divide and conquer approach. First, the entire list/array is split apart into left and right halves. Afterwards, the left half of the array is also divided in half. This recurs until there is only one element in the left and right halves, until the array halves cannot be split further. Afterwards, the values are merged together, being compared and place in sorted order.

This process recurs, gradually sorting and merging groups, until there is only one group with all the elements in their proper order. It is critical to remember that every-time a group is created the sorted order must be maintained.

Here’s an example of the technique, broken into dividing and merging phases.

Dividing Phase:

Assuming we have an array with the following values: [3, 8, 9, 2, 6, 7], the process will look something like this.

Division 1: [3, 8, 9][2, 6, 7]

Division 2: [3] [8, 9] [2] [6,7]

Division 3: [3] [8] [9] [2] [6] [7]

Now comes the merging phase.

Merging Phase:

Merge 1: [3, 8] [2,9] [6,7]

Merge 2: [2, 3, 8, 9][6, 7]

Merge 3: [2, 3, 6, 7, 8, 9]

If two lists are being merged, it is important to compare the elements before they are merged into a single list. Otherwise, the merged list will contain all the sorted elements of one list and then all the sorted elements of the other list.

So let’s make sure we understand the steps to implementing merge sort:

We start with an unsorted list and split it in half.

2. We split the left half of the array into halves.

3. We keep splitting until there is just one element per half.

4. We compare each of the elements

5. The elements are merged into one group, in order.

6. The process continues until the entire array has been sorted and merged.

Now we can try writing merge sort in Python.

This implementation will be a recursive implementation. When writing a recursive algorithm, remember that the algorithm must have a base-case, or a case where the recursion ends.

In order to implement merge sort in Python, the algorithm can be structured as follows:

def merge_sort(array):
    # Only if there are two or more elements
    if len(array) > 1:
        # get the midpoint in the array
        mid = len(array) // 2
        # left half is everything before the mid
        left = array[:mid]
        # right half is everything after the midpoint
        right = array[mid:]

        # recur the splits on the left and right halves
        merge_sort(left)
        merge_sort(right)

        # we need variables to track the iteration through the arrays
        x = 0
        y = 0

        # Iterator for the main list
        z = 0

        # here we copy the data into temporary arrays that store the left and right halves
        # as long as there are still values within the left and right half of the arrays
        # compare the values and swap positions

        while x < len(left) and y < len(right):
            # if left value is smaller than the right value
            if left[x] <= right[y]:
                # place the left position into the sorted array
                array[z] = left[x]
                # update left position
                x = x + 1
            else:
                array[z] = right[y]
                y = y + 1
            # move to the next portion of the array
            z = z + 1

        # If there are any elements left over in the left half of the array
        # We fill in the primary array with the remaining left values and then move the iterators up
        while x < len(left):
            array[z] = left[x]
            x += 1
            z += 1

        # Do the same thing for the right half of the array
        while y < len(right):
            array[z] = right[y]
            y += 1
            z += 1

listy = [35, 72, 29, 55, 41, 27, 79, 11, 23, 56, 50]
merge_sort(listy)
print(listy)

Here’s the sorted list:

[11, 23, 27, 29, 35, 41, 50, 55, 56, 72, 79]

GameInformer Review Analysis

Analyzing reviews of media is a good way to get acquainted with data analysis and data science ideas/techniques. Because of the tendency for reviewers to give numerical scores to produces they review, it can be fairly easy to visualize how certain other features may correlate with these scores. This blog post will serve as an demonstration of how to analyze and explore data, using review data gathered from GameInformer.

I scraped the review data from GameInformer myself, and what follows is the results of the visualization/analysis. If you are curious about how the scraping was done:

First, the review URLS were collected from the main site.

Next, the review data was collected. Following this, some preprocessing was done to remove duplicate entries and non-standard characters from the review data.

This blog will also act as a quick primer on how to visualize data with Pandas and Seaborn.

Let’s start off by making all the necessary imports.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from collections import Counter
import csv

Now we need to load in the data. We also need to get dummy representations of any non-numeric values.

GI_data = pd.read_csv('GIreviewcomplete.csv', encoding = "ISO-8859-1")
GI_data2 = pd.get_dummies(GI_data)

We can get a violin plot for every reviewer. The width of the violins shows how often a particular score was assigned, while the height and length of a violin show the possible range of scores.

#---display violin plot for author and avg score

sns.lmplot(x='author', y='score', data=GI_data, hue='author', fit_reg=False)
sns.violinplot(x='author',y='score', data=GI_data)
plt.xticks(rotation=-90)
plt.show()

It’s possible to filter the database by multiple criteria. For instance, we could return every review over a certain score by a certain reviewer.

EF_reviews = GI_data[(GI_data.author == 'Elise Favis') & (GI_data.score >= 8.0)]
print(EF_reviews)


                                                title  score       author  \
48                                     Heaven's Vault      8  Elise Favis   
58                                        Photographs      8  Elise Favis   
102                                  Bury Me, My Love      8  Elise Favis   
158              Life Is Strange 2: Episode 1 ? Roads      8  Elise Favis   
243                                             Minit      8  Elise Favis   
262                  Where The Water Tastes Like Wine      9  Elise Favis   
266                                          Florence      8  Elise Favis   
276                                        Subnautica      8  Elise Favis   
286                              The Red Strings Club      8  Elise Favis   
405                                            Tacoma      8  Elise Favis   
442                                        Perception      8  Elise Favis   
485                                  Thimbleweed Park      8  Elise Favis   
506                                Night In The Woods      8  Elise Favis   
586  Phoenix Wright: Ace Attorney - Spirit of Justice      8  Elise Favis   
678                    Day of the Tentacle Remastered      8  Elise Favis   

        r_platform                       o_platform  \
48              PC                    PlayStation 4   
58              PC                     iOS, Android   
102         Switch                 PC, iOS, Android   
158             PC          PlayStation 4, Xbox One   
243             PC  PlayStation 4, Xbox One, Switch   
262             PC                PlayStation 4, PC   
266            iOS                     Xbox One, PC   
276             PC        PlayStation 4, Switch, PC   
286             PC             Xbox One, Switch, PC   
405             PC                         Xbox One   
442             PC          PlayStation 4, Xbox One   
485             PC                         Xbox One   
506  PlayStation 4                               PC   
586            3DS                               PC   
678  PlayStation 4             PlayStation Vita, PC   

                         publisher                           developer  \
48                           Inkle                               Inkle   
58               EightyEight Games                   EightyEight Games   
102                       Playdius   The Pixel Hunt, Arte France, FIGS   
158                    Square Enix               Dontnod Entertainment   
243               Devolver Digital           JW, Kitty, Jukio, and Dom   
262     Good Shepard Entertainment      Dim Bulb Games, Serenity Forge   
266          Annapurna Interactive               Annapurna Interactive   
276   Unknown Worlds Entertainment        Unknown Worlds Entertainment   
286               Devolver Digital                      Deconstructeam   
405                     Fullbright                          Fullbright   
442             The Deep End Games                  The Deep End Games   
485                Terrible Toybox                     Terrible Toybox   
506                          Finji                       Infinite Fall   
586                         Capcom                              Capcom   
678        Double Fine Productions             Double Fine Productions   

            release_date           rating  
48        April 16, 2019             Teen  
58         April 3, 2019              NaN  
102     January 10, 2019     Everyone 10+  
158   September 27, 2018           Mature  
243        April 3, 2018         Everyone  
262    February 28, 2018        Not rated  
266    February 14, 2018         Everyone  
276     January 23, 2018     Everyone 10+  
286     January 22, 2018   Rating Pending  
405       August 2, 2017             Teen  
442         May 30, 2017           Mature  
485       March 30, 2017             Teen  
506    February 21, 2017             Teen  
586    September 8, 2016             Teen  
678       March 22, 2016             Teen

Countplots and barplots are useful ways of visualizing data. Countplots just plot the count of your chosen variable, whereas bar plots compare two chosen variables with each other.

plt.figure(figsize=(10, 4))
scores = sns.countplot(x='score', data=GI_data2)
plt.xticks()
plt.show(scores)

plt.figure(figsize=(10, 4))
r_platform = sns.countplot(x="r_platform", data=GI_data)
plt.xticks(rotation=-90)
plt.show(r_platform)

plt.figure(figsize=(10, 4))
rating = sns.countplot(x="rating", data=GI_data)
plt.xticks(rotation=-90)
plt.show(rating)

plt.figure(figsize=(10, 4))
plt.xticks(rotation=-90)
barplot_score = sns.barplot(x="rating", y="score", data=GI_data)
plt.show(barplot_score)

Seaborn also contains a handy distribution function, and in this case we can see the relative distribution of GameInformer review scores. Most of their assigned scores have been an 8.

sns.distplot(GI_data2['score'])
plt.show()

Swarmplots show the individual instances of Y given X, or in this case of a certain score given a certain rating.

plt.figure(figsize=(16, 6))
factorplot_score = sns.swarmplot(x="rating", y="score", data=GI_data, hue='rating')
plt.xticks(rotation= -90)
plt.show(factorplot_score)

Let’s do some analysis of publisher and developer statistics. We’d want to start by getting a list of publishers and developers. Let’s just get the 100 most common.

publishers = Counter(GI_data['publisher'])
developers = Counter(GI_data['developer'])

pub_list = []
dev_list = []

for item, freq in publishers.most_common(100):
    pub_list.append(item)

for item, freq in developers.most_common(100):
    dev_list.append(item)
    
print(pub_list[:10])
print("--------------")
print(dev_list[:10])

[' Nintendo', ' Telltale Games', ' Square Enix', ' Ubisoft', ' Sony Computer Entertainment', ' Devolver Digital', ' Warner Bros. Interactive', ' Microsoft Game Studios', ' Electronic Arts', ' Bandai Namco']
--------------
[' Telltale Games', ' Nintendo', ' Capcom', ' Square Enix', ' EA Tiburon', ' Atlus', ' Visual Concepts', ' EA Canada', ' Dontnod Entertainment', ' Ubisoft Montreal']

If we make a publisher and developer dataframe, we can transform those dataframes by getting the mean and median values of their scores. We can then merge those back into the original dataframe to get a dataframe sorted by publisher which also has the publisher’s mean and median scores. We could do the same thing for developers.

pub_df = pd.DataFrame(index=None)
dev_df = pd.DataFrame(index=None)

# append rows for individual publishers to dataframe

def custom_mean(group):
    group['mean'] = group['score'].mean()
    return group

def custom_median(group):
    group['median'] = group['score'].median()
    return group

for pub in pub_list:
    scores = pd.DataFrame(GI_data[(GI_data.publisher == pub) & (GI_data.score)], index=None)
    pub_df = pub_df.append(scores, ignore_index=True)

pub_mean_df = pub_df.groupby('publisher').apply(custom_mean)
pub_median_df = pub_df.groupby('publisher').apply(custom_median)
pub_median = pub_median_df['median']
pub_merged_df = pub_mean_df.join(pub_median)
print(pub_merged_df.head(3))
pub_merged_df.to_csv('pub_merged.csv')

for dev in dev_list:
    scores = pd.DataFrame(GI_data[(GI_data.developer == dev) & (GI_data.score)], index=None)
    dev_df = dev_df.append(scores, ignore_index=True)

dev_mean_df = dev_df.groupby('developer').apply(custom_mean)
dev_median_df = dev_df.groupby('developer').apply(custom_median)
dev_median = dev_median_df['median']
dev_merged_df = dev_mean_df.join(dev_median)
dev_merged_df.to_csv('dev_merged.csv')
print(dev_merged_df.head(3))

                                               title  score  \
0                          Fire Emblem: Three Houses      9   
1        Marvel Ultimate Alliance 3: The Black Order      7   
2  Cadence of Hyrule ? Crypt of the Necrodancer F...      7   

              author r_platform o_platform  publisher              developer  \
0  Kimberley Wallace     Switch        NaN   Nintendo    Intelligent Systems   
1      Andrew Reiner     Switch        NaN   Nintendo             Team Ninja   
2     Suriel Vazquez     Switch        NaN   Nintendo   Brace Yourself Games   

     release_date           rating     mean  median  
0   July 26, 2019              NaN  7.73913     7.0  
1   July 19, 2019             Teen  7.73913     7.0  
2   June 13, 2019   Rating Pending  7.73913     7.0  
                                              title  score             author  \
0                The Walking Dead: The Final Season      7  Kimberley Wallace   
1    The Walking Dead: The Final Season ? Episode 1      7  Kimberley Wallace   
2  Batman: The Enemy Within Episode 5 ? Same Stitch      9      Javy Gwaltney   

      r_platform                       o_platform        publisher  \
0  PlayStation 4             Xbox One, Switch, PC   Telltale Games   
1             PC          PlayStation 4, Xbox One   Telltale Games   
2  PlayStation 4  PlayStation 4, Xbox One, Switch   Telltale Games   

         developer      release_date           rating  mean  median  
0   Telltale Games    March 27, 2019   Rating Pending  7.16     7.0  
1   Telltale Games   August 14, 2018           Mature  7.16     7.0  
2   Telltale Games    March 27, 2018           Mature  7.16     7.0

Let’s select just the top publishers and see what their mean and median scores are.

dev_data = pd.read_csv('dev_merged.csv')
dev_data = dev_data.drop(dev_data.columns[0], axis=1)
developer_names = list(dev_data['developer'].unique())
#print(developer_names)

dev_examples = pd.DataFrame(index=None)

for dev in developer_names:
   dev_examples = dev_examples.append(dev_data[dev_data.developer == dev].iloc[0])

#print(dev_examples.to_string())

dev_stats = dev_examples[['developer', 'mean', 'median']]
print(dev_stats.head(20))

                    developer      mean  median
0              Telltale Games  7.160000     7.0
25                   Nintendo  7.714286     7.0
39                     Capcom  7.857143     9.0
46                Square Enix  8.111111     9.0
55                 EA Tiburon  6.666667     7.0
61                      Atlus  7.666667     7.0
67            Visual Concepts  7.666667     7.0
76                  EA Canada  7.000000     7.0
80      Dontnod Entertainment  7.000000     7.0
83           Ubisoft Montreal  7.000000     7.0
87              Firaxis Games  8.428571     9.0
94                   TT Games  7.666667     7.0
97     Blizzard Entertainment  9.000000     9.0
104             From Software  8.714286     9.0
111                Game Freak  7.000000     7.0
113   Double Fine Productions  7.666667     7.0
116       Intelligent Systems  8.200000     9.0
121                      SEGA  8.600000     9.0
126            HAL Laboratory  7.000000     7.0
127            Spike Chunsoft  6.000000     6.0

Another interesting thing we could to is get the worst reviewed games, games that have received less than a 4.

# how you can filter for just one criteria and then pull out only the columns you care about
# print(GI_data[GI_data.score <= 4].title)
# this is actually the preferred method...
print(GI_data.loc[GI_data.score <= 4, 'title'])

69               R.B.I. Baseball 19
124     Overkill's The Walking Dead
134                   The Quiet Man
217               Tennis World Tour
220                           Agony
259               Fear Effect Sedna
301                  Hello Neighbor
343              Raid: World War II
483              R.B.I. Baseball 17
488                 Old Time Hockey
489                 Old Time Hockey
503                      1-2-Switch
589                    One Way Trip
612             Ghostbusters (2016)
637       Homefront: The Revolution
722                   Devil's Third
767                        Armikrog
815                        Godzilla
820     Payday 2: Crimewave Edition
933              Escape Dead Island
935       Sonic Boom: Rise of Lyric
1045          Rambo: The Video Game
1063                         Rekoil
Name: title, dtype: object

We could also get only published by Nintendo by filtering out results where ‘Nintendo’ doesn’t appear in ‘publisher’ with isin.

# can use "isin" in a series...
# spaces are in this, be sure to include them

r = GI_data[GI_data['publisher'].isin([' Nintendo'])][:40]
print(r)


                                                 title  score  \
0                            Fire Emblem: Three Houses      9   
5          Marvel Ultimate Alliance 3: The Black Order      7   
6                                      Dr. Mario World      8   
19                                 Super Mario Maker 2      8   
23   Cadence of Hyrule ? Crypt of the Necrodancer F...      7   
43                                  BoxBoy! + BoxGirl!      8   
64                               Yoshi's Crafted World      8   
79                                           Tetris 99      8   
101  Mario & Luigi: Bowser's Inside Story + Bowser ...      8   
103                     New Super Mario Bros. U Deluxe      8   
112                         Super Smash Bros. Ultimate      9   
123                         Pokémon: Let's Go, Pikachu      8   
144               The World Ends With You: Final Remix      7   
151                                  Super Mario Party      7   
161  Xenoblade Chronicles 2: Torna - The Golden Cou...      7   
193                                          Tetris 99      8   
196                                     WarioWare Gold      8   
199                                  Octopath Traveler      8   
202                     Captain Toad: Treasure Tracker      8   
203                         Splatoon 2: Octo Expansion      8   
209                                  Mario Tennis Aces      8   
212                                      Pokémon Quest      6   
215                  Sushi Striker: The Way of Sushido      7   
216                        Dillon's Dead-Heat Breakers      7   
236               Donkey Kong Country: Tropical Freeze      9   
246                                  Detective Pikachu      7   
254                                  Kirby Star Allies      6   
300  The Legend Of Zelda: Breath Of The Wild ? The ...      8   
308                             Xenoblade Chronicles 2      7   
335                                Super Mario Odyssey      9   
338                               Fire Emblem Warriors      7   
377                             Metroid: Samus Returns      9   
378                             Monster Hunter Stories      8   
408                                           Miitopia      7   
410                                        Hey! Pikmin      6   
413                                         Splatoon 2      8   
418  The Legend Of Zelda: Breath Of The Wild ? Mast...      7   
426                                         Ever Oasis      8   
431                                               Arms      8   
447            Fire Emblem Echoes: Shadows of Valentia      7   

                author r_platform                                 o_platform  \
0    Kimberley Wallace     Switch                                        NaN   
5        Andrew Reiner     Switch                                        NaN   
6           Ben Reeves        iOS                                    Android   
19       Kyle Hilliard     Switch                                        NaN   
23      Suriel Vazquez     Switch                                        NaN   
43          Ben Reeves     Switch                                        NaN   
64          Brian Shea     Switch                                        NaN   
79       Kyle Hilliard     Switch                               Xbox One, PC   
101      Kyle Hilliard        3DS                                   Xbox One   
103         Brian Shea     Switch                           PC, iOS, Android   
112          Jeff Cork     Switch                          PlayStation 4, PC   
123         Brian Shea     Switch                              PlayStation 4   
144  Kimberley Wallace     Switch                                         PC   
151         Brian Shea     Switch            PlayStation 4, Xbox One, Switch   
161           Joe Juba     Switch             PlayStation 4, Switch, PC, Mac   
193      Kyle Hilliard     Switch            PlayStation 4, Xbox One, Switch   
196      Kyle Hilliard        3DS            PlayStation 4, PlayStation Vita   
199           Joe Juba     Switch  PlayStation 4, Xbox One, Switch, Mac, iOS   
202         Ben Reeves     Switch                                        3DS   
203         Brian Shea     Switch                                        3DS   
209      Kyle Hilliard     Switch                          PlayStation 4, PC   
212         Brian Shea     Switch                               iOS, Android   
215      Kyle Hilliard     Switch                                        3DS   
216      Kyle Hilliard        3DS                                        3DS   
236      Kyle Hilliard     Switch                                      Wii U   
246         Ben Reeves        3DS            PlayStation 4, Xbox One, Switch   
254      Kyle Hilliard     Switch                               Xbox One, PC   
300     Suriel Vazquez     Switch                               Xbox One, PC   
308           Joe Juba     Switch                    PlayStation 4, Xbox One   
335      Andrew Reiner     Switch                       Xbox One, Switch, PC   
338      Javy Gwaltney     Switch                          PlayStation 4, PC   
377         Ben Reeves        3DS                               Xbox One, PC   
378        Daniel Tack        3DS                               Xbox One, PC   
408          Jeff Cork        3DS                                   Xbox One   
410         Ben Reeves        3DS                              PlayStation 4   
413         Brian Shea     Switch                                         PC   
418      Javy Gwaltney     Switch            Xbox One, PC, Mac, iOS, Android   
426      Kyle Hilliard        3DS                 Xbox One, PlayStation Vita   
431         Brian Shea     Switch                              PlayStation 3   
447      Javy Gwaltney        3DS                       Xbox One, Switch, PC   

     publisher                       developer         release_date  \
0     Nintendo             Intelligent Systems        July 26, 2019   
5     Nintendo                      Team Ninja        July 19, 2019   
6     Nintendo                        Nintendo        July 10, 2019   
19    Nintendo                        Nintendo        June 28, 2019   
23    Nintendo            Brace Yourself Games        June 13, 2019   
43    Nintendo                  HAL Laboratory       April 26, 2019   
64    Nintendo                       Good Feel       March 29, 2019   
79    Nintendo                           Arika    February 13, 2019   
101   Nintendo                      AlphaDream     January 11, 2019   
103   Nintendo                        Nintendo     January 11, 2019   
112   Nintendo                       Sora, Ltd     December 7, 2018   
123   Nintendo                      Game Freak    November 16, 2018   
144   Nintendo           Square Enix, h.a.n.d.     October 12, 2018   
151   Nintendo                        Nintendo      October 5, 2018   
161   Nintendo                   Monolith Soft   September 14, 2018   
193   Nintendo                           Arika    February 13, 2019   
196   Nintendo   Nintendo, Intelligent Systems       August 3, 2018   
199   Nintendo            Square Enix, Acquire        July 13, 2018   
202   Nintendo                        Nintendo        July 13, 2018   
203   Nintendo                        Nintendo        June 13, 2018   
209   Nintendo                Camelot Software        June 22, 2018   
212   Nintendo                      Game Freak         May 29, 2018   
215   Nintendo           Nintendo, indies zero         June 8, 2018   
216   Nintendo                         Vanpool         May 24, 2018   
236   Nintendo                   Retro Studios          May 4, 2018   
246   Nintendo                  Creatures Inc.       March 23, 2018   
254   Nintendo                  HAL Laboratory       March 16, 2018   
300   Nintendo                        Nintendo     December 7, 2017   
308   Nintendo                   Monolith Soft     December 1, 2017   
335   Nintendo                        Nintendo     October 27, 2017   
338   Nintendo                      Koei Tecmo                  TBA   
377   Nintendo                    MercurySteam   September 15, 2017   
378   Nintendo                          Capcom    September 8, 2017   
408   Nintendo                        Nintendo        July 27, 2017   
410   Nintendo                        Nintendo        July 28, 2017   
413   Nintendo                        Nintendo        July 21, 2017   
418   Nintendo                        Nintendo        July 30, 2017   
426   Nintendo                          Grezzo        June 23, 2017   
431   Nintendo                        Nintendo        June 16, 2017   
447   Nintendo             Intelligent Systems         May 19, 2017   

              rating  
0                NaN  
5               Teen  
6           Everyone  
19               NaN  
23    Rating Pending  
43          Everyone  
64          Everyone  
79          Everyone  
101         Everyone  
103         Everyone  
112           Mature  
123         Everyone  
144             Teen  
151         Everyone  
161             Teen  
193             Teen  
196     Everyone 10+  
199             Teen  
202         Everyone  
203     Everyone 10+  
209         Everyone  
212         Everyone  
215         Everyone  
216         Everyone  
236         Everyone  
246         Everyone  
254     Everyone 10+  
300         Everyone  
308             Teen  
335     Everyone 10+  
338   Rating Pending  
377     Everyone 10+  
378     Everyone 10+  
408         Everyone  
410     Everyone 10+  
413     Everyone 10+  
418         Everyone  
426     Everyone 10+  
431     Everyone 10+  
447   Rating Pending

Finally, let’s do a swarmplot of scores by Nintendo associated developers. We’ll take games published by Nintendo and plot the score given some developers.

plt.figure(figsize=(8, 10))
dev_score = sns.swarmplot(x="developer", y="score", data=r, hue='developer')
plt.grid()
plt.xticks(range(50),rotation=-90)
plt.show(dev_score)

If you’d like to get better acquainted with visualizing data, I suggest checking out the documentation of Pandas and Seaborn and trying to visualize some simple datasets, such as the ones below:

Iris Dataset

Pokemon with Stats

Mushroom Dataset

Did It Rain In Seattle?