CriticallyCoding

Data Analysis Capstone Proposal

The tentative title for my project is:

Investigating Religion and Poverty as They Relate to the Spread of Covid-19

There is a wealth of data online now regarding the number of Covid-19 cases in different regions of the world. Many of these datasets and accompanying analyses look at variables like population density and number of hospital beds in order to predict coronavirus case counts and death counts. I’d like to examine less commonly examined variables like religion and poverty and see if there are some potential correlations with spread (as measured by relative case counts) of Sars-Cov-2.

I hypothesize that both variables have some correlation with large gatherings of people and therefore could be tied to increased spread of coronavirus. For instance, more religious areas may have more gatherings due to church attendance while more poverty-stricken areas may hae more people rooming together to afford housing, and both could conceivably play a role in the spread of Sars-Cov-2. For that reason, I want to see if there are any correlations between these two variables and coronavirus case counts. I’d also like to see if these variables have any potential moderating effects on each other.

I think that finding potentially interesting correlations between these variables and the spread of Sars-Cov-2/coronavirus could help lay the groundwork for better understanding of how diseases like coronavirus spread throughout populations.

K-Means Clustering on the Gapminder Dataset

These are the results of a clustering analysis run on the Gapminder dataset. The features used to generate the three classes/categories include: alcohol consumption, breast cancer rate, employ rate, internet use rate, life expectancy, and urbanization rate.

After preprocessing the data, which includes scaling the features, it is separated into a training and testing data sets.

from sklearn.preprocessing import scale, MinMaxScaler
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA

import statsmodels.api as sm

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = pd.to_numeric(dataframe[col], errors='coerce')

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

#print(X["internetuserate"].mean(axis=0))

X['internetuserate_scaled'] = scale(X['internetuserate'])
X['urbanrate_scaled'] = scale(X['urbanrate'])
X['lifeexpectancy_scaled'] = scale(X['lifeexpectancy'])
X['employrate_scaled'] = scale(X['employrate'])
X['alconsumption'] = scale(X['alcconsumption'])
X['breastcancerper100th'] = scale(X['breastcancerper100th'])

#print(X['internetuserate'].mean(axis=0))
#print(X['internetuserate_scaled'].mean(axis=0))

X_train, X_test = train_test_split(X, test_size=0.25)

Afterwards, the model is tested on the dataset to determine which number of clusters seems to give the best performance.

clusters = range(1, 10)
mean_dist = []

for i in clusters:
    model = KMeans(n_clusters=i)
    model.fit(X_train)
    clus_assign = model.predict(X_train)
    mean_dist.append(sum(np.min(cdist(X_train, model.cluster_centers_, 'euclidean'),
                                axis=1))/X_train.shape[0])


plt.plot(clusters, mean_dist)
plt.xlabel("Num. Clusters")
plt.ylabel("Avg. Distance")
plt.title("Graph of Cluster Performance")
plt.show()

Around 4 clusters seems to give the best results that might guard against overfitting.

Afterwards, the model is run again with 4 clusters, this time with the PCA transformed variables passed in as the labels. After the optimized model is run, the means for the different classes are examined for meaningful patterns. The number of observations for each class is then distinguished and printed as the “cluster” variable. Finally, the means for the different clusters are calculated and the variable means for the different clusters can be explored.

K_means_model = KMeans(n_clusters=4)
K_means_model.fit(X_train)
clus_assign = K_means_model.predict(X_train)

pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(X_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=K_means_model.labels_)
plt.xlabel("Canonical Variable 1")
plt.ylabel("Canonical Variable 2")
plt.title("Plot of Variables for 3 Clusters")
plt.show()

# check pattern of means by cluster to see if the means are distinct and meaningful
# create unique identifiers from the index of cluster training data
X_train.reset_index(level=0, inplace=True)
cluster_list = list(X_train['index'])
labels = list(K_means_model.labels_)
new_dict = dict(zip(cluster_list, labels))
print(new_dict)
new_cluster = pd.DataFrame.from_dict(new_dict, orient='index')
print(new_cluster)

# create unique identifiers for cluster assignment variable
# rename cluster assignment column
new_cluster.columns = ["cluster"]
new_cluster.reset_index(level=0, inplace=True)
merged_train = pd.merge(X_train, new_cluster, on='index')
print(merged_train.head(20))
print(merged_train.cluster.value_counts())

# we can now calculate means of variables for the different clusters
cluster_group = merged_train.groupby('cluster').mean()
print("Clustering variable means by cluster")
print(cluster_group)

Here are the results of the code:

{188: 2, 95: 3, 42: 0, 84: 1, 192: 2, 102: 2, 208: 2, 175: 1, 77: 0, 55: 0, 11: 3, 136: 1, 196: 3, 210: 2, 54: 3, 58: 2, 186: 0, 81: 0, 207: 3, 53: 3, 50: 1, 66: 0, 9: 1, 62: 0, 154: 3, 44: 3, 26: 3, 12: 3, 168: 2, 203: 1, 133: 2, 10: 1, 152: 0, 4: 0, 86: 2, 212: 2, 69: 1, 174: 1, 173: 1, 72: 3, 150: 0, 21: 2, 211: 2, 80: 2, 116: 3, 104: 1, 89: 0, 28: 2, 201: 1, 183: 2, 38: 0, 140: 0, 19: 2, 197: 0, 57: 2, 79: 2, 18: 0, 151: 3, 17: 1, 45: 0, 70: 0, 15: 1, 41: 2, 60: 2, 182: 3, 97: 2, 59: 1, 29: 2, 180: 2, 146: 2, 190: 2, 46: 3, 130: 3, 48: 3, 105: 3, 205: 2, 68: 0, 204: 3, 156: 1, 200: 3, 195: 3, 113: 3, 96: 0, 114: 2, 32: 1, 78: 2, 148: 3, 106: 2, 185: 1, 94: 1, 149: 2, 122: 2, 100: 1, 63: 1, 153: 3, 33: 0, 14: 2, 139: 1, 179: 1, 135: 2, 36: 2, 35: 2, 118: 2, 202: 1, 111: 1, 176: 2, 194: 2, 189: 2, 107: 0, 91: 1, 144: 1, 67: 0, 25: 3, 64: 1, 178: 0, 124: 3, 13: 3, 159: 3, 27: 3}
     0
188  2
95   3
42   0
84   1
192  2
..  ..
178  0
124  3
13   3
159  3
27   3

[119 rows x 1 columns]
    index alcconsumption  ...  alconsumption  cluster
0     188           3.39  ...      -0.695805        2
1      95            .65  ...      -1.245141        3
2      42           4.46  ...      -0.481284        0
3      84          16.12  ...       1.856402        1
4     192           1.92  ...      -0.990522        2
5     102           4.72  ...      -0.429157        2
6     208           3.91  ...      -0.591552        2
7     175          14.94  ...       1.619826        1
8      77            7.1  ...       0.048004        0
9      55            .32  ...      -1.311302        0
10     11          13.34  ...       1.299046        3
11    136           9.75  ...       0.579296        1
12    196           3.02  ...      -0.769986        3
13    210             .2  ...      -1.335361        2
14     54           9.43  ...       0.515140        3
15     58           1.64  ...      -1.046659        2
16    186           1.49  ...      -1.076732        0
17     81           5.92  ...      -0.188572        0
18    207            7.6  ...       0.148247        3
19     53           6.28  ...      -0.116396        3

[20 rows x 13 columns]
2    39
3    29
1    28
0    23
Name: cluster, dtype: int64
Clustering variables means by cluster
              index  breastcancerper100th  ...  employrate_scaled  alconsumption
cluster                                    ...                                  
0         87.869565             -0.456400  ...          -0.000201      -0.291431
1        108.535714              1.297299  ...           0.010051       0.732239
2        119.205128             -0.673682  ...           0.459523      -0.636122
3        105.620690              0.096920  ...          -0.479776       0.225055

[4 rows x 10 columns]

Ignoring the index, we can look at some of the results for the column. It looks like, for example, observations in cluster 1 seem to have a higher likelihood of having high breast cancer rates compared to the other clusters. Meanwhile, observations in cluster 2 tended to have much lower alcohol consumption rates than average.

Logistic Regression: Gapminder Dataset

I ran a logistic regression on the Gapminder dataset. I chose to explore the relationship between life expectancy and internet use rate, as potentially moderated by employment rate. My hypothesis is that there is a positive correlation between life expectancy and internet use rate.

First, we’ll examine the relationship between life expectancy and internet use rate. After binning the life expectancy into two different categories, we can run the first logistic regression model.

from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import numpy as np

import statsmodels.api as sm

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = pd.to_numeric(dataframe[col], errors='coerce')

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

print(X["internetuserate"].mean(axis=0))

X['internetuserate_scaled'] = scale(X['internetuserate'])
X['urbanrate_scaled'] = scale(X['urbanrate'])
X['lifeexpectancy_scaled'] = scale(X['lifeexpectancy'])
X['employrate_scaled'] = scale(X['employrate'])

print(X['internetuserate'].mean(axis=0))
print(X['internetuserate_scaled'].mean(axis=0))

def bin_half(dataframe):

    if dataframe['lifeexpectancy'] >= 50.0:
        return 1
    else:
        return 0

df3 = df2.copy()
df3['lifeexpectancy_bins'] = df3.apply(lambda x: bin_half(x), axis=1)

log_reg_1 = smf.logit(formula="lifeexpectancy_bins ~ internetuserate", data=df3).fit()
print(log_reg_1.summary())
print("Odd ratio:")
print(np.exp(log_reg_1.params))

                            Logit Regression Results                           
===============================================================================
Dep. Variable:     lifeexpectancy_bins   No. Observations:                  192
Model:                           Logit   Df Residuals:                      190
Method:                            MLE   Df Model:                            1
Date:                 Mon, 25 May 2020   Pseudo R-squ.:                0.003350
Time:                         01:40:11   Log-Likelihood:                -66.058
converged:                        True   LL-Null:                       -66.280
Covariance Type:             nonrobust   LLR p-value:                    0.5052
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           2.3015      0.394      5.842      0.000       1.529       3.074
internetuserate    -0.0055      0.008     -0.670      0.503      -0.022       0.011
===================================================================================
Odd ratio:
Intercept          9.989521
internetuserate    0.994534
dtype: float64
Confidence intervals:
                 Lower CI   Upper CI        OR
Intercept        4.615573  21.620398  9.989521
internetuserate  0.978717   1.010607  0.994534

The results of the first model show that there’s a fairly large P-value of approximately 0.50, implying that there’s no significant relationship between life expectancy and internet use rate. The odds ratio of approximately 0.99 shows that between the two groups (shorter lives and longer lives), the rate of internet use seems to be approximately equivalent. It’s 95% certain that the true population odds ratios fall somewhere between 0.97 and 1.01. It looks like the results of the logistic regression model do not support my hypothesis.

After running the first regression model, I ran another model that checked for possible confounding from the “employment rate” variable.

                           Logit Regression Results                           
===============================================================================
Dep. Variable:     lifeexpectancy_bins   No. Observations:                  167
Model:                           Logit   Df Residuals:                      164
Method:                            MLE   Df Model:                            2
Date:                 Mon, 25 May 2020   Pseudo R-squ.:                  0.3124
Time:                         01:40:11   Log-Likelihood:                -22.081
converged:                        True   LL-Null:                       -32.114
Covariance Type:             nonrobust   LLR p-value:                 4.393e-05
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -1.3109      2.402     -0.546      0.585      -6.018       3.396
internetuserate     0.2311      0.105      2.209      0.027       0.026       0.436
employrate          0.0320      0.035      0.919      0.358      -0.036       0.100
===================================================================================

Possibly complete quasi-separation: A fraction 0.41 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Odd ratio:
Intercept          0.269569
internetuserate    1.259940
employrate         1.032473
dtype: float64
Confidence intervals:
                 Lower CI   Upper CI        OR
Intercept        0.002434  29.851462  0.269569
internetuserate  1.026374   1.546656  1.259940
employrate       0.964467   1.105275  1.032473

When this second regression model was run, it was found that there does seem to be a significant relationship between life expectancy and internet use rate, as this second model returned a P-value of 0.027 for internet use rate. Meanwhile, the P-value of employment rate was approximately 0.35, indicating a non-significant relationship. That said, the relationship appears to be a fairly weak one, with countries with high internet use rates are only about 1.25 times more likely to have long life expectancies than countries without high internet use rates. In terms of employment rates, the odds ratio is very near 1 (1.03), which implies a non-significant relationship between employment rates and life expectancy. The results suggest that employment rates confound the relationship between internet use rates and life expectancy.

Running A Lasso Regression On The Gapminder Dataset

I’ll be carrying out lasso regression on the Gapminder dataset. The features we are investigating are the following: “alcohol consumption”, “breast cancer per 100k”, “employment rate”, “internet use rate”, “life expectancy”, and “urbanization rate”. We are hoping to predict which income group the data points (countries) fall into based on the provided features.

First, we’ll need to load in the data and preprocesss the data, scaling the features of interest.

from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

print(X["internetuserate"].mean(axis=0))


X['alcconsumption'] = scale(X['alcconsumption'])
X['breastcancerper100th'] = scale(X['breastcancerper100th'])
X['internetuserate'] = scale(X['internetuserate'])
X['urbanrate'] = scale(X['urbanrate'])
X['lifeexpectancy'] = scale(X['lifeexpectancy'])
X['employrate'] = scale(X['employrate'])

print(X['internetuserate'].mean(axis=0))

X.astype(float)
Y = df_clean['income_group']
Y.astype(float)

Now that the data has been preprocessed,

Influence:
{‘alcconsumption’: 0.06989539588334298,

‘breastcancerper100th’: -0.14994722784801043,

’employrate’: -0.1439463951270292,

‘internetuserate’: 0.05671759315545794,

‘lifeexpectancy’: 0.43140559604575485,

‘urbanrate’: 0.2946352802900217}

Train error:
0.31411149133934124
Test error:
0.3410401541834608
R-squared train:
0.6057350783025499
R-squared test:
0.575668143518999

No features were dropped by the Lasso regression model. Life Expectancy and Urbanization rate appeared to have the strongest associations with income level.

Accuracy on the test dataset was around 66%, which is better than chance guessing but still leaves much room for improvement. Adding more features to the model would probably improve the model’s estimation accuracy. In terms of the R-squared, the model performed just slightly better on the training dataset, explaining approximately 60% of the variance in the training dataset, contrasted with the approx. 57% of the variance explained for the testing set.

Multi-Regression on the Gapminder Dataset

I am running a multi-regression model on the Gapminder dataset. Our dependent variable under study is life expectancy, while the independent variables we are investigating are “internet use rate”, “urbanization rate”, and “employment rate”.

First, we’re going to load in the data and do any necessary preprocessing of the dataset. This includes scaling the features of interest.

from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import statsmodels.api as sm

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

print(X["internetuserate"].mean(axis=0))

X['internetuserate_scaled'] = scale(X['internetuserate'])
X['urbanrate_scaled'] = scale(X['urbanrate'])
X['lifeexpectancy_scaled'] = scale(X['lifeexpectancy'])
X['employrate_scaled'] = scale(X['employrate'])

print(X['internetuserate'].mean(axis=0))
print(X['internetuserate_scaled'].mean(axis=0))

Now that we’ve scaled the features, we can run a multiple linear regression on the chosen features. We’ll also visualize the results of the residuals by creating some plots. We’ll create a QQ plot, a residual plot, and a influence/leverage plot.

print("Mult. Regression model findings:")
mult_lin_reg = smf.ols("lifeexpectancy_scaled ~ internetuserate_scaled + urbanrate_scaled + employrate_scaled", data=X).fit()
print(mult_lin_reg.summary())

qq_plot = sm.qqplot(mult_lin_reg.resid, line='r')
plt.show()

stdres = pd.DataFrame(mult_lin_reg.resid_pearson)
stdres_fig = plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.xlabel('Std. Residuals')
plt.ylabel('Observation Numbers')
plt.show()

lev_plot = sm.graphics.influence_plot(mult_lin_reg, size=8)
plt.show()

1) Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary.

The response variable is Life Expectancy, while the explanatory varaibles are Internet Use Rate, Urbanization Rate, and Employment rate. All three explanatory variables seem to have a signficant P-value, with a P-value of ~0, 0.003, and 0.012 for InternetUseRate, UrbanRate, and EmployRate respectively. Checking the beta coefficients for the three explanatory variables gives us some insight into the degrees of change for the three variables. InternetUseRate has a coefficient of approximately 0.60. Meanwhile, UrbanRate has about a third of the relationship strength that InternetUseRate does. Employrate seems to have a slightly negative relationship with the response variable when checking just the coefficient, though if the two variables are graphed it shows that it has a non-linear, parabolic relationship.

2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable.

The hypothesis was that internet use rate was positively correlated with life expectancy, and the multiple regression model seems to support this hypothesis.

3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables);

As none of the three tested variables had a P-value of greater than 0.05, it doesn’t seem that there is any evidence of confounding variables.

4) Generate the following regression diagnostic plots:

a) q-q plot

b) standardized residuals for all observations

c) leverage plot

d) Write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.

The QQ plot indicates that there is a fairly tight fit to the distribution line, though the points deviate off of it at the lower and upper ends. This seems to indicate that the data is roughly normally distributed, but not perfectly normally distributed. Including other, higher order, variables in the regression model might help better capture the relationship.

The standardized residual graph shows that the vast majority of instances in the dataset have residuals that fall within two standard deviations from the mean, which fits the expectations for a normal distribution. It doesn’t look like any data points fall outside of three standard deviations from the mean, though there are some potential outliers. The model’s fit could potentially be improved by adding in new variables that might impact the relationships with the dependent variable.

The leverage plot suggests that there are a handful of outliers that have a notable, though not overwhelming, influence on the model’s estimations (points 183, 178, and 180). There also appear to be some instances which have a powerful impact on the model’s estimations, though in general they are not outliers as they do not fall outside of 1 standard deviation from the mean (the cluster on the far right of the graph).

Running A Random Forest Classifier On The Gapminder Dataset

We’ll be running a Random Forest classifier on the Gapminder dataset. Using a handful of variables to predict income levels, which have been grouped into three different levels.

We’ll start by loading in the data, preprocessing the data, and creating a new feature is the income level divided into three different groups.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()

# convert to numeric
def to_numeric(dataframe, cols):

for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)

for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

df_clean = df2.dropna()


def plot_regression(x, y, data, label_1, label_2):

reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
plt.xlabel(label_1)
plt.ylabel(label_2)
plt.show()

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

def group_incomes(row):
if row['incomeperperson'] <= 744.23:
return 1
elif row['incomeperperson'] <= 942.32:
return 2
else:
return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

We’ll now use the Train-Test split function to divide the dataset into training and testing sets. After this, we’ll fit the classifier, get predictions and print some statistics about the performance of the model.

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)
Y = df_clean['income_group']
Y.astype(float)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

RF_clf = RandomForestClassifier(n_estimators=300, bootstrap = True, max_features = 'sqrt')
RF_clf.fit(X_train, y_train)
preds = RF_clf.predict(X_test)
print(accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

Here’s the accuracy and the confusion matrix:

0.85

[[ 8 0 1]
[ 2 0 0]
[ 3 0 26]]

Running A Basic Linear Regression: Gapminder Dataset

We’ll be running a basic linear regression on the Gapminder dataset. To begin with, we’ll start by importing all the necessary libraries and transforming the data.

from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

#print(df_clean.head())

Now let’s check create an instance of the Scaler from Scikit-learn.

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

We’ll now check the mean before we center/scale the data and after we do the scaling.

print(X["internetuserate"].mean(axis=0))

X['internetuserate_scaled'] = scale(X['internetuserate'])
#print(X_scaled.head())
print(X['internetuserate'].mean(axis=0))
print(X['internetuserate_scaled'].mean(axis=0))

Here’s our results:

34.192929910121784
9.496247254655427e-17

Notice that the top number is the mean before scaling , while the bottom number is the mean after scaling. The bottom number is very close to zero.

Finally, we’ll run the regressions between the scaled internet user rate and our two variables of interest: life expectancy and employment rate.

Here’s the results for the first regression:

                           OLS Regression Results                              
==================================================================================
Dep. Variable:     internetuserate_scaled   R-squared:                       0.593
Model:                                OLS   Adj. R-squared:                  0.591
Method:                     Least Squares   F-statistic:                     229.0
Date:                    Tue, 21 Apr 2020   Prob (F-statistic):           1.77e-32
Time:                            21:56:40   Log-Likelihood:                -154.10
No. Observations:                     159   AIC:                             312.2
Df Residuals:                         157   BIC:                             318.3
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -5.4670      0.365    -14.984      0.000      -6.188      -4.746
lifeexpectancy     0.0786      0.005     15.132      0.000       0.068       0.089
==============================================================================
Omnibus:                       10.462   Durbin-Watson:                   1.879
Prob(Omnibus):                  0.005   Jarque-Bera (JB):                4.239
Skew:                          -0.001   Prob(JB):                        0.120
Kurtosis:                       2.200   Cond. No.                         503.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Regression model findings:

Here’s the results for the second regression:

                              OLS Regression Results                              
==================================================================================
Dep. Variable:     internetuserate_scaled   R-squared:                       0.040
Model:                                OLS   Adj. R-squared:                  0.033
Method:                     Least Squares   F-statistic:                     6.468
Date:                    Tue, 21 Apr 2020   Prob (F-statistic):             0.0120
Time:                            21:56:40   Log-Likelihood:                -222.40
No. Observations:                     159   AIC:                             448.8
Df Residuals:                         157   BIC:                             454.9
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.1340      0.453      2.505      0.013       0.240       2.028
employrate    -0.0192      0.008     -2.543      0.012      -0.034      -0.004
==============================================================================
Omnibus:                       15.938   Durbin-Watson:                   1.710
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               14.963
Skew:                           0.682   Prob(JB):                     0.000563
Kurtosis:                       2.371   Cond. No.                         347.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The first regression finds that there is a very strong relationship between the two variables, with an extremely small P-value of 1.77e-32. Meanwhile, the second set of variables still has a fairly strong, though not as strong, relationship as indicated by the P-value of 0.0120.

Running a Classification Tree

This is an implementation of a decision tree classifier run on the GapMinder dataset.

Here’s the code used to run the Decision Tree classifier:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)

df = pd.read_csv("gapminder.csv")

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

print(df_clean.head())

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)
Y = df_clean['income_group']
Y.astype(float)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

DT_clf = DecisionTreeClassifier()
DT_clf.fit(X_train, y_train)
preds = DT_clf.predict(X_test)
print(accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

The features used by the Decision Tree classifier are:

'alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate'

Meanwhile, the target variable is the income group that was created in the lesson on confounding variable comparisons.

Here’s the shape of the training and testing datasets:

(127, 6)
(127,)
(32, 6)
(32,)

Here’s the accuracy for the decision tree:

0.8125

And here’s the confusion matrix:
[[ 8 0 2]
[ 1 0 0]
[ 3 0 18]]

The correct predictions can be found running diagonally from the top left to the bottom right. It looks like the model was able to guess 8 instances in Group 1 correctly, 0 in Group 2, and 18 in Group 3. The model seems to be weakest at classifying Group 2 instances. Adding more explanatory variables other than the ones I have selected would likely help improve the model’s predictive power.

Testing Potential Moderators

We’re going to be checking to see if income level might have any effect on the relationship between internet usage rates and life expectancy.

Here’s the code used to carry out the check for moderators.

from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import numpy as np

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

df_clean = df2.dropna()
df_clean['incomeperperson'] = df_clean['incomeperperson'].replace('', np.nan)

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

#plot_regression('lifeexpectancy', 'internetuserate', df_clean, 'Life Expectancy', 'Internet Use Rate')
#plot_regression('employrate', 'internetuserate', df_clean, 'Employment Rate', 'Internet Use Rate')

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

subframe_1 = df_clean[(df_clean['income_group'] == 1)]
subframe_2 = df_clean[(df_clean['income_group'] == 2)]
subframe_3 = df_clean[(df_clean['income_group'] == 3)]

print('Association between life expectancy and internet use rate for low income countries')
print(pearsonr(subframe_1['lifeexpectancy'], subframe_1['internetuserate']))
print('Association between life expectancy and internet use rate for medium income countries')
print(pearsonr(subframe_2['lifeexpectancy'], subframe_2['internetuserate']))
print('Association between life expectancy and internet use rate for high income countries')
print(pearsonr(subframe_3['lifeexpectancy'], subframe_3['internetuserate']))

Three different income groups are created using a lambda function and the “group_incomes” function.

Here are the results of the code:

Association between life expectancy and internet use rate for low income countries
(0.38386370068495235, 0.010101223355274047)
Association between life expectancy and internet use rate for medium income countries
(0.9966009508278395, 0.05250454954743393)
Association between life expectancy and internet use rate for high income countries
(0.7019997488251704, 6.526819886007788e-18)

For low income countries, there is a moderate correlation between life expectancy and internet use rate, with a small P-value. For middle income countries, the correlation is very strong, but the P-value is too large to be considered significant. For high income countries, the correlation is decently strong with a very small P-value, suggesting a significant correlation.

Writing About Your Data

This is a summary of the GapMinder data and the chosen attributes/features for the analysis project.

Step 1: Describe your sample.

The dataset I am using is the GapMinder data, collected by the GapMinder foundation (created by Ola Rosling, Anna Rosling Rönnlund and Hans Rosling). The GapMinder dataset includes demographic data for all 192 member states of the United Nations as well as 24 other geographical regions.

The populations studies in the GapMinder data are the populations of the 215 geographic regions, and the level of analysis is the population of that region/aggregate statistics on that population. There are 215 observations in the dataset, the 192 UN members + 24 other regions. I am using all 215 observations in the dataset for my analysis, examining trends in their populations.

Step 2: Describe the procedures that were used to collect the data.

The GapMinder dataset contains information about a variety of population metrics like income per person, CO-2 emissions, employment rates, internet use rate, and life expectancy. This data was collected from a variety of sources such as the US Census Bureau’s International Database, the World Bank, and the United Nations Statistics Division. They were mainly collected through data reporting and surveys. The data was collected from 2002 to 2011, aggregated trough government reports and surveys done by independent organizations like the World Bank. The data was collected across the 215 geographical regions mentioned above.

Step 3: Describe your variables.

I’m curious to see if there might be some relationship between internet use rate and life expectancy, or beyond that if there might be a relationship between internet use rate and employment rate. This means that the explanatory variable is internet use rate and it tracks the percentage of a country’s population that has access to the internet. Meanwhile, life expectancy tracks the the average lifetime of a person within the various countries and employment rate tracks the percentage of a population that are employed. The scales for internet use rate and employment rate run from 0 to 100%, while life expectancy begins at 0 and runs upwards with an upper bound of approximately 82 years. The explanatory and response variables were made numeric and any blanks were replaced with NaN values.