Logistic Regression: Gapminder Dataset

I ran a logistic regression on the Gapminder dataset. I chose to explore the relationship between life expectancy and internet use rate, as potentially moderated by employment rate. My hypothesis is that there is a positive correlation between life expectancy and internet use rate.

First, we’ll examine the relationship between life expectancy and internet use rate. After binning the life expectancy into two different categories, we can run the first logistic regression model.

from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import numpy as np

import statsmodels.api as sm

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = pd.to_numeric(dataframe[col], errors='coerce')

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

print(X["internetuserate"].mean(axis=0))

X['internetuserate_scaled'] = scale(X['internetuserate'])
X['urbanrate_scaled'] = scale(X['urbanrate'])
X['lifeexpectancy_scaled'] = scale(X['lifeexpectancy'])
X['employrate_scaled'] = scale(X['employrate'])

print(X['internetuserate'].mean(axis=0))
print(X['internetuserate_scaled'].mean(axis=0))

def bin_half(dataframe):

    if dataframe['lifeexpectancy'] >= 50.0:
        return 1
    else:
        return 0

df3 = df2.copy()
df3['lifeexpectancy_bins'] = df3.apply(lambda x: bin_half(x), axis=1)

log_reg_1 = smf.logit(formula="lifeexpectancy_bins ~ internetuserate", data=df3).fit()
print(log_reg_1.summary())
print("Odd ratio:")
print(np.exp(log_reg_1.params))

                            Logit Regression Results                           
===============================================================================
Dep. Variable:     lifeexpectancy_bins   No. Observations:                  192
Model:                           Logit   Df Residuals:                      190
Method:                            MLE   Df Model:                            1
Date:                 Mon, 25 May 2020   Pseudo R-squ.:                0.003350
Time:                         01:40:11   Log-Likelihood:                -66.058
converged:                        True   LL-Null:                       -66.280
Covariance Type:             nonrobust   LLR p-value:                    0.5052
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           2.3015      0.394      5.842      0.000       1.529       3.074
internetuserate    -0.0055      0.008     -0.670      0.503      -0.022       0.011
===================================================================================
Odd ratio:
Intercept          9.989521
internetuserate    0.994534
dtype: float64
Confidence intervals:
                 Lower CI   Upper CI        OR
Intercept        4.615573  21.620398  9.989521
internetuserate  0.978717   1.010607  0.994534

The results of the first model show that there’s a fairly large P-value of approximately 0.50, implying that there’s no significant relationship between life expectancy and internet use rate. The odds ratio of approximately 0.99 shows that between the two groups (shorter lives and longer lives), the rate of internet use seems to be approximately equivalent. It’s 95% certain that the true population odds ratios fall somewhere between 0.97 and 1.01. It looks like the results of the logistic regression model do not support my hypothesis.

After running the first regression model, I ran another model that checked for possible confounding from the “employment rate” variable.

                           Logit Regression Results                           
===============================================================================
Dep. Variable:     lifeexpectancy_bins   No. Observations:                  167
Model:                           Logit   Df Residuals:                      164
Method:                            MLE   Df Model:                            2
Date:                 Mon, 25 May 2020   Pseudo R-squ.:                  0.3124
Time:                         01:40:11   Log-Likelihood:                -22.081
converged:                        True   LL-Null:                       -32.114
Covariance Type:             nonrobust   LLR p-value:                 4.393e-05
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -1.3109      2.402     -0.546      0.585      -6.018       3.396
internetuserate     0.2311      0.105      2.209      0.027       0.026       0.436
employrate          0.0320      0.035      0.919      0.358      -0.036       0.100
===================================================================================

Possibly complete quasi-separation: A fraction 0.41 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Odd ratio:
Intercept          0.269569
internetuserate    1.259940
employrate         1.032473
dtype: float64
Confidence intervals:
                 Lower CI   Upper CI        OR
Intercept        0.002434  29.851462  0.269569
internetuserate  1.026374   1.546656  1.259940
employrate       0.964467   1.105275  1.032473

When this second regression model was run, it was found that there does seem to be a significant relationship between life expectancy and internet use rate, as this second model returned a P-value of 0.027 for internet use rate. Meanwhile, the P-value of employment rate was approximately 0.35, indicating a non-significant relationship. That said, the relationship appears to be a fairly weak one, with countries with high internet use rates are only about 1.25 times more likely to have long life expectancies than countries without high internet use rates. In terms of employment rates, the odds ratio is very near 1 (1.03), which implies a non-significant relationship between employment rates and life expectancy. The results suggest that employment rates confound the relationship between internet use rates and life expectancy.

Share this:

Related

Leave a comment Cancel reply