Running A Basic Linear Regression: Gapminder Dataset

We’ll be running a basic linear regression on the Gapminder dataset. To begin with, we’ll start by importing all the necessary libraries and transforming the data.

from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt

# check for missing data
def check_missing(dataframe, cols):

    for col in cols:
       print("Column {} is missing:".format(col))
       print((dataframe[col].values == ' ').sum())
       print()

# convert to numeric
def to_numeric(dataframe, cols):

    for col in cols:
        dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)

# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False)
        print(count)

    for col in norm_cols:
        print("Fred dist for: {}".format(col))
        count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
        print(count)


df = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(df.isnull().values.any())

cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']

df2 = df.copy()

#check_missing(df, cols)
#check_missing(df, norm_cols)

to_numeric(df2, cols)
to_numeric(df2, norm_cols)

#freq_dist(df2, cols, norm_cols)

df_clean = df2.dropna()

def plot_regression(x, y, data, label_1, label_2):

    reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
    plt.xlabel(label_1)
    plt.ylabel(label_2)
    plt.show()

print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))

print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))

def group_incomes(row):
    if row['incomeperperson'] <= 744.23:
        return 1
    elif row['incomeperperson'] <= 942.32:
        return 2
    else:
        return 3

df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)

#print(df_clean.head())

Now let’s check create an instance of the Scaler from Scikit-learn.

scaler = MinMaxScaler()

X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)

We’ll now check the mean before we center/scale the data and after we do the scaling.

print(X["internetuserate"].mean(axis=0))

X['internetuserate_scaled'] = scale(X['internetuserate'])
#print(X_scaled.head())
print(X['internetuserate'].mean(axis=0))
print(X['internetuserate_scaled'].mean(axis=0))

Here’s our results:

34.192929910121784
9.496247254655427e-17

Notice that the top number is the mean before scaling , while the bottom number is the mean after scaling. The bottom number is very close to zero.

Finally, we’ll run the regressions between the scaled internet user rate and our two variables of interest: life expectancy and employment rate.

Here’s the results for the first regression:

                           OLS Regression Results                              
==================================================================================
Dep. Variable:     internetuserate_scaled   R-squared:                       0.593
Model:                                OLS   Adj. R-squared:                  0.591
Method:                     Least Squares   F-statistic:                     229.0
Date:                    Tue, 21 Apr 2020   Prob (F-statistic):           1.77e-32
Time:                            21:56:40   Log-Likelihood:                -154.10
No. Observations:                     159   AIC:                             312.2
Df Residuals:                         157   BIC:                             318.3
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -5.4670      0.365    -14.984      0.000      -6.188      -4.746
lifeexpectancy     0.0786      0.005     15.132      0.000       0.068       0.089
==============================================================================
Omnibus:                       10.462   Durbin-Watson:                   1.879
Prob(Omnibus):                  0.005   Jarque-Bera (JB):                4.239
Skew:                          -0.001   Prob(JB):                        0.120
Kurtosis:                       2.200   Cond. No.                         503.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Regression model findings:

Here’s the results for the second regression:

                              OLS Regression Results                              
==================================================================================
Dep. Variable:     internetuserate_scaled   R-squared:                       0.040
Model:                                OLS   Adj. R-squared:                  0.033
Method:                     Least Squares   F-statistic:                     6.468
Date:                    Tue, 21 Apr 2020   Prob (F-statistic):             0.0120
Time:                            21:56:40   Log-Likelihood:                -222.40
No. Observations:                     159   AIC:                             448.8
Df Residuals:                         157   BIC:                             454.9
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.1340      0.453      2.505      0.013       0.240       2.028
employrate    -0.0192      0.008     -2.543      0.012      -0.034      -0.004
==============================================================================
Omnibus:                       15.938   Durbin-Watson:                   1.710
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               14.963
Skew:                           0.682   Prob(JB):                     0.000563
Kurtosis:                       2.371   Cond. No.                         347.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The first regression finds that there is a very strong relationship between the two variables, with an extremely small P-value of 1.77e-32. Meanwhile, the second set of variables still has a fairly strong, though not as strong, relationship as indicated by the P-value of 0.0120.

Leave a comment

Design a site like this with WordPress.com
Get started