We’ll be running a basic linear regression on the Gapminder dataset. To begin with, we’ll start by importing all the necessary libraries and transforming the data.
from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
# check for missing data
def check_missing(dataframe, cols):
for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()
# convert to numeric
def to_numeric(dataframe, cols):
for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)
# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):
for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)
for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)
df = pd.read_csv("gapminder.csv")
#print(dataframe.head())
#print(df.isnull().values.any())
cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']
df2 = df.copy()
#check_missing(df, cols)
#check_missing(df, norm_cols)
to_numeric(df2, cols)
to_numeric(df2, norm_cols)
#freq_dist(df2, cols, norm_cols)
df_clean = df2.dropna()
def plot_regression(x, y, data, label_1, label_2):
reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
plt.xlabel(label_1)
plt.ylabel(label_2)
plt.show()
print('Association between life expectancy and internet use rate')
print(pearsonr(df_clean['lifeexpectancy'], df_clean['internetuserate']))
print('Association between employment rate and internet use rate')
print(pearsonr(df_clean['employrate'], df_clean['internetuserate']))
def group_incomes(row):
if row['incomeperperson'] <= 744.23:
return 1
elif row['incomeperperson'] <= 942.32:
return 2
else:
return 3
df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)
#print(df_clean.head())
Now let’s check create an instance of the Scaler from Scikit-learn.
scaler = MinMaxScaler()
X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)
We’ll now check the mean before we center/scale the data and after we do the scaling.
print(X["internetuserate"].mean(axis=0)) X['internetuserate_scaled'] = scale(X['internetuserate']) #print(X_scaled.head()) print(X['internetuserate'].mean(axis=0)) print(X['internetuserate_scaled'].mean(axis=0))
Here’s our results:
34.192929910121784
9.496247254655427e-17
Notice that the top number is the mean before scaling , while the bottom number is the mean after scaling. The bottom number is very close to zero.
Finally, we’ll run the regressions between the scaled internet user rate and our two variables of interest: life expectancy and employment rate.
Here’s the results for the first regression:
OLS Regression Results
==================================================================================
Dep. Variable: internetuserate_scaled R-squared: 0.593
Model: OLS Adj. R-squared: 0.591
Method: Least Squares F-statistic: 229.0
Date: Tue, 21 Apr 2020 Prob (F-statistic): 1.77e-32
Time: 21:56:40 Log-Likelihood: -154.10
No. Observations: 159 AIC: 312.2
Df Residuals: 157 BIC: 318.3
Df Model: 1
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept -5.4670 0.365 -14.984 0.000 -6.188 -4.746
lifeexpectancy 0.0786 0.005 15.132 0.000 0.068 0.089
==============================================================================
Omnibus: 10.462 Durbin-Watson: 1.879
Prob(Omnibus): 0.005 Jarque-Bera (JB): 4.239
Skew: -0.001 Prob(JB): 0.120
Kurtosis: 2.200 Cond. No. 503.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Regression model findings:
Here’s the results for the second regression:
OLS Regression Results
==================================================================================
Dep. Variable: internetuserate_scaled R-squared: 0.040
Model: OLS Adj. R-squared: 0.033
Method: Least Squares F-statistic: 6.468
Date: Tue, 21 Apr 2020 Prob (F-statistic): 0.0120
Time: 21:56:40 Log-Likelihood: -222.40
No. Observations: 159 AIC: 448.8
Df Residuals: 157 BIC: 454.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.1340 0.453 2.505 0.013 0.240 2.028
employrate -0.0192 0.008 -2.543 0.012 -0.034 -0.004
==============================================================================
Omnibus: 15.938 Durbin-Watson: 1.710
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14.963
Skew: 0.682 Prob(JB): 0.000563
Kurtosis: 2.371 Cond. No. 347.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The first regression finds that there is a very strong relationship between the two variables, with an extremely small P-value of 1.77e-32. Meanwhile, the second set of variables still has a fairly strong, though not as strong, relationship as indicated by the P-value of 0.0120.