I’ll be carrying out lasso regression on the Gapminder dataset. The features we are investigating are the following: “alcohol consumption”, “breast cancer per 100k”, “employment rate”, “internet use rate”, “life expectancy”, and “urbanization rate”. We are hoping to predict which income group the data points (countries) fall into based on the provided features.
First, we’ll need to load in the data and preprocesss the data, scaling the features of interest.
from sklearn.preprocessing import scale, MinMaxScaler
import statsmodels.formula.api as smf
from scipy.stats import pearsonr
import pandas as pd
from seaborn import regplot
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# check for missing data
def check_missing(dataframe, cols):
for col in cols:
print("Column {} is missing:".format(col))
print((dataframe[col].values == ' ').sum())
print()
# convert to numeric
def to_numeric(dataframe, cols):
for col in cols:
dataframe[col] = dataframe[col].convert_objects(convert_numeric=True)
# check frequency distribution
def freq_dist(dataframe, cols, norm_cols):
for col in cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False)
print(count)
for col in norm_cols:
print("Fred dist for: {}".format(col))
count = dataframe[col].value_counts(sort=False, dropna=False, normalize=True)
print(count)
df = pd.read_csv("gapminder.csv")
cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']
norm_cols = ['internetuserate', 'employrate', 'incomeperperson']
df2 = df.copy()
to_numeric(df2, cols)
to_numeric(df2, norm_cols)
df_clean = df2.dropna()
def plot_regression(x, y, data, label_1, label_2):
reg_plot = regplot(x=x, y=y, fit_reg=True, data=data)
plt.xlabel(label_1)
plt.ylabel(label_2)
plt.show()
def group_incomes(row):
if row['incomeperperson'] <= 744.23:
return 1
elif row['incomeperperson'] <= 942.32:
return 2
else:
return 3
df_clean['income_group'] = df_clean.apply(lambda row: group_incomes(row), axis=1)
scaler = MinMaxScaler()
X = df_clean[['alcconsumption','breastcancerper100th','employrate', 'internetuserate','lifeexpectancy','urbanrate']]
X.astype(float)
print(X["internetuserate"].mean(axis=0))
X['alcconsumption'] = scale(X['alcconsumption'])
X['breastcancerper100th'] = scale(X['breastcancerper100th'])
X['internetuserate'] = scale(X['internetuserate'])
X['urbanrate'] = scale(X['urbanrate'])
X['lifeexpectancy'] = scale(X['lifeexpectancy'])
X['employrate'] = scale(X['employrate'])
print(X['internetuserate'].mean(axis=0))
X.astype(float)
Y = df_clean['income_group']
Y.astype(float)
Now that the data has been preprocessed,
Influence:
{‘alcconsumption’: 0.06989539588334298,
‘breastcancerper100th’: -0.14994722784801043,
’employrate’: -0.1439463951270292,
‘internetuserate’: 0.05671759315545794,
‘lifeexpectancy’: 0.43140559604575485,
‘urbanrate’: 0.2946352802900217}
Train error:
0.31411149133934124
Test error:
0.3410401541834608
R-squared train:
0.6057350783025499
R-squared test:
0.575668143518999
No features were dropped by the Lasso regression model. Life Expectancy and Urbanization rate appeared to have the strongest associations with income level.
Accuracy on the test dataset was around 66%, which is better than chance guessing but still leaves much room for improvement. Adding more features to the model would probably improve the model’s estimation accuracy. In terms of the R-squared, the model performed just slightly better on the training dataset, explaining approximately 60% of the variance in the training dataset, contrasted with the approx. 57% of the variance explained for the testing set.