python – CriticallyCoding

Classifying Mushrooms With Python and Scikit-Learn

Mushrooms can be either toxic or edible. It takes trained mushroom hunters and mycologists to discern the toxic mushrooms from the edible mushrooms. Can machine learning algorithms also classify the edibility of mushrooms? We’ll visualize some of the data here, to get an idea of how the data is related, and then implement several different classifiers on the dataset.

This might go without saying, but don’t take advice about which mushrooms to eat from a random blog post. I do not condone eating any mushrooms based on the patterns revealed in this notebook or in the accompanying Python script or documentation.

To start with, we’ll import all the necessry modules that we need.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report, log_loss
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")

We’ll load in the data for the mushrooms. Then we need to check if there is any null data in the dataset. There are only two classes: edible and poisonous. We can confirm that this is the case by printing the unique values of the class feature. We’re going to need to encode non-numerical data, that is, transform it into a form that our classifiers can use.

Scikit-Learn has a Label Encoder that will represent our non-numerical values as numerical ones. So we’ll create an instance of the encoder and then apply the transformation to every column in the dataset. We can print out the dataframe to be sure that there the transformations have been correct.

m_data = pd.read_csv('mushrooms.csv')

# check for any null values
print(m_data.isnull().sum())

# see all unique values in a category
print(m_data['class'].unique())

# machine learning systems work with integers, we need to encode these
# string characters into ints

encoder = LabelEncoder()
# now apply the transformation to all the columns:
for col in m_data.columns:
    m_data[col] = encoder.fit_transform(m_data[col])

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64
['p' 'e']
   class  cap-shape  cap-surface  cap-color  bruises  odor  gill-attachment  \
0      1          5            2          4        1     6                1   
1      0          5            2          9        1     0                1   
2      0          0            2          8        1     3                1   
3      1          5            3          8        1     6                1   
4      0          5            2          3        0     5                1   

   gill-spacing  gill-size  gill-color  ...  stalk-surface-below-ring  \
0             0          1           4  ...                         2   
1             0          0           4  ...                         2   
2             0          0           5  ...                         2   
3             0          1           5  ...                         2   
4             1          0           4  ...                         2   

   stalk-color-above-ring  stalk-color-below-ring  veil-type  veil-color  \
0                       7                       7          0           2   
1                       7                       7          0           2   
2                       7                       7          0           2   
3                       7                       7          0           2   
4                       7                       7          0           2   

   ring-number  ring-type  spore-print-color  population  habitat  
0            1          4                  2           3        5  
1            1          4                  3           2        1  
2            1          4                  3           2        3  
3            1          4                  2           3        5  
4            1          0                  3           0        1  

[5 rows x 23 columns]

Now we can do some plotting of the data. Using Matplotlib, let’s do a heatmap plot, which compares the correlation of every feature with every other feature.

# let's see how many poisonous and edible there are of each, 1 is poisonous, 0 is edible
# check to get a rough idea of correlations
correlations = m_data.corr()
plt.subplots(figsize=(20, 15))
#plt.figure(figsize=(16, 14))
data_corr = sns.heatmap(correlations, annot=True, linewidths=0.5, cmap="RdBu_r")
plt.show(data_corr)

It looks like the features most closely associated with class are: cap-surface, gill-attachment, gill-size, veil-color, spore-print-color, population, and habitat. Why don’t we see if we can plot the correlations of those variables individually.

features = ["cap-surface", "gill-attachment", "gill-size", "veil-color", "spore-print-color", "population", "habitat"]

for feature in features:
    line_plot = sns.lineplot(x="class", y=feature, label=feature, data=m_data)
    
plt.legend(loc=1)    
plt.show()

If we are so inclined, we can chart how individual attributes are likely to correlate with a mushroom’s likelihood of beind poisonous. We just get the individual colors stored in the “gill-color” feature and we plot them against class. We need to specify the names of the gill-colors and choose colors to represent them in our plot, however.

gill_names =["buff", "red", "gray", "chocolate", "black", "brown", "orange", "pink", "green", "purple", "white", "yellow"]
gill_colors = ["khaki", "Red", "darkGrey", "chocolate", "Black", "saddleBrown", "orange", "lightpink", "darkGreen", "purple", "lightGrey", "Yellow"]

factor = sns.factorplot(x="gill-color",y="class",data=m_data, kind="bar", size = 8,
palette = gill_colors)
factor.set_xticklabels(rotation=45)
factor.set(xticks=range(0,14), xticklabels=gill_names)
factor = factor.set_ylabels("Prob. Poison")
plt.grid(axis='y')
plt.show()

Now we need to split our data into features and labels. This is easy because the “class” feature is the first in the dataframe, so all we need to do is cast the features as everything but the first column and do the opposite for the labels. We’ll also want to scale the data. Scaling our data is important as the data covers a wide range.

The sheer range of the data can throw off the accuracy of our classifier, so by standardizing the data we make sure our classifier performs optimally. We’ll create an instance of the classifier and then transform the data with it. We don’t need to scale the labels/targets as they are only 0 or 1.

X_features = m_data.iloc[:,1:23]  
y_label = m_data.iloc[:, 0]  

# we should probably scale the features, so that SVM or gaussian NB can deliver better predictions
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)

Principal Component Analysis is a method that can simplify the representation of features, distilling the most important features down into a combination of just a few features. PCA can help improve the outcome of a classifier. However, PCA works best when datasets are large, and since this dataset is relatively small, we may not want to do it. We can plot the conclusions of the PCA to see what kind of dimensions the data would be compressed to.

We fit the PCA on our features and plot the explained variance ratio. The explained variance ratio is a way of measuring how many features describe a given portion of the data. We can plot our this function to get an idea of how many features are needed to describe 90% or 100% of the data. Let’s plot the ratio.

# principal component analysis, may or may not want to do, small dataset
pca = PCA()
pca.fit_transform(X_features)
plt.figure(figsize=(10,10))
# plot the explained variance ratio
plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.grid()
plt.show()
# it looks like about 17 of the features account for about 95% of the variance in the dataset

Here’s another way that the explained variance can be plotted. Plotting it like this leads to the same conclusion: around 18 features describe 99% of the dataset.

# here's another way to visualize this
pca_variance = pca.explained_variance_

plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align='center', label='individual variance')
plt.legend()
plt.ylabel('Variance ratio')
plt.xlabel('Principal components')
plt.show()

Let’s go ahead and create a new PCA object and use 17 or 18 features to comprise our new featureset.

pca2 = PCA(n_components=17)
x_new = pca2.fit_transform(X_features)
X_train, X_test, y_train, y_test = train_test_split(x_new, y_label, test_size=0.20, random_state=2)

Now we can select our chosen classifiers and run GridSearchCV to find their best possible parameters. GridSearchCV takes a specified list of parameters and tests the classifier with the potential combinatiosn of your schosen parameters to see which combination is best. Doing this should dramatically improve the performance of our classifiers compared to just implementing them vanilla. We have to fit GridSearchCV on the classifiers and our chosen parameters, and then save the optimized settings as a new instance of the classifier.

logreg_clf = LogisticRegression()
parameters_logreg = {"penalty": ["l2"], "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
                     "max_iter": [25, 50, 100, 200, 400]}
grid_logreg = GridSearchCV(logreg_clf, parameters_logreg)
grid_logreg.fit(X_train, y_train)
logreg_opt = grid_logreg.best_estimator_

GNB_clf = GaussianNB()
GNB_clf.fit(X_train, y_train)

svc_clf = SVC()
svc_param = {"kernel": ["rbf", "linear"]}
grid_svc = GridSearchCV(svc_clf, svc_param)
grid_svc.fit(X_train, y_train)
svc_opt = grid_svc.best_estimator_

rf_clf = RandomForestClassifier()
parameters_rf = {"n_estimators": [4, 6, 8, 10, 12, 14, 16], "criterion": ["gini", "entropy"], "max_features": ["auto", "sqrt", "log2"],
                 "max_depth": [2, 3, 5, 10], "min_samples_split": [2, 3, 5, 10]}
grid_rf = GridSearchCV(rf_clf, parameters_rf)
grid_rf.fit(X_train, y_train)
rf_opt = grid_rf.best_estimator_

knn_clf = KNeighborsClassifier()
parameters_knn = {"n_neighbors": [3, 5, 10, 15, 20], "weights": ["uniform", "distance"],
                  "leaf_size": [10, 20, 30, 45, 60]}
grid_knn = GridSearchCV(knn_clf, parameters_knn)
grid_knn.fit(X_train, y_train)
knn_opt = grid_knn.best_estimator_

dt_clf = DecisionTreeClassifier()
parameters_dt = {"criterion": ["gini", "entropy"], "splitter": ["best", "random"], "max_features": ["auto", "log2", "sqrt"]}
grid_dt = GridSearchCV(dt_clf, parameters_dt)
grid_dt.fit(X_train, y_train)
dt_opt = grid_dt.best_estimator_

xgb_clf = XGBClassifier()
parameters_xg = {"objective" : ["reg:linear"], "n_estimators" : [5, 10, 15, 20]}
grid_xg = GridSearchCV(xgb_clf, parameters_xg)
grid_xg.fit(X_train, y_train)
xgb_opt = grid_xg.best_estimator_

chosen_metrics = [accuracy_score, log_loss, classification_report]

Finally, we can create a function to do the classification and return our chosen metrics.

def get_reports(metrics, classifier):

    preds = classifier.predict(X_test)
    print("'{}' Performance: ".format(classifier.__class__.__name__))
    acc = accuracy_score(y_test, preds)
    l_loss = log_loss(y_test, preds)
    c_report = classification_report(y_test, preds)
    print("Accuracy: " + str(acc))
    print("Log Loss: " + str(l_loss))
    print("Classificaiton Report: ")
    print(c_report)
    print("----------")

get_reports(chosen_metrics, logreg_opt)
get_reports(chosen_metrics, GNB_clf)
get_reports(chosen_metrics, svc_opt)
get_reports(chosen_metrics, rf_opt)
get_reports(chosen_metrics, knn_opt)
get_reports(chosen_metrics, dt_opt)
get_reports(chosen_metrics, xgb_opt)

'LogisticRegression' Performance: 
Accuracy: 0.9427692307692308
Log Loss: 1.9766989475886843
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.94      0.96      0.95       860
           1       0.95      0.93      0.94       765

    accuracy                           0.94      1625
   macro avg       0.94      0.94      0.94      1625
weighted avg       0.94      0.94      0.94      1625

----------
'GaussianNB' Performance: 
Accuracy: 0.9298461538461539
Log Loss: 2.423050640308682
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.92      0.95      0.93       860
           1       0.94      0.91      0.92       765

    accuracy                           0.93      1625
   macro avg       0.93      0.93      0.93      1625
weighted avg       0.93      0.93      0.93      1625

----------
'SVC' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'RandomForestClassifier' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'KNeighborsClassifier' Performance: 
Accuracy: 1.0
Log Loss: 9.992007221626413e-16
Classificaiton Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       860
           1       1.00      1.00      1.00       765

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

----------
'DecisionTreeClassifier' Performance: 
Accuracy: 0.992
Log Loss: 0.2763131635190287
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       860
           1       0.99      0.99      0.99       765

    accuracy                           0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

----------
'XGBClassifier' Performance: 
Accuracy: 0.968
Log Loss: 1.1052531461360688
Classificaiton Report: 
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       860
           1       0.97      0.96      0.97       765

    accuracy                           0.97      1625
   macro avg       0.97      0.97      0.97      1625
weighted avg       0.97      0.97      0.97      1625

----------

I highly recommend experimenting with the different classifiers to get a better idea of which classifiers work better under different circumstances. You can check out a list of classifiers supported by Scikit-learn here or here.

Understanding and Implementing Merge Sort In Python

Previously, we’ve covered binary search and selection sort. This time, we’ll be covering another sorting algorithm – merge sort. Merge sort is a little different from selection sort. While selection sort can’ only be used to sort an unsorted list, merge sort can be utilized to merge two different sorted lists, in addition to sorting an unsorted list.

The core idea of merge sort is an instance of a divide and conquer approach. First, the entire list/array is split apart into left and right halves. Afterwards, the left half of the array is also divided in half. This recurs until there is only one element in the left and right halves, until the array halves cannot be split further. Afterwards, the values are merged together, being compared and place in sorted order.

This process recurs, gradually sorting and merging groups, until there is only one group with all the elements in their proper order. It is critical to remember that every-time a group is created the sorted order must be maintained.

Here’s an example of the technique, broken into dividing and merging phases.

Dividing Phase:

Assuming we have an array with the following values: [3, 8, 9, 2, 6, 7], the process will look something like this.

Division 1: [3, 8, 9][2, 6, 7]

Division 2: [3] [8, 9] [2] [6,7]

Division 3: [3] [8] [9] [2] [6] [7]

Now comes the merging phase.

Merging Phase:

Merge 1: [3, 8] [2,9] [6,7]

Merge 2: [2, 3, 8, 9][6, 7]

Merge 3: [2, 3, 6, 7, 8, 9]

If two lists are being merged, it is important to compare the elements before they are merged into a single list. Otherwise, the merged list will contain all the sorted elements of one list and then all the sorted elements of the other list.

So let’s make sure we understand the steps to implementing merge sort:

We start with an unsorted list and split it in half.

2. We split the left half of the array into halves.

3. We keep splitting until there is just one element per half.

4. We compare each of the elements

5. The elements are merged into one group, in order.

6. The process continues until the entire array has been sorted and merged.

Now we can try writing merge sort in Python.

This implementation will be a recursive implementation. When writing a recursive algorithm, remember that the algorithm must have a base-case, or a case where the recursion ends.

In order to implement merge sort in Python, the algorithm can be structured as follows:

def merge_sort(array):
    # Only if there are two or more elements
    if len(array) > 1:
        # get the midpoint in the array
        mid = len(array) // 2
        # left half is everything before the mid
        left = array[:mid]
        # right half is everything after the midpoint
        right = array[mid:]

        # recur the splits on the left and right halves
        merge_sort(left)
        merge_sort(right)

        # we need variables to track the iteration through the arrays
        x = 0
        y = 0

        # Iterator for the main list
        z = 0

        # here we copy the data into temporary arrays that store the left and right halves
        # as long as there are still values within the left and right half of the arrays
        # compare the values and swap positions

        while x < len(left) and y < len(right):
            # if left value is smaller than the right value
            if left[x] <= right[y]:
                # place the left position into the sorted array
                array[z] = left[x]
                # update left position
                x = x + 1
            else:
                array[z] = right[y]
                y = y + 1
            # move to the next portion of the array
            z = z + 1

        # If there are any elements left over in the left half of the array
        # We fill in the primary array with the remaining left values and then move the iterators up
        while x < len(left):
            array[z] = left[x]
            x += 1
            z += 1

        # Do the same thing for the right half of the array
        while y < len(right):
            array[z] = right[y]
            y += 1
            z += 1

listy = [35, 72, 29, 55, 41, 27, 79, 11, 23, 56, 50]
merge_sort(listy)
print(listy)

Here’s the sorted list:

[11, 23, 27, 29, 35, 41, 50, 55, 56, 72, 79]

GameInformer Review Analysis

Analyzing reviews of media is a good way to get acquainted with data analysis and data science ideas/techniques. Because of the tendency for reviewers to give numerical scores to produces they review, it can be fairly easy to visualize how certain other features may correlate with these scores. This blog post will serve as an demonstration of how to analyze and explore data, using review data gathered from GameInformer.

I scraped the review data from GameInformer myself, and what follows is the results of the visualization/analysis. If you are curious about how the scraping was done:

First, the review URLS were collected from the main site.

Next, the review data was collected. Following this, some preprocessing was done to remove duplicate entries and non-standard characters from the review data.

This blog will also act as a quick primer on how to visualize data with Pandas and Seaborn.

Let’s start off by making all the necessary imports.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from collections import Counter
import csv

Now we need to load in the data. We also need to get dummy representations of any non-numeric values.

GI_data = pd.read_csv('GIreviewcomplete.csv', encoding = "ISO-8859-1")
GI_data2 = pd.get_dummies(GI_data)

We can get a violin plot for every reviewer. The width of the violins shows how often a particular score was assigned, while the height and length of a violin show the possible range of scores.

#---display violin plot for author and avg score

sns.lmplot(x='author', y='score', data=GI_data, hue='author', fit_reg=False)
sns.violinplot(x='author',y='score', data=GI_data)
plt.xticks(rotation=-90)
plt.show()

It’s possible to filter the database by multiple criteria. For instance, we could return every review over a certain score by a certain reviewer.

EF_reviews = GI_data[(GI_data.author == 'Elise Favis') & (GI_data.score >= 8.0)]
print(EF_reviews)


                                                title  score       author  \
48                                     Heaven's Vault      8  Elise Favis   
58                                        Photographs      8  Elise Favis   
102                                  Bury Me, My Love      8  Elise Favis   
158              Life Is Strange 2: Episode 1 ? Roads      8  Elise Favis   
243                                             Minit      8  Elise Favis   
262                  Where The Water Tastes Like Wine      9  Elise Favis   
266                                          Florence      8  Elise Favis   
276                                        Subnautica      8  Elise Favis   
286                              The Red Strings Club      8  Elise Favis   
405                                            Tacoma      8  Elise Favis   
442                                        Perception      8  Elise Favis   
485                                  Thimbleweed Park      8  Elise Favis   
506                                Night In The Woods      8  Elise Favis   
586  Phoenix Wright: Ace Attorney - Spirit of Justice      8  Elise Favis   
678                    Day of the Tentacle Remastered      8  Elise Favis   

        r_platform                       o_platform  \
48              PC                    PlayStation 4   
58              PC                     iOS, Android   
102         Switch                 PC, iOS, Android   
158             PC          PlayStation 4, Xbox One   
243             PC  PlayStation 4, Xbox One, Switch   
262             PC                PlayStation 4, PC   
266            iOS                     Xbox One, PC   
276             PC        PlayStation 4, Switch, PC   
286             PC             Xbox One, Switch, PC   
405             PC                         Xbox One   
442             PC          PlayStation 4, Xbox One   
485             PC                         Xbox One   
506  PlayStation 4                               PC   
586            3DS                               PC   
678  PlayStation 4             PlayStation Vita, PC   

                         publisher                           developer  \
48                           Inkle                               Inkle   
58               EightyEight Games                   EightyEight Games   
102                       Playdius   The Pixel Hunt, Arte France, FIGS   
158                    Square Enix               Dontnod Entertainment   
243               Devolver Digital           JW, Kitty, Jukio, and Dom   
262     Good Shepard Entertainment      Dim Bulb Games, Serenity Forge   
266          Annapurna Interactive               Annapurna Interactive   
276   Unknown Worlds Entertainment        Unknown Worlds Entertainment   
286               Devolver Digital                      Deconstructeam   
405                     Fullbright                          Fullbright   
442             The Deep End Games                  The Deep End Games   
485                Terrible Toybox                     Terrible Toybox   
506                          Finji                       Infinite Fall   
586                         Capcom                              Capcom   
678        Double Fine Productions             Double Fine Productions   

            release_date           rating  
48        April 16, 2019             Teen  
58         April 3, 2019              NaN  
102     January 10, 2019     Everyone 10+  
158   September 27, 2018           Mature  
243        April 3, 2018         Everyone  
262    February 28, 2018        Not rated  
266    February 14, 2018         Everyone  
276     January 23, 2018     Everyone 10+  
286     January 22, 2018   Rating Pending  
405       August 2, 2017             Teen  
442         May 30, 2017           Mature  
485       March 30, 2017             Teen  
506    February 21, 2017             Teen  
586    September 8, 2016             Teen  
678       March 22, 2016             Teen

Countplots and barplots are useful ways of visualizing data. Countplots just plot the count of your chosen variable, whereas bar plots compare two chosen variables with each other.

plt.figure(figsize=(10, 4))
scores = sns.countplot(x='score', data=GI_data2)
plt.xticks()
plt.show(scores)

plt.figure(figsize=(10, 4))
r_platform = sns.countplot(x="r_platform", data=GI_data)
plt.xticks(rotation=-90)
plt.show(r_platform)

plt.figure(figsize=(10, 4))
rating = sns.countplot(x="rating", data=GI_data)
plt.xticks(rotation=-90)
plt.show(rating)

plt.figure(figsize=(10, 4))
plt.xticks(rotation=-90)
barplot_score = sns.barplot(x="rating", y="score", data=GI_data)
plt.show(barplot_score)

Seaborn also contains a handy distribution function, and in this case we can see the relative distribution of GameInformer review scores. Most of their assigned scores have been an 8.

sns.distplot(GI_data2['score'])
plt.show()

Swarmplots show the individual instances of Y given X, or in this case of a certain score given a certain rating.

plt.figure(figsize=(16, 6))
factorplot_score = sns.swarmplot(x="rating", y="score", data=GI_data, hue='rating')
plt.xticks(rotation= -90)
plt.show(factorplot_score)

Let’s do some analysis of publisher and developer statistics. We’d want to start by getting a list of publishers and developers. Let’s just get the 100 most common.

publishers = Counter(GI_data['publisher'])
developers = Counter(GI_data['developer'])

pub_list = []
dev_list = []

for item, freq in publishers.most_common(100):
    pub_list.append(item)

for item, freq in developers.most_common(100):
    dev_list.append(item)
    
print(pub_list[:10])
print("--------------")
print(dev_list[:10])

[' Nintendo', ' Telltale Games', ' Square Enix', ' Ubisoft', ' Sony Computer Entertainment', ' Devolver Digital', ' Warner Bros. Interactive', ' Microsoft Game Studios', ' Electronic Arts', ' Bandai Namco']
--------------
[' Telltale Games', ' Nintendo', ' Capcom', ' Square Enix', ' EA Tiburon', ' Atlus', ' Visual Concepts', ' EA Canada', ' Dontnod Entertainment', ' Ubisoft Montreal']

If we make a publisher and developer dataframe, we can transform those dataframes by getting the mean and median values of their scores. We can then merge those back into the original dataframe to get a dataframe sorted by publisher which also has the publisher’s mean and median scores. We could do the same thing for developers.

pub_df = pd.DataFrame(index=None)
dev_df = pd.DataFrame(index=None)

# append rows for individual publishers to dataframe

def custom_mean(group):
    group['mean'] = group['score'].mean()
    return group

def custom_median(group):
    group['median'] = group['score'].median()
    return group

for pub in pub_list:
    scores = pd.DataFrame(GI_data[(GI_data.publisher == pub) & (GI_data.score)], index=None)
    pub_df = pub_df.append(scores, ignore_index=True)

pub_mean_df = pub_df.groupby('publisher').apply(custom_mean)
pub_median_df = pub_df.groupby('publisher').apply(custom_median)
pub_median = pub_median_df['median']
pub_merged_df = pub_mean_df.join(pub_median)
print(pub_merged_df.head(3))
pub_merged_df.to_csv('pub_merged.csv')

for dev in dev_list:
    scores = pd.DataFrame(GI_data[(GI_data.developer == dev) & (GI_data.score)], index=None)
    dev_df = dev_df.append(scores, ignore_index=True)

dev_mean_df = dev_df.groupby('developer').apply(custom_mean)
dev_median_df = dev_df.groupby('developer').apply(custom_median)
dev_median = dev_median_df['median']
dev_merged_df = dev_mean_df.join(dev_median)
dev_merged_df.to_csv('dev_merged.csv')
print(dev_merged_df.head(3))

                                               title  score  \
0                          Fire Emblem: Three Houses      9   
1        Marvel Ultimate Alliance 3: The Black Order      7   
2  Cadence of Hyrule ? Crypt of the Necrodancer F...      7   

              author r_platform o_platform  publisher              developer  \
0  Kimberley Wallace     Switch        NaN   Nintendo    Intelligent Systems   
1      Andrew Reiner     Switch        NaN   Nintendo             Team Ninja   
2     Suriel Vazquez     Switch        NaN   Nintendo   Brace Yourself Games   

     release_date           rating     mean  median  
0   July 26, 2019              NaN  7.73913     7.0  
1   July 19, 2019             Teen  7.73913     7.0  
2   June 13, 2019   Rating Pending  7.73913     7.0  
                                              title  score             author  \
0                The Walking Dead: The Final Season      7  Kimberley Wallace   
1    The Walking Dead: The Final Season ? Episode 1      7  Kimberley Wallace   
2  Batman: The Enemy Within Episode 5 ? Same Stitch      9      Javy Gwaltney   

      r_platform                       o_platform        publisher  \
0  PlayStation 4             Xbox One, Switch, PC   Telltale Games   
1             PC          PlayStation 4, Xbox One   Telltale Games   
2  PlayStation 4  PlayStation 4, Xbox One, Switch   Telltale Games   

         developer      release_date           rating  mean  median  
0   Telltale Games    March 27, 2019   Rating Pending  7.16     7.0  
1   Telltale Games   August 14, 2018           Mature  7.16     7.0  
2   Telltale Games    March 27, 2018           Mature  7.16     7.0

Let’s select just the top publishers and see what their mean and median scores are.

dev_data = pd.read_csv('dev_merged.csv')
dev_data = dev_data.drop(dev_data.columns[0], axis=1)
developer_names = list(dev_data['developer'].unique())
#print(developer_names)

dev_examples = pd.DataFrame(index=None)

for dev in developer_names:
   dev_examples = dev_examples.append(dev_data[dev_data.developer == dev].iloc[0])

#print(dev_examples.to_string())

dev_stats = dev_examples[['developer', 'mean', 'median']]
print(dev_stats.head(20))

                    developer      mean  median
0              Telltale Games  7.160000     7.0
25                   Nintendo  7.714286     7.0
39                     Capcom  7.857143     9.0
46                Square Enix  8.111111     9.0
55                 EA Tiburon  6.666667     7.0
61                      Atlus  7.666667     7.0
67            Visual Concepts  7.666667     7.0
76                  EA Canada  7.000000     7.0
80      Dontnod Entertainment  7.000000     7.0
83           Ubisoft Montreal  7.000000     7.0
87              Firaxis Games  8.428571     9.0
94                   TT Games  7.666667     7.0
97     Blizzard Entertainment  9.000000     9.0
104             From Software  8.714286     9.0
111                Game Freak  7.000000     7.0
113   Double Fine Productions  7.666667     7.0
116       Intelligent Systems  8.200000     9.0
121                      SEGA  8.600000     9.0
126            HAL Laboratory  7.000000     7.0
127            Spike Chunsoft  6.000000     6.0

Another interesting thing we could to is get the worst reviewed games, games that have received less than a 4.

# how you can filter for just one criteria and then pull out only the columns you care about
# print(GI_data[GI_data.score <= 4].title)
# this is actually the preferred method...
print(GI_data.loc[GI_data.score <= 4, 'title'])

69               R.B.I. Baseball 19
124     Overkill's The Walking Dead
134                   The Quiet Man
217               Tennis World Tour
220                           Agony
259               Fear Effect Sedna
301                  Hello Neighbor
343              Raid: World War II
483              R.B.I. Baseball 17
488                 Old Time Hockey
489                 Old Time Hockey
503                      1-2-Switch
589                    One Way Trip
612             Ghostbusters (2016)
637       Homefront: The Revolution
722                   Devil's Third
767                        Armikrog
815                        Godzilla
820     Payday 2: Crimewave Edition
933              Escape Dead Island
935       Sonic Boom: Rise of Lyric
1045          Rambo: The Video Game
1063                         Rekoil
Name: title, dtype: object

We could also get only published by Nintendo by filtering out results where ‘Nintendo’ doesn’t appear in ‘publisher’ with isin.

# can use "isin" in a series...
# spaces are in this, be sure to include them

r = GI_data[GI_data['publisher'].isin([' Nintendo'])][:40]
print(r)


                                                 title  score  \
0                            Fire Emblem: Three Houses      9   
5          Marvel Ultimate Alliance 3: The Black Order      7   
6                                      Dr. Mario World      8   
19                                 Super Mario Maker 2      8   
23   Cadence of Hyrule ? Crypt of the Necrodancer F...      7   
43                                  BoxBoy! + BoxGirl!      8   
64                               Yoshi's Crafted World      8   
79                                           Tetris 99      8   
101  Mario & Luigi: Bowser's Inside Story + Bowser ...      8   
103                     New Super Mario Bros. U Deluxe      8   
112                         Super Smash Bros. Ultimate      9   
123                         Pokémon: Let's Go, Pikachu      8   
144               The World Ends With You: Final Remix      7   
151                                  Super Mario Party      7   
161  Xenoblade Chronicles 2: Torna - The Golden Cou...      7   
193                                          Tetris 99      8   
196                                     WarioWare Gold      8   
199                                  Octopath Traveler      8   
202                     Captain Toad: Treasure Tracker      8   
203                         Splatoon 2: Octo Expansion      8   
209                                  Mario Tennis Aces      8   
212                                      Pokémon Quest      6   
215                  Sushi Striker: The Way of Sushido      7   
216                        Dillon's Dead-Heat Breakers      7   
236               Donkey Kong Country: Tropical Freeze      9   
246                                  Detective Pikachu      7   
254                                  Kirby Star Allies      6   
300  The Legend Of Zelda: Breath Of The Wild ? The ...      8   
308                             Xenoblade Chronicles 2      7   
335                                Super Mario Odyssey      9   
338                               Fire Emblem Warriors      7   
377                             Metroid: Samus Returns      9   
378                             Monster Hunter Stories      8   
408                                           Miitopia      7   
410                                        Hey! Pikmin      6   
413                                         Splatoon 2      8   
418  The Legend Of Zelda: Breath Of The Wild ? Mast...      7   
426                                         Ever Oasis      8   
431                                               Arms      8   
447            Fire Emblem Echoes: Shadows of Valentia      7   

                author r_platform                                 o_platform  \
0    Kimberley Wallace     Switch                                        NaN   
5        Andrew Reiner     Switch                                        NaN   
6           Ben Reeves        iOS                                    Android   
19       Kyle Hilliard     Switch                                        NaN   
23      Suriel Vazquez     Switch                                        NaN   
43          Ben Reeves     Switch                                        NaN   
64          Brian Shea     Switch                                        NaN   
79       Kyle Hilliard     Switch                               Xbox One, PC   
101      Kyle Hilliard        3DS                                   Xbox One   
103         Brian Shea     Switch                           PC, iOS, Android   
112          Jeff Cork     Switch                          PlayStation 4, PC   
123         Brian Shea     Switch                              PlayStation 4   
144  Kimberley Wallace     Switch                                         PC   
151         Brian Shea     Switch            PlayStation 4, Xbox One, Switch   
161           Joe Juba     Switch             PlayStation 4, Switch, PC, Mac   
193      Kyle Hilliard     Switch            PlayStation 4, Xbox One, Switch   
196      Kyle Hilliard        3DS            PlayStation 4, PlayStation Vita   
199           Joe Juba     Switch  PlayStation 4, Xbox One, Switch, Mac, iOS   
202         Ben Reeves     Switch                                        3DS   
203         Brian Shea     Switch                                        3DS   
209      Kyle Hilliard     Switch                          PlayStation 4, PC   
212         Brian Shea     Switch                               iOS, Android   
215      Kyle Hilliard     Switch                                        3DS   
216      Kyle Hilliard        3DS                                        3DS   
236      Kyle Hilliard     Switch                                      Wii U   
246         Ben Reeves        3DS            PlayStation 4, Xbox One, Switch   
254      Kyle Hilliard     Switch                               Xbox One, PC   
300     Suriel Vazquez     Switch                               Xbox One, PC   
308           Joe Juba     Switch                    PlayStation 4, Xbox One   
335      Andrew Reiner     Switch                       Xbox One, Switch, PC   
338      Javy Gwaltney     Switch                          PlayStation 4, PC   
377         Ben Reeves        3DS                               Xbox One, PC   
378        Daniel Tack        3DS                               Xbox One, PC   
408          Jeff Cork        3DS                                   Xbox One   
410         Ben Reeves        3DS                              PlayStation 4   
413         Brian Shea     Switch                                         PC   
418      Javy Gwaltney     Switch            Xbox One, PC, Mac, iOS, Android   
426      Kyle Hilliard        3DS                 Xbox One, PlayStation Vita   
431         Brian Shea     Switch                              PlayStation 3   
447      Javy Gwaltney        3DS                       Xbox One, Switch, PC   

     publisher                       developer         release_date  \
0     Nintendo             Intelligent Systems        July 26, 2019   
5     Nintendo                      Team Ninja        July 19, 2019   
6     Nintendo                        Nintendo        July 10, 2019   
19    Nintendo                        Nintendo        June 28, 2019   
23    Nintendo            Brace Yourself Games        June 13, 2019   
43    Nintendo                  HAL Laboratory       April 26, 2019   
64    Nintendo                       Good Feel       March 29, 2019   
79    Nintendo                           Arika    February 13, 2019   
101   Nintendo                      AlphaDream     January 11, 2019   
103   Nintendo                        Nintendo     January 11, 2019   
112   Nintendo                       Sora, Ltd     December 7, 2018   
123   Nintendo                      Game Freak    November 16, 2018   
144   Nintendo           Square Enix, h.a.n.d.     October 12, 2018   
151   Nintendo                        Nintendo      October 5, 2018   
161   Nintendo                   Monolith Soft   September 14, 2018   
193   Nintendo                           Arika    February 13, 2019   
196   Nintendo   Nintendo, Intelligent Systems       August 3, 2018   
199   Nintendo            Square Enix, Acquire        July 13, 2018   
202   Nintendo                        Nintendo        July 13, 2018   
203   Nintendo                        Nintendo        June 13, 2018   
209   Nintendo                Camelot Software        June 22, 2018   
212   Nintendo                      Game Freak         May 29, 2018   
215   Nintendo           Nintendo, indies zero         June 8, 2018   
216   Nintendo                         Vanpool         May 24, 2018   
236   Nintendo                   Retro Studios          May 4, 2018   
246   Nintendo                  Creatures Inc.       March 23, 2018   
254   Nintendo                  HAL Laboratory       March 16, 2018   
300   Nintendo                        Nintendo     December 7, 2017   
308   Nintendo                   Monolith Soft     December 1, 2017   
335   Nintendo                        Nintendo     October 27, 2017   
338   Nintendo                      Koei Tecmo                  TBA   
377   Nintendo                    MercurySteam   September 15, 2017   
378   Nintendo                          Capcom    September 8, 2017   
408   Nintendo                        Nintendo        July 27, 2017   
410   Nintendo                        Nintendo        July 28, 2017   
413   Nintendo                        Nintendo        July 21, 2017   
418   Nintendo                        Nintendo        July 30, 2017   
426   Nintendo                          Grezzo        June 23, 2017   
431   Nintendo                        Nintendo        June 16, 2017   
447   Nintendo             Intelligent Systems         May 19, 2017   

              rating  
0                NaN  
5               Teen  
6           Everyone  
19               NaN  
23    Rating Pending  
43          Everyone  
64          Everyone  
79          Everyone  
101         Everyone  
103         Everyone  
112           Mature  
123         Everyone  
144             Teen  
151         Everyone  
161             Teen  
193             Teen  
196     Everyone 10+  
199             Teen  
202         Everyone  
203     Everyone 10+  
209         Everyone  
212         Everyone  
215         Everyone  
216         Everyone  
236         Everyone  
246         Everyone  
254     Everyone 10+  
300         Everyone  
308             Teen  
335     Everyone 10+  
338   Rating Pending  
377     Everyone 10+  
378     Everyone 10+  
408         Everyone  
410     Everyone 10+  
413     Everyone 10+  
418         Everyone  
426     Everyone 10+  
431     Everyone 10+  
447   Rating Pending

Finally, let’s do a swarmplot of scores by Nintendo associated developers. We’ll take games published by Nintendo and plot the score given some developers.

plt.figure(figsize=(8, 10))
dev_score = sns.swarmplot(x="developer", y="score", data=r, hue='developer')
plt.grid()
plt.xticks(range(50),rotation=-90)
plt.show(dev_score)

If you’d like to get better acquainted with visualizing data, I suggest checking out the documentation of Pandas and Seaborn and trying to visualize some simple datasets, such as the ones below:

Iris Dataset

Pokemon with Stats

Mushroom Dataset

Did It Rain In Seattle?

Carrying Out Transfer Learning On X-Ray Images

Training a deep neural network can take a substantial amount of time. Not only will you often need to create a model architecture and train the model, but you’ll likely need to tweak the model to get optimal performance out of the model. This can drastically increase the amount of time needed to train a model. In order to speed up the process of training and deploying a model, a technique called transfer learning is used. Transfer learning lets you take a model architecture and reuse it, potentially using the same weights the model has learned before.

You can take a model architecture than has already been defined and reuse it, either using the same weights or retaining the model and just using the architecture. You can also do something in between and retrain just some of the layers of the architecture. This blog post will demonstrate the creation of a deep learning model in Keras and then demonstrate how the model can be reused on another, similar dataset.

I’ll be creating a Convolutional Neural Network in Keras, saving the model and then reapplying it to another image classification problem. We’ll be using the Chest X-Ray Image Dataset, available HERE for this project. I’ll be splitting up the training data folder into a training and validation set. We’ll use this initial dataset for the training and testing of the model. After that, I’ll apply some perturbations to the test dataset to simulate image corruption and transfer the pre-trained model over, retraining it to classify the damaged images.

To begin with, here are the imports we’ll need.

from keras.preprocessing.image import ImageDataGenerator
import keras
from keras.models import Sequential
from keras.layers import Dense, Conv2D, BatchNormalization, Dropout, MaxPooling2D, Flatten, Activation
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
import matplotlib.pyplot as plt

Now we’ll define some variables we’ll need, as well as handle different possible input specifications for our model. Some models want the channels listed first, others don’t.

batch_size = 16
im_height = 150
im_width = 150

# handling the different possible input shapes for the model

if keras.backend.image_data_format() == 'channels_first':
    inp_shape = (3, im_width, im_height)
else:
    inp_shape = (im_width, im_height, 3)

We’ll now want to specify the directories and use the ImageDataGenerator to get the image data from the directories. After that, we’ll use the generators and make iterables from them using flow_from_directory.

train_dir = "chest_xray/train"
val_dir = "chest_xray/val"
test_dir = "chest_xray/test"

# generate the data for both training and validation data
# when we instantiate the data generator we pass in transformations to use
# rescaling and flipping here

train_1_datagen = ImageDataGenerator(rescale=1. / 255)

train_2_datagen = ImageDataGenerator(rescale=1. /255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, rotation_range=30,
                                     width_shift_range=0., channel_shift_range=0.9, brightness_range=[0.5, 1.5])

test_datagen = ImageDataGenerator(rescale=1. /255)

test_datagen_2 = ImageDataGenerator(rescale=1. /255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, rotation_range=30,
                                     width_shift_range=0., channel_shift_range=0.9, brightness_range=[0.5, 1.5])

# after creating the objects flow from directory
# declare what directory to flow from, as well as image size and batch size
# class mode is binary here, either normal xray or not normal

# we could do "class_mode = binary" here, if so, be sure to make the final output 1 and not 2

train_generator_1 = train_1_datagen.flow_from_directory(train_dir, target_size=(im_width, im_height),
                                                    batch_size=batch_size)

test_generator_1 = test_datagen.flow_from_directory(test_dir, target_size=(im_width, im_height),
                                                    batch_size = batch_size)

train_generator_2 = train_2_datagen.flow_from_directory(val_dir, target_size=(im_width, im_height),
                                                    batch_size = batch_size)

test_generator_2 = test_datagen_2.flow_from_directory(test_dir, target_size=(im_width, im_height),
                                                    batch_size = batch_size)

Let’s visualize some of the data to get a better idea of how the datasets will differ. First, we’ll visualize the data that is being used to train the first model.

def image_show(image_generator):
    x, y = image_generator.next()
    fig = plt.figure(figsize=(8, 8))
    columns = 3
    rows = 3
    for i in range(1, 10):
        # img = np.random.randint(10)
        image = x[i]
        fig.add_subplot(rows, columns, i)
        plt.imshow(image.transpose(0, 1, 2))
    plt.show()
    
image_show(train_generator_1)

Now we’ll visualize the second set of data.

We can see that our second set of data is a little different from the first set. It’s zoomed in, rotated randomly, has image artifacts and more, to simulate the kinds of damages that might happen to an image in the real world.

Now we can make a function to handle the creation of our model. This function will establish the sequential model form and add the layers of the convolutional network – with the convolutional layers, the Max Pooling, and some batch normalization.

We’ll have three of these “blocks” comprising the convolutional layers in our network. These will be followed by a flattening layer, which transforms the data into a long vector that the densely connected layers of the neural network will be able to analyze. We’ll then add in our densely connected layers and activation functions, along with some dropout to prevent overfitting. We then compile the model and return it.

def create_model():
    # first specify the sequential nature of the model

    model = Sequential()

    # second parameter is the size of the "window" you want the CNN to use

    # the shape of the data we are passing in, 3 x 150 x 150
    # last element is just the image in the series, others are pixel widths

    model.add(Conv2D(64, (3, 3), input_shape=(150, 150, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())

    model.add(Conv2D(128, (3, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())

    model.add(Conv2D(256, (3, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(BatchNormalization())

    model.add(Conv2D(256, (3, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(BatchNormalization())

    # you'll need to flatten the data again if you plan on having Dense layers in the model,
    # as it needs a 1d unlike a 2d CNN

    model.add(Flatten())

    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(2, activation='sigmoid'))

    # now compile the model, specify loss, optimization, etc

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    # fit the model, specify batch size, validation split and epochs

    return model

model = create_model()

Now we’ll create the model by calling the function.

We need to use the fit_generator function to fit our model since we used a data generator to set up the image data. We’re also going to specify some callbacks here, which specify how we want our model to behave while training.

The callbacks we will use include:

Model Checkpoint – which saves weights occasionally when a specified event happens (in our case when the validation accuracy improves.

Reduce LR On Plateau, which reduces the learning rate of our classifier when it hits a plateau and stops improving on the loss. This helps to avoid getting stuck oscillating around minimum loss.

Finally, Early Stopping lets us stop training early if we’ve hit a point where validation loss stops decreasing for some given amount of time, which is defined by our “patience”.

We pass in these callbacks and then fit the model, saving it in a variable so we can access the training records later.

filepath = "weights_training_1.hdf5"
callbacks = [ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max'),
              ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, verbose=1, mode='min', min_lr=0.00001),
              EarlyStopping(monitor= 'val_loss', min_delta=1e-10, patience=15, verbose=1, restore_best_weights=True)]

records = model.fit_generator(train_generator_1, steps_per_epoch=100, epochs=25, validation_data=test_generator_1, validation_steps=7, verbose=1, callbacks=callbacks)

After our training is complete and the best weights have been saved, let’s evaluate the performance of our model. We’ll create a function that plots the loss and accuracy of the training and validation sets. We’ll also evaluate the model’s performance using the evaluation metric we specified in the generator, which we can save to a variable and print.

First, we’ll need to get the loss on the training and validation sets, which we can draw from the “records” variable.

t_loss = records.history['loss']
v_loss = records.history['val_loss']
t_acc = records.history['acc']
v_acc = records.history['val_acc']

# gets the lengt of how long the model was trained for
train_length = range(1, len(t_acc) + 1)

Now we’ll create the function to evaluate our model’s performance.

def evaluation(model, train_length, training_acc, val_acc, training_loss, validation_loss, generator):

    # plot the loss across the number of epochs
    plt.figure()
    plt.plot(train_length, training_loss, label='Training Loss')
    plt.plot(train_length, validation_loss, label='Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    plt.figure()
    plt.plot(train_length, training_acc, label='Training Accuracy')
    plt.plot(train_length, val_acc, label='Validation Accuracy')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

    # compare against the test training set
    # get the score/accuracy for the current model
    scores = model.evaluate_generator(generator)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

We have around 83% accuracy already. Not bad.

acc: 83.09%

Now we can call the function and evaluate how our model trained along with its performance on the dataset.

 evaluation(model, train_length, t_acc, v_acc, t_loss, v_loss, test_generator_1)

We’re now going to train a second model that uses the weights and architecture from our first model. We’ll load in the trained weights into a second instance of our model. Then we’ll specify that we only want to train the last five layers of our model, a portion of the densely connected layers. We can make sure these are set up to train by printing out the trainable layers of the model.

model_2 = create_model()
model_2.load_weights("weights_training_1.hdf5")

for layer in model_2.layers[:-5]:
    layer.trainable = False

for layer in model_2.layers:
    print(layer, layer.trainable)

Now we just need to compile and fit the model.

# now compile the model, specify loss, optimization, etc
model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit the model, specify batch size, validation split and epochs

filepath = "C:/Users/Daniel/Downloads/chest-xray-pneumonia/chest_xray/weights_training_2.hdf5"
callbacks = [ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max'),
              ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, verbose=1, mode='min', min_lr=0.00001),
              EarlyStopping(monitor= 'val_loss', min_delta=1e-10, patience=15, verbose=1, restore_best_weights=True)]

records = model_2.fit_generator(train_generator_2, steps_per_epoch=85, epochs=20, validation_data=test_generator_2, validation_steps=7, verbose=1, callbacks=callbacks)

Finally, let’s check to see how the second model performed by getting its metrics.

t_loss = records.history['loss']
v_loss = records.history['val_loss']
t_acc = records.history['acc']
v_acc = records.history['val_acc']

# gets the length of how long the model was trained for
train_length = range(1, len(t_acc) + 1)

evaluation(model_2, train_length, t_acc, v_acc, t_loss, v_loss, 80, test_generator_2)

acc: 73.48%

We can see that after training it for only two epochs, and despite the perturbations applied to the dataset, performance quickly hit over 70%.

Understanding and Implementing Quick Sort in Python

Last time I covered the theory and implementation of binary search, but now let’s turn to sort algorithms. Before an array can be searched, it must be sorted. One of the most efficient sorting algorithms is Quick Sort, and we’ll be exploring how to implement Quick Sort in Python below. However, let’s go over the theory behind Quick Sort first.

Quicksort is an example of applying a divide-and-conquer approach to solving a problem. We can make sorting a large array of unsorted value easy by dividing the problem down into smaller steps and just applying these steps again and again until the array is sorted.

The primary concept employed in Quick Sort is partitioning. As we partition, we divide the array into smaller portions and then sort these small chunks. How do we sort the small portions of the array? We do this by first selecting a pivot, or a point in the array that the rest of the array will be shifted around.

After we select the pivot, we need to put the pivot in its correct position in the array. This means making sure that all values in the array smaller than the pivot are to the left and all values that are larger are to the right. We then recur on the right and left half of the array until everything is sorted.

Essentially, we can describe the algorithm as the following steps:

Choose a value as the pivot
Iterate through the array and place all values smaller than the pivot the left, while all the values larger are to the right. This is done by comparing every value to the pivot. If a value is found that is out of order, the current value and the last sorted element are swapped.
After the above process is completer, the process is carried out on all the values to the left of the pivot value.
The process is carried out again on all values to the right of the pivot.

How do you go about selecting a pivot value? There are multiple ways to select a pivot value. You can select the first value in the array, the last value in the array, the median value, or a random value.

The most common way that Quick Sort is implemented is by using the last value in the array as the pivot. The example implementation below will use the method where the final element in the array is chosen as the pivot.

Now we’ll take a look at how to implement QuickSort in Python.

To begin with, we’ll create a function that partitions the array. We’ll then use this function within another function to carry out the actual sorting.

def partition(arr, low, high):

    # idx is the current index of the smaller array
    idx = (low - 1)
    pivot = arr[high]

    # c is current value in the loop
    for c in range(low, high):
        if arr[c] < pivot:
            idx = idx + 1

            arr[idx], arr[c] = arr[c], arr[idx]

    arr[idx + 1], arr[high] = arr[high], arr[idx + 1]
    return (idx + 1)

def quick_sort(arr, low, high):

    if low < high:
        p_idx = partition(arr, low, high)

        quick_sort(arr, low, p_idx-1)
        quick_sort(arr, p_idx+1, high)

arr = [12, 28, 33, 11, 38, 49, 36, 19, 100, 21, 5, 15, 3]
length = len(arr)
quick_sort(arr, 0, length - 1)
print("Array after sorting:")
for i in range(length):
    print(arr[i])

Here’s the results of running the program:

Array after sorting:
3
5
11
12
15
19
21
28
33
36
38
49
100

I suggest you experiment with different sorting algorithms as well as see how run times can vary when sorting arrays under different conditions.

Video Game Sales Analysis – Visualization and Regression

In machine learning, most problems can put in one of two categories: unsupervised learning and supervised learning. In supervised learning tasks, you know what classes your data points belong to, which means that you can check the performance of your classifier on your dataset. In contrast, in an unsupervised learning task you don’t have specific class labels for your data and it is often up to the researcher to come up with meaningful correlations and explore potential patterns in the dataset using statistical techniques like regression.

In this post, I’m going to demonstrate the process of taking a dataset and carrying out regression on the dataset in order to predict some possible trends using Scikit-learn in Python. The post will also demonstrate the process of visualizing data with Pandas, Seaborn, and Matplotlib.

For this post, we’ll be using the video game sales dataset, available HERE.

As you might expect, we’ll start off by importing all the libraries we will need.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

Let’s start off by loading the data and checking the number of occurrences for features we might be interested in.

df = pd.read_csv("vgsales.csv")

# get an idea of the total number of occurences for important features

publishers = df['Publisher'].unique()
platforms = df['Platform'].unique()
genres = df['Genre'].unique()

print("Number of games: ", len(df))
print("Number of publishers: ", len(publishers))
print("Number of platforms: ", len(platforms))
print("Number of genres: ", len(genres))

Number of games:  16598
Number of publishers:  579
Number of platforms:  31
Number of genres:  12

We need to be sure that our data is free of any null values, so we’ll check for them and drop them if there are any.

print(df.isnull().sum())

# drop them if there are any
df = df.dropna()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

One of the first things we may try doing is checking to see how many global game sales there are a year. We can count how many years there are in the database, then we can plot the years against the global sales.

# if we wanted the counts instead, we could just use Count. Count returns the number of instances,
# not the sums of the values like above
x = df.groupby(['Year']).count()
x = x['Global_Sales']
y = x.index.astype(int)

plt.figure(figsize=(12,8))
colors = sns.color_palette("muted")
ax = sns.barplot(y = y, x = x, orient='h', palette=colors)
ax.set_xlabel(xlabel='Number of releases', fontsize=16)
ax.set_ylabel(ylabel='Year', fontsize=16)
ax.set_title(label='Game Releases Per Year', fontsize=20)

plt.show()

Let’s get an idea of how many games are published by specific publishers. There’s a lot of publishers in this list, so we want to drop any publishers that have published fewer than a chosen number of games. Let’s set 75 as a threshhold. We’ll also apply this same method to the platforms the games are published on.

After dropping much of the data, we can try plotting the remaining data that we’ve put into a new dataframe. We’ll plot the number of games published by both the most prolific publishers and the number published on different consoles.

vg_data = pd.read_csv('vgsales.csv')

print(vg_data.info())
print(vg_data.describe())

# let's choose a cutoff and drop any publishers that have published less than X games

for i in vg_data['Publisher'].unique():
    if vg_data['Publisher'][vg_data['Publisher'] == i].count() < 60:
        vg_data['Publisher'][vg_data['Publisher'] == i] = 'Other'

for i in vg_data['Platform'].unique():
    if vg_data['Platform'][vg_data['Platform'] == i].count() < 100:
        vg_data['Platform'][vg_data['Platform'] == i] = 'Other'

#try plotting the new publisher and platform data
sns.countplot(x='Publisher', data=vg_data)
plt.title("# Games Published By Publisher")
plt.xticks(rotation=-90)
plt.show()

plat_data = vg_data['Platform'].value_counts(sort=False)
sns.countplot(y='Platform', data=vg_data)
plt.title("# Games Published Per Console")
plt.xticks(rotation=-90)
plt.show()

RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
Rank            16598 non-null int64
Name            16598 non-null object
Platform        16598 non-null object
Year            16327 non-null float64
Genre           16598 non-null object
Publisher       16540 non-null object
NA_Sales        16598 non-null float64
EU_Sales        16598 non-null float64
JP_Sales        16598 non-null float64
Other_Sales     16598 non-null float64
Global_Sales    16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
None
               Rank          Year      NA_Sales      EU_Sales      JP_Sales  \
count  16598.000000  16327.000000  16598.000000  16598.000000  16598.000000   
mean    8300.605254   2006.406443      0.264667      0.146652      0.077782   
std     4791.853933      5.828981      0.816683      0.505351      0.309291   
min        1.000000   1980.000000      0.000000      0.000000      0.000000   
25%     4151.250000   2003.000000      0.000000      0.000000      0.000000   
50%     8300.500000   2007.000000      0.080000      0.020000      0.000000   
75%    12449.750000   2010.000000      0.240000      0.110000      0.040000   
max    16600.000000   2020.000000     41.490000     29.020000     10.220000   

        Other_Sales  Global_Sales  
count  16598.000000  16598.000000  
mean       0.048063      0.537441  
std        0.188588      1.555028  
min        0.000000      0.010000  
25%        0.000000      0.060000  
50%        0.010000      0.170000  
75%        0.040000      0.470000  
max       10.570000     82.740000

We can also try plotting variables against each other, like getting the global sales of games by their genre.

sns.barplot(x='Genre', y='Global_Sales', data=vg_data)
plt.title("Total Sales Per Genre")
plt.xticks(rotation=-45)
plt.show()

We can filter and plot by multiple criteria. If we wanted to check and see how many games are published in a given genre AND filter by platform we can do that. We just need to get the individual platforms, which we can do by filtering the “platform” feature with a “unique” function. Then we just have to plot the platform and genre data for each of those platforms.

# try visualizing the number of games in a specific genre
for i in vg_data['Platform'].unique():
    vg_data['Genre'][vg_data['Platform'] == i].value_counts().plot(kind='line', label=i, figsize=(20, 10), grid=True)

# set the legend and ticks

plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=20, borderaxespad=0.)
plt.xticks(np.arange(12), tuple(vg_data['Genre'].unique()))
plt.tight_layout()
plt.show()

Now that we’ve plotted some of the data, let’s try predicting some trends based off of the data. We can carry out linear regression to get an idea of how global sales figures could end up based on North American sales figures. First we need to separate our data into train and test sets. We’ll start by setting North American sales as our X variable and global sales as our Y variable, and then do train/test split.

# going to attempt to carry out linear regression and predict the global sales of games
# based off of the sales in North America

X = vg_data.iloc[:, 6].values
y = vg_data.iloc[:, 10].values

# train test split and split the dataframe

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)

The data needs to be reshaped in order to be compatible with Linear Regression, so we’ll do that with the following commands. We’re reshaping them into two long 2D arrays that have as many rows as necessary and a single column. After that we can fit the data in the Linear Regression function.

# reshape the data into long 2D arrays with 1 column and as many rows as necessary
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let’s check to see how our regression algorithm performed. We should plot the correlation between the variable’s training data, and plot the line of best fit from our regressor. We’ll then do the same thing for our testing data. Essentially we’re looking to see how the regression line fits both the training and testing data.

The regression lines should look approximately the same, and indeed they look fairly similar. The training set regression shows approximately 70 million sales for 40 million North American sales, while the test set regression may be just a little higher. We’ll also print the scores on the training and test sets, and see that our Linear Regression implementation had similar, though slightly worse accuracy on the testing set.

Let’s make a function to handle the plotting.

def plot_regression(classifier):

    plt.scatter(X_train, y_train,color='blue')
    plt.plot(X_train, classifier.predict(X_train), color='red')
    plt.title('(Training set)')
    plt.xlabel('North America Sales')
    plt.ylabel('Global Sales')
    plt.show()

    plt.scatter(X_test, y_test,color='blue')
    plt.plot(X_train, classifier.predict(X_train), color='red')
    plt.title('(Testing set)')
    plt.xlabel('North America Sales')
    plt.ylabel('Global Sales')
    plt.show()
    
plot_regression(lin_reg)
print("Training set score: {:.2f}".format(lin_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lin_reg.score(X_test, y_test)))

Training set score: 0.89

Test set score: 0.87

We can now implement some other regression algorithms and see how they perform. Let’s try using a Decision Tree regressor.

DTree_regressor = DecisionTreeRegressor(random_state=5)
DTree_regressor.fit(X_train, y_train)
plot_regression(DTree_regressor)

print("Training set score: {:.2f}".format(DTree_regressor.score(X_train, y_train)))
print("Test set score: {:.2f}".format(DTree_regressor.score(X_test, y_test)))

Training set score: 0.96 
Test set score: 0.81

Now let’s try a Random Forest regressor algorithm.

RF_regressor = RandomForestRegressor(n_estimators=300, random_state=5)
RF_regressor.fit(X_train, y_train)
plot_regression(RF_regressor)

print("Training set score: {:.2f}".format(RF_regressor.score(X_train, y_train)))
print("Test set score: {:.2f}".format(RF_regressor.score(X_test, y_test)))

Training set score: 0.94 
Test set score: 0.84

It looks like Random Forest and plain Linear Regression have comparable performance. However, we might be able to find a regression algorithm that performs better than these two. We’ll use a type of dimensionality reduction called Principal Component Analysis, which tries to distill the important features of a training set down to just the features that have the most influence on the labels/outcome. By reducing the dimensionality/complexity of a featureset, a representation that contains the features with the most predictive power is created. This can improve the predictive power of a regressor.

We’ll create a Scikit-learn Pipeline, which allows us to specify what kind of regression algorithm we want to use (Linear Regression) and how we want to set up the features for it (use the Standard Scaler and PCA).

Note that there’s only one feature we’re predicting off of here, North American sales, so PCA can’t simplify the representation anymore. But if we had more features we were doing regression on PCA could be useful.

components = [
    ('scaling', StandardScaler()),
    ('PCA', PCA()),
    ('regression', LinearRegression())
]

pca = Pipeline(components)
pca.fit(X_train, y_train)
plot_regression(pca)
print("Training set score: {:.2f}".format(pca.score(X_train, y_train)))
print("Test set score: {:.2f}".format(pca.score(X_test, y_test)))

Training set score: 0.89
Test set score: 0.87

We’re now going to try using different regression algorithms to see what kinds of results we get. Let’s try an Elastic Net regressor.

elastic = ElasticNet()
elastic.fit(X_train, y_train)
plot_regression(elastic)
print("Training set score: {:.2f}".format(elastic.score(X_train, y_train)))
print("Test set score: {:.2f}".format(elastic.score(X_test, y_test)))

Training set score: 0.54
Test set score: 0.51

Now let’s try Ridge regression.

ridge_reg = Ridge()
ridge_reg.fit(X_train, y_train)
plot_regression(ridge_reg)
print("Training set score: {:.2f}".format(ridge_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge_reg.score(X_test, y_test)))

Training set score: 0.89
Test set score: 0.87

Here’s a Lasso regression implementation.

lasso_reg = Lasso()
lasso_reg.fit(X_train, y_train)
plot_regression(lasso_reg)
print("Training set score: {:.2f}".format(lasso_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso_reg.score(X_test, y_test)))

Training set score: 0.38
Test set score: 0.36

Finally, let’s try using AdaBoost regression.

# ADA Boost regressor
ada_reg = AdaBoostRegressor()
ada_reg.fit(X_train, y_train)
plot_regression(ada_reg)

print("Training set score: {:.2f}".format(ada_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ada_reg.score(X_test, y_test)))

Training set score: 0.89
Test set score: 0.81

It looks like Ridge Regression and AdaBoost did the best at predicting the trend.

Thank you for reading through this demonstration of visualizing data and predicting data trends. If you’d like to go further and enhance your understanding of regression algorithms, I suggest checking the documentation for each of the algorithms in Scikit-learn. You can experiment with implementing these techniques on another dataset and altering the regression arguments.

Understanding And Implementing Binary Search In Python

Studying data structures and algorithms is often a frustrating experience for those looking to get into a software engineering role. However, learning the ins and outs of these algorithms can make you a better programmer. Understanding how basic algorithms like binary search and selection sort operate can aid you in thinking in terms of algorithms, giving you a better intuition for how to break complex problems down into simple, solvable steps.

In this blog series, I plan on doing a dive into many of the data structures and algorithms programmers need to know. There’s a wide variety of data structures/algorithms to learn about, but we all have to start somewhere. I’ll be starting with a breakdown of a common searching algorithm: binary search.

Understanding Binary Search

Binary search, sometimes called logarithmic search (in reference to its average run-time), is a searching algorithm designed to find items within an array. Binary search assumes that the array in question is sorted. The “binary” in binary search comes from the fact that it operates by diving an array up into two parts and then recurring this division until the specified value is found or until the no more splits are possible (meaning the item is not found in the array).

The steps of implementing binary search can be broken down as follows:

Start by selecting the middle value in the array.
When given some target value – X, compare X with the middle element of the array.
If target value X matches the middle value, return it.
If the middle vale and X aren’t equal, we check the target value against the middle value. If X is a greater value than the middle element, it follows that X can only be found in the right half of the array, which means that we recur on the right half.
If X is smaller than the middle value, we carry out the actions in step 4, but for the left half of the array instead.

If the number of elements in the array is even instead of odd, this means there isn’t a middle value. So instead, we select the left value plus the right value minus one, and divide everything by 2. Doing this ensures that there is always a value that can be selected as the middle value.

Implementation of Binary Search In Python

There are two ways to implement a binary search algorithm in Python: a recursive implementation and an iterative implementation.

Iterative implementations are actually preferred in Python, as Python has a maximum recursion depth, a point at which Python will cease recurring as a guard against stack overflows.

Let’s cover the recursive implementation first.

To start off with, we’ll create a function that takes in an array, a left value, a right value, and a target X value.

First, we’ll check the right half of the array, selecting the middle value. If the middle value happens to be the target value, we can just return it and end there.

Otherwise, we need to compare the value and find out if it is bigger or smaller than he middle value. If it’s bigger, it can only be in the right half, and if smaller it can only be in the left half. We can make another call and carry out the same function recursively.

Finally, if the we’ve run through the whole array and none of the values matched our target, we conclude that the value isn’t in the array and end the search.

def binary_search(arr, left, right, target):

    not_found = "null"

    if right >= left:
        mid = left + (right - left) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] > target:
            return binary_search(arr, left, mid - 1, target)
        else:
            return binary_search(arr, mid + 1, right, target)
    else:
        return not_found

Now let’s run the code and check the output.

arr = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

def run_binary_search(arr, target):
    result = binary_search(arr, 0, len(arr) - 1, target)

    if result == "null":
        print("Target not in array.")
    else:
        print("Target is in array - position: {}".format(result))

run_binary_search(arr, 2)
run_binary_search(arr, 12)
run_binary_search(arr, 19)
run_binary_search(arr, 25)

Output is:

Target is in array – position: 0
Target is in array – position: 10
Target not in array.
Target not in array.

I mentioned earlier that Python actually prefers an iterative approach to searching, avoiding recursion where possible. Let’s go over the iterative approach for Binary search in Python now.

It’s more or less just the same as the recursive approach, except that instead of calling the function recursively we use a “while” loop, so that as long as the array has values which haven’t been searched, it carries out the binary searching process.

def binary_search_iterative(arr, left, right, target):

    not_found = "null"

    while left <= right:
        mid = left + (right - left) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    else:
        return not_found

arr = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

def run_binary_search_iterative(arr, target):
    result = binary_search_iterative(arr, 0, len(arr) - 1, target)

    if result == "null":
        print("Target not in array.")
    else:
        print("Target is in array - position: {}".format(result))

run_binary_search_iterative(arr, 25)
run_binary_search_iterative(arr, 3)

Output is:

Target not in array.
Target is in array – position: 1