Carrying Out Transfer Learning On X-Ray Images

Training a deep neural network can take a substantial amount of time. Not only will you often need to create a model architecture and train the model, but you’ll likely need to tweak the model to get optimal performance out of the model. This can drastically increase the amount of time needed to train a model. In order to speed up the process of training and deploying a model, a technique called transfer learning is used. Transfer learning lets you take a model architecture and reuse it, potentially using the same weights the model has learned before.

You can take a model architecture than has already been defined and reuse it, either using the same weights or retaining the model and just using the architecture. You can also do something in between and retrain just some of the layers of the architecture. This blog post will demonstrate the creation of a deep learning model in Keras and then demonstrate how the model can be reused on another, similar dataset. 

I’ll be creating a Convolutional Neural Network in Keras, saving the model and then reapplying it to another image classification problem. We’ll be using the Chest X-Ray Image Dataset, available HERE for this project. I’ll be splitting up the training data folder into a training and validation set. We’ll use this initial dataset for the training and testing of the model. After that, I’ll apply some perturbations to the test dataset to simulate image corruption and transfer the pre-trained model over, retraining it to classify the damaged images.

To begin with, here are the imports we’ll need.

from keras.preprocessing.image import ImageDataGenerator
import keras
from keras.models import Sequential
from keras.layers import Dense, Conv2D, BatchNormalization, Dropout, MaxPooling2D, Flatten, Activation
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
import matplotlib.pyplot as plt

Now we’ll define some variables we’ll need, as well as handle different possible input specifications for our model. Some models want the channels listed first, others don’t.

batch_size = 16
im_height = 150
im_width = 150

# handling the different possible input shapes for the model

if keras.backend.image_data_format() == 'channels_first':
    inp_shape = (3, im_width, im_height)
else:
    inp_shape = (im_width, im_height, 3)

We’ll now want to specify the directories and use the ImageDataGenerator to get the image data from the directories. After that, we’ll use the generators and make iterables from them using flow_from_directory.

train_dir = "chest_xray/train"
val_dir = "chest_xray/val"
test_dir = "chest_xray/test"

# generate the data for both training and validation data
# when we instantiate the data generator we pass in transformations to use
# rescaling and flipping here

train_1_datagen = ImageDataGenerator(rescale=1. / 255)

train_2_datagen = ImageDataGenerator(rescale=1. /255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, rotation_range=30,
                                     width_shift_range=0., channel_shift_range=0.9, brightness_range=[0.5, 1.5])

test_datagen = ImageDataGenerator(rescale=1. /255)

test_datagen_2 = ImageDataGenerator(rescale=1. /255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, rotation_range=30,
                                     width_shift_range=0., channel_shift_range=0.9, brightness_range=[0.5, 1.5])

# after creating the objects flow from directory
# declare what directory to flow from, as well as image size and batch size
# class mode is binary here, either normal xray or not normal

# we could do "class_mode = binary" here, if so, be sure to make the final output 1 and not 2

train_generator_1 = train_1_datagen.flow_from_directory(train_dir, target_size=(im_width, im_height),
                                                    batch_size=batch_size)

test_generator_1 = test_datagen.flow_from_directory(test_dir, target_size=(im_width, im_height),
                                                    batch_size = batch_size)

train_generator_2 = train_2_datagen.flow_from_directory(val_dir, target_size=(im_width, im_height),
                                                    batch_size = batch_size)

test_generator_2 = test_datagen_2.flow_from_directory(test_dir, target_size=(im_width, im_height),
                                                    batch_size = batch_size)

Let’s visualize some of the data to get a better idea of how the datasets will differ. First, we’ll visualize the data that is being used to train the first model.

def image_show(image_generator):
    x, y = image_generator.next()
    fig = plt.figure(figsize=(8, 8))
    columns = 3
    rows = 3
    for i in range(1, 10):
        # img = np.random.randint(10)
        image = x[i]
        fig.add_subplot(rows, columns, i)
        plt.imshow(image.transpose(0, 1, 2))
    plt.show()
    
image_show(train_generator_1)

Now we’ll visualize the second set of data.

We can see that our second set of data is a little different from the first set. It’s zoomed in, rotated randomly, has image artifacts and more, to simulate the kinds of damages that might happen to an image in the real world.

Now we can make a function to handle the creation of our model. This function will establish the sequential model form and add the layers of the convolutional network – with the convolutional layers, the Max Pooling, and some batch normalization.

We’ll have three of these “blocks” comprising the convolutional layers in our network. These will be followed by a flattening layer, which transforms the data into a long vector that the densely connected layers of the neural network will be able to analyze. We’ll then add in our densely connected layers and activation functions, along with some dropout to prevent overfitting. We then compile the model and return it.

def create_model():
    # first specify the sequential nature of the model

    model = Sequential()

    # second parameter is the size of the "window" you want the CNN to use

    # the shape of the data we are passing in, 3 x 150 x 150
    # last element is just the image in the series, others are pixel widths

    model.add(Conv2D(64, (3, 3), input_shape=(150, 150, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())

    model.add(Conv2D(128, (3, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())

    model.add(Conv2D(256, (3, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(BatchNormalization())

    model.add(Conv2D(256, (3, 3)))
    model.add(Activation("relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(BatchNormalization())

    # you'll need to flatten the data again if you plan on having Dense layers in the model,
    # as it needs a 1d unlike a 2d CNN

    model.add(Flatten())

    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(2, activation='sigmoid'))

    # now compile the model, specify loss, optimization, etc

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    # fit the model, specify batch size, validation split and epochs

    return model

model = create_model()


Now we’ll create the model by calling the function.

We need to use the fit_generator function to fit our model since we used a data generator to set up the image data. We’re also going to specify some callbacks here, which specify how we want our model to behave while training.

The callbacks we will use include:

Model Checkpoint – which saves weights occasionally when a specified event happens (in our case when the validation accuracy improves.

Reduce LR On Plateau, which reduces the learning rate of our classifier when it hits a plateau and stops improving on the loss. This helps to avoid getting stuck oscillating around minimum loss.

Finally, Early Stopping lets us stop training early if we’ve hit a point where validation loss stops decreasing for some given amount of time, which is defined by our “patience”.

We pass in these callbacks and then fit the model, saving it in a variable so we can access the training records later.

filepath = "weights_training_1.hdf5"
callbacks = [ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max'),
              ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, verbose=1, mode='min', min_lr=0.00001),
              EarlyStopping(monitor= 'val_loss', min_delta=1e-10, patience=15, verbose=1, restore_best_weights=True)]

records = model.fit_generator(train_generator_1, steps_per_epoch=100, epochs=25, validation_data=test_generator_1, validation_steps=7, verbose=1, callbacks=callbacks)

After our training is complete and the best weights have been saved, let’s evaluate the performance of our model. We’ll create a function that plots the loss and accuracy of the training and validation sets. We’ll also evaluate the model’s performance using the evaluation metric we specified in the generator, which we can save to a variable and print.

First, we’ll need to get the loss on the training and validation sets, which we can draw from the “records” variable.

t_loss = records.history['loss']
v_loss = records.history['val_loss']
t_acc = records.history['acc']
v_acc = records.history['val_acc']

# gets the lengt of how long the model was trained for
train_length = range(1, len(t_acc) + 1)

Now we’ll create the function to evaluate our model’s performance.

def evaluation(model, train_length, training_acc, val_acc, training_loss, validation_loss, generator):

    # plot the loss across the number of epochs
    plt.figure()
    plt.plot(train_length, training_loss, label='Training Loss')
    plt.plot(train_length, validation_loss, label='Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    plt.figure()
    plt.plot(train_length, training_acc, label='Training Accuracy')
    plt.plot(train_length, val_acc, label='Validation Accuracy')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

    # compare against the test training set
    # get the score/accuracy for the current model
    scores = model.evaluate_generator(generator)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

We have around 83% accuracy already. Not bad.

acc: 83.09%

Now we can call the function and evaluate how our model trained along with its performance on the dataset.

 evaluation(model, train_length, t_acc, v_acc, t_loss, v_loss, test_generator_1) 

We’re now going to train a second model that uses the weights and architecture from our first model. We’ll load in the trained weights into a second instance of our model. Then we’ll specify that we only want to train the last five layers of our model, a portion of the densely connected layers. We can make sure these are set up to train by printing out the trainable layers of the model.

model_2 = create_model()
model_2.load_weights("weights_training_1.hdf5")

for layer in model_2.layers[:-5]:
    layer.trainable = False

for layer in model_2.layers:
    print(layer, layer.trainable)

Now we just need to compile and fit the model.

# now compile the model, specify loss, optimization, etc
model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit the model, specify batch size, validation split and epochs

filepath = "C:/Users/Daniel/Downloads/chest-xray-pneumonia/chest_xray/weights_training_2.hdf5"
callbacks = [ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max'),
              ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, verbose=1, mode='min', min_lr=0.00001),
              EarlyStopping(monitor= 'val_loss', min_delta=1e-10, patience=15, verbose=1, restore_best_weights=True)]

records = model_2.fit_generator(train_generator_2, steps_per_epoch=85, epochs=20, validation_data=test_generator_2, validation_steps=7, verbose=1, callbacks=callbacks)

Finally, let’s check to see how the second model performed by getting its metrics.

t_loss = records.history['loss']
v_loss = records.history['val_loss']
t_acc = records.history['acc']
v_acc = records.history['val_acc']

# gets the length of how long the model was trained for
train_length = range(1, len(t_acc) + 1)

evaluation(model_2, train_length, t_acc, v_acc, t_loss, v_loss, 80, test_generator_2)

acc: 73.48%

We can see that after training it for only two epochs, and despite the perturbations applied to the dataset, performance quickly hit over 70%.

Understanding and Implementing Quick Sort in Python

Last time I covered the theory and implementation of binary search, but now let’s turn to sort algorithms. Before an array can be searched, it must be sorted. One of the most efficient sorting algorithms is Quick Sort, and we’ll be exploring how to implement Quick Sort in Python below. However, let’s go over the theory behind Quick Sort first. 

Quicksort is an example of applying a divide-and-conquer approach to solving a problem. We can make sorting a large array of unsorted value easy by dividing the problem down into smaller steps and just applying these steps again and again until the array is sorted.

The primary concept employed in Quick Sort is partitioning. As we partition, we divide the array into smaller portions and then sort these small chunks. How do we sort the small portions of the array? We do this by first selecting a pivot, or a point in the array that the rest of the array will be shifted around.

After we select the pivot, we need to put the pivot in its correct position in the array. This means making sure that all values in the array smaller than the pivot are to the left and all values that are larger are to the right. We then recur on the right and left half of the array until everything is sorted.

Essentially, we can describe the algorithm as the following steps:

  1. Choose a value as the pivot
  2. Iterate through the array and  place all values smaller than the pivot the left, while all the values larger are to the right. This is done by comparing every value to the pivot. If a value is found that is out of order, the current value and the last sorted element are swapped.
  3. After the above process is completer, the process is carried out on all the values to the left of the pivot value. 
  4. The process is carried out again on all values to the right of the pivot.

How do you go about selecting a pivot value? There are multiple ways to select a pivot value. You can select the first value in the array, the last value in the array, the median value, or a random value.

The most common way that Quick Sort is implemented is by using the last value in the array as the pivot. The example implementation below will use the method where the final element in the array is chosen as the pivot.

Now we’ll take a look at how to implement QuickSort in Python.

To begin with, we’ll create a function that partitions the array. We’ll then use this function within another function to carry out the actual sorting. 

def partition(arr, low, high):

# idx is the current index of the smaller array
idx = (low - 1)
pivot = arr[high]

# c is current value in the loop
for c in range(low, high):
if arr[c] < pivot:
idx = idx + 1

arr[idx], arr[c] = arr[c], arr[idx]

arr[idx + 1], arr[high] = arr[high], arr[idx + 1]
return (idx + 1)

def quick_sort(arr, low, high):

if low < high:
p_idx = partition(arr, low, high)

quick_sort(arr, low, p_idx-1)
quick_sort(arr, p_idx+1, high)

arr = [12, 28, 33, 11, 38, 49, 36, 19, 100, 21, 5, 15, 3]
length = len(arr)
quick_sort(arr, 0, length - 1)
print("Array after sorting:")
for i in range(length):
print(arr[i])

Here’s the results of running the program:

Array after sorting:
3
5
11
12
15
19
21
28
33
36
38
49
100

I suggest you experiment with different sorting algorithms as well as see how run times can vary when sorting arrays under different conditions.

Video Game Sales Analysis – Visualization and Regression

In machine learning, most problems can put in one of two categories: unsupervised learning and supervised learning. In supervised learning tasks, you know what classes your data points belong to, which means that you can check the performance of your classifier on your dataset. In contrast, in an unsupervised learning task you don’t have specific class labels for your data and it is often up to the researcher to come up with meaningful correlations and explore potential patterns in the dataset using statistical techniques like regression.

In this post, I’m going to demonstrate the process of taking a dataset and carrying out regression on the dataset in order to predict some possible trends using Scikit-learn in Python. The post will also demonstrate the process of visualizing data with Pandas, Seaborn, and Matplotlib.

For this post, we’ll be using the video game sales dataset, available HERE.

As you might expect, we’ll start off by importing all the libraries we will need.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

Let’s start off by loading the data and checking the number of occurrences for features we might be interested in.

df = pd.read_csv("vgsales.csv")

# get an idea of the total number of occurences for important features

publishers = df['Publisher'].unique()
platforms = df['Platform'].unique()
genres = df['Genre'].unique()

print("Number of games: ", len(df))
print("Number of publishers: ", len(publishers))
print("Number of platforms: ", len(platforms))
print("Number of genres: ", len(genres))
Number of games:  16598
Number of publishers:  579
Number of platforms:  31
Number of genres:  12

We need to be sure that our data is free of any null values, so we’ll check for them and drop them if there are any.

print(df.isnull().sum())

# drop them if there are any
df = df.dropna()
Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

One of the first things we may try doing is checking to see how many global game sales there are a year. We can count how many years there are in the database, then we can plot the years against the global sales.

# if we wanted the counts instead, we could just use Count. Count returns the number of instances,
# not the sums of the values like above
x = df.groupby(['Year']).count()
x = x['Global_Sales']
y = x.index.astype(int)

plt.figure(figsize=(12,8))
colors = sns.color_palette("muted")
ax = sns.barplot(y = y, x = x, orient='h', palette=colors)
ax.set_xlabel(xlabel='Number of releases', fontsize=16)
ax.set_ylabel(ylabel='Year', fontsize=16)
ax.set_title(label='Game Releases Per Year', fontsize=20)

plt.show()


Let’s get an idea of how many games are published by specific publishers. There’s a lot of publishers in this list, so we want to drop any publishers that have published fewer than a chosen number of games. Let’s set 75 as a threshhold. We’ll also apply this same method to the platforms the games are published on.

After dropping much of the data, we can try plotting the remaining data that we’ve put into a new dataframe. We’ll plot the number of games published by both the most prolific publishers and the number published on different consoles.

vg_data = pd.read_csv('vgsales.csv')

print(vg_data.info())
print(vg_data.describe())

# let's choose a cutoff and drop any publishers that have published less than X games

for i in vg_data['Publisher'].unique():
    if vg_data['Publisher'][vg_data['Publisher'] == i].count() < 60:
        vg_data['Publisher'][vg_data['Publisher'] == i] = 'Other'

for i in vg_data['Platform'].unique():
    if vg_data['Platform'][vg_data['Platform'] == i].count() < 100:
        vg_data['Platform'][vg_data['Platform'] == i] = 'Other'

#try plotting the new publisher and platform data
sns.countplot(x='Publisher', data=vg_data)
plt.title("# Games Published By Publisher")
plt.xticks(rotation=-90)
plt.show()

plat_data = vg_data['Platform'].value_counts(sort=False)
sns.countplot(y='Platform', data=vg_data)
plt.title("# Games Published Per Console")
plt.xticks(rotation=-90)
plt.show()
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
Rank            16598 non-null int64
Name            16598 non-null object
Platform        16598 non-null object
Year            16327 non-null float64
Genre           16598 non-null object
Publisher       16540 non-null object
NA_Sales        16598 non-null float64
EU_Sales        16598 non-null float64
JP_Sales        16598 non-null float64
Other_Sales     16598 non-null float64
Global_Sales    16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
None
               Rank          Year      NA_Sales      EU_Sales      JP_Sales  \
count  16598.000000  16327.000000  16598.000000  16598.000000  16598.000000   
mean    8300.605254   2006.406443      0.264667      0.146652      0.077782   
std     4791.853933      5.828981      0.816683      0.505351      0.309291   
min        1.000000   1980.000000      0.000000      0.000000      0.000000   
25%     4151.250000   2003.000000      0.000000      0.000000      0.000000   
50%     8300.500000   2007.000000      0.080000      0.020000      0.000000   
75%    12449.750000   2010.000000      0.240000      0.110000      0.040000   
max    16600.000000   2020.000000     41.490000     29.020000     10.220000   

        Other_Sales  Global_Sales  
count  16598.000000  16598.000000  
mean       0.048063      0.537441  
std        0.188588      1.555028  
min        0.000000      0.010000  
25%        0.000000      0.060000  
50%        0.010000      0.170000  
75%        0.040000      0.470000  
max       10.570000     82.740000  

We can also try plotting variables against each other, like getting the global sales of games by their genre.

sns.barplot(x='Genre', y='Global_Sales', data=vg_data)
plt.title("Total Sales Per Genre")
plt.xticks(rotation=-45)
plt.show()

We can filter and plot by multiple criteria. If we wanted to check and see how many games are published in a given genre AND filter by platform we can do that. We just need to get the individual platforms, which we can do by filtering the “platform” feature with a “unique” function. Then we just have to plot the platform and genre data for each of those platforms.

# try visualizing the number of games in a specific genre
for i in vg_data['Platform'].unique():
    vg_data['Genre'][vg_data['Platform'] == i].value_counts().plot(kind='line', label=i, figsize=(20, 10), grid=True)

# set the legend and ticks

plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=20, borderaxespad=0.)
plt.xticks(np.arange(12), tuple(vg_data['Genre'].unique()))
plt.tight_layout()
plt.show()


Now that we’ve plotted some of the data, let’s try predicting some trends based off of the data. We can carry out linear regression to get an idea of how global sales figures could end up based on North American sales figures. First we need to separate our data into train and test sets. We’ll start by setting North American sales as our X variable and global sales as our Y variable, and then do train/test split.

# going to attempt to carry out linear regression and predict the global sales of games
# based off of the sales in North America

X = vg_data.iloc[:, 6].values
y = vg_data.iloc[:, 10].values

# train test split and split the dataframe

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)

The data needs to be reshaped in order to be compatible with Linear Regression, so we’ll do that with the following commands. We’re reshaping them into two long 2D arrays that have as many rows as necessary and a single column. After that we can fit the data in the Linear Regression function.

# reshape the data into long 2D arrays with 1 column and as many rows as necessary
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let’s check to see how our regression algorithm performed. We should plot the correlation between the variable’s training data, and plot the line of best fit from our regressor. We’ll then do the same thing for our testing data. Essentially we’re looking to see how the regression line fits both the training and testing data.

The regression lines should look approximately the same, and indeed they look fairly similar. The training set regression shows approximately 70 million sales for 40 million North American sales, while the test set regression may be just a little higher. We’ll also print the scores on the training and test sets, and see that our Linear Regression implementation had similar, though slightly worse accuracy on the testing set.

Let’s make a function to handle the plotting.

def plot_regression(classifier):

    plt.scatter(X_train, y_train,color='blue')
    plt.plot(X_train, classifier.predict(X_train), color='red')
    plt.title('(Training set)')
    plt.xlabel('North America Sales')
    plt.ylabel('Global Sales')
    plt.show()

    plt.scatter(X_test, y_test,color='blue')
    plt.plot(X_train, classifier.predict(X_train), color='red')
    plt.title('(Testing set)')
    plt.xlabel('North America Sales')
    plt.ylabel('Global Sales')
    plt.show()
    
plot_regression(lin_reg)
print("Training set score: {:.2f}".format(lin_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lin_reg.score(X_test, y_test)))


Training set score: 0.89

Test set score: 0.87

We can now implement some other regression algorithms and see how they perform. Let’s try using a Decision Tree regressor.

DTree_regressor = DecisionTreeRegressor(random_state=5)
DTree_regressor.fit(X_train, y_train)
plot_regression(DTree_regressor)

print("Training set score: {:.2f}".format(DTree_regressor.score(X_train, y_train)))
print("Test set score: {:.2f}".format(DTree_regressor.score(X_test, y_test)))
Training set score: 0.96 
Test set score: 0.81 

Now let’s try a Random Forest regressor algorithm.

RF_regressor = RandomForestRegressor(n_estimators=300, random_state=5)
RF_regressor.fit(X_train, y_train)
plot_regression(RF_regressor)

print("Training set score: {:.2f}".format(RF_regressor.score(X_train, y_train)))
print("Test set score: {:.2f}".format(RF_regressor.score(X_test, y_test)))
Training set score: 0.94 
Test set score: 0.84

It looks like Random Forest and plain Linear Regression have comparable performance. However, we might be able to find a regression algorithm that performs better than these two. We’ll use a type of dimensionality reduction called Principal Component Analysis, which tries to distill the important features of a training set down to just the features that have the most influence on the labels/outcome. By reducing the dimensionality/complexity of a featureset, a representation that contains the features with the most predictive power is created. This can improve the predictive power of a regressor.

We’ll create a Scikit-learn Pipeline, which allows us to specify what kind of regression algorithm we want to use (Linear Regression) and how we want to set up the features for it (use the Standard Scaler and PCA).

Note that there’s only one feature we’re predicting off of here, North American sales, so PCA can’t simplify the representation anymore. But if we had more features we were doing regression on PCA could be useful.

components = [
    ('scaling', StandardScaler()),
    ('PCA', PCA()),
    ('regression', LinearRegression())
]

pca = Pipeline(components)
pca.fit(X_train, y_train)
plot_regression(pca)
print("Training set score: {:.2f}".format(pca.score(X_train, y_train)))
print("Test set score: {:.2f}".format(pca.score(X_test, y_test)))
Training set score: 0.89
Test set score: 0.87

We’re now going to try using different regression algorithms to see what kinds of results we get. Let’s try an Elastic Net regressor.

elastic = ElasticNet()
elastic.fit(X_train, y_train)
plot_regression(elastic)
print("Training set score: {:.2f}".format(elastic.score(X_train, y_train)))
print("Test set score: {:.2f}".format(elastic.score(X_test, y_test)))

Training set score: 0.54
Test set score: 0.51

Now let’s try Ridge regression.

ridge_reg = Ridge()
ridge_reg.fit(X_train, y_train)
plot_regression(ridge_reg)
print("Training set score: {:.2f}".format(ridge_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge_reg.score(X_test, y_test)))
Training set score: 0.89
Test set score: 0.87

Here’s a Lasso regression implementation.

lasso_reg = Lasso()
lasso_reg.fit(X_train, y_train)
plot_regression(lasso_reg)
print("Training set score: {:.2f}".format(lasso_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso_reg.score(X_test, y_test)))
Training set score: 0.38
Test set score: 0.36

Finally, let’s try using AdaBoost regression.

# ADA Boost regressor
ada_reg = AdaBoostRegressor()
ada_reg.fit(X_train, y_train)
plot_regression(ada_reg)

print("Training set score: {:.2f}".format(ada_reg.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ada_reg.score(X_test, y_test)))

Training set score: 0.89
Test set score: 0.81

It looks like Ridge Regression and AdaBoost did the best at predicting the trend.

Thank you for reading through this demonstration of visualizing data and predicting data trends. If you’d like to go further and enhance your understanding of regression algorithms, I suggest checking the documentation for each of the algorithms in Scikit-learn. You can experiment with implementing these techniques on another dataset and altering the regression arguments.

Understanding And Implementing Binary Search In Python

Studying data structures and algorithms is often a frustrating experience for those looking to get into a software engineering role. However, learning the ins and outs of these algorithms can make you a better programmer. Understanding how basic algorithms like binary search and selection sort operate can aid you in thinking in terms of algorithms, giving you a better intuition for how to break complex problems down into simple, solvable steps. 

In this blog series, I plan on doing a dive into many of the data structures and algorithms programmers need to know. There’s a wide variety of data structures/algorithms to learn about, but we all have to start somewhere. I’ll be starting with a breakdown of a common searching algorithm: binary search.

Understanding Binary Search

Binary search, sometimes called logarithmic search (in reference to its average run-time), is a searching algorithm designed to find items within an array. Binary search assumes that the array in question is sorted.  The “binary” in binary search comes from the fact that it operates by diving an array up into two parts and then recurring this division until the specified value is found or until the no more splits are possible (meaning the item is not found in the array).

The steps of implementing binary search can be broken down as follows:

  1. Start by selecting the middle value in the array.
  2. When given some target value – X, compare X with the middle element of the array.
  3. If target value X matches the middle value, return it.
  4. If the middle vale and X aren’t equal, we check the target value against the middle value. If X is a greater value than the middle element, it follows that X can only be found in the right half of the array, which means that we recur on the right half.
  5. If X is smaller than the middle value, we carry out the actions in step 4, but for the left half of the array instead.

If the number of elements in the array is even instead of odd, this means there isn’t a middle value. So instead, we select the left value plus the right value minus one, and divide everything by 2. Doing this ensures that there is always a value that can be selected as the middle value.

Implementation of Binary Search In Python

There are two ways to implement a binary search algorithm in Python: a recursive implementation and an iterative implementation.

Iterative implementations are actually preferred in Python, as Python has a maximum recursion depth, a point at which Python will cease recurring as a guard against stack overflows.

Let’s cover the recursive implementation first.

To start off with, we’ll create a function that takes in an array, a left value, a right value, and a target X value. 

First, we’ll check the right half of the array, selecting the middle value. If the middle value happens to be the target value, we can just return it and end there.

Otherwise, we need to compare the value and find out if it is bigger or smaller than he middle value. If it’s bigger, it can only be in the right half, and if smaller it can only be in the left half. We can make another call and carry out the same function recursively.

Finally, if the we’ve run through the whole array and none of the values matched our target, we conclude that the value isn’t in the array and end the search.

def binary_search(arr, left, right, target):

not_found = "null"

if right >= left:
mid = left + (right - left) // 2
if arr[mid] == target:
return mid
elif arr[mid] > target:
return binary_search(arr, left, mid - 1, target)
else:
return binary_search(arr, mid + 1, right, target)
else:
return not_found

Now let’s run the code and check the output.

arr = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

def run_binary_search(arr, target):
    result = binary_search(arr, 0, len(arr) - 1, target)

    if result == "null":
        print("Target not in array.")
    else:
        print("Target is in array - position: {}".format(result))

run_binary_search(arr, 2)
run_binary_search(arr, 12)
run_binary_search(arr, 19)
run_binary_search(arr, 25)

Output is:

Target is in array – position: 0
Target is in array – position: 10
Target not in array.
Target not in array.

I mentioned earlier that  Python actually prefers an iterative approach to searching, avoiding recursion where possible. Let’s go over the iterative approach for Binary search in Python now.

It’s more or less just the same as the recursive approach, except that instead of calling the function recursively we use a “while” loop, so that as long as the array has values which haven’t been searched, it carries out the binary searching process.

def binary_search_iterative(arr, left, right, target):

    not_found = "null"

    while left <= right:
        mid = left + (right - left) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    else:
        return not_found

arr = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

def run_binary_search_iterative(arr, target):
    result = binary_search_iterative(arr, 0, len(arr) - 1, target)

    if result == "null":
        print("Target not in array.")
    else:
        print("Target is in array - position: {}".format(result))

run_binary_search_iterative(arr, 25)
run_binary_search_iterative(arr, 3)

Output is:

Target not in array.
Target is in array – position: 1


Design a site like this with WordPress.com
Get started