import pandas as pd
bike_rentals = pd.read_csv("bike_rental_hour.csv")
print(bike_rentals.head())

   instant      dteday  season  yr  mnth  hr  holiday  weekday  workingday  \
0        1  2011-01-01       1   0     1   0        0        6           0   
1        2  2011-01-01       1   0     1   1        0        6           0   
2        3  2011-01-01       1   0     1   2        0        6           0   
3        4  2011-01-01       1   0     1   3        0        6           0   
4        5  2011-01-01       1   0     1   4        0        6           0   

   weathersit  temp   atemp   hum  windspeed  casual  registered  cnt  
0           1  0.24  0.2879  0.81        0.0       3          13   16  
1           1  0.22  0.2727  0.80        0.0       8          32   40  
2           1  0.22  0.2727  0.80        0.0       5          27   32  
3           1  0.24  0.2879  0.75        0.0       3          10   13  
4           1  0.24  0.2879  0.75        0.0       0           1    1

Each row represents one hour. Our target column will be "cnt" which represents the total number of bikes rented that hour.¶

# Plotting "cnt" column
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(bike_rentals["cnt"])

(array([ 6972.,  3705.,  2659.,  1660.,   987.,   663.,   369.,   188.,
          139.,    37.]),
 array([   1. ,   98.6,  196.2,  293.8,  391.4,  489. ,  586.6,  684.2,
         781.8,  879.4,  977. ]),
 <a list of 10 Patch objects>)

# Printing out how each column correlates with the "cnt" column. 
bike_rentals.corr()["cnt"]

instant       0.278379
season        0.178056
yr            0.250495
mnth          0.120638
hr            0.394071
holiday      -0.030927
weekday       0.026900
workingday    0.030284
weathersit   -0.142426
temp          0.404772
atemp         0.400929
hum          -0.322911
windspeed     0.093234
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64

# Creating "time_label" column, which will give our algorithm information about how certain hours are related (Morning, Afternoon, etc.)
def assign_label(hr):
    if hr >= 6 and hr < 12:
        return 1
    elif hr >= 12 and hr < 18:
        return 2
    elif hr >= 18 and hr <= 24:
        return 3
    elif hr >= 0 and hr < 6:
        return 4

bike_rentals["time_label"] = bike_rentals["hr"].apply(assign_label)

Error Metric:¶

We are working with continuous numeric data, so Mean Squared Error will work well here.

# Spliting dataframe into train and test sets.
train = bike_rentals.sample(frac=0.8, random_state=1)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]

# Selecting columns to use in algorithm
cols = ["season", "yr", "mnth", "hr", "time_label", "holiday", "weekday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"]

# Training and testing a Linear Regression model, and then determining error metric.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[cols], train["cnt"])
predictions = lr.predict(test[cols])

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test["cnt"], predictions)
print(mse)

17054.9594635

This is a fairly high number for mean squared error and indicates Linear Regression is probably not our best option. Next we'll try a decision tree.¶

# Training and testing a Decision Tree model, and then determining error metric.
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(train[cols], train["cnt"])
predictions_dtr = dtr.predict(test[cols])

mse_dtr = mean_squared_error(test["cnt"], predictions_dtr)
print(mse_dtr)

3342.90491945

# Adjusting parameters of the DecisionTreeRegressor class to minimize model error.
dtr2 = DecisionTreeRegressor(max_depth=15, min_samples_leaf=3)
dtr2.fit(train[cols], train["cnt"])
predictions_dtr2 = dtr2.predict(test[cols])
mse_dtr2 = mean_squared_error(test["cnt"], predictions_dtr2)
print(mse_dtr2)

3170.10390078

The Decision Tree model performed much better than the Linear Regression model. Now we will try to create an even better model using Random Forest.¶

# Training and testing a Random Forest model, and then determining error metric.
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(train[cols], train["cnt"])
predictions_rfr = rfr.predict(test[cols])
mse_rfr = mean_squared_error(test["cnt"], predictions_rfr)
print(mse_rfr)

2203.00476058

# Adjusting parameters of the RandomForestRegressor class to minimize model error.
rfr2 = RandomForestRegressor(max_depth=17, min_samples_leaf=2)
rfr2.fit(train[cols], train["cnt"])
predictions_rfr2 = rfr2.predict(test[cols])
mse_rfr2 = mean_squared_error(test["cnt"], predictions_rfr2)
print(mse_rfr2)

2149.20005548

Random Forest models are typically one of the more accurate models for making predictions and as expected, our Random Forest model with certain parameters adjusted, performed best.¶