In this project, I assume the role of data analytics consultant whose client is the New York City Taxi & Limousine Commission (New York City TLC). The New York City TLC has requested to build a machine learning model to predict if a customer will not leave a tip. They want to use the model in an app that will alert taxi drivers to customers who are unlikely to tip, since drivers depend on tips. Specifically, the goal of the model will be to predict whether or not a customer is a generous tipper. The purpose of the model is to find ways to generate more revenue for taxi cab drivers.
To accomplish the project goal, I will be using two tree-based modeling techniques (Random forest and XGBoost) to predict on a binary target class.
As a standard practice for me, I will be following a problem-solving framework called PACE, an acronym for Plan, Analyze, Construct, and Execute. Additionally, further embeded within the PACE framework are three project tasks that I will be following.
Part 1: Ethical considerations
I'll like to think about the ethical implications of the project request
Should the objective of the model be adjusted?
Part 2: Feature engineering
Part 3: Modeling
In the plan stage of the PACE framework, I wish to reflect on the following questions:
What am I being asked to do?
What are the ethical implications of the model? What are the consequences of my model making errors?
What is the likely effect of the model when it predicts a false negative (i.e., when the model says a customer will give a tip, but they actually won't)?
What is the likely effect of the model when it predicts a false positive (i.e., when the model says a customer will not give a tip, but they actually will)?
Do the benefits of such a model outweigh the potential problems?
Should I proceed with the request to build this model?
Can the objective be modified to make it less problematic?
Possible responses to these questions may be:
Question 1:
Predict if a customer will not leave a tip.
Question 2:
Drivers who didn't receive tips will probably be upset that the app told them a customer would leave a tip. If it happened often, drivers might not trust the app. Drivers are unlikely to pick up people who are predicted to not leave tips. Customers will have difficulty finding a taxi that will pick them up, and might get angry at the taxi company. Even when the model is correct, people who can't afford to tip will find it more difficult to get taxis, which limits the accessibility of taxi service to those who pay extra.
Question 3:
It's not good to disincentivize drivers from picking up customers. It could also cause a customer backlash. The problems seem to outweigh the benefits.
Question 4:
No. Effectively limiting equal access to taxis is ethically problematic, and carries a lot of risk.
Question 5:
We can build a model that predicts the most generous customers. This could accomplish the goal of helping taxi drivers increase their earnings from tips while preventing the wrongful exclusion of certain people from using taxis.
Suppose we were to modify the modeling objective so, instead of predicting people who won't tip at all, we predicted people who are particularly generous—those who will tip 20% or more? Consider the following questions:
Question 1: What features (variables) do you need to make this prediction?
Ideally, we'd have behavioral history for each customer, so we could know how much they tipped on previous taxi rides. We'd also want times, dates, and locations of both pickups and dropoffs, estimated fares, and payment method.
Question 2: What would be the target variable?
The target variable would be a binary variable (1 or 0) that indicates whether or not the customer is expected to tip ≥ 20%.
Question 3:
This is a supervised learning, classification task. We could use accuracy, precision, recall, F-score, area under the ROC curve, or a number of other metrics. However, we don't have enough information at this time to know which are most appropriate. We need to know the class balance of the target variable.
The data for this project comes from the New York City Taxi & Limousine Commission OpenData site and contains trip record submissions made by yellow taxi Technology Service Providers (TSPs). it includes over 200,000 taxi and limousine licensees, making approximately one million combined trips per day. Each row of the data represents a single trip in a yellow taxi. The trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off taxi zone locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
Check the link about to learn more about the data and access the data file.
To begin, I import python packages and libraries needed to build and evaluate random forest and XGBoost classification models.
# Packages for manupliating data
import numpy as np
import pandas as pd
# Packages for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns
# Maching learning packages for preparing data, building and evaluting a classification model
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# This is the function that helps plot feature importance
from xgboost import plot_importance
# package to suppress unnecessary python warnings
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
# Activate the set_option function
# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)
Next, I read in the data. There are two dataframes: one is the original data set and the other is a dataset that I created using the original dataset in a seperate previous project. The latter data set contains features like mean durations, mean distances, and predicted fares, which I believe are relevant features to use in this modeling process.
# Load the original dataset into a dataframe called `df0`
path1 = "/Users/mluri/Library/CloudStorage/OneDrive-TheCollegeofWooster/Desktop/Projects/automatidata/2017_Yellow_Taxi_Trip_Data.csv"
df0 = pd.read_csv(path1)
# load the second dataset into a dataframe called 'nyc_preds_means'
path2 = "/Users/mluri/Library/CloudStorage/OneDrive-TheCollegeofWooster/Desktop/Projects/automatidata/nyc_preds_means.csv"
nyc_preds_means = pd.read_csv(path2)
Inspect the first few rows of df0.
# Inspect the first few rows of `df0`
df0.head()
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 |
| 1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 |
| 2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 |
| 3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 |
| 4 | 30841670 | 2 | 04/15/2017 11:32:20 PM | 04/15/2017 11:49:03 PM | 1 | 4.37 | 1 | N | 4 | 112 | 2 | 16.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 17.80 |
Inspect the first few rows of nyc_preds_means.
# Inspect the first few rows of `nyc_preds_means`
nyc_preds_means.head()
| mean_duration | mean_distance | predicted_fare | |
|---|---|---|---|
| 0 | 22.847222 | 3.521667 | 16.434245 |
| 1 | 24.470370 | 3.108889 | 16.052218 |
| 2 | 7.250000 | 0.881429 | 7.053706 |
| 3 | 30.250000 | 3.700000 | 18.731650 |
| 4 | 14.616667 | 4.435000 | 15.845642 |
Join the two dataframes using a method of your choice.
# Merge datasets
df0 = df0.merge(nyc_preds_means, left_index=True, right_index=True)
df0.head()
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 |
| 1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 |
| 2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 |
| 3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 |
| 4 | 30841670 | 2 | 04/15/2017 11:32:20 PM | 04/15/2017 11:49:03 PM | 1 | 4.37 | 1 | N | 4 | 112 | 2 | 16.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 17.80 | 14.616667 | 4.435000 | 15.845642 |
This concludes the Plan stage of the PACE Framework. Next, I'll proceed to the Analyze stage.
The Analyze stage of the PACE framework entails data prep and exploration. This may involve feature/variable engineering, which is what I do next. I will not perform a detailed exploratory data analysis (EDA) here as I already did that on this data in previous project.
# Let's start by getting some general information about the data
df0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22699 entries, 0 to 22698 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 22699 non-null int64 1 VendorID 22699 non-null int64 2 tpep_pickup_datetime 22699 non-null object 3 tpep_dropoff_datetime 22699 non-null object 4 passenger_count 22699 non-null int64 5 trip_distance 22699 non-null float64 6 RatecodeID 22699 non-null int64 7 store_and_fwd_flag 22699 non-null object 8 PULocationID 22699 non-null int64 9 DOLocationID 22699 non-null int64 10 payment_type 22699 non-null int64 11 fare_amount 22699 non-null float64 12 extra 22699 non-null float64 13 mta_tax 22699 non-null float64 14 tip_amount 22699 non-null float64 15 tolls_amount 22699 non-null float64 16 improvement_surcharge 22699 non-null float64 17 total_amount 22699 non-null float64 18 mean_duration 22699 non-null float64 19 mean_distance 22699 non-null float64 20 predicted_fare 22699 non-null float64 dtypes: float64(11), int64(7), object(3) memory usage: 3.6+ MB
The are no missing values in the data. In addition, the contains 22699 rows and 21 columns. Out of the 21 features in the data, 11 have data type as floating, 7 features have data type as integer, and 3 features have data type as spring.
# Let's take a second to look at the `payment_type` variable to understand the types of payments customers make
df0['payment_type'].unique()
array([1, 2, 3, 4])
_Note: In the dataset, paymenttype is encoded in integers:
1: Credit card 2: Cash 3: No charge 4: Dispute 5: Unknown
From the EDA I previously performed, I discovered that customers who pay cash generally have a tip amount of $0. Therefore, to meet the modeling objective, I'll sample the data to only include the customers who pay with credit card.
# Subset the data to isolate only customers who paid by credit card
df1 = df0[df0['payment_type']==1]
df1.head()
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 |
| 1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 |
| 2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 |
| 3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 |
| 5 | 23345809 | 2 | 03/25/2017 8:34:11 PM | 03/25/2017 8:42:11 PM | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 |
Notice that there isn't a column that indicates tip percent, which is what we need to create the target variable. I'll therefore have to engineer it.
I'll do this by adding a tip_percent column to the dataframe by performing the following calculation:
# Create tip % column
df1['tip_percent'] = round(df1['tip_amount'] / (df1['total_amount'] - df1['tip_amount']), 3)
Now we create another column called generous. This will be the target variable. The column should be a binary indicator of whether or not a customer tipped ≥ 20% (0=no, 1=yes).
generous column a copy of the tip_percent column.# Create 'generous' col (target)
df1['generous'] = df1['tip_percent']
df1['generous'] = (df1['generous'] >= 0.2)
df1['generous'] = df1['generous'].astype(int)
Next, I'll Convert the tpep_pickup_datetime and tpep_dropoff_datetime columns to datetime.
# Convert pickup and dropoff cols to datetime
df1['tpep_pickup_datetime'] = pd.to_datetime(df1['tpep_pickup_datetime'], format='%m/%d/%Y %I:%M:%S %p')
df1['tpep_dropoff_datetime'] = pd.to_datetime(df1['tpep_dropoff_datetime'], format='%m/%d/%Y %I:%M:%S %p')
Next, I'll create a day column that contains only the day of the week when each passenger was picked up.
# Create a 'day' col
df1['day'] = df1['tpep_pickup_datetime'].dt.day_name().str.lower()
Next, I engineer four new columns that represent time of day bins. Each column will contain binary values (0=no, 1=yes) that indicate whether a trip began (picked up) during the following times:
am_rush = [06:00–10:00)
daytime = [10:00–16:00)
pm_rush = [16:00–20:00)
nighttime = [20:00–06:00)
To do this, first I create the four columns. For now, each of the new columns will be identical and contain the same information: the hour (only) from the tpep_pickup_datetime column.
# Create 'am_rush' col
df1['am_rush'] = df1['tpep_pickup_datetime'].dt.hour
# Create 'daytime' col
df1['daytime'] = df1['tpep_pickup_datetime'].dt.hour
# Create 'pm_rush' col
df1['pm_rush'] = df1['tpep_pickup_datetime'].dt.hour
# Create 'nighttime' col
df1['nighttime'] = df1['tpep_pickup_datetime'].dt.hour
I'll now write four functions to convert each new column to binary (0/1). I'll begin with am_rush. The function is such that if the hour is between [06:00–10:00), it returns 1, otherwise, it returns 0.
# Define 'am_rush()' conversion function [06:00–10:00)
def am_rush(hour):
if 6 <= hour['am_rush'] < 10:
val = 1
else:
val = 0
return val
Next, I'll apply the am_rush() function to the am_rush series to perform the conversion. Print the first five values of the column to make sure it did what I expected it to do.
# Apply 'am_rush' function to the 'am_rush' series
df1['am_rush'] = df1.apply(am_rush, axis=1)
df1['am_rush'].head()
0 1 1 0 2 1 3 0 5 0 Name: am_rush, dtype: int64
Finally, I repeat the process by writing similar functions to convert the three remaining columns and apply them to their respective series.
# Define 'daytime()' conversion function [10:00–16:00)
def daytime(hour):
if 10 <= hour['daytime'] < 16:
val = 1
else:
val = 0
return val
# Apply 'daytime' function to the 'daytime' series
df1['daytime'] = df1.apply(daytime, axis=1)
# Define 'pm_rush()' conversion function [16:00–20:00)
def pm_rush(hour):
if 16 <= hour['pm_rush'] < 20:
val = 1
else:
val = 0
return val
# Apply 'pm_rush' function to the 'pm_rush' series
df1['pm_rush'] = df1.apply(pm_rush, axis=1)
# Define 'nighttime()' conversion function [20:00–06:00)
def nighttime(hour):
if 20 <= hour['nighttime'] < 24:
val = 1
elif 0 <= hour['nighttime'] < 6:
val = 1
else:
val = 0
return val
# Apply 'nighttime' function to the 'nighttime' series
df1['nighttime'] = df1.apply(nighttime, axis=1)
Next, I'll create a month column that contains only the abbreviated name of the month when each passenger was picked up, then convert the result to lowercase.
# Create 'month' colunm
df1['month'] = df1['tpep_pickup_datetime'].dt.strftime('%b').str.lower()
Examine the first five rows of the dataframe up to this point.
df1.head()
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | tip_percent | generous | day | am_rush | daytime | pm_rush | nighttime | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24870114 | 2 | 2017-03-25 08:55:43 | 2017-03-25 09:09:47 | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 | 0.200 | 1 | saturday | 1 | 0 | 0 | 0 | mar |
| 1 | 35634249 | 1 | 2017-04-11 14:53:28 | 2017-04-11 15:19:58 | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 | 0.238 | 1 | tuesday | 0 | 1 | 0 | 0 | apr |
| 2 | 106203690 | 1 | 2017-12-15 07:26:56 | 2017-12-15 07:34:08 | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 | 0.199 | 0 | friday | 1 | 0 | 0 | 0 | dec |
| 3 | 38942136 | 2 | 2017-05-07 13:17:59 | 2017-05-07 13:48:14 | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 | 0.300 | 1 | sunday | 0 | 1 | 0 | 0 | may |
| 5 | 23345809 | 2 | 2017-03-25 20:34:11 | 2017-03-25 20:42:11 | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 | 0.200 | 1 | saturday | 0 | 0 | 0 | 1 | mar |
I'll drop redundant and irrelevant columns as well as those that would not be available when the model is deployed. This includes information like payment type, trip distance, tip amount, tip percentage, total amount, toll amount, etc. The target variable (generous) must remain in the data because it will get isolated as the y data for modeling.
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15265 entries, 0 to 22698 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 15265 non-null int64 1 VendorID 15265 non-null int64 2 tpep_pickup_datetime 15265 non-null datetime64[ns] 3 tpep_dropoff_datetime 15265 non-null datetime64[ns] 4 passenger_count 15265 non-null int64 5 trip_distance 15265 non-null float64 6 RatecodeID 15265 non-null int64 7 store_and_fwd_flag 15265 non-null object 8 PULocationID 15265 non-null int64 9 DOLocationID 15265 non-null int64 10 payment_type 15265 non-null int64 11 fare_amount 15265 non-null float64 12 extra 15265 non-null float64 13 mta_tax 15265 non-null float64 14 tip_amount 15265 non-null float64 15 tolls_amount 15265 non-null float64 16 improvement_surcharge 15265 non-null float64 17 total_amount 15265 non-null float64 18 mean_duration 15265 non-null float64 19 mean_distance 15265 non-null float64 20 predicted_fare 15265 non-null float64 21 tip_percent 15262 non-null float64 22 generous 15265 non-null int64 23 day 15265 non-null object 24 am_rush 15265 non-null int64 25 daytime 15265 non-null int64 26 pm_rush 15265 non-null int64 27 nighttime 15265 non-null int64 28 month 15265 non-null object dtypes: datetime64[ns](2), float64(12), int64(12), object(3) memory usage: 3.5+ MB
# Drop columns
drop_cols = ['Unnamed: 0', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
'payment_type', 'trip_distance', 'store_and_fwd_flag', 'payment_type',
'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
'improvement_surcharge', 'total_amount', 'tip_percent']
df1 = df1.drop(drop_cols, axis=1)
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15265 entries, 0 to 22698 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 15265 non-null int64 1 passenger_count 15265 non-null int64 2 RatecodeID 15265 non-null int64 3 PULocationID 15265 non-null int64 4 DOLocationID 15265 non-null int64 5 mean_duration 15265 non-null float64 6 mean_distance 15265 non-null float64 7 predicted_fare 15265 non-null float64 8 generous 15265 non-null int64 9 day 15265 non-null object 10 am_rush 15265 non-null int64 11 daytime 15265 non-null int64 12 pm_rush 15265 non-null int64 13 nighttime 15265 non-null int64 14 month 15265 non-null object dtypes: float64(3), int64(10), object(2) memory usage: 1.9+ MB
Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information, such as RatecodeID and the pickup and dropoff locations. To make these columns recognizable to the get_dummies() function as categorical variables, I'll first need to convert them to type(str).
cols_to_str, which is a list of the numeric columns that contain categorical information and must be converted to string: RatecodeID, PULocationID, DOLocationID.cols_to_str to string.# 1. Define list of cols to convert to string
cols_to_str = ['RatecodeID', 'PULocationID', 'DOLocationID', 'VendorID']
# 2. Convert each column to string
for col in cols_to_str:
df1[col] = df1[col].astype('str')
Next, convert all the categorical columns to binary.
get_dummies() on the dataframe and assign the results back to a new dataframe called df2.# Convert categoricals to binary
df2 = pd.get_dummies(df1, drop_first=True)
df2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15265 entries, 0 to 22698 Columns: 347 entries, passenger_count to month_sep dtypes: float64(3), int64(6), uint8(338) memory usage: 6.1 MB
Before modeling, I must decide on an evaluation metric.
# Get class balance of 'generous' col
df2['generous'].value_counts(normalize=True)
1 0.526368 0 0.473632 Name: generous, dtype: float64
A little over half of the customers in this dataset were "generous" (tipped ≥ 20%). The dataset is very nearly balanced.
To determine a metric, it is important to consider the cost of twooiq kinds of model error:
False positives are worse for cab drivers, because they would pick up a customer expecting a good tip and then not receive one, frustrating the driver.
False negatives are worse for customers, because a cab driver would likely pick up a different customer who was predicted to tip more—even when the original customer would have tipped generously.
The stakes are relatively even. We want to help taxi drivers make more money, but we don't want this to anger customers. Our metric should weigh both precision and recall equally. Which metric is this? The F1 score is the metric that places equal weight on true postives and false positives, and so therefore on precision and recall.
I am now ready to move onto the construct stage of the PACE workflow framework, which entails model building and evaluation.
The only remaining preprocessing step is to split the data into features/target variable and training/testing data.
y that isolates the target variable (generous);X that isolates the features; and# Isolate target variable (y)
y = df2['generous']
# Isolate the features (X)
X = df2.drop('generous', axis=1)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
I'll build a random forest model and an xgboost model and compare their performance.
I'll begin with using GridSearchCV to tune a random forest model.
Instantiate the random forest classifier rf and set the random state.
Create a dictionary cv_params of any of the following hyperparameters and their corresponding values to tune. The more we tune, the better our model will fit the data, but the longer it will take.
max_depth max_features max_samples min_samples_leaf min_samples_splitn_estimators Define a set scoring of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).
Instantiate the GridSearchCV object rf1. Pass to it as arguments:
rfcv_paramsscoringcv=_)refit=_)Note: refit should be set to 'f1'.
</details>
# 1. Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=42)
# 2. Create a dictionary of hyperparameters to tune
# Note that this example only contains 1 value for each parameter for simplicity,
# but you should assign a dictionary with ranges of values
cv_params = {'max_depth': [None],
'max_features': [1.0],
'max_samples': [0.7],
'min_samples_leaf': [1],
'min_samples_split': [2],
'n_estimators': [300]
}
# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}
# 4. Instantiate the GridSearchCV object
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='f1')
Now we fit the model to the training data.
Note: _Depending on how many options we include in our search grid and the number of cross-validation folds we select, this could take a very long time—even hours. If we use 4-fold validation and include only one possible value for each hyperparameter and grow 300 trees to full depth, it should take about 5 minutes. If we add another value for GridSearch to check for, say, min_samples_split (so all hyperparameters now have 1 value except for min_samples_split, which has 2 possibilities), it would double the time to ~10 minutes. Each additional parameter would approximately double the time._
%%time
rf1.fit(X_train, y_train)
CPU times: user 2min 6s, sys: 427 ms, total: 2min 6s Wall time: 2min 6s
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=42),
param_grid={'max_depth': [None], 'max_features': [1.0],
'max_samples': [0.7], 'min_samples_leaf': [1],
'min_samples_split': [2], 'n_estimators': [300]},
refit='f1', scoring={'precision', 'recall', 'accuracy', 'f1'})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=42),
param_grid={'max_depth': [None], 'max_features': [1.0],
'max_samples': [0.7], 'min_samples_leaf': [1],
'min_samples_split': [2], 'n_estimators': [300]},
refit='f1', scoring={'precision', 'recall', 'accuracy', 'f1'})RandomForestClassifier(random_state=42)
RandomForestClassifier(random_state=42)
If you want, use pickle to save your models and read them back in. This can be particularly helpful when performing a search over many possible hyperparameter values.
import pickle
# Define a path to the folder where you want to save the model
path = '/Users/mluri/Library/CloudStorage/OneDrive-TheCollegeofWooster/Desktop/Projects/automatidata'
def write_pickle(path, model_object, save_name:str):
'''
save_name is a string.
'''
with open(path + save_name + '.pickle', 'wb') as to_write:
pickle.dump(model_object, to_write)
def read_pickle(path, saved_model_name:str):
'''
saved_model_name is a string.
'''
with open(path + saved_model_name + '.pickle', 'rb') as to_read:
model = pickle.load(to_read)
return model
Examine the best average score across all the validation folds.
# Examine best score
rf1.best_score_
0.7133701077670043
Examine the best combination of hyperparameters.
rf1.best_params_
{'max_depth': None,
'max_features': 1.0,
'max_samples': 0.7,
'min_samples_leaf': 1,
'min_samples_split': 2,
'n_estimators': 300}
I'll use the make_results() function to output all of the scores of our model. Note that it accepts three arguments.
def make_results(model_name:str, model_object, metric:str):
'''
Arguments:
model_name (string): what you want the model to be called in the output table
model_object: a fit GridSearchCV object
metric (string): precision, recall, f1, or accuracy
Returns a pandas df with the F1, recall, precision, and accuracy scores
for the model with the best mean 'metric' score across all validation folds.
'''
# Create dictionary that maps input metric to actual metric name in GridSearchCV
metric_dict = {'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy',
}
# Get all the results from the CV and put them in a df
cv_results = pd.DataFrame(model_object.cv_results_)
# Isolate the row of the df with the max(metric) score
best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Extract Accuracy, precision, recall, and f1 score from that row
f1 = best_estimator_results.mean_test_f1
recall = best_estimator_results.mean_test_recall
precision = best_estimator_results.mean_test_precision
accuracy = best_estimator_results.mean_test_accuracy
# Create table of results
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
},
)
return table
# Call 'make_results()' on the GridSearch object
results = make_results('RF CV', rf1, 'f1')
results
| model | precision | recall | F1 | accuracy | |
|---|---|---|---|---|---|
| 0 | RF CV | 0.675349 | 0.756223 | 0.71337 | 0.680314 |
This is an acceptable model across the board. Typically scores of 0.65 or better are considered acceptable, but this is always dependent on the use case. Optional: I could try to improve the scores. It's worth trying, especially searching over different hyperparameters.
I'll now use the model to predict on the test data. Assign the results to a variable called rf_preds.
# Get scores on test data
rf_preds = rf1.best_estimator_.predict(X_test)
I'll use the get_test_scores() function below to output the scores of the model on the test data.
def get_test_scores(model_name:str, preds, y_test_data):
'''
Generate a table of test scores.
In:
model_name (string): Your choice: how the model will be named in the output table
preds: numpy array of test predictions
y_test_data: numpy array of y_test data
Out:
table: a pandas df of precision, recall, f1, and accuracy scores for your model
'''
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy]
})
return table
get_test_scores() function to generate the scores on the test data. Assign the results to rf_test_scores.rf_test_scores to output the results.# Get scores on test data
rf_test_scores = get_test_scores('RF test', rf_preds, y_test)
results = pd.concat([results, rf_test_scores], axis=0)
results
| model | precision | recall | F1 | accuracy | |
|---|---|---|---|---|---|
| 0 | RF CV | 0.675349 | 0.756223 | 0.713370 | 0.680314 |
| 0 | RF test | 0.674419 | 0.775980 | 0.721644 | 0.684900 |
The table above shows the results of two random forest models: one performed with four-fold crossvalidation (RF CV), and the other with a single test data set aside (RF test). Notice that going from the RF CV model to the RF test model, all scores increased by at most ~0.02.
I'll now try to improve my model scores using an XGBoost model. Here are the steps:
Instantiate the XGBoost classifier xgb and set objective='binary:logistic'. Also set the random state.
Create a dictionary cv_params of the following hyperparameters and their corresponding values to tune:
max_depthmin_child_weightlearning_raten_estimatorsDefine a set scoring of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy).
Instantiate the GridSearchCV object xgb1. Pass to it as arguments:
xgbcv_paramsscoringcv=_)refit='f1')# 1. Instantiate the XGBoost classifier
xgb = XGBClassifier(objective='binary:logistic', random_state=0)
# 2. Create a dictionary of hyperparameters to tune
# Note that this example only contains 1 value for each parameter for simplicity,
# but you should assign a dictionary with ranges of values
cv_params = {'learning_rate': [0.1],
'max_depth': [8],
'min_child_weight': [2],
'n_estimators': [500]
}
# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}
# 4. Instantiate the GridSearchCV object
xgb1 = GridSearchCV(xgb, cv_params, scoring=scoring, cv=4, refit='f1')
Fit the model to the X_train and y_train data.
%%time
xgb1.fit(X_train, y_train)
CPU times: user 57.9 s, sys: 30.3 s, total: 1min 28s Wall time: 17.1 s
GridSearchCV(cv=4,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=None,...
max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None,
random_state=0, ...),
param_grid={'learning_rate': [0.1], 'max_depth': [8],
'min_child_weight': [2], 'n_estimators': [500]},
refit='f1', scoring={'precision', 'recall', 'accuracy', 'f1'})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=4,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=None,...
max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None,
random_state=0, ...),
param_grid={'learning_rate': [0.1], 'max_depth': [8],
'min_child_weight': [2], 'n_estimators': [500]},
refit='f1', scoring={'precision', 'recall', 'accuracy', 'f1'})XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=0, ...)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=0, ...)Get the best score from this model.
# Examine best score
xgb1.best_score_
0.6955124635485908
And the best parameters.
# Examine best parameters
xgb1.best_params_
{'learning_rate': 0.1,
'max_depth': 8,
'min_child_weight': 2,
'n_estimators': 500}
Use the make_results() function to output all of the scores of your model. Note that it accepts three arguments.
# Call 'make_results()' on the GridSearch object
xgb1_cv_results = make_results('XGB CV', xgb1, 'f1')
results = pd.concat([results, xgb1_cv_results], axis=0)
results
| model | precision | recall | F1 | accuracy | |
|---|---|---|---|---|---|
| 0 | RF CV | 0.675349 | 0.756223 | 0.713370 | 0.680314 |
| 0 | RF test | 0.674419 | 0.775980 | 0.721644 | 0.684900 |
| 0 | XGB CV | 0.669726 | 0.723553 | 0.695512 | 0.666557 |
I'll use my model to predict on the test data. Assign the results to a variable called xgb_preds.
# Get scores on test data
xgb_preds = xgb1.best_estimator_.predict(X_test)
get_test_scores() function to generate the scores on the test data. Assign the results to xgb_test_scores.xgb_test_scores to output the results.# Get scores on test data
xgb_test_scores = get_test_scores('XGB test', xgb_preds, y_test)
results = pd.concat([results, xgb_test_scores], axis=0)
results
| model | precision | recall | F1 | accuracy | |
|---|---|---|---|---|---|
| 0 | RF CV | 0.675349 | 0.756223 | 0.713370 | 0.680314 |
| 0 | RF test | 0.674419 | 0.775980 | 0.721644 | 0.684900 |
| 0 | XGB CV | 0.669726 | 0.723553 | 0.695512 | 0.666557 |
| 0 | XGB test | 0.677219 | 0.745488 | 0.709716 | 0.679004 |
The F1 score is ~0.01 lower for the xgboost model than the random forest model. Both models are acceptable, but the random forest model is better.
Now, I'll plot a confusion matrix of my top model's predictions on the test data.
# Generate array of values for confusion matrix
cm = confusion_matrix(y_test, rf_preds, labels=rf1.classes_)
# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf1.classes_,
)
disp.plot(values_format='');
The model is almost twice as likely to predict a false positive than it is to predict a false negative. Therefore, type I errors are more common. This is less desirable, because it's better for a driver to be pleasantly surprised by a generous tip when they weren't expecting one than to be disappointed by a low tip when they were expecting a generous one. However, the overall performance of this model is satisfactory.
Next, I'll use the feature_importances_ attribute of the best estimator object to inspect the features of my final model. I can then sort them and plot the most important ones.
importances = rf1.best_estimator_.feature_importances_
rf_importances = pd.Series(importances, index=X_test.columns)
rf_importances = rf_importances.sort_values(ascending=False)[:15]
fig, ax = plt.subplots(figsize=(8,5))
rf_importances.plot.bar(ax=ax)
ax.set_title('Feature importances')
ax.set_ylabel('Mean decrease in impurity')
fig.tight_layout();
The Execute stage of the PACE workflow framework involves presenting the model to stakeholders, receiving feedback and possibly reworking the model.
In this step, I use the results of the models above to formulate a conclusion.
Would I recommend using this model?
Yes, this is model performs acceptably. Its F1 score was 0.7235 and it had an overall accuracy of 0.6865. It correctly identified ~78% of the actual responders in the test set, which is 48% better than a random guess. It may be worthwhile to test the model with a select group of taxi drivers to get feedback.
What was my highest scoring model doing? Or how was it making predictions?
Unfortunately, the random forest model which was my highest performing model is not the most transparent machine learning algorithm. I know that VendorID, predicted_fare, mean_duration, and mean_distance are the most important features/variables, but I don't know how they influence tipping. This would require further exploration. It is interesting that VendorID is the most predictive feature. This seems to indicate that one of the two vendors tends to attract more generous customers. It may be worth performing statistical tests on the different vendors to examine this further.
Are there new features that I could engineer that may improve model performance?
There are almost always additional features that can be engineered, but hopefully the most obvious ones were generated during the first round of modeling. In my case, I could try creating three new columns that indicate if the trip distance is short, medium, or far. I could also engineer a column that gives a ratio that represents (the amount of money from the fare amount to the nearest higher multiple of \$5) / fare amount. For example, if the fare were \\$12, the value in this column would be 0.25, because \$12 to the nearest higher multiple of \\$5 (\$15) is \\$3, and \$3 divided by \\$12 is 0.25. The intuition for this feature is that people might be likely to simply round up their tip, so journeys with fares that have values just under a multiple of \$5 may have lower tip percentages than those with fare values just over a multiple of \\$5. I could also do the same thing for fares to the nearest \$10.
What features would I want to have that would likely improve the performance of my model?
It would probably be very helpful to have past tipping behavior for each customer. It would also be valuable to have accurate tip values for customers who pay with cash.
It would be helpful to have a lot more data. With enough data, I could create a unique feature for each pickup/dropoff combination.