Car Insurance Prediction Modeling using Python

Descriptive Analysis, Statistics and Machine Learning Models for prediction using Python on an Insurance Company Dataset.

Posted Aug 13, 2024 Updated Aug 19, 2024

By Prasanna Venkatesan

24 min read

Objective

An insurance company has been providing health insurance to its customers and are now hoping to build a model to predict whether the policy holders from past will also be interested in Vehicle Insurance provided by the company.

Building this model would help the company in its communication strategy to reach out to those customers and optimize its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle Insurance, the following information are at hand in the dataset,

id
Gender
Age
Driving_License - 0 : Customer does not have DL, 1 : Customer already has DL
Region_Code
Previously_Insured - 1 : Customer already has Vehicle Insurance, 0 : Customer doesn’t have Vehicle Insurance
Vehicle_Age
Vehicle_Damage - 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn’t get his/her vehicle damaged in the past
Annual_Premium - The amount customer needs to pay as premium in the year
PolicySalesChannel - Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
Vintage - Number of Days, Customer has been associated with the company
Response - 1 : Customer is interested, 0 : Customer is not interested

In this report, descriptive analytics and statistics is produced along with data preprocessing for the dataset. You can find the Github repo for this project here.

1. Load the data

  
#the required libraries for the tasks are imported
import numpy as np           #for efficient numerical operations
import pandas as pd          #for manipulating and visualising data
import matplotlib.pyplot as plt   #for data visualization
import seaborn as sns             #for data visualization

#load the dataset
dataset = pd.read_csv('dataset.csv')
dataset.head(5)

	id	Gender	Age	Driving_License	Region_Code	Previously_Insured	Vehicle_Age	Vehicle_Damage	Annual_Premium	Policy_Sales_Channel	Vintage	Response
0	1	Male	44	1	28	0	> 2 Years	Yes	40454	26	217	1
1	2	Male	76	1	3	0	1-2 Year	No	33536	26	183	0
2	3	Male	47	1	28	0	> 2 Years	Yes	38294	26	27	1
3	4	Male	21	1	11	1	< 1 Year	No	28619	152	203	0
4	5	Female	29	1	41	1	< 1 Year	No	27496	152	39	0

2. Split the data into Training and Testing Dataset

  
#import scikit-learn library for machine learning and also for sampling the dataset
from sklearn.model_selection import train_test_split

#the dataset is split into train and test dataset using random sampling
train, test = train_test_split(dataset, test_size=0.4, random_state = 7)

print(f"{train.shape[0]} train and {test.shape[0]} test instances")

228665 train and 152444 test instances

The dataset is split into train and test dataset. Train dataset alone would be used in this report to avoid data snooping.

  
#to provide information on the train dataset
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 228665 entries, 301388 to 61615
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   id                    228665 non-null  int64 
 1   Gender                228665 non-null  object
 2   Age                   228665 non-null  int64 
 3   Driving_License       228665 non-null  int64 
 4   Region_Code           228665 non-null  int64 
 5   Previously_Insured    228665 non-null  int64 
 6   Vehicle_Age           228665 non-null  object
 7   Vehicle_Damage        228665 non-null  object
 8   Annual_Premium        228665 non-null  int64 
 9   Policy_Sales_Channel  228665 non-null  int64 
 10  Vintage               228665 non-null  int64 
 11  Response              228665 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 22.7+ MB

  
#to provide information on the test dataset
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 152444 entries, 87112 to 221223
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   id                    152444 non-null  int64 
 1   Gender                152444 non-null  object
 2   Age                   152444 non-null  int64 
 3   Driving_License       152444 non-null  int64 
 4   Region_Code           152444 non-null  int64 
 5   Previously_Insured    152444 non-null  int64 
 6   Vehicle_Age           152444 non-null  object
 7   Vehicle_Damage        152444 non-null  object
 8   Annual_Premium        152444 non-null  int64 
 9   Policy_Sales_Channel  152444 non-null  int64 
 10  Vintage               152444 non-null  int64 
 11  Response              152444 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 15.1+ MB

Information on the train and test dataset can be seen above.

  
#to check for null values
train.isnull().sum()

id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

  
#to check for null values
test.isnull().sum()

id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

There are no missing values in the train and test dataset.

3. Exploratory Data Analysis

  
#descriptive statistics on the train dataset
train.describe()

	id	Age	Driving_License	Region_Code	Previously_Insured	Annual_Premium	Policy_Sales_Channel	Vintage	Response
count	228665.000000	228665.000000	228665.000000	228665.000000	228665.000000	228665.000000	228665.000000	228665.000000	228665.000000
mean	190652.209538	38.828557	0.997853	26.397398	0.458269	30595.860070	112.027350	154.434920	0.122940
std	110061.420927	15.535094	0.046289	13.222463	0.498257	17292.567811	54.194537	83.736668	0.328369
min	1.000000	20.000000	0.000000	0.000000	0.000000	2630.000000	1.000000	10.000000	0.000000
25%	95377.000000	25.000000	1.000000	15.000000	0.000000	24432.000000	29.000000	82.000000	0.000000
50%	190495.000000	36.000000	1.000000	28.000000	0.000000	31692.000000	131.000000	154.000000	0.000000
75%	286007.000000	49.000000	1.000000	35.000000	1.000000	39432.000000	152.000000	227.000000	0.000000
max	381107.000000	85.000000	1.000000	52.000000	1.000000	540165.000000	163.000000	299.000000	1.000000

Descriptive statistics on the train dataset is seen above.

  
#plot correlation matrix
plt.figure(figsize=(16,6))
ht=sns.heatmap(train.corr(), vmin=-1, vmax=1, annot=True)
ht.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

The attributes that are highly correlated with “Response” are “Previously_Insured”, Policy_Sales_Channel” and “Age”, in that order.

  
#plot distribution of age as an histogram
sns.histplot(x=train["Age"], bins=20)

From the graph above, the histogram of age is right skewed upon inspection.

  
#plot distribution of annual premium as an histogram
sns.boxplot(y=train["Age"])

From the above boxplot, it can be seen that there are no outliers present in the Age distribution.

  
#plot distribution of annual premium as an histogram
sns.histplot(x=train["Annual_Premium"], bins=20)

The histogram for Annual Premium also shows that it is right skewed similar to age.

  
#plot distribution of annual premium as an histogram
sns.boxplot(y=train["Annual_Premium"])

From the above boxplot on Annual premium, it can be seen that there are a large number of outliers present in the annual premium distribution.

  
#plotting the distribution of gender based on customer's response
sns.catplot(x="Gender",  col="Response", kind="count", data=train)

From the graph above, it can be seen that there are more number of male customers in the dataset and also male customers are more interested in vehicle insurance than female customers.

  
#plotting the distribution of vehicle age based on customer's response
sns.catplot(x="Vehicle_Age",  col="Response", kind="count", data=train)

The above graph describes the distribution of Vehicle Age among the customers in the dataset. It can be seen that there are more number of customers with vehicles with 1-2 years old and less than 1 year old. The customers with vehicles greater than 2 years are significantly less. The same trend follows with the customers who responded saying that they would be interested in vehicle insurance.

  
#total number of responses under each category in vehicle age
train['Vehicle_Age'].value_counts()

1-2 Year     120131
< 1 Year      98879
> 2 Years      9655
Name: Vehicle_Age, dtype: int64

From the above values, it can be seen that the customers with vehicles more than 2 years old are significantly less.

  
#plotting the distribution of vehicle damage based on customer's response
sns.catplot(x="Vehicle_Damage",  col="Response", kind="count", data=train)

The above graph describes the number of customers that have had their vehicle damaged.

  
#plotting the distribution of previosuly owned vehicle insurance based on customer's response
sns.catplot(x="Previously_Insured",  col="Response", kind="count", data=train)

The above graph illustrates the number of customers in the dataset who already have a vehilce insurance. It can be seen that there are more customers in total who do not already have a vehicle insurance. Looking at the case of customers who are interested in a vehicle insurance, it is evident that customers who do not already own a vehicle insurance are more interested in getting a vehicle insurance.

4. Data Cleaning and Transformation

From the analysis in previous sections, there are no missing values in the dataset. Therefore, the only data preprocessing steps that needs to be taken care of is dropping the id column, data transformation of right skewed columns (Age and Annual_Premium), the removal of outliers and data scaling of the continuous variables in both train and test dataset.

  
#display all numerical values in an histogram
dummy = train.hist(bins=50, figsize=(16,12))

  
#drop the id-column in the train dataset
train = train.drop(['id'], axis = 1)

  
#drop the id-column in the test dataset
test = test.drop(['id'], axis = 1)

  
#changing categorical variables to numerical
train.loc[train['Gender'] == 'Male', 'Gender'] = 1
train.loc[train['Gender'] == 'Female', 'Gender'] = 0
test.loc[test['Gender'] == 'Male', 'Gender'] = 1
test.loc[test['Gender'] == 'Female', 'Gender'] = 0

train.loc[train['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
train.loc[train['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
train.loc[train['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0
test.loc[test['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
test.loc[test['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
test.loc[test['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0

train.loc[train['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
train.loc[train['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0
test.loc[test['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
test.loc[test['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0

  
#changing all the dtypes to int in train dataset
for col in train.columns:
    train[col] = train[col].astype(np.int32)

  
#changing all the dtypes to int in test dataset
for col in test.columns:
    test[col] = test[col].astype(np.int32)

  
#log transformation of Age column
train['Age_log'] = np.log(train['Age'])

  
#plot distribution of Age after log transformation in an histogram
sns.histplot(x=train["Age_log"], bins=20)

  
#log transformation of Age column in the test dataset
test['Age_log'] = np.log(test['Age'])

  
#log transformation of Annual_Premium column in train dataset
train['Annual_Premium_log'] = np.log(train['Annual_Premium'])

  
#plot distribution of Annual_Premium after log transformation in an histogram
sns.histplot(x=train["Annual_Premium_log"], bins=20)

  
#removing outliers from Annual_Premium_log using z_score
from scipy import stats
train['z_score']=stats.zscore(train['Annual_Premium_log'])

  
#checking the distribution of z_score so that values with high z_score can be removed
sns.histplot(x=train['z_score'], bins=20)

  
#removing values with z_score greater than 2
train = train.loc[train['z_score']<=2]

  
#removing values with z_score lesser than -1
train = train.loc[train['z_score']>=-1]

  
#plot distribution of Annual_Premium after removal of outliers in an histogram
sns.histplot(x=train["Annual_Premium_log"], bins=20)

  
#log transformation of Annual_Premium column in the test dataset
test['Annual_Premium_log'] = np.log(test['Annual_Premium'])

  
#drop the age column in test and train dataset
train = train.drop(['Age'], axis = 1)
test = test.drop(['Age'], axis = 1)

  
#drop the Annual_Premium column from the train and test dataset
train = train.drop(['Annual_Premium'], axis = 1)
test = test.drop(['Annual_Premium'], axis = 1)

  
#drop the z_score column from train dataset
train = train.drop(['z_score'], axis = 1)

  
#data scaling needs to be done for both train and test dataset using standard scaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

train_target = train['Response'].values
train_predictors = train.drop(['Response'], axis=1)

#fit_transform returns a NumPy aray, so need to put it back 
#into a Pandas dataframe
scaled_vals = scaler.fit_transform(train_predictors)
train = pd.DataFrame(scaled_vals, columns=train_predictors.columns)

#put the non-scaled target back in
train['Response'] = train_target

  
#repeat the same steps for the test dataset
test_target = test['Response'].values
test_predictors = test.drop(['Response'], axis=1)

scaled_vals = scaler.fit_transform(test_predictors)
test = pd.DataFrame(scaled_vals, columns=test_predictors.columns)

test['Response'] = test_target

  
#plotting all the distribution of values in each column after data-preprocesing
dummy = train.hist(bins=50, figsize=(16,12))

The train and test dataset is now clean and ready to be used for modeling the predictive models.

  
train.to_csv('train.csv', index=False)

  
test.to_csv('test.csv', index=False)

5. Modeling

The aim now is to implement and evaluate several alternative predictive models to predict whether an existing customer would be interested or not in Vehicle Insurance with the help of the predictors: Gender, Age, Driving License, Region Code, Previously Insured, Vehicle Age, Vehicle Damage, Annual Premium, Policy Sales Channel and Vintage.

  
import time                  #for getting local time from the number of seconds elapsed

#load the required train and test datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

  
train.head()

	Gender	Driving_License	Region_Code	Previously_Insured	Vehicle_Age	Vehicle_Damage	Policy_Sales_Channel	Vintage	Age_log	Annual_Premium_log	Response
0	0.930654	0.047053	-1.197614	-0.956976	0.724263	1.022372	0.228267	0.210398	0.871617	-0.345003	0
1	-1.074513	0.047053	-0.654205	1.044959	-1.025121	-0.978117	0.747745	-0.852524	-0.785843	-0.331879	0
2	0.930654	0.047053	-0.654205	-0.956976	0.724263	1.022372	-1.589904	-1.461615	0.489337	-0.459352	1
3	0.930654	0.047053	-1.818652	1.044959	-1.025121	-0.978117	0.896167	-0.780867	-0.785843	-1.036554	0
4	-1.074513	0.047053	-1.896282	1.044959	-1.025121	-0.978117	0.747745	-1.354128	-0.598007	1.223449	0

  
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189674 entries, 0 to 189673
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Gender                189674 non-null  float64
 1   Driving_License       189674 non-null  float64
 2   Region_Code           189674 non-null  float64
 3   Previously_Insured    189674 non-null  float64
 4   Vehicle_Age           189674 non-null  float64
 5   Vehicle_Damage        189674 non-null  float64
 6   Policy_Sales_Channel  189674 non-null  float64
 7   Vintage               189674 non-null  float64
 8   Age_log               189674 non-null  float64
 9   Annual_Premium_log    189674 non-null  float64
 10  Response              189674 non-null  int64  
dtypes: float64(10), int64(1)
memory usage: 15.9 MB

  
train.shape[0]

189674

  
test.shape[0]

152444

Sampling

The train and test dataset are too large and would take massive amounts of time to train models on this dataset. Therefore, a sample is taken out of the train and test dataset.

  
#sampling from train dataset
ftrain = train.sample(n=50000, random_state=7)

  
ftrain.head()

	Gender	Driving_License	Region_Code	Previously_Insured	Vehicle_Age	Vehicle_Damage	Policy_Sales_Channel	Vintage	Age_log	Annual_Premium_log	Response
50183	0.930654	0.047053	-1.430503	1.044959	-1.025121	-0.978117	0.729192	1.416637	-1.096595	0.815633	0
46028	0.930654	0.047053	-1.430503	1.044959	-1.025121	-0.978117	0.747745	-0.374806	-0.988722	2.021025	0
41101	-1.074513	0.047053	0.587873	-0.956976	-1.025121	-0.978117	0.747745	0.735888	-0.509064	-0.685612	0
39742	-1.074513	0.047053	0.122094	-0.956976	0.724263	1.022372	-1.589904	0.819489	0.547607	2.429836	1
187634	0.930654	0.047053	-0.343685	1.044959	-1.025121	-0.978117	0.896167	0.640345	-1.327174	0.145065	0

  
#sampling from test dataset
ftest = test.sample(n=35000, random_state=7)

  
ftest.head()

	Gender	Driving_License	Region_Code	Previously_Insured	Vehicle_Age	Vehicle_Damage	Policy_Sales_Channel	Vintage	Age_log	Annual_Premium_log	Response
140791	0.922772	0.045936	0.122655	-0.919475	-1.074838	0.991209	0.220503	-0.864108	-1.257842	-0.084741	1
21139	-1.083691	0.045936	-1.161235	-0.919475	0.689124	0.991209	-1.587009	1.170042	0.518490	1.032980	1
12854	-1.083691	0.045936	-0.028391	1.087577	-1.074838	-1.008869	0.884487	-0.253863	-0.829732	-2.127575	0
13704	0.922772	0.045936	1.482068	1.087577	-1.074838	-1.008869	0.884487	-0.026517	-1.377059	-0.049361	0
21554	0.922772	0.045936	-0.406006	-0.919475	0.689124	0.991209	-1.587009	-1.306835	0.896837	0.597470	0

Train Models

The first step is to create seperate arrays for the predictors (Xtrain) and for the target (ytrain):

  
from sklearn.model_selection import GridSearchCV

#seperating the predictors and target variable
Xtrain = ftrain.drop('Response', axis=1)

ytrain = ftrain['Response'].copy()

Baseline Model

A majority class classifier is used as baseline where most common class label in the training set would be found out and predicted as the output always.

  
#count the number of instances
ftrain["Response"].value_counts()

0    43867
1     6133
Name: Response, dtype: int64

0: Not interested, 1: Interested

  
#train set size
ftrain.shape[0]

50000

According to the baseline classifier, the output will be “Not interested” for all predictions. In this project, macro-averaging will be used (precision, recall and F-score are evaluated in each class seperately and then avergaed across classes).

Therefore, applying the baseline classifier to all of the train dataset.

For responses with “Not interested”, the accuarcy measures will be:

Precision: 43867/50000 = 0.877
Recall: 50000/50000 = 1.0
F-score: 2/(1/precision+1/recall) = 0.935

For responses with “Interested”, the accuarcy measures will be:

Precision: 0.0/0.0 = 0.0
Recall: 0.0/6133 = 0.0
F-score: 0.0

The averages of the two classes which is the eventual baseline scores, are:

Precision: 0.439
Recall: 0.5
F-score: 0.468

Random Forest

  
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

#put in the hyperparameters
param_grid = {
    'n_estimators': [10, 100, 200, 1000],
    'max_depth': [3, 5, 15],
    'min_samples_split': [5, 10],
    'random_state': [7]
}

#5-fold cross-validation is used
grid_search = GridSearchCV(rf, param_grid, cv=5,
                          scoring='f1_macro',
                          return_train_score=True)

start = time.time()
grid_search.fit(Xtrain, ytrain)
end = time.time() - start
print(f"Took {end} seconds")

Took 899.0751824378967 seconds

  
grid_search.best_estimator_

RandomForestClassifier(max_depth=15, min_samples_split=5, n_estimators=10,
                       random_state=7)

  
grid_search.best_score_

0.5077910772485159

The best hyperparameters prove to be n_estimators = 200, max_depth = 15 and min_sample_split=5. Based on this, they achieve a F-score of 0.51 which is the best one so far.

The results of the best model are recorded in each split and the below command gives the index of the best performing model,

  
grid_search.cv_results_['rank_test_score'].tolist().index(1)

16

  
rf_split_test_scores = []
for x in range(5):
    #extract f-score of the best model (index=18) from each of the 5 splits
    val = grid_search.cv_results_[f"split{x}_test_score"][18]
    rf_split_test_scores.append(val)

The scores achieved by all the models for different hyperparameter are reviewed:

  
val_scores = grid_search.cv_results_['mean_test_score']
train_scores = grid_search.cv_results_['mean_train_score']
params = [str(x) for x in grid_search.cv_results_["params"]]

for val_score, train_score, param in sorted(zip(val_scores, train_scores, params), reverse=True):
    print(val_score, train_score, param)

5077910772485159 0.6144937097675991 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 10, 'random_state': 7}
500160391905794 0.5800966185190405 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 10, 'random_state': 7}
4901043794684668 0.5983222542940497 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 7}
489769462731983 0.5974611399846733 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 200, 'random_state': 7}
488091595400079 0.5962736744561885 {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 1000, 'random_state': 7}
48802775984303126 0.565265257292125 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 200, 'random_state': 7}
4879105683806045 0.5624624248068428 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 1000, 'random_state': 7}
487125083987973 0.5651251866402451 {'max_depth': 15, 'min_samples_split': 10, 'n_estimators': 100, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 200, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 1000, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 10, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 200, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 1000, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 100, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 10, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 200, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 1000, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 100, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 10, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 10, 'n_estimators': 200, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 10, 'n_estimators': 1000, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 10, 'n_estimators': 100, 'random_state': 7}
4673314366705239 0.46733143701057855 {'max_depth': 3, 'min_samples_split': 10, 'n_estimators': 10, 'random_state': 7}

The performance of Random Forest varies between 0.47 and 0.51. It can also be noticed that better score is achieved for greater max_depth. However, this score is only slight better than the baseline model. Therefore, more models need to evaluted for better understanding.

  
# put them into a separate variable for convenience
feature_importances = grid_search.best_estimator_.feature_importances_

# the order of the features in `feature_importances` is the same as in the Xtrain dataframe,
# so we can "zip" the two and print in the descending order:

for k, v in sorted(zip(feature_importances, Xtrain.columns), reverse=True):
    print(f"{v}: {k}")

Vehicle_Damage: 0.23034578200483757
Age_log: 0.15910325280867196
Annual_Premium_log: 0.1399068056819435
Previously_Insured: 0.13163255776225508
Vintage: 0.11583783732321544
Policy_Sales_Channel: 0.08475334501851194
Region_Code: 0.06984894911839537
Vehicle_Age: 0.05473726453721193
Gender: 0.012907136629646949
Driving_License: 0.0009270691153103203

Vehicle damage, age, annual premium, previously insured and vintage are quite predictive of whether a customer would be interested in vehicle insurance or not.

Every other variable has very little to do with the response of the customer.

Following on, the model is saved to the disk so that it can be used in the future directly for testing instead of re-training the model.

  
import os
from joblib import dump

#creating a folder to save all the models
if not os.path.exists('ML models'):
    os.makedirs('ML models')

dump(grid_search.best_estimator_, 'ML models/rf-clf.joblib')

['ML models/rf-clf.joblib']

The model will be loaded later on using joblib’s load function.

Support Vector Machines

Linear SVMs

  
from sklearn.svm import LinearSVC

lsvm = LinearSVC()

# specify the hypermaters
param_grid = {
    'C': [0.1, 1, 3, 5],
    'max_iter': [5000],
    'random_state': [7]
}

#5-fold cross-validation is used
grid_search = GridSearchCV(lsvm, param_grid, cv=5,
                           scoring='f1_macro', 
                           return_train_score=True) 

start = time.time()
grid_search.fit(Xtrain, ytrain)
end = time.time() - start
print(f"Took {end} seconds")

Took 498.01508927345276 seconds

  
grid_search.best_estimator_

LinearSVC(C=0.1, max_iter=5000, random_state=7)

  
grid_search.best_score_

0.4673314366705239

There is no significant difference between the f-score of the Linear SVM and the baseline model. Therefore, this model turns out to be very poor.

  
val_scores = grid_search.cv_results_["mean_test_score"]
train_scores = grid_search.cv_results_["mean_train_score"]
params = [str(x) for x in grid_search.cv_results_["params"]]

for val_score, train_score, param in sorted(zip(val_scores, train_scores, params), reverse=True):
    print(val_score, train_score, param)

4673314366705239 0.46733143701057855 {'C': 3, 'max_iter': 5000, 'random_state': 7}
4673314366705239 0.46733143701057855 {'C': 1, 'max_iter': 5000, 'random_state': 7}
4673314366705239 0.46733143701057855 {'C': 0.1, 'max_iter': 5000, 'random_state': 7}
46732576201501475 0.4673662960221964 {'C': 5, 'max_iter': 5000, 'random_state': 7}

From the above results, it can be seen that there is no difference in the F-score as the C-value changes.

However, this model is now saved for future refernece.

  
import os
from joblib import dump

# create a folder where all trained models will be kept
if not os.path.exists("ML models"):
    os.makedirs("ML models")
    
dump(grid_search.best_estimator_, 'ML models/svm-lnr-clf.joblib')

['ML models/svm-lnr-clf.joblib']

Radial Basis Function

  
from sklearn.svm import SVC

svm = SVC()

#put in the parameters
param_grid = {
    'C': [1, 10, 100],
    'gamma': ["scale", "auto"],
    'kernel': ["rbf"],
    'random_state': [7]
}

#5-fold cross-validation is used
grid_search = GridSearchCV(svm, param_grid, cv=5,
                           scoring='f1_macro', 
                           return_train_score=True) 

start = time.time()
grid_search.fit(Xtrain, ytrain)
end = time.time() - start
print(f"Took {end} seconds")

Took 5844.312863826752 seconds

This model took significant amount of time to train and are impartical for large datasets.

  
grid_search.best_estimator_

SVC(C=100, gamma='auto', random_state=7)

  
grid_search.best_score_

0.4752675855876845

The F-score of this model is approximately 0.475 which is 0.01 more than that of the baseline model. Therefore, this model turns out to be no better than the baseline model as well.

  
# obtain the f-scores of the best models in each split

svmrbf_split_test_scores = []
for x in range(5):
    # extract f-score of the best model (at index=0) from each of the 5 splits
    val = grid_search.cv_results_[f"split{x}_test_score"][0]
    svmrbf_split_test_scores.append(val)

  
val_scores = grid_search.cv_results_["mean_test_score"]
train_scores = grid_search.cv_results_["mean_train_score"]
params = [str(x) for x in grid_search.cv_results_["params"]]

for val_score, train_score, param in sorted(zip(val_scores, train_scores, params), reverse=True):
    print(val_score, train_score, param)

4752675855876845 0.49304582093749144 {'C': 100, 'gamma': 'auto', 'kernel': 'rbf', 'random_state': 7}
4749530799050228 0.49179605580100966 {'C': 100, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 7}
4682958744065912 0.47032797495572937 {'C': 10, 'gamma': 'auto', 'kernel': 'rbf', 'random_state': 7}
4682958744065912 0.470081072861702 {'C': 10, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 7}
4673314366705239 0.46733143701057855 {'C': 1, 'gamma': 'scale', 'kernel': 'rbf', 'random_state': 7}
4673314366705239 0.46733143701057855 {'C': 1, 'gamma': 'auto', 'kernel': 'rbf', 'random_state': 7}

From the above results, it can be seen that the F-scores of the model increase with increase in the C - value. This is similar to that of random forests where high values of dept produced better results.

Polynomial SVM was ignored as it took a significant amount of time to train.

The SVM rbf model is saved.

  
import os
from joblib import dump

# create a folder where all trained models will be kept
if not os.path.exists("ML models"):
    os.makedirs("ML models")
    
dump(grid_search.best_estimator_, 'ML models/svm-rbf-clf.joblib')

['ML models/svm-rbf-clf.joblib']

From the two SVM models above, the F-scores are less than what was observed for Random forest and are not significantly different from the baseline models. Compared to the SVM models, the random forest was slightly better with an F-score of 0.51. However, this is an extremly poor score as well in reality. A model with such poor score has significantly low prediction power.

Test the Models

Even though, models that were trained have a poor f-score, the random forest with the relatively high f-score will be evaluated on the test dataset.

The model is loaded from the local disk:

  
from joblib import load

best_rf = load("ML models/rf-clf.joblib")

  
# drop labels for training set, but keep all others
Xtest = ftest.drop("Response", axis=1)
ytest = ftest["Response"].copy()

  
from sklearn.metrics import precision_recall_fscore_support

# rf
yhat = best_rf.predict(Xtest)

# micro-averaged precision, recall and f-score
p, r, f, s = precision_recall_fscore_support(ytest, yhat, average="macro")
print("Random Forest:")
print(f"Precision: {p}")
print(f"Recall: {r}")
print(f"F score: {f}")

Random Forest:
Precision: 0.6461335174851621
Recall: 0.5258023688231142
F score: 0.5218418455618304

Thus, similar classification accuracy can be found with Random forrest classifier, as observed during cross-validation.

6. Future Improvements and Business Scenario

There is big room for future improvments for the model as the models accuracy is very poor. Different steps need to be taken to overcome this problem. One of the reason for this poor score could be that fact that there was significantly low number of customers interested in vehicle insurance compared to customers who weren’t. This could have potentially created a bias in the models learning. Another problem is that the predictors may not be a good representative of the target variable. To support this, the correlation matrix in the group report showed very poor correlation between the target variable and the predictors. Addressing these problesm could be potential future improvements.

Currently, this model cannot be used in real world scenarios due to its low accuarcy but similar models with high accuracy can be used in various business scenarios. For example, banks can use this type of model to predict who would be interested in a certain type of credit or debit cards.

Tutorial

This post is licensed under CC BY 4.0 by the author.