Churn Prediction for Fintech Data
Brief : The main idea of the project is to predict churn and which parameters is more relevant during analysis using classic supervised techniques like LR, DT, SVM, RF, KNN with the help of GridSearchCV.
Objective
Predict user churn of fintech app using clasical supervised models.
Project Overview
- Get dataset from fintech app, anderstand the key features.
- Explore and Analyze the behaviour of data related to null values, duplicated indexes, balancing data, outliers, categorical and numerical analysis, get some insights with compared analysis between churn and key features.
- As result of exploration phase, pre-process the dataset for being accurate to train, test and later use it in the prediction.
- Model and evaluate 5 models with GridSearchCV for hyperparameter tunning and compare metrics to find the best model.
- Save the model with pickle package in order to use later for prediction test.
- Create an app where we can use the model.
- Note: For more details HERE
In order to have a big picture about the project, here a diagram of each phase:
graph LR;
A[Data set
for trainning/testing] --> B[Exploration]
B --> C{Pre-Processing}
C -->|Training| D[Fitting]
D -->|Feature Selection, Tunning| F[Model]
F --> E[Evaluation]
F --> H[Prediction]
C -->|Testing| F
G[Data set
for predicting] -->|Pre-Processing| F
Index
1. Dataset
1.1. Dataset Understanding
The bussines challenge: A subscription model on a fintech have users who use the app which it can bring some information about customer behavior. Normaly companies try to minimize the churn which it means cancel subscription. The object is to predict when customer is trying to churn and which feature is the most relevant.
Here some of the fearures:
- userid - userid
- churn - Active = No Suspended < 30 = No Else Churn = Yes
- age - age of the customer
- zodiac_sign - zodiac sign of the customer
- housing - rent_or_own - Does the customer rents or owns a house
- withdrawn_application - has the customer withdrawn the loan applicaiton
- used_ios- Has the user used an iphone
- used_android - Has the user used a android based phone
- has_used_web - Has the user used MoneyLion Web app
- has_used_mobile - as the user used MoneyLion app
- has_reffered- Has the user referred
- credit_score - Customer’s credit score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
# Loading the data
dfChurnBank = pd.read_csv('../projectChurnRate/Data/churn_data.csv', index_col=0)
print("Size of the dataset: %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
if df.index.is_unique:
print('Indexes are unique.')
else:
print('There are duplicated indexes.')
dfChurnBank.head()
# Output:
Size of the dataset: 21600
Number of variables: 30
There are duplicated indexes.
user churn age housing credit_score deposits withdrawal purchases_partners purchases cc_taken cc_recommended ... waiting_4_loan cancelled_loan received_loan rejected_loan zodiac_sign left_for_two_month_plus left_for_one_month rewards_earned reward_rate is_referred
55409 0 37.0 na NaN 0 0 0 0 0 0 ... 0 0 0 0 Leo 1 0 NaN 0.00 0
23547 0 28.0 R 486.0 0 0 1 0 0 96 ... 0 0 0 0 Leo 0 0 44.0 1.47 1
58313 0 35.0 R 561.0 47 2 86 47 0 285 ... 0 0 0 0 Capricorn 1 0 65.0 2.17 0
8095 0 26.0 R 567.0 26 3 38 25 0 74 ... 0 0 0 0 Capricorn 0 0 33.0 1.10 1
61353 1 27.0 na NaN 0 0 2 0 0 0 ... 0 0 0 0 Aries 1 0 1.0 0.03 0
5 rows × 30 columns
|
2. Exploration and Analysis
For exploration, we will focus on check how is the behavior of data related to the following topics:
- Checking Null and na values
- Check if index is duplicated
- Verify balance data on dataset
- Dataframe data type analysis -> Categorical and numerical for separated
- Check outliers
- Analysis of Churn over features
- Conclusions after exploration and analysis
2.1. Checking Null and na values
As we can see there are many null values on credit_score and rewards_earned columns. If we remove null values as rows, we will lost many data and dataset size will reduce a lot, so the best option it is to remove them. On the other hand, age column only has 4 null values, so we can drop null values as rows(axis=0).
1
2
3
4
5
6
7
8
9
10
11
|
null_finder = df.isnull().sum()
print(" ***** Number of Null Values by row: ***** ")
null_finder.where(null_finder > 0).dropna()
# Output:
***** Number of Null Values by row: *****
age 4.0
credit_score 6436.0
rewards_earned 2569.0
dtype: float64
|
2.2. Check if index is duplicated
There are duplicated indexes.
1
2
3
4
5
6
7
|
if df.index.is_unique:
print('Indexes are unique.')
else:
print('There are duplicated indexes.')
# Output:
There are duplicated indexes.
|
2.3. Verify balance data on dataset
For continue EDA, we have to remove null/nas and duplicated indexes in order to have a good explorarion.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
sns.countplot(x='churn', data=df)
class_ratio_0 = Counter(df['churn'])[0]/df['churn'].shape[0]
class_ratio_1 = Counter(df['churn'])[1]/df['churn'].shape[0]
print('Class Ratio of {:} is {:.2f}'.format(list(Counter(df['churn']).keys())[0], class_ratio_0))
print('Class Ratio of {:} is {:.2f}'.format(list(Counter(df['churn']).keys())[1], class_ratio_1))
print('Class Ratio of 1 over 0 is {:.2f}'.format(class_ratio_1/class_ratio_0))
# Output:
Class Ratio of 0 is 0.56
Class Ratio of 1 is 0.44
Class Ratio of 1 over 0 is 0.80
|
As we can see, the ratio of the class variable is almost balanced. So in this way, we don’t need to use any technique for balance them.
- Class 1 -> churn
- Class 0 -> no churn
2.4. Dataframe data type analysis
For this kind of analysis, we have to separate into categorical and numerical variables. Hence, we define 2 list: cat_features and num_features.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Separation of data types for EDA
cat_features = df.select_dtypes(exclude = np.number).columns
num_features = df.select_dtypes(include = np.number).columns
print( "Quantity of Categorical features: ", len(cat_features),"\nCategorical features: ", cat_features)
print( "\nQuantity of Numerical features: ", len(num_features),"\nNumerical features: ", num_features)
# Output:
Quantity of Categorical features: 3
Categorical features: Index(['housing', 'payment_type', 'zodiac_sign'], dtype='object')
Quantity of Numerical features: 25
Numerical features: Index(['churn', 'age', 'deposits', 'withdrawal', 'purchases_partners',
'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked',
'cc_application_begin', 'app_downloaded', 'web_user', 'app_web_user',
'ios_user', 'android_user', 'registered_phones', 'waiting_4_loan',
'cancelled_loan', 'received_loan', 'rejected_loan',
'left_for_two_month_plus', 'left_for_one_month', 'reward_rate',
'is_referred'],
dtype='object')
|
2.4.1. Categorical analysis:
We can see that those values cannot be binarized because they has more than 2 values. Hence, we need to use one-hot encoding technique. On the other hand, for removing ‘na’ values, we have to check the amount of those values.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
print("Different quantity values: " ,list(map(lambda col: (col,len(df[col].value_counts())), df.select_dtypes(exclude = np.number).columns)))
print("\nhousing feature values: ", df['housing'].unique() ,
"\npayment_type feature value: ", df['payment_type'].unique() ,
"\nzodiac_sign feature values: ", df['zodiac_sign'].unique())
# Output:
Different quantity values: [('housing', 3), ('payment_type', 5), ('zodiac_sign', 13)]
housing feature values: ['na' 'R' 'O']
payment_type feature value: ['Bi-Weekly' 'Weekly' 'na' 'Monthly' 'Semi-Monthly']
zodiac_sign feature values: ['Aquarius' 'Scorpio' 'Capricorn' 'Pisces' 'Gemini' 'Cancer' 'Virgo'
'Libra' 'na' 'Leo' 'Aries' 'Taurus' 'Sagittarius']
|
As we can see, there are a lot of ‘na’ values. We can remove them one hot technique(dummy) later during pre-processing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Counting "na" values:
print("na values from housing: ", len(df[df['housing'] == 'na']) )
print("na values from payment_type: ", len(df[df['payment_type'] == 'na']) )
print("na values from zodiac:sign: ", len(df[df['zodiac_sign'] == 'na']) )
# Output:
na values from housing: 10373
na values from payment_type: 2883
na values from zodiac:sign: 1570
####
# Analysis each categorical feature by churn quantity in order to know which one is more relevant than other.
fig, axs = plt.subplots(1, len(cat_features), figsize=(28, 10))
for col, ax in enumerate(axs.flatten()):
sns.countplot(x=cat_features[col], hue='churn', dodge=False, data=df, ax = axs[col])
ax.set_title(cat_features[col])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=8)
|
As we can see, this values should be traslate to numbers with one hot encoding technique in order to keep the size of dataset. Categorical values should be considered becuase it affect on churn behavior. Here some insights:
- The people who has the Bi-weekly payment_type churn more than others
- The people who Rent(R) churn more than people who Own(O) his home.
2.4.1. Numerical analysis:
From correlation matrix, we can get some insights like the following:
- correlation 0.91 between app_web_user and web_user. It means they are positively depended so we can ommit one of them.
- correlation 0.88 between reward_rate and cc_recommended. It means they are positively depended so we can ommit one of them.
- correlation -0.84 between android_user and ios_user. It means they are negatively depended so we can ommit one of them.
- correlation 1 between purchases and deposit. It means they are positively depended so we can ommit one of them.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
# Quantity of numerical values:
list(map(lambda col: (col,len(df[col].value_counts())), df.select_dtypes(include = np.number).columns))
# Output:
[('churn', 2),
('age', 69),
('deposits', 66),
('withdrawal', 20),
('purchases_partners', 281),
('purchases', 64),
('cc_taken', 11),
('cc_recommended', 324),
('cc_disliked', 20),
('cc_liked', 8),
('cc_application_begin', 123),
('app_downloaded', 2),
('web_user', 2),
('app_web_user', 2),
('ios_user', 2),
('android_user', 2),
('registered_phones', 5),
('waiting_4_loan', 2),
('cancelled_loan', 2),
('received_loan', 2),
('rejected_loan', 2),
('left_for_two_month_plus', 2),
('left_for_one_month', 2),
('reward_rate', 186),
('is_referred', 2)]
|
As we can see there are binary fearures and other are longer than 2 values.
First, we will explore features which has more than 2 different values because there are many like (‘registered_phones’, 5), (‘cc_disliked’, 20), (‘cc_taken’, 11) and (‘withdrawal’, 20) which have specific values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, num_features))
num_features_bin = list(filter(lambda col: len(df[col].value_counts()) == 2, num_features))
print("Quantity of features greater than 2:", len(num_features_new))
print(num_features_new)
print()
print("Quantity of features equal to 2(binary):", len(num_features_bin))
print(num_features_bin)
# Output:
Quantity of features greater than 2: 12
['age', 'deposits', 'withdrawal', 'purchases_partners', 'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked', 'cc_application_begin', 'registered_phones', 'reward_rate']
Quantity of features equal to 2(binary): 13
['churn', 'app_downloaded', 'web_user', 'app_web_user', 'ios_user', 'android_user', 'waiting_4_loan', 'cancelled_loan', 'received_loan', 'rejected_loan', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']
|
In the next pictures, We can see that there are some features like ‘deposits’, ‘withdrawal’, ‘purchases’, ‘cc_taken’, ‘cc_disliked’, ‘cc_liked’, ‘registered_phones’ which has huge values as 0(zero) than other values. Later during removing outliers, all the values will be 0, so those features are not useful. We have to remove them.
1
2
3
4
5
6
7
8
9
10
|
# Exploring the distribution about num_features_new values: 'age', 'credit_score', 'deposits' , 'purchases_partners', 'purchases', 'cc_recommended', 'cc_application_begin', 'rewards_earned', 'reward_rate'
for col in num_features_new:
#plt.figure(figsize=(5,2))
#sns.countplot(x=col, data=df)
sns.displot(df[col], kde = True, height = 5 )
plt.xticks(rotation=90, fontsize=9)
plt.xlabel(col, labelpad=10)
plt.ylabel('Counts')
plt.show()
|
For continue the exploration, we will ommit ‘churn’, ‘app_downloaded’, ‘web_user’, ‘app_web_user’, ‘ios_user’, ‘android_user’ because they already have been analyzed. So we focus in the rest. We can see that there is 2 main insights:
- ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘left_for_one_month’,‘rejected_loan’ have almost the same distribution regarded churn. So we will only use one of them.
- The others seems have different behaviors.
1
2
3
4
5
6
7
8
|
num_features_binary = ['waiting_4_loan', 'cancelled_loan', 'received_loan', 'rejected_loan', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']
fig, axs = plt.subplots(1, len(num_features_binary), figsize=(28, 10))
for col, ax in enumerate(axs.flatten()):
sns.countplot(x=num_features_binary[col], hue='churn', dodge=False, data=df, ax = axs[col])
ax.set_title(num_features_binary[col])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=8)
|
2.5. Check outliers
For this part, we have to detect outliers in features which are not binary because in binary there is not sense. For this part, we will check with the $ IQR = 1.5(percentile_3 - percentile_1) $.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
def detect_outlier(X):
"""
X: dataframe
"""
#X = df_new.iloc[:, :-1]
#for i in range(len(X.columns)):
for i in range(5): # Only priny the firs 5
first_q = np.percentile(X[X.columns[i]], 25)
third_q = np.percentile(X[X.columns[i]], 75)
IQR = 1.5*(third_q - first_q)
minimum = first_q - IQR
maximum = third_q + IQR
if(minimum > np.min(X[X.columns[i]]) or maximum < np.max(X[X.columns[i]])):
print(X.columns[i], "There is Outlier")
detect_outlier(df[num_features_new])
# Output:
age There is Outlier
deposits There is Outlier
withdrawal There is Outlier
purchases_partners There is Outlier
purchases There is Outlier
|
As we can see there are ‘age’, ‘purchases_partners’ and ‘cc_application_begin’ have outliers. On the other hand, the others its mean is on 0 which means most of all these values are 0. So later during removing outliers, these features will have only one value ‘0’. Hence, they should be removed.
2.6. Analysis of Churn over features
Analysis 1:
- Question: Whats the main age group of customers ? and Which age group of customers churn more ?
- Results: From picture we can see that people between 21 and 32 churn more than older people.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
age_group = df[(df.age>20)& (df.age<80)]
df_filtered = age_group.groupby(['age', 'churn'])['deposits'].count().reset_index()
b = pd.pivot_table(df_filtered, values='deposits', index='age',columns=['churn']).reset_index()
b.head()
# Output:
churn age 0 1
0 21.0 373.0 397.0
1 22.0 409.0 410.0
2 23.0 519.0 437.0
3 24.0 573.0 449.0
4 25.0 580.0 478.0
#####
sns.barplot(x = 'age', y = 1 , data = b)
plt.xlabel('Age')
plt.ylabel('Churn Quantity')
plt.title('Age Churn Rate')
plt.xticks(rotation=90)
|
Analysis 2:
- Question: People who makes deposits churn? or People who makes purchases churn?
- Results: We can see from picture that peoplo who does not make any deposit or purchase churn more. It is logic.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
df_filtered = df.groupby(['deposits', 'churn'])['housing'].count().reset_index()
b = pd.pivot_table(df_filtered, values='housing', index='deposits',columns=['churn']).reset_index()
b.head()
# Output:
#####
plt.figure(figsize=(18,10))
sns.barplot(x = 'deposits', y = 1 , data = b)
plt.xlabel('Deposits')
plt.ylabel('Churn Quantity')
plt.title('Depostis Churn Rate')
plt.xticks(rotation=90)
|
2.7. Conclusions after exploration and analysis
After EDA, we have some insights and conclusion that must be applied in the pre-processing phase for preparing the data for start modeling:
- Peoble who does not make any deposits or do any purchase with the app churn more
- From picture we can see that people between 21 and 32 churn more than older people.
- As we can see there are ‘age’, ‘purchases_partners’ and ‘cc_application_begin’ have outliers. On the other hand, the others its mean is on 0 which means most of all these values are 0. So later during removing outliers, these features will have only one value ‘0’. They should be removed.
- ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’, ‘left_for_one_month’ have almost the same distribution regarded churn. We will only use one of them.
- From correlation matrix, correlation 0.91 between app_web_user and web_user. It means they are positively depended so we can ommit one of them.
- From correlation matrix, correlation 0.88 between reward_rate and cc_recommended. It means they are positively depended so we can ommit one of them.
- From correlation matrix, correlation -0.84 between android_user and ios_user. It means they are negatively depended so we can ommit one of them.
- From correlation matrix, correlation 1 between purchases and deposit. It means they are positively depended so we can ommit one of them.
- From Categorical values, The people who has the Bi-weekly payment_type churn more than others.
- From Categorical values, The people who Rent(R) churn more than people who Own(O) his home.
- From Categorical values, these values should be traslate to numbers with one hot encoding technique in order to keep the size of dataset. There are a lot of ‘na’ values. We can remove them using one hot technique(dummy).
- As we can see, the ratio of the class variable is almost balanced. SO in this way, we won’t need to use any technique for balance them.
- We can see that there are some features like ‘deposits’, ‘withdrawal’, ‘purchases’, ‘cc_taken’, ‘cc_disliked’, ‘cc_liked’, ‘registered_phones’ which has huge values as ‘0’ than other values. Lated during removing outliers, all the values will be 0, so those features are not useful. We have to remove them.
3. Pre-Processing
As found in exploration phase, we will pre-process the dataset for being accurate for modelling. All of these steps we will save in a pipeline list for next pre-processing steps. Below the list of task:
- Remove Null/Nan values
- Remove duplicated indexes
- Remove not useful features: app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', ‘received_loan’, ‘rejected_loan’, ‘waiting_4_loan’ columns
- Update numerical and categorical values list for modeling
- Remove outliers
- Num-Val: Standarize data for training(without binary features)
- Cat-Val: One hot encoding
- Create pre-process pipeline function for next preprocesing for test data.
Before start, we create the pipeline list. At the end of process, we will test this pipeline with a new dataset:
1
2
|
# Pipeline for pre-preocessing
pipeline_preprocess = []
|
3.1. Remove Null/Nan values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
def dropnull(df):
print("Removing columns credit_score and rewards_earned ...(1)")
df = df.drop(columns=['credit_score','rewards_earned'])
print("Drop null values from age column ...(2)")
df = df[pd.notnull(df['age'])]
return df
df = dropnull(df)
null_finder = df.isnull().sum()
print(" ***** Number of Null Values by row: ***** ")
null_finder.where(null_finder > 0).dropna()
# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
***** Number of Null Values by row: *****
Series([], dtype: float64)
# Adding to pre-preocessing pipeline:
pipeline_preprocess.append(dropnull)
|
3.2. Remove duplicated indexes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# This function will be agregated to the pre-precessing pipeline:
def dropduplicated(df):
if df.index.is_unique:
print('Indexes are unique.')
return df
else:
print('There are duplicated indexes....So removing duplicated indexes ...(3)')
return df[~df.index.duplicated(keep='first')]
df = dropduplicated(df)
if df.index.is_unique:
print('Indexes are unique.')
else:
print('There are duplicated indexes.')
# Output:
There are duplicated indexes....So removing duplicated indexes ...(3)
Indexes are unique.
# Adding to pre-preocessing pipeline:
pipeline_preprocess.append(dropduplicated)
|
3.3. Remove unuseful features
During EDA, we conclude that the following features are unusful:
- app_web_user
- deposit
- ios_user
- cc_recommended
- cancelled_loan
- received_loan
- rejected_loan
- waiting_4_loan
- withdrawal
- purchases
- cc_taken
- cc_disliked
- cc_liked
- registered_phones
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
# Remove app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns
def dropcolumns(df):
print("Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)")
df = df.drop(columns=['app_web_user','deposits', 'ios_user','cc_recommended', 'cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan',
'withdrawal', 'purchases', 'cc_taken', 'cc_disliked', 'cc_liked', 'registered_phones'])
return df
df = dropcolumns(df)
# Output:
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
# Adding to pre-preocessing pipeline:
pipeline_preprocess.append(dropcolumns)
|
3.4. Update numerical and categorical values list for modeling
We need to update numerical and categorical values list before start modeling because some of them were removed form dataset.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
# Update numerical and categorical values for later modeling
cat_features = df.select_dtypes(exclude = np.number).columns
num_features = df.select_dtypes(include = np.number).columns
print( "Quantity of Categorical features: ", len(cat_features),"\nCategorical features: ", cat_features)
print( "\nQuantity of Numerical features: ", len(num_features),"\nNumerical features: ", num_features)
# Output:
Quantity of Categorical features: 3
Categorical features: Index(['housing', 'payment_type', 'zodiac_sign'], dtype='object')
Quantity of Numerical features: 11
Numerical features: Index(['churn', 'age', 'purchases_partners', 'cc_application_begin',
'app_downloaded', 'web_user', 'android_user', 'left_for_two_month_plus',
'left_for_one_month', 'reward_rate', 'is_referred'],
dtype='object')
# Quantity of numerical values:
num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, df.select_dtypes(include = np.number).columns))
num_features_bin = list(filter(lambda col: len(df[col].value_counts()) == 2, df.select_dtypes(include = np.number).columns))
print("Quantity of features greater than 2:", len(num_features_new))
print(num_features_new)
print()
print("Quantity of features equal to 2(binary):", len(num_features_bin))
print(num_features_bin)
# Output:
Quantity of features greater than 2: 4
['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Quantity of features equal to 2(binary): 7
['churn', 'app_downloaded', 'web_user', 'android_user', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']
|
So now we have the following features quantity:
- Quantity of Categorical features: 3
- Quantity of Numerical features: 11 (4: Not binary, 7; Binary)
3.5. Remove outliers
For removeing outliers, we focus on numerical features which are not binary
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
# Removing outliers
def remove_outlier(df):
"""
df: dataframe
"""
num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, df.select_dtypes(include = np.number).columns))
outliers = []
for col in num_features_new:
first_q = np.percentile(df[col], 25)
third_q = np.percentile(df[col], 75)
IQR = 1.5*(third_q - first_q)
minimum = first_q - IQR
maximum = third_q + IQR
if(minimum > np.min(df[col]) or maximum < np.max(df[col])):
outliers.append(col)
print("The outliers are :", outliers)
print("Removing outliers ...(5)")
for col in outliers:
first_q = np.percentile(df[col], 25)
third_q = np.percentile(df[col], 75)
IQR = 1.5*(third_q - first_q)
minimum = first_q - IQR
maximum = third_q + IQR
median = df[col].median()
df.loc[df[col] < minimum, col] = median
df.loc[df[col] > maximum, col] = median
return df
df = remove_outlier(df)
# Output:
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
# Adding to pre-preocessing pipeline:
pipeline_preprocess.append(remove_outlier)
|
3.6. Num-Val: Standarize data
For numerical values, we will standarize for training but without considering binary features.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
def standardizeNum(df):
standardize = StandardScaler()
num_features = df.select_dtypes(include = np.number).columns
num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, num_features))
print("Standarizing Numerical variables ...(6)")
df[num_features_new] = standardize.fit_transform(df[num_features_new])
return df
df = standardizeNum(df)
print("Size of the dataset: %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
# Output:
Standarizing Numerical variables ...(6)
Size of the dataset: 20095
Number of variables: 14
# Adding to pre-preocessing pipeline:
pipeline_preprocess.append(standardizeNum)
|
3.7. Cat-Val: One hot encoding
Doing the one hot encoding task for categorical features. Also removeing ‘na’ values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
def removedummy(df):
print("Convert categorical values into numbers ...(7)")
df_new = pd.get_dummies(df)
print("Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)")
df_new = df_new.drop(columns = ['housing_na', 'zodiac_sign_na', 'payment_type_na'])
return df_new
df = removedummy(df)
print("Size of the dataset: %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
# Output:
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset: 20095
Number of variables: 29
# Adding to pre-preocessing pipeline:
pipeline_preprocess.append(removedummy)
|
3.8. Create pre-process pipeline function
During the pre-processing phase, we use functions for do each especific task. These functions later will be used for new dataset like testing or even new data. Hence, we need to save these functions in a pipeline for later preprocesing. Here the list of all the functions we used:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
print("\nSteps for pre-processing: ")
for step, function in enumerate(pipeline_preprocess):
print("\t {:d}: {:s}".format(step, function.__name__))
# Output:
Steps for pre-processing:
0: dropnull
1: dropduplicated
2: dropcolumns
3: remove_outlier
4: standardizeNum
5: removedummy
|
In this way, we create the funcion ‘preprocess_data_pipeline’ for later pre-processing for new dataset(prediction):
1
2
3
4
5
6
7
8
9
|
# Definition of preprocess_data for an specific dataset:
def preprocess_data_pipeline(df, pipeline_preprocess):
for step, function in enumerate(pipeline_preprocess):
df = function(df)
print("Size of the dataset: %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
df.head(10)
return df
|
Testing the ‘preprocess_data_pipeline’ function with new dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
df_aux = pd.read_csv('../projectChurnRate/Data/churn_data.csv', index_col=0).sample(n=2100, random_state=0)
preprocess_data_pipeline(df_aux, pipeline_preprocess)
# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
There are duplicated indexes....So removing duplicated indexes ...(3)
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
Standarizing Numerical variables ...(6)
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset: 2086
Number of variables: 29
user churn age housing purchases_partners cc_application_begin app_downloaded web_user android_user payment_type zodiac_sign left_for_two_month_plus left_for_one_month reward_rate is_referred
50488 0 20.0 R 29 8 1 0 1 Bi-Weekly Virgo 0 0 0.44 1
53603 0 38.0 na 28 5 1 0 1 Monthly Sagittarius 0 0 0.67 1
42289 1 40.0 R 9 4 1 1 1 na Aries 0 0 0.63 1
4185 0 34.0 na 0 4 1 1 0 Bi-Weekly Scorpio 0 0 2.07 0
12436 1 24.0 O 38 7 1 1 1 Weekly Scorpio 0 0 0.73 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23350 1 22.0 R 31 7 1 1 1 Weekly Virgo 1 0 0.00 1
53737 0 43.0 na 23 1 1 1 0 Bi-Weekly Gemini 0 0 0.50 0
56724 1 40.0 R 0 0 1 1 0 Monthly Taurus 0 0 0.30 0
17737 1 23.0 na 8 1 1 1 1 Semi-Monthly Pisces 0 0 0.65 0
46045 1 27.0 R 0 3 1 0 0 Bi-Weekly Virgo 0 0 0.53 0
2086 rows × 14 columns
|
4. Model and Evaluation
For modeling, we faced the following faces:
- Separate dataset
- For X and y classes.
- Split dataset for training and testing.
- Feature Selection
- Use linear regression model
- Use other techniques like: RFE,Random Forest and Mutual Information
- Then sum up and choose the most important features
- Modeling for finding the best:
- Hyperparameter Tunning using GridSearchCV for finding the best hyperparameters of each model
- Fitting and Evaluation using the next models: Logistic Regresion, Decision Tree, Support Vector Machine, Random Forest, Nearest Neighbor.
- Choose the best model according metrics
- Save the model to used with new values
For evaluation, we are using the following metrics:
- Confusion matrix
- Accuracy as classification metric
- Precision, Recall, F1-score
- Some graphs like ROC, Precision vs Recall, KS Statistic Test, Cummulative Gain, Lift Curve.
graph LR;
A[Training Dataset] --> B[Feature Selection]
B --> D[Hyper-paremter Tunning]
D --> E[Fitting]
E --> F[Model]
F -->|Tunning| D
graph LR;
C[Testing Dataset] --> F[Model Fitted]
F --> G[Evaluation]
G --> H[Metrics Insgiths]
4.1. Dataset Separation
Here first, we split the data for features and label(churn variable). Then split for training and testing.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
X = df.drop('churn', axis = 1)
y = df['churn']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=10)
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))
# Output:
x_train shape: (14066, 28)
y_train shape: (14066,)
x_test shape: (6029, 28)
y_test shape: (6029,)
|
4.2. Feature Selection
For Feature Selection, we will use the following techniques:
- Use linear regression model
- Use other techniques like:
- Recursive Feature Elimination (RFE) as a wrapped method
- Random Forest as embeded method
- Mutual Information classification as entropy method
- Then sum up and choose the most important features
4.2.1. Logistic Regression
Using this model, the more important features with coef < -0.2 and coef > 0.2. There is 7 features from 28. We can see that the more important variables are:
- left_for_one_month
- web_user
- app_downloaded
- payment_type_Weekly
- housing_O
- reward_rate
- purchases_partners
1
2
3
4
5
6
7
8
9
10
11
12
13
|
feature_importance_lr.where((feature_importance_lr["coef"] < -0.2)|(feature_importance_lr["coef"] > 0.2)).dropna()
# Output:
feature coef
7 left_for_one_month 0.438657
4 web_user 0.289608
3 app_downloaded 0.268084
15 payment_type_Weekly 0.218277
10 housing_O -0.204331
8 reward_rate -0.298566
1 purchases_partners -0.538643
|
4.2.2. Recursive Feature Elimination (RFE)
As a wrapped method, it use an external method or estimator(LogisticRegression in our case) for searching/select in a recursive way a small subset of features evaluating the importance. Using RFE model, the 12 more important features are:
- purchases_partners
- app_downloaded
- web_user
- left_for_two_month_plus
- left_for_one_month
- reward_rate
- housing_O
- payment_type_Weekly
- zodiac_sign_Aquarius
- zodiac_sign_Capricorn
- zodiac_sign_Libra
- zodiac_sign_Taurus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# Number of features as 12
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=12, step=10, verbose=5)
rfe_selector.fit(x_train, y_train)
rfe_support = rfe_selector.get_support()
feature_rfe = x_train.loc[:,rfe_support].columns.tolist()
print(str(len(feature_rfe)), 'selected features')
print(feature_rfe)
# Output:
Fitting estimator with 28 features.
Fitting estimator with 18 features.
12 selected features
['purchases_partners', 'app_downloaded', 'web_user', 'left_for_two_month_plus', 'left_for_one_month', 'reward_rate', 'housing_O', 'payment_type_Weekly', 'zodiac_sign_Aquarius', 'zodiac_sign_Capricorn', 'zodiac_sign_Libra', 'zodiac_sign_Taurus']
|
4.2.3. Random Forest
In Random Forest, there is the purity of the node which tell us about the importance of the model. So if a node with low imputiry(Gini impurity) over all trees, this node will be the most important. Using the Random Forest model, there are only 4 important features:
- age
- purchases_partners
- cc_application_begin
- reward_rate
1
2
3
4
5
6
7
8
9
10
11
|
mbeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=12)
embeded_rf_selector.fit(x_train, y_train)
embeded_rf_support = embeded_rf_selector.get_support()
feature_embeded_rf = x_train.loc[:,embeded_rf_support].columns.tolist()
print(str(len(feature_embeded_rf)), 'selected features')
feature_embeded_rf
# Output:
4 selected features
['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
|
Mutual Information comes from information theory which apply information gain(like decision tree) to feature selection. This gain is calculated between 2 variables and it measures the reduction in uncertainty. Using this model, we can get the first 12 important features:
- purchases_partners
- reward_rate
- zodiac_sign_Leo
- is_referred
- cc_application_begin
- android_user
- zodiac_sign_Scorpio
- housing_O
- zodiac_sign_Aquarius
- housing_R
- app_downloaded
- zodiac_sign_Virgo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
importances = mutual_info_classif(x_train,y_train)
feature_mutual_importance = pd.Series(importances, index = x_train.columns)
feature_mutual_importance.sort_values(ascending=False).plot.bar(figsize=(20, 8))
feature_mutual = list(feature_mutual_importance.where(feature_mutual_importance > 0.002).sort_values(ascending=False).dropna().index)
print("There are {} important features and they are: \n".format(len(feature_mutual)))
feature_mutual
# Output:
There are 12 important features and they are:
['purchases_partners',
'reward_rate',
'zodiac_sign_Leo',
'is_referred',
'cc_application_begin',
'android_user',
'zodiac_sign_Scorpio',
'housing_O',
'zodiac_sign_Aquarius',
'housing_R',
'app_downloaded',
'zodiac_sign_Virgo']
|
Here we can see the most importan feature vs churn. We can see the “purchases_partners” is the most influence.
4.2.5. Summary
Sum up all the results, we can get the 12 most important features for start modeling.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':x_train.columns,
'Logistics':lr_support,
'RFE':rfe_support,
'Random Forest':embeded_rf_support, 'Mutual Classif':mutual_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
# feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)
# Output:
Feature Logistics RFE Random Forest Mutual Classif Total
1 purchases_partners True True True True 4
8 reward_rate False True True True 3
2 cc_application_begin True False True True 3
3 app_downloaded True True False True 3
27 zodiac_sign_Virgo True False False True 2
26 zodiac_sign_Taurus True True False False 2
25 zodiac_sign_Scorpio True False False True 2
16 zodiac_sign_Aquarius False True False True 2
10 housing_O False True False True 2
0 age True False True False 2
22 zodiac_sign_Libra False True False False 1
21 zodiac_sign_Leo False False False True 1
|
Here they are:
- purchases_partners
- reward_rate
- cc_application_begin
- app_downloaded
- zodiac_sign_Virgo
- zodiac_sign_Taurus
- zodiac_sign_Scorpio
- zodiac_sign_Aquarius
- housing_O
- age
- zodiac_sign_Libra
- zodiac_sign_Leo
Now we are ready to start modeling. For this we split again the dataset but filtering using the new_num_features list.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
X = X[new_num_features]
y = df['churn']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=10)
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))
print("y_test shape: {}".format(y_test.shape))
# Output:
x_train shape: (14066, 12)
y_train shape: (14066,)
x_test shape: (6029, 12)
y_test shape: (6029,)
|
4.3. Modeling & Evaluation
Modeling for finding the best:
- Hyperparameter Tunning using GridSearchCV for finding the best hyperparameters of each model
- Fitting and Evaluation using the next models:
- Logistic Regresion Model
- Decision Tree Model
- Support Vector Machine Model
- Random Forest Model
- K-Nearest Neighbor Model
- Choose the best model according metrics
- Save the model to used with new values
For evaluation, we are using the following metrics:
- Confusion matrix
- Accuracy as classification metric
- Precision, Recall, F1-score
- Some graphs like ROC, Precision vs Recall, KS Statistic Test, Cummulative Gain, Lift Curve.
Most of the steps for this phase are mechanic, so we create some key points to be considered:
graph LR;
A[Training Dataset] --> B[GridSearchCV]
B -->|Hyperparameter Tunning| C[Fitting/Predicting]
C --> B
C -->|Evaluation Metrics| D[Save Metrics]
Also we create 2 functions to summary all the results during tunning hyperparameters and modeling(fit/predict):
1
2
3
4
5
6
7
8
|
def GridSearchResults(grid_clf, num_results=10, display_all_params=True)
"""
Function to summary all the results of gridsearchCV.
"""
def evaluationMetricsGCV(x_test, y_test, model_fit):
"""
Function to summary all the results of fitting and predicting.
"""
|
To save metrics, we create a dataframe for later usage and to choose the best model:
1
2
3
4
5
6
7
8
|
# Dataframe for statistics
model_stats = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1-Score", "AUC-Score"])
model_stats.head()
# Output:
Model Accuracy Precision Recall F1-Score AUC-Score
|
4.3.1. Logistic Regresion Model
Here the results for hyperparameter tunning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
# GridSearchCV for logistic Regression
parameters = {}
parameters['C'] = [10e-3, 10e-2, 10e-1, 1, 10, 100, 1000]
parameters['class_weight'] = [None, 'balanced']
parameters['penalty'] = ["l1","l2",'elasticnet']
parameters['solver'] = ['newton-cg', 'lbfgs', 'liblinear', 'sag']
GS_log = GridSearchCV(LogisticRegression(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_log.fit(x_train, y_train)
GridSearchResults(GS_log)
# Output:
Best parameters:
{'C': 0.01, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Best score:
0.63038 (+/-0.01300)
All parameters:
{'C': 0.01,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l1',
'random_state': None,
'solver': 'liblinear',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
params mean_test_score std_test_score rank_test_score
2 {'C': 0.01, 'class_weight': None, 'penalty': '... 0.630383 0.012998 1
6 {'C': 0.01, 'class_weight': None, 'penalty': '... 0.627824 0.013089 2
26 {'C': 0.1, 'class_weight': None, 'penalty': 'l... 0.627611 0.012696 3
4 {'C': 0.01, 'class_weight': None, 'penalty': '... 0.627469 0.012713 4
5 {'C': 0.01, 'class_weight': None, 'penalty': '... 0.627469 0.012713 4
7 {'C': 0.01, 'class_weight': None, 'penalty': '... 0.627469 0.012713 4
29 {'C': 0.1, 'class_weight': None, 'penalty': 'l... 0.627184 0.013151 7
31 {'C': 0.1, 'class_weight': None, 'penalty': 'l... 0.627184 0.013151 7
28 {'C': 0.1, 'class_weight': None, 'penalty': 'l... 0.627184 0.013151 7
30 {'C': 0.1, 'class_weight': None, 'penalty': 'l... 0.626971 0.013364 10
|
Here the result for modeling and evaluation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
logr_model = LogisticRegression(C=0.01, class_weight=None, max_iter=1000, penalty='l1', random_state=1000, solver='liblinear')
logr_model.fit(x_train, y_train)
Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, logr_model)
model_stats = model_stats.append({"Model": "Logistic model",
"Accuracy": Accuracy,
"Precision": Precision,
"Recall": Recall,
"F1-Score": F1,
"AUC-Score": auc_score}, ignore_index=True)
# Output:
Results:
+++++ Accuracy Score 0.639
+++++ Precision Score 0.598
+++++ Recall Score 0.570
+++++ F1 Score 0.584
precision recall f1-score support
No churn 0.67 0.69 0.68 3354
Churn 0.60 0.57 0.58 2675
accuracy 0.64 6029
macro avg 0.63 0.63 0.63 6029
weighted avg 0.64 0.64 0.64 6029
+++++ AUC (Area under the ROC Curve) : 0.632
|
As we can see, the accuracy is 0.64.
4.3.2. Decision Tree Model
Here the results for hyperparameter tunning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
# GridSearchCV for Decision Tree Model
parameters = {}
parameters['max_depth'] = [i for i in range(1, 11)]
parameters['class_weight'] = [None, 'balanced']
parameters['max_features'] = [i for i in range(1, 8)]
parameters['min_samples_leaf'] = [i for i in range(1, 11)]
#
GS_tree = GridSearchCV(DecisionTreeClassifier(random_state = 1000), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_tree.fit(x_train, y_train)
GridSearchResults(GS_tree)
# Output:
Best parameters:
{'class_weight': None, 'max_depth': 8, 'max_features': 7, 'min_samples_leaf': 5}
Best score:
0.66344 (+/-0.00662)
All parameters:
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': 8,
'max_features': 7,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 5,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 1000,
'splitter': 'best'}
params mean_test_score std_test_score rank_test_score
554 {'class_weight': None, 'max_depth': 8, 'max_fe... 0.663444 0.006620 1
474 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.663371 0.009664 2
482 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.661808 0.006527 3
485 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.661666 0.008073 4
487 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.661594 0.008310 5
481 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.660883 0.008289 6
483 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.660670 0.007874 7
553 {'class_weight': None, 'max_depth': 8, 'max_fe... 0.660669 0.010791 8
473 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.660385 0.008892 9
480 {'class_weight': None, 'max_depth': 7, 'max_fe... 0.660314 0.009318 10
|
Here the result for modeling and evaluation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
dt_model = DecisionTreeClassifier(max_depth=8,max_features=7, min_samples_leaf=5, random_state = 1000);
dt_model.fit(x_train, y_train)
Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, dt_model)
model_stats = model_stats.append({"Model": "Decision Tree model",
"Accuracy": Accuracy,
"Precision": Precision,
"Recall": Recall,
"F1-Score": F1,
"AUC-Score": auc_score}, ignore_index=True)
# Output:
Results:
+++++ Accuracy Score 0.673
+++++ Precision Score 0.647
+++++ Recall Score 0.582
+++++ F1 Score 0.612
precision recall f1-score support
No churn 0.69 0.75 0.72 3354
Churn 0.65 0.58 0.61 2675
accuracy 0.67 6029
macro avg 0.67 0.66 0.67 6029
weighted avg 0.67 0.67 0.67 6029
+++++ AUC (Area under the ROC Curve) : 0.664
|
As we can see, the accuracy is 0.67.
4.3.3. Support Vector Machine Model
Here the results for hyperparameter tunning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
parameters = {}
parameters['C'] = [10e-2, 1, 100]
parameters['kernel'] = ['linear','poly', 'rbf']
#parameters['gamma'] = np.arange(0.01, 0.4, 0.1)
#
GS_SVM = GridSearchCV(SVC(random_state = 1000, probability=True), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_SVM.fit(x_train, y_train)
GridSearchResults(GS_SVM)
# Output:
Best parameters:
{'C': 1, 'kernel': 'rbf'}
Best score:
0.64545 (+/-0.01170)
All parameters:
{'C': 1,
'break_ties': False,
'cache_size': 200,
'class_weight': None,
'coef0': 0.0,
'decision_function_shape': 'ovr',
'degree': 3,
'gamma': 'scale',
'kernel': 'rbf',
'max_iter': -1,
'probability': True,
'random_state': 1000,
'shrinking': True,
'tol': 0.001,
'verbose': False}
params mean_test_score std_test_score rank_test_score
5 {'C': 1, 'kernel': 'rbf'} 0.645455 0.011701 1
7 {'C': 100, 'kernel': 'poly'} 0.644317 0.010408 2
8 {'C': 100, 'kernel': 'rbf'} 0.643818 0.012859 3
4 {'C': 1, 'kernel': 'poly'} 0.639838 0.010713 4
2 {'C': 0.1, 'kernel': 'rbf'} 0.637066 0.013727 5
1 {'C': 0.1, 'kernel': 'poly'} 0.632587 0.010276 6
0 {'C': 0.1, 'kernel': 'linear'} 0.612112 0.012761 7
6 {'C': 100, 'kernel': 'linear'} 0.612041 0.013000 8
3 {'C': 1, 'kernel': 'linear'} 0.611970 0.012831 9
|
Here the result for modeling and evaluation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
svm_model = SVC(C=1, gamma = 0.31, kernel = 'rbf', random_state = 1000, probability=True);
svm_model.fit(x_train, y_train)
Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, svm_model)
model_stats = model_stats.append({"Model": "SVM model",
"Accuracy": Accuracy,
"Precision": Precision,
"Recall": Recall,
"F1-Score": F1,
"AUC-Score": auc_score}, ignore_index=True)
# Output:
Results:
+++++ Accuracy Score 0.664
+++++ Precision Score 0.631
+++++ Recall Score 0.587
+++++ F1 Score 0.608
precision recall f1-score support
No churn 0.69 0.73 0.71 3354
Churn 0.63 0.59 0.61 2675
accuracy 0.66 6029
macro avg 0.66 0.66 0.66 6029
weighted avg 0.66 0.66 0.66 6029
+++++ AUC (Area under the ROC Curve) : 0.656
|
As we can see, the accuracy is 0.66.
4.3.5. Random Forest Model
Here the results for hyperparameter tunning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
# GridSearchCV
parameters = {}
#parameters['max_features'] = ['auto', 'sqrt', 'log2', None]
parameters['n_estimators'] = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1100, 1000, 1100, 1200, 1300]
#parameters['criterion'] = ['entropy', 'gini']
parameters['max_depth'] = [7, 8, 9, 10, 11, 12, 13, 14, 15, None]
#
GS_rf_0 = GridSearchCV(RandomForestClassifier(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_rf_0.fit(x_train, y_train)
GridSearchResults(GS_rf_0)
# Output.
Best parameters:
{'max_depth': 9, 'n_estimators': 200}
Best score:
0.66863 (+/-0.00920)
All parameters:
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': 9,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 200,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
params mean_test_score std_test_score rank_test_score
29 {'max_depth': 9, 'n_estimators': 200} 0.668632 0.009199 1
8 {'max_depth': 7, 'n_estimators': 900} 0.668418 0.009838 2
19 {'max_depth': 8, 'n_estimators': 600} 0.668348 0.008300 3
10 {'max_depth': 7, 'n_estimators': 1000} 0.668347 0.010234 4
49 {'max_depth': 10, 'n_estimators': 800} 0.668064 0.008299 5
15 {'max_depth': 8, 'n_estimators': 200} 0.667992 0.008415 6
23 {'max_depth': 8, 'n_estimators': 1100} 0.667921 0.008200 7
38 {'max_depth': 9, 'n_estimators': 1000} 0.667779 0.007937 8
6 {'max_depth': 7, 'n_estimators': 700} 0.667779 0.010037 9
28 {'max_depth': 9, 'n_estimators': 100} 0.667566 0.008903 10
|
Here the result for modeling and evaluation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
rf_model = RandomForestClassifier(criterion = 'entropy', max_depth =9, n_estimators = 170, max_features = None)
rf_model.fit(x_train, y_train)
Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, rf_model)
model_stats = model_stats.append({"Model": "Random Forest model",
"Accuracy": Accuracy,
"Precision": Precision,
"Recall": Recall,
"F1-Score": F1,
"AUC-Score": auc_score}, ignore_index=True)
# Output:
Results:
+++++ Accuracy Score 0.688
+++++ Precision Score 0.663
+++++ Recall Score 0.603
+++++ F1 Score 0.631
precision recall f1-score support
No churn 0.70 0.76 0.73 3354
Churn 0.66 0.60 0.63 2675
accuracy 0.69 6029
macro avg 0.68 0.68 0.68 6029
weighted avg 0.69 0.69 0.69 6029
+++++ AUC (Area under the ROC Curve) : 0.679
|
As we can see, the accuracy is 0.69.
4.3.5. K-Nearest Neighbors Model
Here the results for hyperparameter tunning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
# GridSearchCV for KNN
parameters = {}
parameters['n_neighbors'] = [i for i in range(1,50)]
#parameters['weights'] = ['uniform', 'distance']
#parameters['algorithm'] = ['auto', 'ball_tree', 'kd_tree', 'brute']
#
GS_knn_0 = GridSearchCV(KNeighborsClassifier(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_knn_0.fit(x_train, y_train)
GridSearchResults(GS_knn_0)
# Output:
Best parameters:
{'n_neighbors': 37}
Best score:
0.63380 (+/-0.00678)
All parameters:
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 37,
'p': 2,
'weights': 'uniform'}
params mean_test_score std_test_score rank_test_score
36 {'n_neighbors': 37} 0.633796 0.006776 1
37 {'n_neighbors': 38} 0.632374 0.009099 2
38 {'n_neighbors': 39} 0.631309 0.004983 3
35 {'n_neighbors': 36} 0.630597 0.007942 4
34 {'n_neighbors': 35} 0.630454 0.008985 5
32 {'n_neighbors': 33} 0.630241 0.008958 6
30 {'n_neighbors': 31} 0.629885 0.010527 7
22 {'n_neighbors': 23} 0.629033 0.010570 8
33 {'n_neighbors': 34} 0.629032 0.009545 9
31 {'n_neighbors': 32} 0.628677 0.009892 10
|
Here the result for modeling and evaluation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
knn_model = KNeighborsClassifier(algorithm='auto', n_neighbors = 37, weights = 'uniform')
knn_model.fit(x_train, y_train)
Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, knn_model)
model_stats = model_stats.append({"Model": "KNN model",
"Accuracy": Accuracy,
"Precision": Precision,
"Recall": Recall,
"F1-Score": F1,
"AUC-Score": auc_score}, ignore_index=True)
# Output:
Results:
+++++ Accuracy Score 0.638
+++++ Precision Score 0.606
+++++ Recall Score 0.524
+++++ F1 Score 0.562
precision recall f1-score support
No churn 0.66 0.73 0.69 3354
Churn 0.61 0.52 0.56 2675
accuracy 0.64 6029
macro avg 0.63 0.63 0.63 6029
weighted avg 0.63 0.64 0.63 6029
+++++ AUC (Area under the ROC Curve) : 0.626
|
As we can see, the accuracy is 0.64.
4.3.6. Summary and save the model
Now we have all the statistics and results of all models. As we can see the best model is Random Forest whicha has a better accuracy( ** 0.69 aprox.** ) than others.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
model_stats.head()
# Output:
| | Model | Accuracy | Precision | Recall | F1-Score | AUC-Score |
|---:|:--------------------|-----------:|------------:|---------:|-----------:|------------:|
| 0 | Logistic model | 0.639244 | 0.598039 | 0.570093 | 0.583732 | 0.632244 |
| 1 | Decision Tree model | 0.673412 | 0.646717 | 0.581682 | 0.612478 | 0.664127 |
| 2 | SVM model | 0.664123 | 0.630522 | 0.586916 | 0.607938 | 0.656308 |
| 3 | Random Forest model | 0.687842 | 0.663102 | 0.602617 | 0.631414 | 0.679215 |
| 4 | KNN model | 0.637917 | 0.606494 | 0.523738 | 0.562086 | 0.626359 |
| 5 | KNN model | 0.637917 | 0.606771 | 0.522617 | 0.561559 | 0.626246 |
| 6 | KNN model | 0.637751 | 0.6066 | 0.522243 | 0.56127 | 0.626059 |
| 7 | KNN model | 0.637917 | 0.606494 | 0.523738 | 0.562086 | 0.626359 |
|
So in this way, our model is a Random Forest with the following hyperparameters:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
GridSearchResults(GS_rf_2)
# Output:
All parameters:
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'entropy',
'max_depth': 9,
'max_features': None,
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 170,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
|
We saved the model with pickle option in order to use later in prediction using new dataset.
1
2
3
|
# Saving the model:
filename = 'modelChurn.pickle'
pickle.dump(rf_model, open(filename,'wb'))
|
5. Prediction
We will test the model with new dataset. At the begging, we split the data in two: one for train/test and two for prediction. So we will use this prediction data. Here the key task for this part:
graph LR;
A[New Dataset] --> |Preprocess Pipeline| B[Dataset ready
for Prediction]
B -->|Load Model| C[Prediction]
5.1. Dataset Preparation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Load the new dataset for prediction
df = pd.read_csv('./Data/df_prediction.csv', index_col=0)
print("Size of the dataset: %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
df.head()
# Output:
Size of the dataset: 5400
Number of variables: 30
churn age housing credit_score deposits withdrawal purchases_partners purchases cc_taken cc_recommended ... waiting_4_loan cancelled_loan received_loan rejected_loan zodiac_sign left_for_two_month_plus left_for_one_month rewards_earned reward_rate is_referred
user
53131 0 37.0 O 588.0 5 0 19 5 0 58 ... 0 0 0 0 Gemini 0 0 11.0 0.92 1
23310 1 31.0 na 546.0 0 0 67 0 0 144 ... 0 0 0 0 Sagittarius 0 0 17.0 0.57 1
29996 0 51.0 na 508.0 0 0 7 0 0 15 ... 0 0 0 0 Aries 0 0 6.0 0.20 1
60425 0 25.0 na NaN 0 0 0 0 0 0 ... 0 0 0 0 Pisces 1 0 NaN 0.00 0
22972 1 28.0 na NaN 0 0 3 0 0 5 ... 0 0 0 0 Scorpio 1 0 2.0 0.07 0
5 rows × 30 columns
|
During preprocesing phase, we create a pipeline list “pipeline_preprocess” which has all the necessary functions for make the preprocesing tasks.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
print("\nSteps for pre-processing: ")
for step, function in enumerate(pipeline_preprocess):
print("\t {:d}: {:s}".format(step, function.__name__))
# Output:
Steps for pre-processing:
0: dropnull
1: dropduplicated
2: dropcolumns
3: remove_outlier
4: standardizeNum
5: removedummy
|
So now we define a function for do this process to the new dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
def preprocess_data_pipeline(df, pipeline_preprocess):
for step, function in enumerate(pipeline_preprocess):
df = function(df)
print("Size of the dataset: %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
display(df.head(10))
return df
df_new = preprocess_data_pipeline(df, pipeline_preprocess)
# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
There are duplicated indexes....So removing duplicated indexes ...(3)
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
Standarizing Numerical variables ...(6)
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset: 5309
Number of variables: 29
churn age purchases_partners cc_application_begin app_downloaded web_user android_user left_for_two_month_plus left_for_one_month reward_rate ... zodiac_sign_Cancer zodiac_sign_Capricorn zodiac_sign_Gemini zodiac_sign_Leo zodiac_sign_Libra zodiac_sign_Pisces zodiac_sign_Sagittarius zodiac_sign_Scorpio zodiac_sign_Taurus zodiac_sign_Virgo
user
53131 0 0.669618 -0.036352 1.610414 1 1 1 0 0 0.026999 ... 0 0 1 0 0 0 0 0 0 0
23310 1 -0.037564 1.761699 1.302417 1 0 0 0 0 -0.440722 ... 0 0 0 0 0 0 1 0 0 0
29996 0 2.319710 -0.485864 0.840422 1 1 0 0 0 -0.935169 ... 0 0 0 0 0 0 0 0 0 0
60425 0 -0.744746 -0.748080 -0.853562 0 1 0 1 0 -1.202438 ... 0 0 0 0 0 1 0 0 0 0
22972 1 -0.391155 -0.635702 -0.853562 1 0 1 1 0 -1.108894 ... 0 0 0 0 0 0 0 1 0 0
1195 0 0.080300 2.286131 1.610414 1 1 1 0 0 1.563795 ... 0 0 0 0 0 0 0 0 0 0
41350 1 -0.626882 -0.748080 -0.853562 1 0 1 0 0 -1.202438 ... 0 0 0 0 0 0 0 0 0 1
29695 0 -0.273291 -0.748080 -0.853562 1 1 0 0 0 -1.202438 ... 1 0 0 0 0 0 0 0 0 0
15739 1 -0.980473 -0.748080 0.840422 1 0 1 0 0 1.376707 ... 0 0 0 0 0 0 0 0 0 0
51516 0 -0.980473 0.825214 -0.391566 1 0 1 0 0 -0.400631 ... 0 0 0 0 0 0 0 0 0 0
10 rows × 29 columns
|
Now we are ready to split filterint the “new_num_features”:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# Using feature selection for dataset
new_num_features = ['purchases_partners', 'reward_rate', 'cc_application_begin', 'app_downloaded', 'zodiac_sign_Virgo', 'zodiac_sign_Taurus', 'zodiac_sign_Scorpio', 'zodiac_sign_Aquarius', 'housing_O', 'age', 'zodiac_sign_Libra', 'zodiac_sign_Leo']
X = df_new[new_num_features]
y = df_new['churn']
print("X shape: {}".format(X.shape))
print("y shape: {}".format(y.shape))
# Output:
X shape: (5309, 12)
y shape: (5309,)
|
5.2. Make Predictons
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# Load the model:
filename = 'modelChurn.pickle'
model_Loaded = pickle.load(open(filename, 'rb'))
# Make predictions
prediction = model_Loaded.predict(X)
results = []
for i in list(zip(y, prediction)):
if i[0] == i[1]:
res = 1
else:
res = 0
results.append(res)
#print(i, ' ', res)
print("% of succesful prediction: {}".format(np.sum(results)/len(results)))
# Output:
% of succesful prediction: 0.6824260689395366
|
6. Conclusions and Lessons learned
- As spected, the prediction with new data is aroung $0.7$ which reflects the accuracy model
- As reviewed, the 4 most important feature are purchases_partners, reward_rate, cc_application_begin, app_downloaded. They seems logical becasue those features are related to customer behavior related to the app. The others like zodiac_sigh or housing have low importance.
- During deployment, we faced some issues related to processing time in gridsearchCV. To speed up, we split the tunning hyperparameters in phases using gridsearchCV.
- We tried to improve accuracy but most of the model output in the range of $[0.64-0.69]$. It seems that the model is not so good. In the other hand, we choosed 12 features. It could be betteer if we choose less features.
- During exploration, we need to separate the analysis in categorical and numerical. And with numerical feature in binary and not binary.