Churn Prediction for Fintech Data

Supervised Learning

2022

Brief : The main idea of the project is to predict churn and which parameters is more relevant during analysis using classic supervised techniques like LR, DT, SVM, RF, KNN with the help of GridSearchCV.

Objective

Predict user churn of fintech app using clasical supervised models.

Project Overview

Get dataset from fintech app, anderstand the key features.
Explore and Analyze the behaviour of data related to null values, duplicated indexes, balancing data, outliers, categorical and numerical analysis, get some insights with compared analysis between churn and key features.
As result of exploration phase, pre-process the dataset for being accurate to train, test and later use it in the prediction.
Model and evaluate 5 models with GridSearchCV for hyperparameter tunning and compare metrics to find the best model.
Save the model with pickle package in order to use later for prediction test.
Create an app where we can use the model.
Note: For more details HERE

In order to have a big picture about the project, here a diagram of each phase:

graph LR; A[Data set
for trainning/testing] --> B[Exploration] B --> C{Pre-Processing} C -->|Training| D[Fitting] D -->|Feature Selection, Tunning| F[Model] F --> E[Evaluation] F --> H[Prediction] C -->|Testing| F G[Data set
for predicting] -->|Pre-Processing| F

1. Dataset

1.1. Dataset Understanding

The bussines challenge: A subscription model on a fintech have users who use the app which it can bring some information about customer behavior. Normaly companies try to minimize the churn which it means cancel subscription. The object is to predict when customer is trying to churn and which feature is the most relevant.

Here some of the fearures:

userid - userid
churn - Active = No Suspended < 30 = No Else Churn = Yes
age - age of the customer
zodiac_sign - zodiac sign of the customer
housing - rent_or_own - Does the customer rents or owns a house
withdrawn_application - has the customer withdrawn the loan applicaiton
used_ios- Has the user used an iphone
used_android - Has the user used a android based phone
has_used_web - Has the user used MoneyLion Web app
has_used_mobile - as the user used MoneyLion app
has_reffered- Has the user referred
credit_score - Customer’s credit score

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


# Loading the data

dfChurnBank = pd.read_csv('../projectChurnRate/Data/churn_data.csv', index_col=0)

print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])

if df.index.is_unique:
    print('Indexes are unique.')
else:
    print('There are duplicated indexes.')
    
dfChurnBank.head()

# Output:
Size of the dataset:  21600
Number of variables: 30
There are duplicated indexes.

user  churn	age	housing	credit_score	deposits	withdrawal	purchases_partners	purchases	cc_taken	cc_recommended	...	waiting_4_loan	cancelled_loan	received_loan	rejected_loan	zodiac_sign	left_for_two_month_plus	left_for_one_month	rewards_earned	reward_rate	is_referred
																					
55409	0	37.0	na	NaN	0	0	0	0	0	0	...	0	0	0	0	Leo	1	0	NaN	0.00	0
23547	0	28.0	R	486.0	0	0	1	0	0	96	...	0	0	0	0	Leo	0	0	44.0	1.47	1
58313	0	35.0	R	561.0	47	2	86	47	0	285	...	0	0	0	0	Capricorn	1	0	65.0	2.17	0
8095	0	26.0	R	567.0	26	3	38	25	0	74	...	0	0	0	0	Capricorn	0	0	33.0	1.10	1
61353	1	27.0	na	NaN	0	0	2	0	0	0	...	0	0	0	0	Aries	1	0	1.0	0.03	0
5 rows × 30 columns

⬆

2. Exploration and Analysis

For exploration, we will focus on check how is the behavior of data related to the following topics:

Checking Null and na values
Check if index is duplicated
Verify balance data on dataset
Dataframe data type analysis -> Categorical and numerical for separated
Check outliers
Analysis of Churn over features
Conclusions after exploration and analysis

2.1. Checking Null and na values

As we can see there are many null values on credit_score and rewards_earned columns. If we remove null values as rows, we will lost many data and dataset size will reduce a lot, so the best option it is to remove them. On the other hand, age column only has 4 null values, so we can drop null values as rows(axis=0).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


null_finder = df.isnull().sum()

print(" ***** Number of Null Values by row: ***** ")
null_finder.where(null_finder > 0).dropna()

# Output:
 ***** Number of Null Values by row: ***** 
age                  4.0
credit_score      6436.0
rewards_earned    2569.0
dtype: float64

2.2. Check if index is duplicated

There are duplicated indexes.

1
2
3
4
5
6
7


if df.index.is_unique:
    print('Indexes are unique.')
else:
    print('There are duplicated indexes.')

# Output:
There are duplicated indexes.

2.3. Verify balance data on dataset

For continue EDA, we have to remove null/nas and duplicated indexes in order to have a good explorarion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


sns.countplot(x='churn', data=df)

class_ratio_0 = Counter(df['churn'])[0]/df['churn'].shape[0]
class_ratio_1 = Counter(df['churn'])[1]/df['churn'].shape[0]

print('Class Ratio of {:} is {:.2f}'.format(list(Counter(df['churn']).keys())[0], class_ratio_0))
print('Class Ratio of {:} is {:.2f}'.format(list(Counter(df['churn']).keys())[1], class_ratio_1))
print('Class Ratio of 1 over 0 is {:.2f}'.format(class_ratio_1/class_ratio_0))

# Output:
Class Ratio of 0 is 0.56
Class Ratio of 1 is 0.44
Class Ratio of 1 over 0 is 0.80

As we can see, the ratio of the class variable is almost balanced. So in this way, we don’t need to use any technique for balance them.

Class 1 -> churn
Class 0 -> no churn

2.4. Dataframe data type analysis

For this kind of analysis, we have to separate into categorical and numerical variables. Hence, we define 2 list: cat_features and num_features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# Separation of data types for EDA

cat_features = df.select_dtypes(exclude = np.number).columns
num_features = df.select_dtypes(include = np.number).columns
print( "Quantity of Categorical features: ", len(cat_features),"\nCategorical features: ", cat_features)
print( "\nQuantity of Numerical features: ", len(num_features),"\nNumerical features: ", num_features)

# Output:
Quantity of Categorical features:  3 
Categorical features:  Index(['housing', 'payment_type', 'zodiac_sign'], dtype='object')

Quantity of Numerical features:  25 
Numerical features:  Index(['churn', 'age', 'deposits', 'withdrawal', 'purchases_partners',
       'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked',
       'cc_application_begin', 'app_downloaded', 'web_user', 'app_web_user',
       'ios_user', 'android_user', 'registered_phones', 'waiting_4_loan',
       'cancelled_loan', 'received_loan', 'rejected_loan',
       'left_for_two_month_plus', 'left_for_one_month', 'reward_rate',
       'is_referred'],
      dtype='object')

2.4.1. Categorical analysis:

We can see that those values cannot be binarized because they has more than 2 values. Hence, we need to use one-hot encoding technique. On the other hand, for removing ‘na’ values, we have to check the amount of those values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13



print("Different quantity values: " ,list(map(lambda col: (col,len(df[col].value_counts())), df.select_dtypes(exclude = np.number).columns)))

print("\nhousing feature values: ", df['housing'].unique() ,
      "\npayment_type feature value: ", df['payment_type'].unique() ,
      "\nzodiac_sign feature values: ", df['zodiac_sign'].unique())
# Output:
Different quantity values:  [('housing', 3), ('payment_type', 5), ('zodiac_sign', 13)]

housing feature values:  ['na' 'R' 'O'] 
payment_type feature value:  ['Bi-Weekly' 'Weekly' 'na' 'Monthly' 'Semi-Monthly'] 
zodiac_sign feature values:  ['Aquarius' 'Scorpio' 'Capricorn' 'Pisces' 'Gemini' 'Cancer' 'Virgo'
 'Libra' 'na' 'Leo' 'Aries' 'Taurus' 'Sagittarius']

As we can see, there are a lot of ‘na’ values. We can remove them one hot technique(dummy) later during pre-processing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# Counting "na" values:

print("na values from housing: ", len(df[df['housing'] ==  'na']) )
print("na values from payment_type: ", len(df[df['payment_type'] ==  'na']) )
print("na values from zodiac:sign: ", len(df[df['zodiac_sign'] ==  'na']) )
# Output:
na values from housing:  10373
na values from payment_type:  2883
na values from zodiac:sign:  1570


####
# Analysis each categorical feature by churn quantity in order to know which one is more relevant than other.

fig, axs = plt.subplots(1, len(cat_features), figsize=(28, 10))

for col, ax in enumerate(axs.flatten()):
    sns.countplot(x=cat_features[col], hue='churn', dodge=False, data=df, ax = axs[col])
    ax.set_title(cat_features[col])
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=8)

As we can see, this values should be traslate to numbers with one hot encoding technique in order to keep the size of dataset. Categorical values should be considered becuase it affect on churn behavior. Here some insights:

The people who has the Bi-weekly payment_type churn more than others
The people who Rent(R) churn more than people who Own(O) his home.

2.4.1. Numerical analysis:

From correlation matrix, we can get some insights like the following:

correlation 0.91 between app_web_user and web_user. It means they are positively depended so we can ommit one of them.
correlation 0.88 between reward_rate and cc_recommended. It means they are positively depended so we can ommit one of them.
correlation -0.84 between android_user and ios_user. It means they are negatively depended so we can ommit one of them.
correlation 1 between purchases and deposit. It means they are positively depended so we can ommit one of them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# Quantity of numerical values:

list(map(lambda col: (col,len(df[col].value_counts())), df.select_dtypes(include = np.number).columns))

# Output:
[('churn', 2),
 ('age', 69),
 ('deposits', 66),
 ('withdrawal', 20),
 ('purchases_partners', 281),
 ('purchases', 64),
 ('cc_taken', 11),
 ('cc_recommended', 324),
 ('cc_disliked', 20),
 ('cc_liked', 8),
 ('cc_application_begin', 123),
 ('app_downloaded', 2),
 ('web_user', 2),
 ('app_web_user', 2),
 ('ios_user', 2),
 ('android_user', 2),
 ('registered_phones', 5),
 ('waiting_4_loan', 2),
 ('cancelled_loan', 2),
 ('received_loan', 2),
 ('rejected_loan', 2),
 ('left_for_two_month_plus', 2),
 ('left_for_one_month', 2),
 ('reward_rate', 186),
 ('is_referred', 2)]

As we can see there are binary fearures and other are longer than 2 values.

First, we will explore features which has more than 2 different values because there are many like (‘registered_phones’, 5), (‘cc_disliked’, 20), (‘cc_taken’, 11) and (‘withdrawal’, 20) which have specific values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, num_features))
num_features_bin = list(filter(lambda col: len(df[col].value_counts()) == 2, num_features))

print("Quantity of features greater than 2:", len(num_features_new))
print(num_features_new)
print()
print("Quantity of features equal to 2(binary):", len(num_features_bin))
print(num_features_bin)

# Output:
Quantity of features greater than 2: 12
['age', 'deposits', 'withdrawal', 'purchases_partners', 'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked', 'cc_application_begin', 'registered_phones', 'reward_rate']

Quantity of features equal to 2(binary): 13
['churn', 'app_downloaded', 'web_user', 'app_web_user', 'ios_user', 'android_user', 'waiting_4_loan', 'cancelled_loan', 'received_loan', 'rejected_loan', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']

In the next pictures, We can see that there are some features like ‘deposits’, ‘withdrawal’, ‘purchases’, ‘cc_taken’, ‘cc_disliked’, ‘cc_liked’, ‘registered_phones’ which has huge values as 0(zero) than other values. Later during removing outliers, all the values will be 0, so those features are not useful. We have to remove them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# Exploring the distribution about num_features_new values: 'age', 'credit_score', 'deposits' , 'purchases_partners', 'purchases', 'cc_recommended', 'cc_application_begin', 'rewards_earned', 'reward_rate'

for col in num_features_new:
    #plt.figure(figsize=(5,2))
    #sns.countplot(x=col, data=df)
    sns.displot(df[col], kde = True, height = 5 )
    plt.xticks(rotation=90, fontsize=9)
    plt.xlabel(col, labelpad=10)
    plt.ylabel('Counts')
    plt.show()

For continue the exploration, we will ommit ‘churn’, ‘app_downloaded’, ‘web_user’, ‘app_web_user’, ‘ios_user’, ‘android_user’ because they already have been analyzed. So we focus in the rest. We can see that there is 2 main insights:

‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘left_for_one_month’,‘rejected_loan’ have almost the same distribution regarded churn. So we will only use one of them.
The others seems have different behaviors.

1
2
3
4
5
6
7
8


num_features_binary = ['waiting_4_loan', 'cancelled_loan', 'received_loan', 'rejected_loan', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']

fig, axs = plt.subplots(1, len(num_features_binary), figsize=(28, 10))

for col, ax in enumerate(axs.flatten()):
    sns.countplot(x=num_features_binary[col], hue='churn', dodge=False, data=df, ax = axs[col])
    ax.set_title(num_features_binary[col])
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=8)

2.5. Check outliers

For this part, we have to detect outliers in features which are not binary because in binary there is not sense. For this part, we will check with the $ IQR = 1.5(percentile_3 - percentile_1) $.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


def detect_outlier(X):
    """
    X: dataframe
    """
    #X = df_new.iloc[:, :-1]
    #for i in range(len(X.columns)):
    for i in range(5): # Only priny the firs 5
        first_q = np.percentile(X[X.columns[i]], 25)
        third_q = np.percentile(X[X.columns[i]], 75) 
        IQR = 1.5*(third_q - first_q)
        minimum = first_q - IQR 
        maximum = third_q + IQR
        
        if(minimum > np.min(X[X.columns[i]]) or maximum < np.max(X[X.columns[i]])):
            print(X.columns[i], "There is Outlier")
detect_outlier(df[num_features_new])

# Output:
age There is Outlier
deposits There is Outlier
withdrawal There is Outlier
purchases_partners There is Outlier
purchases There is Outlier

As we can see there are ‘age’, ‘purchases_partners’ and ‘cc_application_begin’ have outliers. On the other hand, the others its mean is on 0 which means most of all these values are 0. So later during removing outliers, these features will have only one value ‘0’. Hence, they should be removed.

2.6. Analysis of Churn over features

Analysis 1:

Question: Whats the main age group of customers ? and Which age group of customers churn more ?
Results: From picture we can see that people between 21 and 32 churn more than older people.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


age_group = df[(df.age>20)& (df.age<80)]
df_filtered = age_group.groupby(['age', 'churn'])['deposits'].count().reset_index()
b = pd.pivot_table(df_filtered, values='deposits', index='age',columns=['churn']).reset_index()
b.head()

# Output:
churn	age	0	1
0	21.0	373.0	397.0
1	22.0	409.0	410.0
2	23.0	519.0	437.0
3	24.0	573.0	449.0
4	25.0	580.0	478.0

#####
sns.barplot(x = 'age', y = 1 , data = b)
plt.xlabel('Age')
plt.ylabel('Churn Quantity')
plt.title('Age Churn Rate')
plt.xticks(rotation=90)

Analysis 2:

Question: People who makes deposits churn? or People who makes purchases churn?
Results: We can see from picture that peoplo who does not make any deposit or purchase churn more. It is logic.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


df_filtered = df.groupby(['deposits', 'churn'])['housing'].count().reset_index()
b = pd.pivot_table(df_filtered, values='housing', index='deposits',columns=['churn']).reset_index()
b.head()

# Output:

#####
plt.figure(figsize=(18,10))
sns.barplot(x = 'deposits', y = 1 , data = b)
plt.xlabel('Deposits')
plt.ylabel('Churn Quantity')
plt.title('Depostis Churn Rate')
plt.xticks(rotation=90)

2.7. Conclusions after exploration and analysis

After EDA, we have some insights and conclusion that must be applied in the pre-processing phase for preparing the data for start modeling:

Peoble who does not make any deposits or do any purchase with the app churn more
From picture we can see that people between 21 and 32 churn more than older people.
As we can see there are ‘age’, ‘purchases_partners’ and ‘cc_application_begin’ have outliers. On the other hand, the others its mean is on 0 which means most of all these values are 0. So later during removing outliers, these features will have only one value ‘0’. They should be removed.
‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’, ‘left_for_one_month’ have almost the same distribution regarded churn. We will only use one of them.
From correlation matrix, correlation 0.91 between app_web_user and web_user. It means they are positively depended so we can ommit one of them.
From correlation matrix, correlation 0.88 between reward_rate and cc_recommended. It means they are positively depended so we can ommit one of them.
From correlation matrix, correlation -0.84 between android_user and ios_user. It means they are negatively depended so we can ommit one of them.
From correlation matrix, correlation 1 between purchases and deposit. It means they are positively depended so we can ommit one of them.
From Categorical values, The people who has the Bi-weekly payment_type churn more than others.
From Categorical values, The people who Rent(R) churn more than people who Own(O) his home.
From Categorical values, these values should be traslate to numbers with one hot encoding technique in order to keep the size of dataset. There are a lot of ‘na’ values. We can remove them using one hot technique(dummy).
As we can see, the ratio of the class variable is almost balanced. SO in this way, we won’t need to use any technique for balance them.
We can see that there are some features like ‘deposits’, ‘withdrawal’, ‘purchases’, ‘cc_taken’, ‘cc_disliked’, ‘cc_liked’, ‘registered_phones’ which has huge values as ‘0’ than other values. Lated during removing outliers, all the values will be 0, so those features are not useful. We have to remove them.

⬆

3. Pre-Processing

As found in exploration phase, we will pre-process the dataset for being accurate for modelling. All of these steps we will save in a pipeline list for next pre-processing steps. Below the list of task:

Remove Null/Nan values
Remove duplicated indexes
Remove not useful features: app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', ‘received_loan’, ‘rejected_loan’, ‘waiting_4_loan’ columns
Update numerical and categorical values list for modeling
Remove outliers
Num-Val: Standarize data for training(without binary features)
Cat-Val: One hot encoding
Create pre-process pipeline function for next preprocesing for test data.

Before start, we create the pipeline list. At the end of process, we will test this pipeline with a new dataset:

1
2


# Pipeline for pre-preocessing
pipeline_preprocess = []

3.1. Remove Null/Nan values

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


def dropnull(df):
    print("Removing columns credit_score and rewards_earned ...(1)")
    df = df.drop(columns=['credit_score','rewards_earned'])
    print("Drop null values from age column ...(2)")
    df = df[pd.notnull(df['age'])]
    return df

df = dropnull(df)

null_finder = df.isnull().sum()
print(" ***** Number of Null Values by row: ***** ")
null_finder.where(null_finder > 0).dropna()

# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
 ***** Number of Null Values by row: ***** 
Series([], dtype: float64)

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(dropnull)

3.2. Remove duplicated indexes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


# This function will be agregated to the pre-precessing pipeline:
def dropduplicated(df):
    if df.index.is_unique:
        print('Indexes are unique.')
        return df
    else:
        print('There are duplicated indexes....So removing duplicated indexes ...(3)')
        return df[~df.index.duplicated(keep='first')]

df = dropduplicated(df)


if df.index.is_unique:
    print('Indexes are unique.')
else:
    print('There are duplicated indexes.')

# Output:
There are duplicated indexes....So removing duplicated indexes ...(3)
Indexes are unique.

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(dropduplicated)

3.3. Remove unuseful features

During EDA, we conclude that the following features are unusful:

app_web_user
deposit
ios_user
cc_recommended
cancelled_loan
received_loan
rejected_loan
waiting_4_loan
withdrawal
purchases
cc_taken
cc_disliked
cc_liked
registered_phones

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# Remove app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns

def dropcolumns(df):
    print("Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)")
    df = df.drop(columns=['app_web_user','deposits', 'ios_user','cc_recommended', 'cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan',
                          'withdrawal', 'purchases', 'cc_taken', 'cc_disliked', 'cc_liked', 'registered_phones'])
    return df
df = dropcolumns(df)

# Output:
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(dropcolumns)

3.4. Update numerical and categorical values list for modeling

We need to update numerical and categorical values list before start modeling because some of them were removed form dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


# Update numerical and categorical values for later modeling

cat_features = df.select_dtypes(exclude = np.number).columns
num_features = df.select_dtypes(include = np.number).columns
print( "Quantity of Categorical features: ", len(cat_features),"\nCategorical features: ", cat_features)
print( "\nQuantity of Numerical features: ", len(num_features),"\nNumerical features: ", num_features)

# Output:
Quantity of Categorical features:  3 
Categorical features:  Index(['housing', 'payment_type', 'zodiac_sign'], dtype='object')

Quantity of Numerical features:  11 
Numerical features:  Index(['churn', 'age', 'purchases_partners', 'cc_application_begin',
       'app_downloaded', 'web_user', 'android_user', 'left_for_two_month_plus',
       'left_for_one_month', 'reward_rate', 'is_referred'],
      dtype='object')

# Quantity of numerical values:

num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, df.select_dtypes(include = np.number).columns))
num_features_bin = list(filter(lambda col: len(df[col].value_counts()) == 2, df.select_dtypes(include = np.number).columns))

print("Quantity of features greater than 2:", len(num_features_new))
print(num_features_new)
print()
print("Quantity of features equal to 2(binary):", len(num_features_bin))
print(num_features_bin)

# Output:
Quantity of features greater than 2: 4
['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']

Quantity of features equal to 2(binary): 7
['churn', 'app_downloaded', 'web_user', 'android_user', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']

So now we have the following features quantity:

Quantity of Categorical features: 3
Quantity of Numerical features: 11 (4: Not binary, 7; Binary)

3.5. Remove outliers

For removeing outliers, we focus on numerical features which are not binary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


# Removing outliers
def remove_outlier(df):
    """
    df: dataframe
    """
    
    num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, df.select_dtypes(include = np.number).columns))
    outliers = []
    for col in num_features_new:
        first_q = np.percentile(df[col], 25)
        third_q = np.percentile(df[col], 75) 
        IQR = 1.5*(third_q - first_q)
        minimum = first_q - IQR 
        maximum = third_q + IQR
        
        if(minimum > np.min(df[col]) or maximum < np.max(df[col])):
            outliers.append(col)
    print("The outliers are :", outliers)
    
    print("Removing outliers ...(5)")
    for col in outliers:
        first_q = np.percentile(df[col], 25)
        third_q = np.percentile(df[col], 75) 
        IQR = 1.5*(third_q - first_q)
        minimum = first_q - IQR 
        maximum = third_q + IQR
    
        median = df[col].median()
    
        df.loc[df[col] < minimum, col] = median 
        df.loc[df[col] > maximum, col] = median
    return df
df = remove_outlier(df)

# Output:
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(remove_outlier)

3.6. Num-Val: Standarize data

For numerical values, we will standarize for training but without considering binary features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


def standardizeNum(df):
    standardize = StandardScaler()
    num_features = df.select_dtypes(include = np.number).columns
    num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, num_features))
    print("Standarizing Numerical variables ...(6)")
    df[num_features_new] = standardize.fit_transform(df[num_features_new])
    return df
df = standardizeNum(df)
print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])

# Output:
Standarizing Numerical variables ...(6)
Size of the dataset:  20095
Number of variables: 14

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(standardizeNum)

3.7. Cat-Val: One hot encoding

Doing the one hot encoding task for categorical features. Also removeing ‘na’ values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


def removedummy(df):
    print("Convert categorical values into numbers ...(7)")
    df_new = pd.get_dummies(df)
    print("Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)")
    df_new = df_new.drop(columns = ['housing_na', 'zodiac_sign_na', 'payment_type_na'])
    return df_new
df = removedummy(df)
print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])

# Output:
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset:  20095
Number of variables: 29

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(removedummy)

3.8. Create pre-process pipeline function

During the pre-processing phase, we use functions for do each especific task. These functions later will be used for new dataset like testing or even new data. Hence, we need to save these functions in a pipeline for later preprocesing. Here the list of all the functions we used:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


print("\nSteps for pre-processing: ")
for step, function in enumerate(pipeline_preprocess):
    print("\t {:d}: {:s}".format(step, function.__name__))

# Output:
Steps for pre-processing: 
	 0: dropnull
	 1: dropduplicated
	 2: dropcolumns
	 3: remove_outlier
	 4: standardizeNum
	 5: removedummy

In this way, we create the funcion ‘preprocess_data_pipeline’ for later pre-processing for new dataset(prediction):

1
2
3
4
5
6
7
8
9


# Definition of preprocess_data for an specific dataset:

def preprocess_data_pipeline(df, pipeline_preprocess):
    for step, function in enumerate(pipeline_preprocess):
        df = function(df)
    print("Size of the dataset:  %d" % df.shape[0])
    print("Number of variables: %d" % df.shape[1])
    df.head(10)
    return df

Testing the ‘preprocess_data_pipeline’ function with new dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


df_aux = pd.read_csv('../projectChurnRate/Data/churn_data.csv', index_col=0).sample(n=2100, random_state=0)
preprocess_data_pipeline(df_aux, pipeline_preprocess)

# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
There are duplicated indexes....So removing duplicated indexes ...(3)
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
Standarizing Numerical variables ...(6)
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset:  2086
Number of variables: 29
user  churn	age	housing	purchases_partners	cc_application_begin	app_downloaded	web_user	android_user	payment_type	zodiac_sign	left_for_two_month_plus	left_for_one_month	reward_rate	is_referred
														
50488	0	20.0	R	29	8	1	0	1	Bi-Weekly	Virgo	0	0	0.44	1
53603	0	38.0	na	28	5	1	0	1	Monthly	Sagittarius	0	0	0.67	1
42289	1	40.0	R	9	4	1	1	1	na	Aries	0	0	0.63	1
4185	0	34.0	na	0	4	1	1	0	Bi-Weekly	Scorpio	0	0	2.07	0
12436	1	24.0	O	38	7	1	1	1	Weekly	Scorpio	0	0	0.73	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23350	1	22.0	R	31	7	1	1	1	Weekly	Virgo	1	0	0.00	1
53737	0	43.0	na	23	1	1	1	0	Bi-Weekly	Gemini	0	0	0.50	0
56724	1	40.0	R	0	0	1	1	0	Monthly	Taurus	0	0	0.30	0
17737	1	23.0	na	8	1	1	1	1	Semi-Monthly	Pisces	0	0	0.65	0
46045	1	27.0	R	0	3	1	0	0	Bi-Weekly	Virgo	0	0	0.53	0
2086 rows × 14 columns

⬆

4. Model and Evaluation

For modeling, we faced the following faces:

Separate dataset

For X and y classes.
Split dataset for training and testing.

Feature Selection

Use linear regression model
Use other techniques like: RFE,Random Forest and Mutual Information

Then sum up and choose the most important features

Modeling for finding the best:

Hyperparameter Tunning using GridSearchCV for finding the best hyperparameters of each model
Fitting and Evaluation using the next models: Logistic Regresion, Decision Tree, Support Vector Machine, Random Forest, Nearest Neighbor.
Choose the best model according metrics
Save the model to used with new values

For evaluation, we are using the following metrics:

Confusion matrix
Accuracy as classification metric
Precision, Recall, F1-score
Some graphs like ROC, Precision vs Recall, KS Statistic Test, Cummulative Gain, Lift Curve.

graph LR; A[Training Dataset] --> B[Feature Selection] B --> D[Hyper-paremter Tunning] D --> E[Fitting] E --> F[Model] F -->|Tunning| D

graph LR; C[Testing Dataset] --> F[Model Fitted] F --> G[Evaluation] G --> H[Metrics Insgiths]

4.1. Dataset Separation

Here first, we split the data for features and label(churn variable). Then split for training and testing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


X = df.drop('churn', axis = 1)
y = df['churn'] 
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=10)
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))

# Output:
x_train shape: (14066, 28)
y_train shape: (14066,)
x_test shape: (6029, 28)
y_test shape: (6029,)

4.2. Feature Selection

For Feature Selection, we will use the following techniques:

Use linear regression model
Use other techniques like:

Recursive Feature Elimination (RFE) as a wrapped method
Random Forest as embeded method
Mutual Information classification as entropy method

Then sum up and choose the most important features

4.2.1. Logistic Regression

Using this model, the more important features with coef < -0.2 and coef > 0.2. There is 7 features from 28. We can see that the more important variables are:

left_for_one_month
web_user
app_downloaded
payment_type_Weekly
housing_O
reward_rate
purchases_partners

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


feature_importance_lr.where((feature_importance_lr["coef"] < -0.2)|(feature_importance_lr["coef"] > 0.2)).dropna()

# Output:
                    feature	     coef
 7	 left_for_one_month      0.438657
 4	           web_user      0.289608
 3           app_downloaded      0.268084
15	payment_type_Weekly      0.218277
10	          housing_O     -0.204331
 8              reward_rate     -0.298566
 1	 purchases_partners     -0.538643
    

4.2.2. Recursive Feature Elimination (RFE)

As a wrapped method, it use an external method or estimator(LogisticRegression in our case) for searching/select in a recursive way a small subset of features evaluating the importance. Using RFE model, the 12 more important features are:

purchases_partners
app_downloaded
web_user
left_for_two_month_plus
left_for_one_month
reward_rate
housing_O
payment_type_Weekly
zodiac_sign_Aquarius
zodiac_sign_Capricorn
zodiac_sign_Libra
zodiac_sign_Taurus

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# Number of features as 12
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=12, step=10, verbose=5)
rfe_selector.fit(x_train, y_train)
rfe_support = rfe_selector.get_support()
feature_rfe = x_train.loc[:,rfe_support].columns.tolist()
print(str(len(feature_rfe)), 'selected features')
print(feature_rfe)

# Output:
Fitting estimator with 28 features.
Fitting estimator with 18 features.
12 selected features
['purchases_partners', 'app_downloaded', 'web_user', 'left_for_two_month_plus', 'left_for_one_month', 'reward_rate', 'housing_O', 'payment_type_Weekly', 'zodiac_sign_Aquarius', 'zodiac_sign_Capricorn', 'zodiac_sign_Libra', 'zodiac_sign_Taurus']

4.2.3. Random Forest

In Random Forest, there is the purity of the node which tell us about the importance of the model. So if a node with low imputiry(Gini impurity) over all trees, this node will be the most important. Using the Random Forest model, there are only 4 important features:

age
purchases_partners
cc_application_begin
reward_rate

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


mbeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=12)
embeded_rf_selector.fit(x_train, y_train)

embeded_rf_support = embeded_rf_selector.get_support()
feature_embeded_rf = x_train.loc[:,embeded_rf_support].columns.tolist()
print(str(len(feature_embeded_rf)), 'selected features')
feature_embeded_rf

# Output:
4 selected features
['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']

4.2.4. Mutual Information classification

Mutual Information comes from information theory which apply information gain(like decision tree) to feature selection. This gain is calculated between 2 variables and it measures the reduction in uncertainty. Using this model, we can get the first 12 important features:

purchases_partners
reward_rate
zodiac_sign_Leo
is_referred
cc_application_begin
android_user
zodiac_sign_Scorpio
housing_O
zodiac_sign_Aquarius
housing_R
app_downloaded
zodiac_sign_Virgo

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


importances =  mutual_info_classif(x_train,y_train)
feature_mutual_importance = pd.Series(importances, index = x_train.columns)
feature_mutual_importance.sort_values(ascending=False).plot.bar(figsize=(20, 8))

feature_mutual = list(feature_mutual_importance.where(feature_mutual_importance > 0.002).sort_values(ascending=False).dropna().index)
print("There are {} important features and they are: \n".format(len(feature_mutual)))
feature_mutual

# Output:
There are 12 important features and they are: 

['purchases_partners',
 'reward_rate',
 'zodiac_sign_Leo',
 'is_referred',
 'cc_application_begin',
 'android_user',
 'zodiac_sign_Scorpio',
 'housing_O',
 'zodiac_sign_Aquarius',
 'housing_R',
 'app_downloaded',
 'zodiac_sign_Virgo']

Here we can see the most importan feature vs churn. We can see the “purchases_partners” is the most influence.

4.2.5. Summary

Sum up all the results, we can get the 12 most important features for start modeling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


# put all selection together
feature_selection_df = pd.DataFrame({'Feature':x_train.columns, 
                                     'Logistics':lr_support,
                                     'RFE':rfe_support,
                                     'Random Forest':embeded_rf_support, 'Mutual Classif':mutual_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
# feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

# Output:
	Feature	Logistics	RFE	Random Forest	Mutual Classif	Total
1	purchases_partners	True	True	True	True	        4
8	reward_rate	        False	True	True	True	        3
2	cc_application_begin	True	False	True	True	        3
3	app_downloaded	        True	True	False	True	        3
27	zodiac_sign_Virgo	True	False	False	True	        2
26	zodiac_sign_Taurus	True	True	False	False	        2
25	zodiac_sign_Scorpio	True	False	False	True	        2
16	zodiac_sign_Aquarius	False	True	False	True	        2
10	housing_O	        False	True	False	True	        2
0	age	                True	False	True	False	        2
22	zodiac_sign_Libra	False	True	False	False	        1
21	zodiac_sign_Leo	        False	False	False	True	        1

Here they are:

purchases_partners
reward_rate
cc_application_begin
app_downloaded
zodiac_sign_Virgo
zodiac_sign_Taurus
zodiac_sign_Scorpio
zodiac_sign_Aquarius
housing_O
age
zodiac_sign_Libra
zodiac_sign_Leo

Now we are ready to start modeling. For this we split again the dataset but filtering using the new_num_features list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


X = X[new_num_features]
y = df['churn']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=10)
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))
print("y_test shape: {}".format(y_test.shape))

# Output:
x_train shape: (14066, 12)
y_train shape: (14066,)
x_test shape: (6029, 12)
y_test shape: (6029,)

⬆

4.3. Modeling & Evaluation

Modeling for finding the best:

Hyperparameter Tunning using GridSearchCV for finding the best hyperparameters of each model
Fitting and Evaluation using the next models:

Logistic Regresion Model
Decision Tree Model
Support Vector Machine Model
Random Forest Model
K-Nearest Neighbor Model

Choose the best model according metrics
Save the model to used with new values

For evaluation, we are using the following metrics:

Confusion matrix
Accuracy as classification metric
Precision, Recall, F1-score
Some graphs like ROC, Precision vs Recall, KS Statistic Test, Cummulative Gain, Lift Curve.

Most of the steps for this phase are mechanic, so we create some key points to be considered:

graph LR; A[Training Dataset] --> B[GridSearchCV] B -->|Hyperparameter Tunning| C[Fitting/Predicting] C --> B C -->|Evaluation Metrics| D[Save Metrics]

Also we create 2 functions to summary all the results during tunning hyperparameters and modeling(fit/predict):

1
2
3
4
5
6
7
8


def GridSearchResults(grid_clf, num_results=10, display_all_params=True)
    """
    Function to summary all the results of gridsearchCV.
    """
def evaluationMetricsGCV(x_test, y_test, model_fit):
    """
    Function to summary all the results of fitting and predicting.
    """

To save metrics, we create a dataframe for later usage and to choose the best model:

1
2
3
4
5
6
7
8


# Dataframe for statistics

model_stats = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1-Score", "AUC-Score"])
model_stats.head()

# Output:
Model	Accuracy	Precision	Recall	F1-Score	AUC-Score

4.3.1. Logistic Regresion Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


# GridSearchCV for logistic Regression

parameters = {}
parameters['C'] = [10e-3, 10e-2, 10e-1, 1, 10, 100, 1000]
parameters['class_weight'] = [None, 'balanced']
parameters['penalty'] = ["l1","l2",'elasticnet']
parameters['solver'] = ['newton-cg', 'lbfgs', 'liblinear', 'sag']

GS_log = GridSearchCV(LogisticRegression(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_log.fit(x_train, y_train)

GridSearchResults(GS_log)

# Output:
Best parameters: 
 {'C': 0.01, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Best score: 
 0.63038 (+/-0.01300)
All parameters: 
{'C': 0.01,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l1',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
                                                params	    mean_test_score	std_test_score	rank_test_score
2	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.630383	0.012998	1
6	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627824	0.013089	2
26	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627611	0.012696	3
4	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627469	0.012713	4
5	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627469	0.012713	4
7	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627469	0.012713	4
29	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627184	0.013151	7
31	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627184	0.013151	7
28	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627184	0.013151	7
30	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.626971	0.013364	10

Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


logr_model = LogisticRegression(C=0.01, class_weight=None, max_iter=1000, penalty='l1', random_state=1000, solver='liblinear')
logr_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, logr_model)
model_stats = model_stats.append({"Model": "Logistic model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.639
+++++ Precision Score 0.598
+++++ Recall Score 0.570
+++++ F1 Score 0.584


              precision    recall  f1-score   support

    No churn       0.67      0.69      0.68      3354
       Churn       0.60      0.57      0.58      2675

    accuracy                           0.64      6029
   macro avg       0.63      0.63      0.63      6029
weighted avg       0.64      0.64      0.64      6029

+++++ AUC (Area under the ROC Curve) : 0.632

As we can see, the accuracy is 0.64.

4.3.2. Decision Tree Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


# GridSearchCV for Decision Tree Model

parameters = {}
parameters['max_depth'] = [i for i in range(1, 11)]
parameters['class_weight'] = [None, 'balanced']
parameters['max_features'] = [i for i in range(1, 8)]
parameters['min_samples_leaf'] = [i for i in range(1, 11)]

#
GS_tree = GridSearchCV(DecisionTreeClassifier(random_state = 1000), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_tree.fit(x_train, y_train)

GridSearchResults(GS_tree)

# Output:
Best parameters: 
 {'class_weight': None, 'max_depth': 8, 'max_features': 7, 'min_samples_leaf': 5}
Best score: 
 0.66344 (+/-0.00662)
All parameters: 
{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 8,
 'max_features': 7,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 1000,
 'splitter': 'best'}
                                            params	    mean_test_score	std_test_score	rank_test_score
554	{'class_weight': None, 'max_depth': 8, 'max_fe...	0.663444	0.006620	1
474	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.663371	0.009664	2
482	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.661808	0.006527	3
485	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.661666	0.008073	4
487	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.661594	0.008310	5
481	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660883	0.008289	6
483	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660670	0.007874	7
553	{'class_weight': None, 'max_depth': 8, 'max_fe...	0.660669	0.010791	8
473	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660385	0.008892	9
480	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660314	0.009318	10

Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


dt_model = DecisionTreeClassifier(max_depth=8,max_features=7, min_samples_leaf=5, random_state = 1000);
dt_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, dt_model)
model_stats = model_stats.append({"Model": "Decision Tree model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.673
+++++ Precision Score 0.647
+++++ Recall Score 0.582
+++++ F1 Score 0.612


              precision    recall  f1-score   support

    No churn       0.69      0.75      0.72      3354
       Churn       0.65      0.58      0.61      2675

    accuracy                           0.67      6029
   macro avg       0.67      0.66      0.67      6029
weighted avg       0.67      0.67      0.67      6029

+++++ AUC (Area under the ROC Curve) : 0.664

As we can see, the accuracy is 0.67.

4.3.3. Support Vector Machine Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


parameters = {}
parameters['C'] = [10e-2, 1, 100]
parameters['kernel'] = ['linear','poly', 'rbf']
#parameters['gamma'] = np.arange(0.01, 0.4, 0.1)


#
GS_SVM = GridSearchCV(SVC(random_state = 1000, probability=True), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_SVM.fit(x_train, y_train)

GridSearchResults(GS_SVM)

# Output:
Best parameters: 
 {'C': 1, 'kernel': 'rbf'}
Best score: 
 0.64545 (+/-0.01170)
All parameters: 
{'C': 1,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': True,
 'random_state': 1000,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}
                        params	    mean_test_score	std_test_score	rank_test_score
5	{'C': 1, 'kernel': 'rbf'}	0.645455	0.011701	1
7	{'C': 100, 'kernel': 'poly'}	0.644317	0.010408	2
8	{'C': 100, 'kernel': 'rbf'}	0.643818	0.012859	3
4	{'C': 1, 'kernel': 'poly'}	0.639838	0.010713	4
2	{'C': 0.1, 'kernel': 'rbf'}	0.637066	0.013727	5
1	{'C': 0.1, 'kernel': 'poly'}	0.632587	0.010276	6
0	{'C': 0.1, 'kernel': 'linear'}	0.612112	0.012761	7
6	{'C': 100, 'kernel': 'linear'}	0.612041	0.013000	8
3	{'C': 1, 'kernel': 'linear'}	0.611970	0.012831	9

Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


svm_model = SVC(C=1, gamma = 0.31, kernel = 'rbf', random_state = 1000, probability=True);
svm_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, svm_model)
model_stats = model_stats.append({"Model": "SVM model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.664
+++++ Precision Score 0.631
+++++ Recall Score 0.587
+++++ F1 Score 0.608


              precision    recall  f1-score   support

    No churn       0.69      0.73      0.71      3354
       Churn       0.63      0.59      0.61      2675

    accuracy                           0.66      6029
   macro avg       0.66      0.66      0.66      6029
weighted avg       0.66      0.66      0.66      6029

+++++ AUC (Area under the ROC Curve) : 0.656

As we can see, the accuracy is 0.66.

4.3.5. Random Forest Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


# GridSearchCV

parameters = {}
#parameters['max_features'] = ['auto', 'sqrt', 'log2', None]
parameters['n_estimators'] = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1100, 1000, 1100, 1200, 1300]
#parameters['criterion'] = ['entropy', 'gini']
parameters['max_depth'] = [7, 8, 9, 10, 11, 12, 13, 14, 15, None]

#
GS_rf_0 = GridSearchCV(RandomForestClassifier(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_rf_0.fit(x_train, y_train)

GridSearchResults(GS_rf_0)

# Output.
Best parameters: 
 {'max_depth': 9, 'n_estimators': 200}
Best score: 
 0.66863 (+/-0.00920)
All parameters: 
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 9,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 200,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
                                params	    mean_test_score	std_test_score	rank_test_score
29	{'max_depth': 9, 'n_estimators': 200}	0.668632	0.009199	1
8	{'max_depth': 7, 'n_estimators': 900}	0.668418	0.009838	2
19	{'max_depth': 8, 'n_estimators': 600}	0.668348	0.008300	3
10	{'max_depth': 7, 'n_estimators': 1000}	0.668347	0.010234	4
49	{'max_depth': 10, 'n_estimators': 800}	0.668064	0.008299	5
15	{'max_depth': 8, 'n_estimators': 200}	0.667992	0.008415	6
23	{'max_depth': 8, 'n_estimators': 1100}	0.667921	0.008200	7
38	{'max_depth': 9, 'n_estimators': 1000}	0.667779	0.007937	8
6	{'max_depth': 7, 'n_estimators': 700}	0.667779	0.010037	9
28	{'max_depth': 9, 'n_estimators': 100}	0.667566	0.008903	10

Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


rf_model = RandomForestClassifier(criterion = 'entropy', max_depth =9, n_estimators = 170, max_features = None)
rf_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, rf_model)
model_stats = model_stats.append({"Model": "Random Forest model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.688
+++++ Precision Score 0.663
+++++ Recall Score 0.603
+++++ F1 Score 0.631


              precision    recall  f1-score   support

    No churn       0.70      0.76      0.73      3354
       Churn       0.66      0.60      0.63      2675

    accuracy                           0.69      6029
   macro avg       0.68      0.68      0.68      6029
weighted avg       0.69      0.69      0.69      6029

+++++ AUC (Area under the ROC Curve) : 0.679

As we can see, the accuracy is 0.69.

4.3.5. K-Nearest Neighbors Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


# GridSearchCV for KNN

parameters = {}
parameters['n_neighbors'] = [i for i in range(1,50)]
#parameters['weights'] = ['uniform', 'distance']
#parameters['algorithm'] = ['auto', 'ball_tree', 'kd_tree', 'brute']

#
GS_knn_0 = GridSearchCV(KNeighborsClassifier(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_knn_0.fit(x_train, y_train)

GridSearchResults(GS_knn_0)

# Output:
Best parameters: 
 {'n_neighbors': 37}
Best score: 
 0.63380 (+/-0.00678)
All parameters: 
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 37,
 'p': 2,
 'weights': 'uniform'}
                    params	mean_test_score	std_test_score	rank_test_score
36	{'n_neighbors': 37}	0.633796	0.006776	1
37	{'n_neighbors': 38}	0.632374	0.009099	2
38	{'n_neighbors': 39}	0.631309	0.004983	3
35	{'n_neighbors': 36}	0.630597	0.007942	4
34	{'n_neighbors': 35}	0.630454	0.008985	5
32	{'n_neighbors': 33}	0.630241	0.008958	6
30	{'n_neighbors': 31}	0.629885	0.010527	7
22	{'n_neighbors': 23}	0.629033	0.010570	8
33	{'n_neighbors': 34}	0.629032	0.009545	9
31	{'n_neighbors': 32}	0.628677	0.009892	10

Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


knn_model = KNeighborsClassifier(algorithm='auto', n_neighbors = 37, weights = 'uniform')
knn_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, knn_model)
model_stats = model_stats.append({"Model": "KNN model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.638
+++++ Precision Score 0.606
+++++ Recall Score 0.524
+++++ F1 Score 0.562


              precision    recall  f1-score   support

    No churn       0.66      0.73      0.69      3354
       Churn       0.61      0.52      0.56      2675

    accuracy                           0.64      6029
   macro avg       0.63      0.63      0.63      6029
weighted avg       0.63      0.64      0.63      6029

+++++ AUC (Area under the ROC Curve) : 0.626

As we can see, the accuracy is 0.64.

⬆

4.3.6. Summary and save the model

Now we have all the statistics and results of all models. As we can see the best model is Random Forest whicha has a better accuracy( ** 0.69 aprox.** ) than others.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


model_stats.head()

# Output:
|    | Model               |   Accuracy |   Precision |   Recall |   F1-Score |   AUC-Score | 
|---:|:--------------------|-----------:|------------:|---------:|-----------:|------------:|
|  0 | Logistic model      |   0.639244 |    0.598039 | 0.570093 |   0.583732 |    0.632244 |
|  1 | Decision Tree model |   0.673412 |    0.646717 | 0.581682 |   0.612478 |    0.664127 |
|  2 | SVM model           |   0.664123 |    0.630522 | 0.586916 |   0.607938 |    0.656308 |
|  3 | Random Forest model |   0.687842 |    0.663102 | 0.602617 |   0.631414 |    0.679215 |
|  4 | KNN model           |   0.637917 |    0.606494 | 0.523738 |   0.562086 |    0.626359 |
|  5 | KNN model           |   0.637917 |    0.606771 | 0.522617 |   0.561559 |    0.626246 |
|  6 | KNN model           |   0.637751 |    0.6066   | 0.522243 |   0.56127  |    0.626059 |
|  7 | KNN model           |   0.637917 |    0.606494 | 0.523738 |   0.562086 |    0.626359 |

So in this way, our model is a Random Forest with the following hyperparameters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


GridSearchResults(GS_rf_2)

# Output:

All parameters: 
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': 9,
 'max_features': None,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 170,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We saved the model with pickle option in order to use later in prediction using new dataset.

1
2
3


# Saving the model:
filename = 'modelChurn.pickle'
pickle.dump(rf_model, open(filename,'wb'))

5. Prediction

We will test the model with new dataset. At the begging, we split the data in two: one for train/test and two for prediction. So we will use this prediction data. Here the key task for this part:

graph LR; A[New Dataset] --> |Preprocess Pipeline| B[Dataset ready
for Prediction] B -->|Load Model| C[Prediction]

5.1. Dataset Preparation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# Load the new dataset for prediction
df = pd.read_csv('./Data/df_prediction.csv', index_col=0)
print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
df.head()

# Output:
Size of the dataset:  5400
Number of variables: 30

    	churn	age	housing	credit_score	deposits	withdrawal	purchases_partners	purchases	cc_taken	cc_recommended	...	waiting_4_loan	cancelled_loan	received_loan	rejected_loan	zodiac_sign	left_for_two_month_plus	left_for_one_month	rewards_earned	reward_rate	is_referred
user																					
53131	0	37.0	O	588.0	5	0	19	5	0	58	...	0	0	0	0	Gemini	0	0	11.0	0.92	1
23310	1	31.0	na	546.0	0	0	67	0	0	144	...	0	0	0	0	Sagittarius	0	0	17.0	0.57	1
29996	0	51.0	na	508.0	0	0	7	0	0	15	...	0	0	0	0	Aries	0	0	6.0	0.20	1
60425	0	25.0	na	NaN	0	0	0	0	0	0	...	0	0	0	0	Pisces	1	0	NaN	0.00	0
22972	1	28.0	na	NaN	0	0	3	0	0	5	...	0	0	0	0	Scorpio	1	0	2.0	0.07	0
5 rows × 30 columns

During preprocesing phase, we create a pipeline list “pipeline_preprocess” which has all the necessary functions for make the preprocesing tasks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


print("\nSteps for pre-processing: ")
for step, function in enumerate(pipeline_preprocess):
    print("\t {:d}: {:s}".format(step, function.__name__))

# Output:
Steps for pre-processing: 
	 0: dropnull
	 1: dropduplicated
	 2: dropcolumns
	 3: remove_outlier
	 4: standardizeNum
	 5: removedummy

So now we define a function for do this process to the new dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


def preprocess_data_pipeline(df, pipeline_preprocess):
    for step, function in enumerate(pipeline_preprocess):
        df = function(df)
    print("Size of the dataset:  %d" % df.shape[0])
    print("Number of variables: %d" % df.shape[1])
    display(df.head(10))
    return df

df_new = preprocess_data_pipeline(df, pipeline_preprocess)

# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
There are duplicated indexes....So removing duplicated indexes ...(3)
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
Standarizing Numerical variables ...(6)
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset:  5309
Number of variables: 29

        churn	age	purchases_partners	cc_application_begin	app_downloaded	web_user	android_user	left_for_two_month_plus	left_for_one_month	reward_rate	...	zodiac_sign_Cancer	zodiac_sign_Capricorn	zodiac_sign_Gemini	zodiac_sign_Leo	zodiac_sign_Libra	zodiac_sign_Pisces	zodiac_sign_Sagittarius	zodiac_sign_Scorpio	zodiac_sign_Taurus	zodiac_sign_Virgo
user																					
53131	0	0.669618	-0.036352	1.610414	1	1	1	0	0	0.026999	...	0	0	1	0	0	0	0	0	0	0
23310	1	-0.037564	1.761699	1.302417	1	0	0	0	0	-0.440722	...	0	0	0	0	0	0	1	0	0	0
29996	0	2.319710	-0.485864	0.840422	1	1	0	0	0	-0.935169	...	0	0	0	0	0	0	0	0	0	0
60425	0	-0.744746	-0.748080	-0.853562	0	1	0	1	0	-1.202438	...	0	0	0	0	0	1	0	0	0	0
22972	1	-0.391155	-0.635702	-0.853562	1	0	1	1	0	-1.108894	...	0	0	0	0	0	0	0	1	0	0
1195	0	0.080300	2.286131	1.610414	1	1	1	0	0	1.563795	...	0	0	0	0	0	0	0	0	0	0
41350	1	-0.626882	-0.748080	-0.853562	1	0	1	0	0	-1.202438	...	0	0	0	0	0	0	0	0	0	1
29695	0	-0.273291	-0.748080	-0.853562	1	1	0	0	0	-1.202438	...	1	0	0	0	0	0	0	0	0	0
15739	1	-0.980473	-0.748080	0.840422	1	0	1	0	0	1.376707	...	0	0	0	0	0	0	0	0	0	0
51516	0	-0.980473	0.825214	-0.391566	1	0	1	0	0	-0.400631	...	0	0	0	0	0	0	0	0	0	0
10 rows × 29 columns

Now we are ready to split filterint the “new_num_features”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# Using feature selection for dataset

new_num_features = ['purchases_partners', 'reward_rate', 'cc_application_begin', 'app_downloaded', 'zodiac_sign_Virgo', 'zodiac_sign_Taurus', 'zodiac_sign_Scorpio', 'zodiac_sign_Aquarius', 'housing_O', 'age', 'zodiac_sign_Libra', 'zodiac_sign_Leo']

X = df_new[new_num_features]
y = df_new['churn']
print("X shape: {}".format(X.shape))
print("y shape: {}".format(y.shape))

# Output:
X shape: (5309, 12)
y shape: (5309,)

5.2. Make Predictons

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


# Load the model:
filename = 'modelChurn.pickle'
model_Loaded = pickle.load(open(filename, 'rb'))
# Make predictions

prediction = model_Loaded.predict(X)

results = []
for i in list(zip(y, prediction)):
    if i[0] == i[1]:
        res = 1
    else:
        res = 0
    results.append(res)
    #print(i, ' ', res)
    
print("% of succesful prediction: {}".format(np.sum(results)/len(results)))

# Output:
% of succesful prediction: 0.6824260689395366

⬆

6. Conclusions and Lessons learned

As spected, the prediction with new data is aroung $0.7$ which reflects the accuracy model
As reviewed, the 4 most important feature are purchases_partners, reward_rate, cc_application_begin, app_downloaded. They seems logical becasue those features are related to customer behavior related to the app. The others like zodiac_sigh or housing have low importance.
During deployment, we faced some issues related to processing time in gridsearchCV. To speed up, we split the tunning hyperparameters in phases using gridsearchCV.
We tried to improve accuracy but most of the model output in the range of $[0.64-0.69]$. It seems that the model is not so good. In the other hand, we choosed 12 features. It could be betteer if we choose less features.
During exploration, we need to separate the analysis in categorical and numerical. And with numerical feature in binary and not binary.

⬆