YonKeenn </>
Journal About Portfolio Research Post

YonKeenn </>

Hello, I'm Jhon Vargas - a Data Scientist & Technical Lead

Curious. Self Learning. Proactive.

Churn Prediction for Fintech Data

Brief : The main idea of the project is to predict churn and which parameters is more relevant during analysis using classic supervised techniques like LR, DT, SVM, RF, KNN with the help of GridSearchCV.

Objective

Predict user churn of fintech app using clasical supervised models.

Project Overview

In order to have a big picture about the project, here a diagram of each phase:

graph LR; A[Data set
for trainning/testing] --> B[Exploration] B --> C{Pre-Processing} C -->|Training| D[Fitting] D -->|Feature Selection, Tunning| F[Model] F --> E[Evaluation] F --> H[Prediction] C -->|Testing| F G[Data set
for predicting] -->|Pre-Processing| F

Index


1. Dataset

1.1. Dataset Understanding

The bussines challenge: A subscription model on a fintech have users who use the app which it can bring some information about customer behavior. Normaly companies try to minimize the churn which it means cancel subscription. The object is to predict when customer is trying to churn and which feature is the most relevant.

Here some of the fearures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Loading the data

dfChurnBank = pd.read_csv('../projectChurnRate/Data/churn_data.csv', index_col=0)

print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])

if df.index.is_unique:
    print('Indexes are unique.')
else:
    print('There are duplicated indexes.')
    
dfChurnBank.head()

# Output:
Size of the dataset:  21600
Number of variables: 30
There are duplicated indexes.

user  churn	age	housing	credit_score	deposits	withdrawal	purchases_partners	purchases	cc_taken	cc_recommended	...	waiting_4_loan	cancelled_loan	received_loan	rejected_loan	zodiac_sign	left_for_two_month_plus	left_for_one_month	rewards_earned	reward_rate	is_referred
																					
55409	0	37.0	na	NaN	0	0	0	0	0	0	...	0	0	0	0	Leo	1	0	NaN	0.00	0
23547	0	28.0	R	486.0	0	0	1	0	0	96	...	0	0	0	0	Leo	0	0	44.0	1.47	1
58313	0	35.0	R	561.0	47	2	86	47	0	285	...	0	0	0	0	Capricorn	1	0	65.0	2.17	0
8095	0	26.0	R	567.0	26	3	38	25	0	74	...	0	0	0	0	Capricorn	0	0	33.0	1.10	1
61353	1	27.0	na	NaN	0	0	2	0	0	0	...	0	0	0	0	Aries	1	0	1.0	0.03	0
5 rows × 30 columns


2. Exploration and Analysis

For exploration, we will focus on check how is the behavior of data related to the following topics:

  1. Checking Null and na values
  2. Check if index is duplicated
  3. Verify balance data on dataset
  4. Dataframe data type analysis -> Categorical and numerical for separated
  5. Check outliers
  6. Analysis of Churn over features
  7. Conclusions after exploration and analysis

2.1. Checking Null and na values

As we can see there are many null values on credit_score and rewards_earned columns. If we remove null values as rows, we will lost many data and dataset size will reduce a lot, so the best option it is to remove them. On the other hand, age column only has 4 null values, so we can drop null values as rows(axis=0).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
null_finder = df.isnull().sum()

print(" ***** Number of Null Values by row: ***** ")
null_finder.where(null_finder > 0).dropna()

# Output:
 ***** Number of Null Values by row: ***** 
age                  4.0
credit_score      6436.0
rewards_earned    2569.0
dtype: float64

2.2. Check if index is duplicated

There are duplicated indexes.

1
2
3
4
5
6
7
if df.index.is_unique:
    print('Indexes are unique.')
else:
    print('There are duplicated indexes.')

# Output:
There are duplicated indexes.

2.3. Verify balance data on dataset

For continue EDA, we have to remove null/nas and duplicated indexes in order to have a good explorarion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sns.countplot(x='churn', data=df)

class_ratio_0 = Counter(df['churn'])[0]/df['churn'].shape[0]
class_ratio_1 = Counter(df['churn'])[1]/df['churn'].shape[0]

print('Class Ratio of {:} is {:.2f}'.format(list(Counter(df['churn']).keys())[0], class_ratio_0))
print('Class Ratio of {:} is {:.2f}'.format(list(Counter(df['churn']).keys())[1], class_ratio_1))
print('Class Ratio of 1 over 0 is {:.2f}'.format(class_ratio_1/class_ratio_0))

# Output:
Class Ratio of 0 is 0.56
Class Ratio of 1 is 0.44
Class Ratio of 1 over 0 is 0.80


As we can see, the ratio of the class variable is almost balanced. So in this way, we don’t need to use any technique for balance them.

2.4. Dataframe data type analysis

For this kind of analysis, we have to separate into categorical and numerical variables. Hence, we define 2 list: cat_features and num_features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Separation of data types for EDA

cat_features = df.select_dtypes(exclude = np.number).columns
num_features = df.select_dtypes(include = np.number).columns
print( "Quantity of Categorical features: ", len(cat_features),"\nCategorical features: ", cat_features)
print( "\nQuantity of Numerical features: ", len(num_features),"\nNumerical features: ", num_features)

# Output:
Quantity of Categorical features:  3 
Categorical features:  Index(['housing', 'payment_type', 'zodiac_sign'], dtype='object')

Quantity of Numerical features:  25 
Numerical features:  Index(['churn', 'age', 'deposits', 'withdrawal', 'purchases_partners',
       'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked',
       'cc_application_begin', 'app_downloaded', 'web_user', 'app_web_user',
       'ios_user', 'android_user', 'registered_phones', 'waiting_4_loan',
       'cancelled_loan', 'received_loan', 'rejected_loan',
       'left_for_two_month_plus', 'left_for_one_month', 'reward_rate',
       'is_referred'],
      dtype='object')

2.4.1. Categorical analysis:

We can see that those values cannot be binarized because they has more than 2 values. Hence, we need to use one-hot encoding technique. On the other hand, for removing ‘na’ values, we have to check the amount of those values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13

print("Different quantity values: " ,list(map(lambda col: (col,len(df[col].value_counts())), df.select_dtypes(exclude = np.number).columns)))

print("\nhousing feature values: ", df['housing'].unique() ,
      "\npayment_type feature value: ", df['payment_type'].unique() ,
      "\nzodiac_sign feature values: ", df['zodiac_sign'].unique())
# Output:
Different quantity values:  [('housing', 3), ('payment_type', 5), ('zodiac_sign', 13)]

housing feature values:  ['na' 'R' 'O'] 
payment_type feature value:  ['Bi-Weekly' 'Weekly' 'na' 'Monthly' 'Semi-Monthly'] 
zodiac_sign feature values:  ['Aquarius' 'Scorpio' 'Capricorn' 'Pisces' 'Gemini' 'Cancer' 'Virgo'
 'Libra' 'na' 'Leo' 'Aries' 'Taurus' 'Sagittarius']

As we can see, there are a lot of ‘na’ values. We can remove them one hot technique(dummy) later during pre-processing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Counting "na" values:

print("na values from housing: ", len(df[df['housing'] ==  'na']) )
print("na values from payment_type: ", len(df[df['payment_type'] ==  'na']) )
print("na values from zodiac:sign: ", len(df[df['zodiac_sign'] ==  'na']) )
# Output:
na values from housing:  10373
na values from payment_type:  2883
na values from zodiac:sign:  1570


####
# Analysis each categorical feature by churn quantity in order to know which one is more relevant than other.

fig, axs = plt.subplots(1, len(cat_features), figsize=(28, 10))

for col, ax in enumerate(axs.flatten()):
    sns.countplot(x=cat_features[col], hue='churn', dodge=False, data=df, ax = axs[col])
    ax.set_title(cat_features[col])
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=8)


As we can see, this values should be traslate to numbers with one hot encoding technique in order to keep the size of dataset. Categorical values should be considered becuase it affect on churn behavior. Here some insights:

2.4.1. Numerical analysis:

From correlation matrix, we can get some insights like the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Quantity of numerical values:

list(map(lambda col: (col,len(df[col].value_counts())), df.select_dtypes(include = np.number).columns))

# Output:
[('churn', 2),
 ('age', 69),
 ('deposits', 66),
 ('withdrawal', 20),
 ('purchases_partners', 281),
 ('purchases', 64),
 ('cc_taken', 11),
 ('cc_recommended', 324),
 ('cc_disliked', 20),
 ('cc_liked', 8),
 ('cc_application_begin', 123),
 ('app_downloaded', 2),
 ('web_user', 2),
 ('app_web_user', 2),
 ('ios_user', 2),
 ('android_user', 2),
 ('registered_phones', 5),
 ('waiting_4_loan', 2),
 ('cancelled_loan', 2),
 ('received_loan', 2),
 ('rejected_loan', 2),
 ('left_for_two_month_plus', 2),
 ('left_for_one_month', 2),
 ('reward_rate', 186),
 ('is_referred', 2)]

As we can see there are binary fearures and other are longer than 2 values.

First, we will explore features which has more than 2 different values because there are many like (‘registered_phones’, 5), (‘cc_disliked’, 20), (‘cc_taken’, 11) and (‘withdrawal’, 20) which have specific values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, num_features))
num_features_bin = list(filter(lambda col: len(df[col].value_counts()) == 2, num_features))

print("Quantity of features greater than 2:", len(num_features_new))
print(num_features_new)
print()
print("Quantity of features equal to 2(binary):", len(num_features_bin))
print(num_features_bin)

# Output:
Quantity of features greater than 2: 12
['age', 'deposits', 'withdrawal', 'purchases_partners', 'purchases', 'cc_taken', 'cc_recommended', 'cc_disliked', 'cc_liked', 'cc_application_begin', 'registered_phones', 'reward_rate']

Quantity of features equal to 2(binary): 13
['churn', 'app_downloaded', 'web_user', 'app_web_user', 'ios_user', 'android_user', 'waiting_4_loan', 'cancelled_loan', 'received_loan', 'rejected_loan', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']

In the next pictures, We can see that there are some features like ‘deposits’, ‘withdrawal’, ‘purchases’, ‘cc_taken’, ‘cc_disliked’, ‘cc_liked’, ‘registered_phones’ which has huge values as 0(zero) than other values. Later during removing outliers, all the values will be 0, so those features are not useful. We have to remove them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Exploring the distribution about num_features_new values: 'age', 'credit_score', 'deposits' , 'purchases_partners', 'purchases', 'cc_recommended', 'cc_application_begin', 'rewards_earned', 'reward_rate'

for col in num_features_new:
    #plt.figure(figsize=(5,2))
    #sns.countplot(x=col, data=df)
    sns.displot(df[col], kde = True, height = 5 )
    plt.xticks(rotation=90, fontsize=9)
    plt.xlabel(col, labelpad=10)
    plt.ylabel('Counts')
    plt.show()

For continue the exploration, we will ommit ‘churn’, ‘app_downloaded’, ‘web_user’, ‘app_web_user’, ‘ios_user’, ‘android_user’ because they already have been analyzed. So we focus in the rest. We can see that there is 2 main insights:

1
2
3
4
5
6
7
8
num_features_binary = ['waiting_4_loan', 'cancelled_loan', 'received_loan', 'rejected_loan', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']

fig, axs = plt.subplots(1, len(num_features_binary), figsize=(28, 10))

for col, ax in enumerate(axs.flatten()):
    sns.countplot(x=num_features_binary[col], hue='churn', dodge=False, data=df, ax = axs[col])
    ax.set_title(num_features_binary[col])
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=8)


2.5. Check outliers

For this part, we have to detect outliers in features which are not binary because in binary there is not sense. For this part, we will check with the $ IQR = 1.5(percentile_3 - percentile_1) $.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def detect_outlier(X):
    """
    X: dataframe
    """
    #X = df_new.iloc[:, :-1]
    #for i in range(len(X.columns)):
    for i in range(5): # Only priny the firs 5
        first_q = np.percentile(X[X.columns[i]], 25)
        third_q = np.percentile(X[X.columns[i]], 75) 
        IQR = 1.5*(third_q - first_q)
        minimum = first_q - IQR 
        maximum = third_q + IQR
        
        if(minimum > np.min(X[X.columns[i]]) or maximum < np.max(X[X.columns[i]])):
            print(X.columns[i], "There is Outlier")
detect_outlier(df[num_features_new])

# Output:
age There is Outlier
deposits There is Outlier
withdrawal There is Outlier
purchases_partners There is Outlier
purchases There is Outlier

As we can see there are ‘age’, ‘purchases_partners’ and ‘cc_application_begin’ have outliers. On the other hand, the others its mean is on 0 which means most of all these values are 0. So later during removing outliers, these features will have only one value ‘0’. Hence, they should be removed.

2.6. Analysis of Churn over features

Analysis 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
age_group = df[(df.age>20)& (df.age<80)]
df_filtered = age_group.groupby(['age', 'churn'])['deposits'].count().reset_index()
b = pd.pivot_table(df_filtered, values='deposits', index='age',columns=['churn']).reset_index()
b.head()

# Output:
churn	age	0	1
0	21.0	373.0	397.0
1	22.0	409.0	410.0
2	23.0	519.0	437.0
3	24.0	573.0	449.0
4	25.0	580.0	478.0

#####
sns.barplot(x = 'age', y = 1 , data = b)
plt.xlabel('Age')
plt.ylabel('Churn Quantity')
plt.title('Age Churn Rate')
plt.xticks(rotation=90)

Analysis 2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
df_filtered = df.groupby(['deposits', 'churn'])['housing'].count().reset_index()
b = pd.pivot_table(df_filtered, values='housing', index='deposits',columns=['churn']).reset_index()
b.head()

# Output:

#####
plt.figure(figsize=(18,10))
sns.barplot(x = 'deposits', y = 1 , data = b)
plt.xlabel('Deposits')
plt.ylabel('Churn Quantity')
plt.title('Depostis Churn Rate')
plt.xticks(rotation=90)

2.7. Conclusions after exploration and analysis

After EDA, we have some insights and conclusion that must be applied in the pre-processing phase for preparing the data for start modeling:


3. Pre-Processing

As found in exploration phase, we will pre-process the dataset for being accurate for modelling. All of these steps we will save in a pipeline list for next pre-processing steps. Below the list of task:

  1. Remove Null/Nan values
  2. Remove duplicated indexes
  3. Remove not useful features: app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', ‘received_loan’, ‘rejected_loan’, ‘waiting_4_loan’ columns
  4. Update numerical and categorical values list for modeling
  5. Remove outliers
  6. Num-Val: Standarize data for training(without binary features)
  7. Cat-Val: One hot encoding
  8. Create pre-process pipeline function for next preprocesing for test data.

Before start, we create the pipeline list. At the end of process, we will test this pipeline with a new dataset:

1
2
# Pipeline for pre-preocessing
pipeline_preprocess = []

3.1. Remove Null/Nan values

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def dropnull(df):
    print("Removing columns credit_score and rewards_earned ...(1)")
    df = df.drop(columns=['credit_score','rewards_earned'])
    print("Drop null values from age column ...(2)")
    df = df[pd.notnull(df['age'])]
    return df

df = dropnull(df)

null_finder = df.isnull().sum()
print(" ***** Number of Null Values by row: ***** ")
null_finder.where(null_finder > 0).dropna()

# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
 ***** Number of Null Values by row: ***** 
Series([], dtype: float64)

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(dropnull)

3.2. Remove duplicated indexes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# This function will be agregated to the pre-precessing pipeline:
def dropduplicated(df):
    if df.index.is_unique:
        print('Indexes are unique.')
        return df
    else:
        print('There are duplicated indexes....So removing duplicated indexes ...(3)')
        return df[~df.index.duplicated(keep='first')]

df = dropduplicated(df)


if df.index.is_unique:
    print('Indexes are unique.')
else:
    print('There are duplicated indexes.')

# Output:
There are duplicated indexes....So removing duplicated indexes ...(3)
Indexes are unique.

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(dropduplicated)

3.3. Remove unuseful features

During EDA, we conclude that the following features are unusful:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Remove app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns

def dropcolumns(df):
    print("Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)")
    df = df.drop(columns=['app_web_user','deposits', 'ios_user','cc_recommended', 'cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan',
                          'withdrawal', 'purchases', 'cc_taken', 'cc_disliked', 'cc_liked', 'registered_phones'])
    return df
df = dropcolumns(df)

# Output:
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(dropcolumns)

3.4. Update numerical and categorical values list for modeling

We need to update numerical and categorical values list before start modeling because some of them were removed form dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Update numerical and categorical values for later modeling

cat_features = df.select_dtypes(exclude = np.number).columns
num_features = df.select_dtypes(include = np.number).columns
print( "Quantity of Categorical features: ", len(cat_features),"\nCategorical features: ", cat_features)
print( "\nQuantity of Numerical features: ", len(num_features),"\nNumerical features: ", num_features)

# Output:
Quantity of Categorical features:  3 
Categorical features:  Index(['housing', 'payment_type', 'zodiac_sign'], dtype='object')

Quantity of Numerical features:  11 
Numerical features:  Index(['churn', 'age', 'purchases_partners', 'cc_application_begin',
       'app_downloaded', 'web_user', 'android_user', 'left_for_two_month_plus',
       'left_for_one_month', 'reward_rate', 'is_referred'],
      dtype='object')

# Quantity of numerical values:

num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, df.select_dtypes(include = np.number).columns))
num_features_bin = list(filter(lambda col: len(df[col].value_counts()) == 2, df.select_dtypes(include = np.number).columns))

print("Quantity of features greater than 2:", len(num_features_new))
print(num_features_new)
print()
print("Quantity of features equal to 2(binary):", len(num_features_bin))
print(num_features_bin)

# Output:
Quantity of features greater than 2: 4
['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']

Quantity of features equal to 2(binary): 7
['churn', 'app_downloaded', 'web_user', 'android_user', 'left_for_two_month_plus', 'left_for_one_month', 'is_referred']

So now we have the following features quantity:

3.5. Remove outliers

For removeing outliers, we focus on numerical features which are not binary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Removing outliers
def remove_outlier(df):
    """
    df: dataframe
    """
    
    num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, df.select_dtypes(include = np.number).columns))
    outliers = []
    for col in num_features_new:
        first_q = np.percentile(df[col], 25)
        third_q = np.percentile(df[col], 75) 
        IQR = 1.5*(third_q - first_q)
        minimum = first_q - IQR 
        maximum = third_q + IQR
        
        if(minimum > np.min(df[col]) or maximum < np.max(df[col])):
            outliers.append(col)
    print("The outliers are :", outliers)
    
    print("Removing outliers ...(5)")
    for col in outliers:
        first_q = np.percentile(df[col], 25)
        third_q = np.percentile(df[col], 75) 
        IQR = 1.5*(third_q - first_q)
        minimum = first_q - IQR 
        maximum = third_q + IQR
    
        median = df[col].median()
    
        df.loc[df[col] < minimum, col] = median 
        df.loc[df[col] > maximum, col] = median
    return df
df = remove_outlier(df)

# Output:
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(remove_outlier)

3.6. Num-Val: Standarize data

For numerical values, we will standarize for training but without considering binary features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def standardizeNum(df):
    standardize = StandardScaler()
    num_features = df.select_dtypes(include = np.number).columns
    num_features_new = list(filter(lambda col: len(df[col].value_counts()) > 2, num_features))
    print("Standarizing Numerical variables ...(6)")
    df[num_features_new] = standardize.fit_transform(df[num_features_new])
    return df
df = standardizeNum(df)
print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])

# Output:
Standarizing Numerical variables ...(6)
Size of the dataset:  20095
Number of variables: 14

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(standardizeNum)

3.7. Cat-Val: One hot encoding

Doing the one hot encoding task for categorical features. Also removeing ‘na’ values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def removedummy(df):
    print("Convert categorical values into numbers ...(7)")
    df_new = pd.get_dummies(df)
    print("Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)")
    df_new = df_new.drop(columns = ['housing_na', 'zodiac_sign_na', 'payment_type_na'])
    return df_new
df = removedummy(df)
print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])

# Output:
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset:  20095
Number of variables: 29

# Adding to pre-preocessing pipeline:

pipeline_preprocess.append(removedummy)

3.8. Create pre-process pipeline function

During the pre-processing phase, we use functions for do each especific task. These functions later will be used for new dataset like testing or even new data. Hence, we need to save these functions in a pipeline for later preprocesing. Here the list of all the functions we used:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
print("\nSteps for pre-processing: ")
for step, function in enumerate(pipeline_preprocess):
    print("\t {:d}: {:s}".format(step, function.__name__))

# Output:
Steps for pre-processing: 
	 0: dropnull
	 1: dropduplicated
	 2: dropcolumns
	 3: remove_outlier
	 4: standardizeNum
	 5: removedummy

In this way, we create the funcion ‘preprocess_data_pipeline’ for later pre-processing for new dataset(prediction):

1
2
3
4
5
6
7
8
9
# Definition of preprocess_data for an specific dataset:

def preprocess_data_pipeline(df, pipeline_preprocess):
    for step, function in enumerate(pipeline_preprocess):
        df = function(df)
    print("Size of the dataset:  %d" % df.shape[0])
    print("Number of variables: %d" % df.shape[1])
    df.head(10)
    return df

Testing the ‘preprocess_data_pipeline’ function with new dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
df_aux = pd.read_csv('../projectChurnRate/Data/churn_data.csv', index_col=0).sample(n=2100, random_state=0)
preprocess_data_pipeline(df_aux, pipeline_preprocess)

# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
There are duplicated indexes....So removing duplicated indexes ...(3)
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
Standarizing Numerical variables ...(6)
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset:  2086
Number of variables: 29
user  churn	age	housing	purchases_partners	cc_application_begin	app_downloaded	web_user	android_user	payment_type	zodiac_sign	left_for_two_month_plus	left_for_one_month	reward_rate	is_referred
														
50488	0	20.0	R	29	8	1	0	1	Bi-Weekly	Virgo	0	0	0.44	1
53603	0	38.0	na	28	5	1	0	1	Monthly	Sagittarius	0	0	0.67	1
42289	1	40.0	R	9	4	1	1	1	na	Aries	0	0	0.63	1
4185	0	34.0	na	0	4	1	1	0	Bi-Weekly	Scorpio	0	0	2.07	0
12436	1	24.0	O	38	7	1	1	1	Weekly	Scorpio	0	0	0.73	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23350	1	22.0	R	31	7	1	1	1	Weekly	Virgo	1	0	0.00	1
53737	0	43.0	na	23	1	1	1	0	Bi-Weekly	Gemini	0	0	0.50	0
56724	1	40.0	R	0	0	1	1	0	Monthly	Taurus	0	0	0.30	0
17737	1	23.0	na	8	1	1	1	1	Semi-Monthly	Pisces	0	0	0.65	0
46045	1	27.0	R	0	3	1	0	0	Bi-Weekly	Virgo	0	0	0.53	0
2086 rows × 14 columns


4. Model and Evaluation

For modeling, we faced the following faces:

  1. Separate dataset
  1. Feature Selection
  1. Modeling for finding the best:

For evaluation, we are using the following metrics:

graph LR; A[Training Dataset] --> B[Feature Selection] B --> D[Hyper-paremter Tunning] D --> E[Fitting] E --> F[Model] F -->|Tunning| D
graph LR; C[Testing Dataset] --> F[Model Fitted] F --> G[Evaluation] G --> H[Metrics Insgiths]

4.1. Dataset Separation

Here first, we split the data for features and label(churn variable). Then split for training and testing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
X = df.drop('churn', axis = 1)
y = df['churn'] 
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=10)
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))

# Output:
x_train shape: (14066, 28)
y_train shape: (14066,)
x_test shape: (6029, 28)
y_test shape: (6029,)

4.2. Feature Selection

For Feature Selection, we will use the following techniques:

  1. Use linear regression model
  2. Use other techniques like:
  1. Then sum up and choose the most important features

4.2.1. Logistic Regression

Using this model, the more important features with coef < -0.2 and coef > 0.2. There is 7 features from 28. We can see that the more important variables are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
feature_importance_lr.where((feature_importance_lr["coef"] < -0.2)|(feature_importance_lr["coef"] > 0.2)).dropna()

# Output:
                    feature	     coef
 7	 left_for_one_month      0.438657
 4	           web_user      0.289608
 3           app_downloaded      0.268084
15	payment_type_Weekly      0.218277
10	          housing_O     -0.204331
 8              reward_rate     -0.298566
 1	 purchases_partners     -0.538643
    

4.2.2. Recursive Feature Elimination (RFE)

As a wrapped method, it use an external method or estimator(LogisticRegression in our case) for searching/select in a recursive way a small subset of features evaluating the importance. Using RFE model, the 12 more important features are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Number of features as 12
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=12, step=10, verbose=5)
rfe_selector.fit(x_train, y_train)
rfe_support = rfe_selector.get_support()
feature_rfe = x_train.loc[:,rfe_support].columns.tolist()
print(str(len(feature_rfe)), 'selected features')
print(feature_rfe)

# Output:
Fitting estimator with 28 features.
Fitting estimator with 18 features.
12 selected features
['purchases_partners', 'app_downloaded', 'web_user', 'left_for_two_month_plus', 'left_for_one_month', 'reward_rate', 'housing_O', 'payment_type_Weekly', 'zodiac_sign_Aquarius', 'zodiac_sign_Capricorn', 'zodiac_sign_Libra', 'zodiac_sign_Taurus']

4.2.3. Random Forest

In Random Forest, there is the purity of the node which tell us about the importance of the model. So if a node with low imputiry(Gini impurity) over all trees, this node will be the most important. Using the Random Forest model, there are only 4 important features:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
mbeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=12)
embeded_rf_selector.fit(x_train, y_train)

embeded_rf_support = embeded_rf_selector.get_support()
feature_embeded_rf = x_train.loc[:,embeded_rf_support].columns.tolist()
print(str(len(feature_embeded_rf)), 'selected features')
feature_embeded_rf

# Output:
4 selected features
['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']

4.2.4. Mutual Information classification

Mutual Information comes from information theory which apply information gain(like decision tree) to feature selection. This gain is calculated between 2 variables and it measures the reduction in uncertainty. Using this model, we can get the first 12 important features:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
importances =  mutual_info_classif(x_train,y_train)
feature_mutual_importance = pd.Series(importances, index = x_train.columns)
feature_mutual_importance.sort_values(ascending=False).plot.bar(figsize=(20, 8))

feature_mutual = list(feature_mutual_importance.where(feature_mutual_importance > 0.002).sort_values(ascending=False).dropna().index)
print("There are {} important features and they are: \n".format(len(feature_mutual)))
feature_mutual

# Output:
There are 12 important features and they are: 

['purchases_partners',
 'reward_rate',
 'zodiac_sign_Leo',
 'is_referred',
 'cc_application_begin',
 'android_user',
 'zodiac_sign_Scorpio',
 'housing_O',
 'zodiac_sign_Aquarius',
 'housing_R',
 'app_downloaded',
 'zodiac_sign_Virgo']


Here we can see the most importan feature vs churn. We can see the “purchases_partners” is the most influence.

4.2.5. Summary

Sum up all the results, we can get the 12 most important features for start modeling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':x_train.columns, 
                                     'Logistics':lr_support,
                                     'RFE':rfe_support,
                                     'Random Forest':embeded_rf_support, 'Mutual Classif':mutual_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
# feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

# Output:
	Feature	Logistics	RFE	Random Forest	Mutual Classif	Total
1	purchases_partners	True	True	True	True	        4
8	reward_rate	        False	True	True	True	        3
2	cc_application_begin	True	False	True	True	        3
3	app_downloaded	        True	True	False	True	        3
27	zodiac_sign_Virgo	True	False	False	True	        2
26	zodiac_sign_Taurus	True	True	False	False	        2
25	zodiac_sign_Scorpio	True	False	False	True	        2
16	zodiac_sign_Aquarius	False	True	False	True	        2
10	housing_O	        False	True	False	True	        2
0	age	                True	False	True	False	        2
22	zodiac_sign_Libra	False	True	False	False	        1
21	zodiac_sign_Leo	        False	False	False	True	        1


Here they are:

Now we are ready to start modeling. For this we split again the dataset but filtering using the new_num_features list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
X = X[new_num_features]
y = df['churn']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=10)
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))
print("y_test shape: {}".format(y_test.shape))

# Output:
x_train shape: (14066, 12)
y_train shape: (14066,)
x_test shape: (6029, 12)
y_test shape: (6029,)


4.3. Modeling & Evaluation

Modeling for finding the best:

  1. Hyperparameter Tunning using GridSearchCV for finding the best hyperparameters of each model
  2. Fitting and Evaluation using the next models:
  1. Choose the best model according metrics
  2. Save the model to used with new values

For evaluation, we are using the following metrics:

Most of the steps for this phase are mechanic, so we create some key points to be considered:

graph LR; A[Training Dataset] --> B[GridSearchCV] B -->|Hyperparameter Tunning| C[Fitting/Predicting] C --> B C -->|Evaluation Metrics| D[Save Metrics]

Also we create 2 functions to summary all the results during tunning hyperparameters and modeling(fit/predict):

1
2
3
4
5
6
7
8
def GridSearchResults(grid_clf, num_results=10, display_all_params=True)
    """
    Function to summary all the results of gridsearchCV.
    """
def evaluationMetricsGCV(x_test, y_test, model_fit):
    """
    Function to summary all the results of fitting and predicting.
    """


To save metrics, we create a dataframe for later usage and to choose the best model:

1
2
3
4
5
6
7
8
# Dataframe for statistics

model_stats = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1-Score", "AUC-Score"])
model_stats.head()

# Output:
Model	Accuracy	Precision	Recall	F1-Score	AUC-Score

4.3.1. Logistic Regresion Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# GridSearchCV for logistic Regression

parameters = {}
parameters['C'] = [10e-3, 10e-2, 10e-1, 1, 10, 100, 1000]
parameters['class_weight'] = [None, 'balanced']
parameters['penalty'] = ["l1","l2",'elasticnet']
parameters['solver'] = ['newton-cg', 'lbfgs', 'liblinear', 'sag']

GS_log = GridSearchCV(LogisticRegression(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_log.fit(x_train, y_train)

GridSearchResults(GS_log)

# Output:
Best parameters: 
 {'C': 0.01, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Best score: 
 0.63038 (+/-0.01300)
All parameters: 
{'C': 0.01,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l1',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
                                                params	    mean_test_score	std_test_score	rank_test_score
2	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.630383	0.012998	1
6	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627824	0.013089	2
26	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627611	0.012696	3
4	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627469	0.012713	4
5	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627469	0.012713	4
7	{'C': 0.01, 'class_weight': None, 'penalty': '...	0.627469	0.012713	4
29	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627184	0.013151	7
31	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627184	0.013151	7
28	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.627184	0.013151	7
30	{'C': 0.1, 'class_weight': None, 'penalty': 'l...	0.626971	0.013364	10


Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
logr_model = LogisticRegression(C=0.01, class_weight=None, max_iter=1000, penalty='l1', random_state=1000, solver='liblinear')
logr_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, logr_model)
model_stats = model_stats.append({"Model": "Logistic model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.639
+++++ Precision Score 0.598
+++++ Recall Score 0.570
+++++ F1 Score 0.584


              precision    recall  f1-score   support

    No churn       0.67      0.69      0.68      3354
       Churn       0.60      0.57      0.58      2675

    accuracy                           0.64      6029
   macro avg       0.63      0.63      0.63      6029
weighted avg       0.64      0.64      0.64      6029

+++++ AUC (Area under the ROC Curve) : 0.632


As we can see, the accuracy is 0.64.

4.3.2. Decision Tree Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# GridSearchCV for Decision Tree Model

parameters = {}
parameters['max_depth'] = [i for i in range(1, 11)]
parameters['class_weight'] = [None, 'balanced']
parameters['max_features'] = [i for i in range(1, 8)]
parameters['min_samples_leaf'] = [i for i in range(1, 11)]

#
GS_tree = GridSearchCV(DecisionTreeClassifier(random_state = 1000), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_tree.fit(x_train, y_train)

GridSearchResults(GS_tree)

# Output:
Best parameters: 
 {'class_weight': None, 'max_depth': 8, 'max_features': 7, 'min_samples_leaf': 5}
Best score: 
 0.66344 (+/-0.00662)
All parameters: 
{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 8,
 'max_features': 7,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 1000,
 'splitter': 'best'}
                                            params	    mean_test_score	std_test_score	rank_test_score
554	{'class_weight': None, 'max_depth': 8, 'max_fe...	0.663444	0.006620	1
474	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.663371	0.009664	2
482	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.661808	0.006527	3
485	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.661666	0.008073	4
487	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.661594	0.008310	5
481	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660883	0.008289	6
483	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660670	0.007874	7
553	{'class_weight': None, 'max_depth': 8, 'max_fe...	0.660669	0.010791	8
473	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660385	0.008892	9
480	{'class_weight': None, 'max_depth': 7, 'max_fe...	0.660314	0.009318	10


Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
dt_model = DecisionTreeClassifier(max_depth=8,max_features=7, min_samples_leaf=5, random_state = 1000);
dt_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, dt_model)
model_stats = model_stats.append({"Model": "Decision Tree model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.673
+++++ Precision Score 0.647
+++++ Recall Score 0.582
+++++ F1 Score 0.612


              precision    recall  f1-score   support

    No churn       0.69      0.75      0.72      3354
       Churn       0.65      0.58      0.61      2675

    accuracy                           0.67      6029
   macro avg       0.67      0.66      0.67      6029
weighted avg       0.67      0.67      0.67      6029

+++++ AUC (Area under the ROC Curve) : 0.664


As we can see, the accuracy is 0.67.

4.3.3. Support Vector Machine Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
parameters = {}
parameters['C'] = [10e-2, 1, 100]
parameters['kernel'] = ['linear','poly', 'rbf']
#parameters['gamma'] = np.arange(0.01, 0.4, 0.1)


#
GS_SVM = GridSearchCV(SVC(random_state = 1000, probability=True), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_SVM.fit(x_train, y_train)

GridSearchResults(GS_SVM)

# Output:
Best parameters: 
 {'C': 1, 'kernel': 'rbf'}
Best score: 
 0.64545 (+/-0.01170)
All parameters: 
{'C': 1,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': True,
 'random_state': 1000,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}
                        params	    mean_test_score	std_test_score	rank_test_score
5	{'C': 1, 'kernel': 'rbf'}	0.645455	0.011701	1
7	{'C': 100, 'kernel': 'poly'}	0.644317	0.010408	2
8	{'C': 100, 'kernel': 'rbf'}	0.643818	0.012859	3
4	{'C': 1, 'kernel': 'poly'}	0.639838	0.010713	4
2	{'C': 0.1, 'kernel': 'rbf'}	0.637066	0.013727	5
1	{'C': 0.1, 'kernel': 'poly'}	0.632587	0.010276	6
0	{'C': 0.1, 'kernel': 'linear'}	0.612112	0.012761	7
6	{'C': 100, 'kernel': 'linear'}	0.612041	0.013000	8
3	{'C': 1, 'kernel': 'linear'}	0.611970	0.012831	9


Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
svm_model = SVC(C=1, gamma = 0.31, kernel = 'rbf', random_state = 1000, probability=True);
svm_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, svm_model)
model_stats = model_stats.append({"Model": "SVM model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.664
+++++ Precision Score 0.631
+++++ Recall Score 0.587
+++++ F1 Score 0.608


              precision    recall  f1-score   support

    No churn       0.69      0.73      0.71      3354
       Churn       0.63      0.59      0.61      2675

    accuracy                           0.66      6029
   macro avg       0.66      0.66      0.66      6029
weighted avg       0.66      0.66      0.66      6029

+++++ AUC (Area under the ROC Curve) : 0.656


As we can see, the accuracy is 0.66.

4.3.5. Random Forest Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# GridSearchCV

parameters = {}
#parameters['max_features'] = ['auto', 'sqrt', 'log2', None]
parameters['n_estimators'] = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1100, 1000, 1100, 1200, 1300]
#parameters['criterion'] = ['entropy', 'gini']
parameters['max_depth'] = [7, 8, 9, 10, 11, 12, 13, 14, 15, None]

#
GS_rf_0 = GridSearchCV(RandomForestClassifier(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_rf_0.fit(x_train, y_train)

GridSearchResults(GS_rf_0)

# Output.
Best parameters: 
 {'max_depth': 9, 'n_estimators': 200}
Best score: 
 0.66863 (+/-0.00920)
All parameters: 
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 9,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 200,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
                                params	    mean_test_score	std_test_score	rank_test_score
29	{'max_depth': 9, 'n_estimators': 200}	0.668632	0.009199	1
8	{'max_depth': 7, 'n_estimators': 900}	0.668418	0.009838	2
19	{'max_depth': 8, 'n_estimators': 600}	0.668348	0.008300	3
10	{'max_depth': 7, 'n_estimators': 1000}	0.668347	0.010234	4
49	{'max_depth': 10, 'n_estimators': 800}	0.668064	0.008299	5
15	{'max_depth': 8, 'n_estimators': 200}	0.667992	0.008415	6
23	{'max_depth': 8, 'n_estimators': 1100}	0.667921	0.008200	7
38	{'max_depth': 9, 'n_estimators': 1000}	0.667779	0.007937	8
6	{'max_depth': 7, 'n_estimators': 700}	0.667779	0.010037	9
28	{'max_depth': 9, 'n_estimators': 100}	0.667566	0.008903	10


Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
rf_model = RandomForestClassifier(criterion = 'entropy', max_depth =9, n_estimators = 170, max_features = None)
rf_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, rf_model)
model_stats = model_stats.append({"Model": "Random Forest model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.688
+++++ Precision Score 0.663
+++++ Recall Score 0.603
+++++ F1 Score 0.631


              precision    recall  f1-score   support

    No churn       0.70      0.76      0.73      3354
       Churn       0.66      0.60      0.63      2675

    accuracy                           0.69      6029
   macro avg       0.68      0.68      0.68      6029
weighted avg       0.69      0.69      0.69      6029

+++++ AUC (Area under the ROC Curve) : 0.679


As we can see, the accuracy is 0.69.

4.3.5. K-Nearest Neighbors Model

Here the results for hyperparameter tunning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# GridSearchCV for KNN

parameters = {}
parameters['n_neighbors'] = [i for i in range(1,50)]
#parameters['weights'] = ['uniform', 'distance']
#parameters['algorithm'] = ['auto', 'ball_tree', 'kd_tree', 'brute']

#
GS_knn_0 = GridSearchCV(KNeighborsClassifier(), parameters , scoring = 'accuracy', cv = 10, verbose=1, n_jobs=-1)
GS_knn_0.fit(x_train, y_train)

GridSearchResults(GS_knn_0)

# Output:
Best parameters: 
 {'n_neighbors': 37}
Best score: 
 0.63380 (+/-0.00678)
All parameters: 
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 37,
 'p': 2,
 'weights': 'uniform'}
                    params	mean_test_score	std_test_score	rank_test_score
36	{'n_neighbors': 37}	0.633796	0.006776	1
37	{'n_neighbors': 38}	0.632374	0.009099	2
38	{'n_neighbors': 39}	0.631309	0.004983	3
35	{'n_neighbors': 36}	0.630597	0.007942	4
34	{'n_neighbors': 35}	0.630454	0.008985	5
32	{'n_neighbors': 33}	0.630241	0.008958	6
30	{'n_neighbors': 31}	0.629885	0.010527	7
22	{'n_neighbors': 23}	0.629033	0.010570	8
33	{'n_neighbors': 34}	0.629032	0.009545	9
31	{'n_neighbors': 32}	0.628677	0.009892	10


Here the result for modeling and evaluation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
knn_model = KNeighborsClassifier(algorithm='auto', n_neighbors = 37, weights = 'uniform')
knn_model.fit(x_train, y_train)

Accuracy, Precision, Recall, F1, auc_score, y_pred, y_prob = evaluationMetricsGCV(x_test, y_test, knn_model)
model_stats = model_stats.append({"Model": "KNN model",
                                    "Accuracy": Accuracy,
                                    "Precision": Precision,
                                    "Recall": Recall,
                                    "F1-Score": F1,
                                    "AUC-Score": auc_score}, ignore_index=True)

# Output:
Results: 
+++++ Accuracy Score 0.638
+++++ Precision Score 0.606
+++++ Recall Score 0.524
+++++ F1 Score 0.562


              precision    recall  f1-score   support

    No churn       0.66      0.73      0.69      3354
       Churn       0.61      0.52      0.56      2675

    accuracy                           0.64      6029
   macro avg       0.63      0.63      0.63      6029
weighted avg       0.63      0.64      0.63      6029

+++++ AUC (Area under the ROC Curve) : 0.626


As we can see, the accuracy is 0.64.


4.3.6. Summary and save the model

Now we have all the statistics and results of all models. As we can see the best model is Random Forest whicha has a better accuracy( ** 0.69 aprox.** ) than others.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
model_stats.head()

# Output:
|    | Model               |   Accuracy |   Precision |   Recall |   F1-Score |   AUC-Score | 
|---:|:--------------------|-----------:|------------:|---------:|-----------:|------------:|
|  0 | Logistic model      |   0.639244 |    0.598039 | 0.570093 |   0.583732 |    0.632244 |
|  1 | Decision Tree model |   0.673412 |    0.646717 | 0.581682 |   0.612478 |    0.664127 |
|  2 | SVM model           |   0.664123 |    0.630522 | 0.586916 |   0.607938 |    0.656308 |
|  3 | Random Forest model |   0.687842 |    0.663102 | 0.602617 |   0.631414 |    0.679215 |
|  4 | KNN model           |   0.637917 |    0.606494 | 0.523738 |   0.562086 |    0.626359 |
|  5 | KNN model           |   0.637917 |    0.606771 | 0.522617 |   0.561559 |    0.626246 |
|  6 | KNN model           |   0.637751 |    0.6066   | 0.522243 |   0.56127  |    0.626059 |
|  7 | KNN model           |   0.637917 |    0.606494 | 0.523738 |   0.562086 |    0.626359 |

So in this way, our model is a Random Forest with the following hyperparameters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GridSearchResults(GS_rf_2)

# Output:

All parameters: 
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': 9,
 'max_features': None,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 170,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We saved the model with pickle option in order to use later in prediction using new dataset.

1
2
3
# Saving the model:
filename = 'modelChurn.pickle'
pickle.dump(rf_model, open(filename,'wb'))

5. Prediction

We will test the model with new dataset. At the begging, we split the data in two: one for train/test and two for prediction. So we will use this prediction data. Here the key task for this part:

graph LR; A[New Dataset] --> |Preprocess Pipeline| B[Dataset ready
for Prediction] B -->|Load Model| C[Prediction]

5.1. Dataset Preparation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Load the new dataset for prediction
df = pd.read_csv('./Data/df_prediction.csv', index_col=0)
print("Size of the dataset:  %d" % df.shape[0])
print("Number of variables: %d" % df.shape[1])
df.head()

# Output:
Size of the dataset:  5400
Number of variables: 30

    	churn	age	housing	credit_score	deposits	withdrawal	purchases_partners	purchases	cc_taken	cc_recommended	...	waiting_4_loan	cancelled_loan	received_loan	rejected_loan	zodiac_sign	left_for_two_month_plus	left_for_one_month	rewards_earned	reward_rate	is_referred
user																					
53131	0	37.0	O	588.0	5	0	19	5	0	58	...	0	0	0	0	Gemini	0	0	11.0	0.92	1
23310	1	31.0	na	546.0	0	0	67	0	0	144	...	0	0	0	0	Sagittarius	0	0	17.0	0.57	1
29996	0	51.0	na	508.0	0	0	7	0	0	15	...	0	0	0	0	Aries	0	0	6.0	0.20	1
60425	0	25.0	na	NaN	0	0	0	0	0	0	...	0	0	0	0	Pisces	1	0	NaN	0.00	0
22972	1	28.0	na	NaN	0	0	3	0	0	5	...	0	0	0	0	Scorpio	1	0	2.0	0.07	0
5 rows × 30 columns


During preprocesing phase, we create a pipeline list “pipeline_preprocess” which has all the necessary functions for make the preprocesing tasks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
print("\nSteps for pre-processing: ")
for step, function in enumerate(pipeline_preprocess):
    print("\t {:d}: {:s}".format(step, function.__name__))

# Output:
Steps for pre-processing: 
	 0: dropnull
	 1: dropduplicated
	 2: dropcolumns
	 3: remove_outlier
	 4: standardizeNum
	 5: removedummy


So now we define a function for do this process to the new dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def preprocess_data_pipeline(df, pipeline_preprocess):
    for step, function in enumerate(pipeline_preprocess):
        df = function(df)
    print("Size of the dataset:  %d" % df.shape[0])
    print("Number of variables: %d" % df.shape[1])
    display(df.head(10))
    return df

df_new = preprocess_data_pipeline(df, pipeline_preprocess)

# Output:
Removing columns credit_score and rewards_earned ...(1)
Drop null values from age column ...(2)
There are duplicated indexes....So removing duplicated indexes ...(3)
Drop app_web_user, deposit, ios_user, cc_recommended, cancelled_loan', 'received_loan', 'rejected_loan', 'waiting_4_loan' columns ...(4)
The outliers are : ['age', 'purchases_partners', 'cc_application_begin', 'reward_rate']
Removing outliers ...(5)
Standarizing Numerical variables ...(6)
Convert categorical values into numbers ...(7)
Remove Categorical Variables housing_na, zodiac_sign_na, payment_type_na ...(8)
Size of the dataset:  5309
Number of variables: 29

        churn	age	purchases_partners	cc_application_begin	app_downloaded	web_user	android_user	left_for_two_month_plus	left_for_one_month	reward_rate	...	zodiac_sign_Cancer	zodiac_sign_Capricorn	zodiac_sign_Gemini	zodiac_sign_Leo	zodiac_sign_Libra	zodiac_sign_Pisces	zodiac_sign_Sagittarius	zodiac_sign_Scorpio	zodiac_sign_Taurus	zodiac_sign_Virgo
user																					
53131	0	0.669618	-0.036352	1.610414	1	1	1	0	0	0.026999	...	0	0	1	0	0	0	0	0	0	0
23310	1	-0.037564	1.761699	1.302417	1	0	0	0	0	-0.440722	...	0	0	0	0	0	0	1	0	0	0
29996	0	2.319710	-0.485864	0.840422	1	1	0	0	0	-0.935169	...	0	0	0	0	0	0	0	0	0	0
60425	0	-0.744746	-0.748080	-0.853562	0	1	0	1	0	-1.202438	...	0	0	0	0	0	1	0	0	0	0
22972	1	-0.391155	-0.635702	-0.853562	1	0	1	1	0	-1.108894	...	0	0	0	0	0	0	0	1	0	0
1195	0	0.080300	2.286131	1.610414	1	1	1	0	0	1.563795	...	0	0	0	0	0	0	0	0	0	0
41350	1	-0.626882	-0.748080	-0.853562	1	0	1	0	0	-1.202438	...	0	0	0	0	0	0	0	0	0	1
29695	0	-0.273291	-0.748080	-0.853562	1	1	0	0	0	-1.202438	...	1	0	0	0	0	0	0	0	0	0
15739	1	-0.980473	-0.748080	0.840422	1	0	1	0	0	1.376707	...	0	0	0	0	0	0	0	0	0	0
51516	0	-0.980473	0.825214	-0.391566	1	0	1	0	0	-0.400631	...	0	0	0	0	0	0	0	0	0	0
10 rows × 29 columns


Now we are ready to split filterint the “new_num_features”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Using feature selection for dataset

new_num_features = ['purchases_partners', 'reward_rate', 'cc_application_begin', 'app_downloaded', 'zodiac_sign_Virgo', 'zodiac_sign_Taurus', 'zodiac_sign_Scorpio', 'zodiac_sign_Aquarius', 'housing_O', 'age', 'zodiac_sign_Libra', 'zodiac_sign_Leo']

X = df_new[new_num_features]
y = df_new['churn']
print("X shape: {}".format(X.shape))
print("y shape: {}".format(y.shape))

# Output:
X shape: (5309, 12)
y shape: (5309,)

5.2. Make Predictons

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Load the model:
filename = 'modelChurn.pickle'
model_Loaded = pickle.load(open(filename, 'rb'))
# Make predictions

prediction = model_Loaded.predict(X)

results = []
for i in list(zip(y, prediction)):
    if i[0] == i[1]:
        res = 1
    else:
        res = 0
    results.append(res)
    #print(i, ' ', res)
    
print("% of succesful prediction: {}".format(np.sum(results)/len(results)))

# Output:
% of succesful prediction: 0.6824260689395366


6. Conclusions and Lessons learned

  1. As spected, the prediction with new data is aroung $0.7$ which reflects the accuracy model
  2. As reviewed, the 4 most important feature are purchases_partners, reward_rate, cc_application_begin, app_downloaded. They seems logical becasue those features are related to customer behavior related to the app. The others like zodiac_sigh or housing have low importance.
  3. During deployment, we faced some issues related to processing time in gridsearchCV. To speed up, we split the tunning hyperparameters in phases using gridsearchCV.
  4. We tried to improve accuracy but most of the model output in the range of $[0.64-0.69]$. It seems that the model is not so good. In the other hand, we choosed 12 features. It could be betteer if we choose less features.
  5. During exploration, we need to separate the analysis in categorical and numerical. And with numerical feature in binary and not binary.