YonKeenn </>
Journal About Portfolio Research Post

YonKeenn </>

Hello, I'm Jhon Vargas - a Data Scientist & Technical Lead

Curious. Self Learning. Proactive.

Car Plate Automatic Detection

Brief : The main idea of the project is to automate the car plate detection using LBP algorithm. For this, we create dataset from raw data using a set of Computer Vision techniques.

Objective

Automate car plate detection using LBP algorithm and compare supervised machine model like Decision Tree, Support Vector Machine, Logistic Regression.

Project Overview


Index


1. Dataset

The main idea for generating dataset is to use the original images in gray scale for generating small images of bounding box size of car plate in order to create two variables: plate images and background images.

1.1. Raw data pre-processing

For preprocessing, we need to convert one image into many small images which can cover car plate. Here some main steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
img1 = imageio.imread('../orig_0018.png')
morph_n_thresholded_img_open = preprocess(img1)
contour = extract_contours(morph_n_thresholded_img_open)
img_contours = np.zeros(img1.shape)
# draw the contours on the empty image
cv2.drawContours(img_contours, contour, 0, (255,255,25), 3)
#save image
cv2.imwrite('contours.png',img_contours) 
plate_img, plate_coordinates = find_plates(img1)
plt.imshow(img1[plate_coordinates[0][1]-7:plate_coordinates[0][1]-7+plate_coordinates[0][3]+10,plate_coordinates[0][0]:plate_coordinates[0][0] + plate_coordinates[0][2]], cmap="gray")
plt.show()

1.2. Dataset creation

For creation we need to pre-process each image in the raw dataset and get the car plate images and background images. At the end we will have two variable class for training model.

1
2
3
img_LBP = algLBPnxn(img, 3)
plt.imshow(img_LBP, cmap="gray")
plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
plate_arr = np.c_[ent_arr_p, np.ones(ent_arr_p.shape[0])]
fondo_arr = np.c_[ent_arr_f, np.zeros(ent_arr_f.shape[0])]
dataset_arr = np.vstack((plate_arr, fondo_arr))
randomize = np.arange(len(dataset_arr))
np.random.shuffle(randomize)
dataset_arr = dataset_arr[randomize]
col = [ "Bit"+str(i) for i in range(256)]
col.append("class")
df = pd.DataFrame(dataset_arr, columns = [col])
df.head()

# Output:
        Bit0	Bit1	Bit2	Bit3	Bit4	Bit5	Bit6	Bit7	Bit8	Bit9	...	Bit247	Bit248	Bit249	Bit250	Bit251	Bit252	Bit253	Bit254	Bit255	class
0	376.0	12.0	8.0	12.0	15.0	3.0	16.0	63.0	79.0	21.0	...	106.0	59.0	23.0	5.0	20.0	21.0	19.0	24.0	204.0	0.0
1	362.0	7.0	11.0	6.0	11.0	5.0	13.0	33.0	91.0	12.0	...	123.0	85.0	19.0	2.0	25.0	33.0	25.0	34.0	494.0	0.0
2	328.0	5.0	10.0	7.0	5.0	0.0	14.0	29.0	61.0	9.0	...	99.0	143.0	19.0	1.0	19.0	21.0	12.0	21.0	289.0	0.0
3	401.0	10.0	5.0	9.0	18.0	4.0	16.0	32.0	62.0	21.0	...	121.0	57.0	36.0	7.0	50.0	35.0	23.0	50.0	370.0	0.0
4	382.0	3.0	7.0	9.0	8.0	3.0	13.0	97.0	155.0	13.0	...	131.0	95.0	20.0	4.0	16.0	14.0	12.0	9.0	249.0	0.0

2. Exploration and Analysis

For exploration, we will focus on check how is the behaveior of data related to the following topics:

  1. Checking Nulls
  2. Data Information revision to check if there is outliers, checking skewness.
  3. Target value revision
  4. Checking correlation
  5. Checking data distribution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Checking correlation. I splitted dataset because there are lots of variables, which each of dataset including class column.

split1 = [i for i in range(7,22)]
split2 = [i for i in range(50,65)]

df1 = df.iloc[:, split1]
df2 = df.iloc[:, split2]

plt.subplots(figsize=(12,8))
sns.heatmap(df1.corr(), annot=True)


1
2
3
4
# Data Information revision to check if there is outliers, checking skewness.
plt.subplots(figsize=(15,6))
df1.boxplot(patch_artist=True, sym="k.")
plt.xticks(rotation=90)



3. Pre-Processing

As found in exploration phase, we will pre-process the dataset for being accurate to train, test and prediction. So we will focus in the following topics:

  1. Removing outliers
  2. Scaling and mormalized
  3. Feature selection
  4. Balancing data
  5. Splitting data for training and testing
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 1. Removing outliers
# detecting Outlier
# Inter Quartile Range is the distance between the 3rd Quartile and the first Qartile

X = df_new.iloc[:, :-1]
detect_outlier(X)
remove_outlier(X)
# Output:
('Bit0',) There is Outlier
('Bit1',) There is Outlier
('Bit2',) There is Outlier
('Bit3',) There is Outlier
('Bit4',) There is Outlier

For variable selection, we chosed a linear regression model to find the best coeficient(>75):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# The importance variable selection will be done by usign coeficient:

feature_importance_lr = pd.DataFrame(zip(X.columns.values, log_rg.coef_.ravel()))
feature_importance_lr.columns = ['feature', 'coef']
feature_importance_lr.sort_values("coef", ascending=False, inplace=True)
feature_importance_lr.where(feature_importance_lr["coef"] >0.75).head(10)
# Output:
      feature     coef
244   (Bit244,)   1.200516
35    (Bit35,)    1.178940
8     (Bit8,)     1.090945
255   (Bit255,)   1.039968
47    (Bit47,)    0.774656
157   (Bit157,)   0.755275
230   (Bit230,)   0.754994
100   NaN         NaN
129   NaN         NaN
3     NaN         NaN

During EDA, we found that variable class is imbalanced which can cause some bias and it could make the model overfit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# 4. Balancing dataset:
# Ratio for imbalanced class:
smote_model = SMOTE(sampling_strategy = 0.8, k_neighbors = 5)
X, y = smote_model.fit_resample(X_imb, y_imb)
print(X_imb.shape, y_imb.shape, Counter(y))
# Ratio for balanced class:
counter = Counter(y)
class_ratio = counter[0.0]/counter[1.0]
print("Class Ratio: ", class_ratio)
# Output:
(1937, 8) (1937,) Counter({0.0: 1837, 1.0: 1469})
Class Ratio:  1.250510551395507

# After Balancing:
counter = Counter(y)
class_ratio = counter[0.0]/counter[1.0]
print("Class Ratio: ", class_ratio)
y.value_counts().sort_index().plot.bar()
# Output:
Class Ratio:  1.250510551395507

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 5. Splitting data for training and testing:
# Imbalanced data:
Xm_train, Xm_test, ym_train, ym_test = train_test_split(X_imb, y_imb, test_size=0.2, random_state=42)
# Balanced data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(":::::ImBalanced Data:::::") 
print("Train shape: ", Xm_train.shape, ym_train.shape) 
print("Testimg shape: ", Xm_test.shape, ym_test.shape)
print("\n")
print(":::::Balanced Data:::::") 
print("Train shape: ", X_train.shape, y_train.shape) 
print("Testimg shape: ", X_test.shape, y_test.shape)

# Output:
:::::ImBalanced Data:::::
Train shape:  (1549, 8) (1549,)
Testimg shape:  (388, 8) (388,)


:::::Balanced Data:::::
Train shape:  (2644, 8) (2644,)
Testimg shape:  (662, 8) (662,)

4. Model and Evaluation

We use 4 models in order to compare metrics and evaluate which one is better than other. Also we use GridSearchCV package in order to find the best parameters.

Supervised models

  1. Decision Tree - with Imbalanced/Balanced data in order to check behavior.
  2. Logistic Regresion
  3. Support Vector Machine
  4. K-Nearest Neighbors

Evaluation metrics

  1. Confusion matrix
  2. ROC, AUC
  3. Precision
  4. Recall
  5. F1-score
  6. MSE, RMSE

4.1. Decision Tree - with Imbalanced

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
param_grid = {'max_depth': [i for i in range(1, 11)], 
              'max_features': [i for i in range(1, 8)], 
              'min_samples_leaf': [i for i in range(1, 11)]}
dt_grid_imb = GridSearchCV(DecisionTreeClassifier(random_state = 1000), param_grid=param_grid, cv=10, return_train_score = True, n_jobs=-1)
evaluationMetricsGCV(Xm_test, ym_test, dt_grid_imb)
# Output:
+++++ Accuracy score 0.972%
+++++ Precision score 0.824%
+++++ Recall score 0.636%
+++++ F1 score 0.718%


              precision    recall  f1-score   support

         0.0       0.98      0.99      0.99       366
         1.0       0.82      0.64      0.72        22

    accuracy                           0.97       388
   macro avg       0.90      0.81      0.85       388
weighted avg       0.97      0.97      0.97       388


+++++ Best Score: 97.93%
+++++ Best params: {'max_depth': 9, 'max_features': 6, 'min_samples_leaf': 5}
+++++ AUC (Area under the ROC Curve) : 0.814

+++++ GridSearchCV MSE : 0.028
+++++ GridSearchCV RMSE : 0.168

4.2. Decision Tree - with Balanced

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
param_grid = {'max_depth': [i for i in range(1, 11)], 
              'max_features': [i for i in range(1, 8)], 
              'min_samples_leaf': [i for i in range(1, 11)]}
dt_grid_bal = GridSearchCV(DecisionTreeClassifier(random_state = 1000), param_grid=param_grid, cv=10, return_train_score = True, n_jobs=-1)
dt_grid_bal.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, dt_grid_bal)
# Output:

+++++ Accuracy score 0.980%
+++++ Precision score 0.964%
+++++ Recall score 0.989%
+++++ F1 score 0.977%


              precision    recall  f1-score   support

         0.0       0.99      0.97      0.98       388
         1.0       0.96      0.99      0.98       274

    accuracy                           0.98       662
   macro avg       0.98      0.98      0.98       662
weighted avg       0.98      0.98      0.98       662


+++++ Best Score: 98.26%
+++++ Best params: {'max_depth': 8, 'max_features': 7, 'min_samples_leaf': 1}
+++++ AUC (Area under the ROC Curve) : 0.982

+++++ GridSearchCV MSE : 0.020
+++++ GridSearchCV RMSE : 0.140

As we can see, the model with balanced data have better results than imbalanced data. So for now to the end we will use balanced data.

4.3. Logistic Regresion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
param_grid = {'penalty': ['l1', 'l2','elasticnet']}
lr_grid = GridSearchCV(LogisticRegression(random_state = 1000, solver = 'lbfgs'), param_grid=param_grid, cv=10, return_train_score=True, n_jobs=-1)
lr_grid.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, lr_grid)
# Output:
+++++ Accuracy score 0.831%
+++++ Precision score 0.798%
+++++ Recall score 0.792%
+++++ F1 score 0.795%


              precision    recall  f1-score   support

         0.0       0.85      0.86      0.86       388
         1.0       0.80      0.79      0.79       274

    accuracy                           0.83       662
   macro avg       0.83      0.83      0.83       662
weighted avg       0.83      0.83      0.83       662


+++++ Best Score: 82.71%
+++++ Best params: {'penalty': 'l2'}
+++++ AUC (Area under the ROC Curve) : 0.825

+++++ GridSearchCV MSE : 0.169
+++++ GridSearchCV RMSE : 0.411

4.4. Support Vector Machine

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
param_grid = {
            'gamma': np.arange(0.01, 0.4, 0.1),
            'kernel': ['linear','poly','rbf']
             }
SVC_grid = GridSearchCV(SVC(random_state = 1000, probability=True), param_grid=param_grid, cv=10, return_train_score=True,n_jobs=-1)
SVC_grid.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, SVC_grid)
# Output:
+++++ Accuracy score 0.903%
+++++ Precision score 0.880%
+++++ Recall score 0.887%
+++++ F1 score 0.884%


              precision    recall  f1-score   support

         0.0       0.92      0.91      0.92       388
         1.0       0.88      0.89      0.88       274

    accuracy                           0.90       662
   macro avg       0.90      0.90      0.90       662
weighted avg       0.90      0.90      0.90       662


+++++ Best Score: 88.09%
+++++ Best params: {'gamma': 0.31000000000000005, 'kernel': 'rbf'}
+++++ AUC (Area under the ROC Curve) : 0.901

+++++ GridSearchCV MSE : 0.097
+++++ GridSearchCV RMSE : 0.311

4.5. K-Nearest Neighbors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
op_knn = KNeighborsClassifier(n_neighbors  = 31)
op_knn.fit(X_train,y_train)
y_pred = op_knn.predict(X_test)
print_accuracy(accuracy_score(y_test, y_pred), 'KNN accuracy:')
# Output:
KNN accuracy: 89.88%

print(classification_report(y_test,y_pred))
# Output:
precision    recall  f1-score   support

         0.0       0.99      0.83      0.91       388
         1.0       0.81      0.99      0.89       274

    accuracy                           0.90       662
   macro avg       0.90      0.91      0.90       662
weighted avg       0.92      0.90      0.90       662

print("Auc Score: {}".format(roc_auc_score(y_test,y_pred)))
# Output:
Auc Score: 0.9125874783655656


5. Prediction

We saved the model wil pickle option in order to use later in prediction test. For prediction we use the serialized model and new images for testing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Load the model:
filename = 'modelLBPDT.pickle'
modelLBPDT_Loaded = pickle.load(open(filename, 'rb'))
 
# Use the loaded model to make predictions:
results = modelLBPDT_Loaded.predict(X_test_pred)
for i in list(zip(imagesBB, results)):
    print(i, ' ')
# Output:
('./Prediction/ImagesBB/orig_0006_129_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0004_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_129_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0005_129_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_86_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0004_129_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_43_113.png', 1.0)  
('./Prediction/ImagesBB/orig_0005_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0004_43_226.png', 1.0)  
('./Prediction/ImagesBB/orig_0009_172_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_43_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0005_86_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_86_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0006_0_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_0_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_0_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_86_0.png', 1.0)  
('./Prediction/ImagesBB/orig_0002_43_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0001_43_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_172_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_86_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0005_129_0.png', 0.0) 


6. Conclusions and Lessons learned