Car Plate Automatic Detection

Computer Vision

2022

Brief : The main idea of the project is to automate the car plate detection using LBP algorithm. For this, we create dataset from raw data using a set of Computer Vision techniques.

Objective

Automate car plate detection using LBP algorithm and compare supervised machine model like Decision Tree, Support Vector Machine, Logistic Regression.

Project Overview

Generating dataset from raw original images in gray scale in order to create two variables classes plate and background.
Explore and Analyze the behaveior of data related to outliers, databalancing, variable selection.
As result of exploration phase, pre-process the dataset for being accurate to train, test and prediction.
Model and evaluate four models with GridSearchCV in order to find best parameters and compare metrics to get the better one.
Save the model with pickle package in order to use later for prediction test.
Note: For more details HERE

1. Dataset

The main idea for generating dataset is to use the original images in gray scale for generating small images of bounding box size of car plate in order to create two variables: plate images and background images.

1.1. Raw data pre-processing

For preprocessing, we need to convert one image into many small images which can cover car plate. Here some main steps:

Create gray scale image if it is not already done.
Apply gaussian filter for removing noise and smoothness the image.
Apply solbel vertical filter for detecting vertical edges.
Apply OTSU threshold for segmenting the image.
Apply morphological closing operation to enhance car place image.
Get contours of car plate in order to get mask which will be used as bounding box

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


img1 = imageio.imread('../orig_0018.png')
morph_n_thresholded_img_open = preprocess(img1)
contour = extract_contours(morph_n_thresholded_img_open)
img_contours = np.zeros(img1.shape)
# draw the contours on the empty image
cv2.drawContours(img_contours, contour, 0, (255,255,25), 3)
#save image
cv2.imwrite('contours.png',img_contours) 
plate_img, plate_coordinates = find_plates(img1)
plt.imshow(img1[plate_coordinates[0][1]-7:plate_coordinates[0][1]-7+plate_coordinates[0][3]+10,plate_coordinates[0][0]:plate_coordinates[0][0] + plate_coordinates[0][2]], cmap="gray")
plt.show()

1.2. Dataset creation

For creation we need to pre-process each image in the raw dataset and get the car plate images and background images. At the end we will have two variable class for training model.

Aplply pre-processing steps for each images and get bounding box of each car plate.
Use bounding box limits for create dataset background images.
Use bounding box limits for create dataset plate images.
Use LBP Algoritm for create descriptor for each class and save it in a file for later usage.

1
2
3


img_LBP = algLBPnxn(img, 3)
plt.imshow(img_LBP, cmap="gray")
plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


plate_arr = np.c_[ent_arr_p, np.ones(ent_arr_p.shape[0])]
fondo_arr = np.c_[ent_arr_f, np.zeros(ent_arr_f.shape[0])]
dataset_arr = np.vstack((plate_arr, fondo_arr))
randomize = np.arange(len(dataset_arr))
np.random.shuffle(randomize)
dataset_arr = dataset_arr[randomize]
col = [ "Bit"+str(i) for i in range(256)]
col.append("class")
df = pd.DataFrame(dataset_arr, columns = [col])
df.head()

# Output:
        Bit0	Bit1	Bit2	Bit3	Bit4	Bit5	Bit6	Bit7	Bit8	Bit9	...	Bit247	Bit248	Bit249	Bit250	Bit251	Bit252	Bit253	Bit254	Bit255	class
0	376.0	12.0	8.0	12.0	15.0	3.0	16.0	63.0	79.0	21.0	...	106.0	59.0	23.0	5.0	20.0	21.0	19.0	24.0	204.0	0.0
1	362.0	7.0	11.0	6.0	11.0	5.0	13.0	33.0	91.0	12.0	...	123.0	85.0	19.0	2.0	25.0	33.0	25.0	34.0	494.0	0.0
2	328.0	5.0	10.0	7.0	5.0	0.0	14.0	29.0	61.0	9.0	...	99.0	143.0	19.0	1.0	19.0	21.0	12.0	21.0	289.0	0.0
3	401.0	10.0	5.0	9.0	18.0	4.0	16.0	32.0	62.0	21.0	...	121.0	57.0	36.0	7.0	50.0	35.0	23.0	50.0	370.0	0.0
4	382.0	3.0	7.0	9.0	8.0	3.0	13.0	97.0	155.0	13.0	...	131.0	95.0	20.0	4.0	16.0	14.0	12.0	9.0	249.0	0.0

⬆

2. Exploration and Analysis

For exploration, we will focus on check how is the behaveior of data related to the following topics:

Checking Nulls
Data Information revision to check if there is outliers, checking skewness.
Target value revision
Checking correlation
Checking data distribution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# Checking correlation. I splitted dataset because there are lots of variables, which each of dataset including class column.

split1 = [i for i in range(7,22)]
split2 = [i for i in range(50,65)]

df1 = df.iloc[:, split1]
df2 = df.iloc[:, split2]

plt.subplots(figsize=(12,8))
sns.heatmap(df1.corr(), annot=True)

1
2
3
4


# Data Information revision to check if there is outliers, checking skewness.
plt.subplots(figsize=(15,6))
df1.boxplot(patch_artist=True, sym="k.")
plt.xticks(rotation=90)

⬆

3. Pre-Processing

As found in exploration phase, we will pre-process the dataset for being accurate to train, test and prediction. So we will focus in the following topics:

Removing outliers
Scaling and mormalized
Feature selection
Balancing data
Splitting data for training and testing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# 1. Removing outliers
# detecting Outlier
# Inter Quartile Range is the distance between the 3rd Quartile and the first Qartile

X = df_new.iloc[:, :-1]
detect_outlier(X)
remove_outlier(X)
# Output:
('Bit0',) There is Outlier
('Bit1',) There is Outlier
('Bit2',) There is Outlier
('Bit3',) There is Outlier
('Bit4',) There is Outlier

For variable selection, we chosed a linear regression model to find the best coeficient(>75):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# The importance variable selection will be done by usign coeficient:

feature_importance_lr = pd.DataFrame(zip(X.columns.values, log_rg.coef_.ravel()))
feature_importance_lr.columns = ['feature', 'coef']
feature_importance_lr.sort_values("coef", ascending=False, inplace=True)
feature_importance_lr.where(feature_importance_lr["coef"] >0.75).head(10)
# Output:
      feature     coef
244   (Bit244,)   1.200516
35    (Bit35,)    1.178940
8     (Bit8,)     1.090945
255   (Bit255,)   1.039968
47    (Bit47,)    0.774656
157   (Bit157,)   0.755275
230   (Bit230,)   0.754994
100   NaN         NaN
129   NaN         NaN
3     NaN         NaN

During EDA, we found that variable class is imbalanced which can cause some bias and it could make the model overfit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


# 4. Balancing dataset:
# Ratio for imbalanced class:
smote_model = SMOTE(sampling_strategy = 0.8, k_neighbors = 5)
X, y = smote_model.fit_resample(X_imb, y_imb)
print(X_imb.shape, y_imb.shape, Counter(y))
# Ratio for balanced class:
counter = Counter(y)
class_ratio = counter[0.0]/counter[1.0]
print("Class Ratio: ", class_ratio)
# Output:
(1937, 8) (1937,) Counter({0.0: 1837, 1.0: 1469})
Class Ratio:  1.250510551395507

# After Balancing:
counter = Counter(y)
class_ratio = counter[0.0]/counter[1.0]
print("Class Ratio: ", class_ratio)
y.value_counts().sort_index().plot.bar()
# Output:
Class Ratio:  1.250510551395507

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# 5. Splitting data for training and testing:
# Imbalanced data:
Xm_train, Xm_test, ym_train, ym_test = train_test_split(X_imb, y_imb, test_size=0.2, random_state=42)
# Balanced data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(":::::ImBalanced Data:::::") 
print("Train shape: ", Xm_train.shape, ym_train.shape) 
print("Testimg shape: ", Xm_test.shape, ym_test.shape)
print("\n")
print(":::::Balanced Data:::::") 
print("Train shape: ", X_train.shape, y_train.shape) 
print("Testimg shape: ", X_test.shape, y_test.shape)

# Output:
:::::ImBalanced Data:::::
Train shape:  (1549, 8) (1549,)
Testimg shape:  (388, 8) (388,)


:::::Balanced Data:::::
Train shape:  (2644, 8) (2644,)
Testimg shape:  (662, 8) (662,)

⬆

4. Model and Evaluation

We use 4 models in order to compare metrics and evaluate which one is better than other. Also we use GridSearchCV package in order to find the best parameters.

Supervised models

Decision Tree - with Imbalanced/Balanced data in order to check behavior.
Logistic Regresion
Support Vector Machine
K-Nearest Neighbors

Evaluation metrics

Confusion matrix
ROC, AUC
Precision
Recall
F1-score
MSE, RMSE

4.1. Decision Tree - with Imbalanced

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


param_grid = {'max_depth': [i for i in range(1, 11)], 
              'max_features': [i for i in range(1, 8)], 
              'min_samples_leaf': [i for i in range(1, 11)]}
dt_grid_imb = GridSearchCV(DecisionTreeClassifier(random_state = 1000), param_grid=param_grid, cv=10, return_train_score = True, n_jobs=-1)
evaluationMetricsGCV(Xm_test, ym_test, dt_grid_imb)
# Output:
+++++ Accuracy score 0.972%
+++++ Precision score 0.824%
+++++ Recall score 0.636%
+++++ F1 score 0.718%


              precision    recall  f1-score   support

         0.0       0.98      0.99      0.99       366
         1.0       0.82      0.64      0.72        22

    accuracy                           0.97       388
   macro avg       0.90      0.81      0.85       388
weighted avg       0.97      0.97      0.97       388


+++++ Best Score: 97.93%
+++++ Best params: {'max_depth': 9, 'max_features': 6, 'min_samples_leaf': 5}
+++++ AUC (Area under the ROC Curve) : 0.814

+++++ GridSearchCV MSE : 0.028
+++++ GridSearchCV RMSE : 0.168

4.2. Decision Tree - with Balanced

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


param_grid = {'max_depth': [i for i in range(1, 11)], 
              'max_features': [i for i in range(1, 8)], 
              'min_samples_leaf': [i for i in range(1, 11)]}
dt_grid_bal = GridSearchCV(DecisionTreeClassifier(random_state = 1000), param_grid=param_grid, cv=10, return_train_score = True, n_jobs=-1)
dt_grid_bal.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, dt_grid_bal)
# Output:

+++++ Accuracy score 0.980%
+++++ Precision score 0.964%
+++++ Recall score 0.989%
+++++ F1 score 0.977%


              precision    recall  f1-score   support

         0.0       0.99      0.97      0.98       388
         1.0       0.96      0.99      0.98       274

    accuracy                           0.98       662
   macro avg       0.98      0.98      0.98       662
weighted avg       0.98      0.98      0.98       662


+++++ Best Score: 98.26%
+++++ Best params: {'max_depth': 8, 'max_features': 7, 'min_samples_leaf': 1}
+++++ AUC (Area under the ROC Curve) : 0.982

+++++ GridSearchCV MSE : 0.020
+++++ GridSearchCV RMSE : 0.140

As we can see, the model with balanced data have better results than imbalanced data. So for now to the end we will use balanced data.

4.3. Logistic Regresion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


param_grid = {'penalty': ['l1', 'l2','elasticnet']}
lr_grid = GridSearchCV(LogisticRegression(random_state = 1000, solver = 'lbfgs'), param_grid=param_grid, cv=10, return_train_score=True, n_jobs=-1)
lr_grid.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, lr_grid)
# Output:
+++++ Accuracy score 0.831%
+++++ Precision score 0.798%
+++++ Recall score 0.792%
+++++ F1 score 0.795%


              precision    recall  f1-score   support

         0.0       0.85      0.86      0.86       388
         1.0       0.80      0.79      0.79       274

    accuracy                           0.83       662
   macro avg       0.83      0.83      0.83       662
weighted avg       0.83      0.83      0.83       662


+++++ Best Score: 82.71%
+++++ Best params: {'penalty': 'l2'}
+++++ AUC (Area under the ROC Curve) : 0.825

+++++ GridSearchCV MSE : 0.169
+++++ GridSearchCV RMSE : 0.411

4.4. Support Vector Machine

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


param_grid = {
            'gamma': np.arange(0.01, 0.4, 0.1),
            'kernel': ['linear','poly','rbf']
             }
SVC_grid = GridSearchCV(SVC(random_state = 1000, probability=True), param_grid=param_grid, cv=10, return_train_score=True,n_jobs=-1)
SVC_grid.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, SVC_grid)
# Output:
+++++ Accuracy score 0.903%
+++++ Precision score 0.880%
+++++ Recall score 0.887%
+++++ F1 score 0.884%


              precision    recall  f1-score   support

         0.0       0.92      0.91      0.92       388
         1.0       0.88      0.89      0.88       274

    accuracy                           0.90       662
   macro avg       0.90      0.90      0.90       662
weighted avg       0.90      0.90      0.90       662


+++++ Best Score: 88.09%
+++++ Best params: {'gamma': 0.31000000000000005, 'kernel': 'rbf'}
+++++ AUC (Area under the ROC Curve) : 0.901

+++++ GridSearchCV MSE : 0.097
+++++ GridSearchCV RMSE : 0.311

4.5. K-Nearest Neighbors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


op_knn = KNeighborsClassifier(n_neighbors  = 31)
op_knn.fit(X_train,y_train)
y_pred = op_knn.predict(X_test)
print_accuracy(accuracy_score(y_test, y_pred), 'KNN accuracy:')
# Output:
KNN accuracy: 89.88%

print(classification_report(y_test,y_pred))
# Output:
precision    recall  f1-score   support

         0.0       0.99      0.83      0.91       388
         1.0       0.81      0.99      0.89       274

    accuracy                           0.90       662
   macro avg       0.90      0.91      0.90       662
weighted avg       0.92      0.90      0.90       662

print("Auc Score: {}".format(roc_auc_score(y_test,y_pred)))
# Output:
Auc Score: 0.9125874783655656

⬆

5. Prediction

We saved the model wil pickle option in order to use later in prediction test. For prediction we use the serialized model and new images for testing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


# Load the model:
filename = 'modelLBPDT.pickle'
modelLBPDT_Loaded = pickle.load(open(filename, 'rb'))
 
# Use the loaded model to make predictions:
results = modelLBPDT_Loaded.predict(X_test_pred)
for i in list(zip(imagesBB, results)):
    print(i, ' ')
# Output:
('./Prediction/ImagesBB/orig_0006_129_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0004_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_129_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0005_129_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_86_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0004_129_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_43_113.png', 1.0)  
('./Prediction/ImagesBB/orig_0005_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0004_43_226.png', 1.0)  
('./Prediction/ImagesBB/orig_0009_172_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_43_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0005_86_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_86_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0006_0_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_0_0.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_172_226.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_0_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_86_0.png', 1.0)  
('./Prediction/ImagesBB/orig_0002_43_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0001_43_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0002_172_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0009_86_113.png', 0.0)  
('./Prediction/ImagesBB/orig_0005_129_0.png', 0.0) 

⬆

6. Conclusions and Lessons learned

During dataset creation, we faced many issues related to bounding box which cannot fit well the car plate area. We had to experiment different values for with, height like media, max, increment in some values.
For exploration of many variables, we split into 2 small set which analyzed and get some insights.
During evaluation, we found that Decision Tree model has better performance than others, but there are some models we did not test like ensambles, CNN.
During project, we’ve realize that LBP descriptors can help to reduce computation resources because dispate of work with NxMxZ arrays, with descripts we work with NxM.

⬆

Hello, I'm Jhon Vargas - a Data Scientist & Technical Lead

Curious. Self Learning. Proactive.