Car Plate Automatic Detection
Brief : The main idea of the project is to automate the car plate detection using LBP algorithm. For this, we create dataset from raw data using a set of Computer Vision techniques.
Objective
Automate car plate detection using LBP algorithm and compare supervised machine model like Decision Tree, Support Vector Machine, Logistic Regression.
Project Overview
- Generating dataset from raw original images in gray scale in order to create two variables classes plate and background.
- Explore and Analyze the behaveior of data related to outliers, databalancing, variable selection.
- As result of exploration phase, pre-process the dataset for being accurate to train, test and prediction.
- Model and evaluate four models with GridSearchCV in order to find best parameters and compare metrics to get the better one.
- Save the model with pickle package in order to use later for prediction test.
- Note: For more details HERE
Index
1. Dataset
The main idea for generating dataset is to use the original images in gray scale for generating small images of bounding box size of car plate in order to create two variables: plate images and background images.
1.1. Raw data pre-processing
For preprocessing, we need to convert one image into many small images which can cover car plate. Here some main steps:
- Create gray scale image if it is not already done.
- Apply gaussian filter for removing noise and smoothness the image.
- Apply solbel vertical filter for detecting vertical edges.
- Apply OTSU threshold for segmenting the image.
- Apply morphological closing operation to enhance car place image.
- Get contours of car plate in order to get mask which will be used as bounding box
1
2
3
4
5
6
7
8
9
10
11
|
img1 = imageio.imread('../orig_0018.png')
morph_n_thresholded_img_open = preprocess(img1)
contour = extract_contours(morph_n_thresholded_img_open)
img_contours = np.zeros(img1.shape)
# draw the contours on the empty image
cv2.drawContours(img_contours, contour, 0, (255,255,25), 3)
#save image
cv2.imwrite('contours.png',img_contours)
plate_img, plate_coordinates = find_plates(img1)
plt.imshow(img1[plate_coordinates[0][1]-7:plate_coordinates[0][1]-7+plate_coordinates[0][3]+10,plate_coordinates[0][0]:plate_coordinates[0][0] + plate_coordinates[0][2]], cmap="gray")
plt.show()
|
1.2. Dataset creation
For creation we need to pre-process each image in the raw dataset and get the car plate images and background images. At the end we will have two variable class for training model.
- Aplply pre-processing steps for each images and get bounding box of each car plate.
- Use bounding box limits for create dataset background images.
- Use bounding box limits for create dataset plate images.
- Use LBP Algoritm for create descriptor for each class and save it in a file for later usage.
1
2
3
|
img_LBP = algLBPnxn(img, 3)
plt.imshow(img_LBP, cmap="gray")
plt.show()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
plate_arr = np.c_[ent_arr_p, np.ones(ent_arr_p.shape[0])]
fondo_arr = np.c_[ent_arr_f, np.zeros(ent_arr_f.shape[0])]
dataset_arr = np.vstack((plate_arr, fondo_arr))
randomize = np.arange(len(dataset_arr))
np.random.shuffle(randomize)
dataset_arr = dataset_arr[randomize]
col = [ "Bit"+str(i) for i in range(256)]
col.append("class")
df = pd.DataFrame(dataset_arr, columns = [col])
df.head()
# Output:
Bit0 Bit1 Bit2 Bit3 Bit4 Bit5 Bit6 Bit7 Bit8 Bit9 ... Bit247 Bit248 Bit249 Bit250 Bit251 Bit252 Bit253 Bit254 Bit255 class
0 376.0 12.0 8.0 12.0 15.0 3.0 16.0 63.0 79.0 21.0 ... 106.0 59.0 23.0 5.0 20.0 21.0 19.0 24.0 204.0 0.0
1 362.0 7.0 11.0 6.0 11.0 5.0 13.0 33.0 91.0 12.0 ... 123.0 85.0 19.0 2.0 25.0 33.0 25.0 34.0 494.0 0.0
2 328.0 5.0 10.0 7.0 5.0 0.0 14.0 29.0 61.0 9.0 ... 99.0 143.0 19.0 1.0 19.0 21.0 12.0 21.0 289.0 0.0
3 401.0 10.0 5.0 9.0 18.0 4.0 16.0 32.0 62.0 21.0 ... 121.0 57.0 36.0 7.0 50.0 35.0 23.0 50.0 370.0 0.0
4 382.0 3.0 7.0 9.0 8.0 3.0 13.0 97.0 155.0 13.0 ... 131.0 95.0 20.0 4.0 16.0 14.0 12.0 9.0 249.0 0.0
|
2. Exploration and Analysis
For exploration, we will focus on check how is the behaveior of data related to the following topics:
- Checking Nulls
- Data Information revision to check if there is outliers, checking skewness.
- Target value revision
- Checking correlation
- Checking data distribution
1
2
3
4
5
6
7
8
9
10
|
# Checking correlation. I splitted dataset because there are lots of variables, which each of dataset including class column.
split1 = [i for i in range(7,22)]
split2 = [i for i in range(50,65)]
df1 = df.iloc[:, split1]
df2 = df.iloc[:, split2]
plt.subplots(figsize=(12,8))
sns.heatmap(df1.corr(), annot=True)
|
1
2
3
4
|
# Data Information revision to check if there is outliers, checking skewness.
plt.subplots(figsize=(15,6))
df1.boxplot(patch_artist=True, sym="k.")
plt.xticks(rotation=90)
|
3. Pre-Processing
As found in exploration phase, we will pre-process the dataset for being accurate to train, test and prediction. So we will focus in the following topics:
- Removing outliers
- Scaling and mormalized
- Feature selection
- Balancing data
- Splitting data for training and testing
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# 1. Removing outliers
# detecting Outlier
# Inter Quartile Range is the distance between the 3rd Quartile and the first Qartile
X = df_new.iloc[:, :-1]
detect_outlier(X)
remove_outlier(X)
# Output:
('Bit0',) There is Outlier
('Bit1',) There is Outlier
('Bit2',) There is Outlier
('Bit3',) There is Outlier
('Bit4',) There is Outlier
|
For variable selection, we chosed a linear regression model to find the best coeficient(>75):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# The importance variable selection will be done by usign coeficient:
feature_importance_lr = pd.DataFrame(zip(X.columns.values, log_rg.coef_.ravel()))
feature_importance_lr.columns = ['feature', 'coef']
feature_importance_lr.sort_values("coef", ascending=False, inplace=True)
feature_importance_lr.where(feature_importance_lr["coef"] >0.75).head(10)
# Output:
feature coef
244 (Bit244,) 1.200516
35 (Bit35,) 1.178940
8 (Bit8,) 1.090945
255 (Bit255,) 1.039968
47 (Bit47,) 0.774656
157 (Bit157,) 0.755275
230 (Bit230,) 0.754994
100 NaN NaN
129 NaN NaN
3 NaN NaN
|
During EDA, we found that variable class is imbalanced which can cause some bias and it could make the model overfit.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# 4. Balancing dataset:
# Ratio for imbalanced class:
smote_model = SMOTE(sampling_strategy = 0.8, k_neighbors = 5)
X, y = smote_model.fit_resample(X_imb, y_imb)
print(X_imb.shape, y_imb.shape, Counter(y))
# Ratio for balanced class:
counter = Counter(y)
class_ratio = counter[0.0]/counter[1.0]
print("Class Ratio: ", class_ratio)
# Output:
(1937, 8) (1937,) Counter({0.0: 1837, 1.0: 1469})
Class Ratio: 1.250510551395507
# After Balancing:
counter = Counter(y)
class_ratio = counter[0.0]/counter[1.0]
print("Class Ratio: ", class_ratio)
y.value_counts().sort_index().plot.bar()
# Output:
Class Ratio: 1.250510551395507
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# 5. Splitting data for training and testing:
# Imbalanced data:
Xm_train, Xm_test, ym_train, ym_test = train_test_split(X_imb, y_imb, test_size=0.2, random_state=42)
# Balanced data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(":::::ImBalanced Data:::::")
print("Train shape: ", Xm_train.shape, ym_train.shape)
print("Testimg shape: ", Xm_test.shape, ym_test.shape)
print("\n")
print(":::::Balanced Data:::::")
print("Train shape: ", X_train.shape, y_train.shape)
print("Testimg shape: ", X_test.shape, y_test.shape)
# Output:
:::::ImBalanced Data:::::
Train shape: (1549, 8) (1549,)
Testimg shape: (388, 8) (388,)
:::::Balanced Data:::::
Train shape: (2644, 8) (2644,)
Testimg shape: (662, 8) (662,)
|
4. Model and Evaluation
We use 4 models in order to compare metrics and evaluate which one is better than other. Also we use GridSearchCV package in order to find the best parameters.
Supervised models
- Decision Tree - with Imbalanced/Balanced data in order to check behavior.
- Logistic Regresion
- Support Vector Machine
- K-Nearest Neighbors
Evaluation metrics
- Confusion matrix
- ROC, AUC
- Precision
- Recall
- F1-score
- MSE, RMSE
4.1. Decision Tree - with Imbalanced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
param_grid = {'max_depth': [i for i in range(1, 11)],
'max_features': [i for i in range(1, 8)],
'min_samples_leaf': [i for i in range(1, 11)]}
dt_grid_imb = GridSearchCV(DecisionTreeClassifier(random_state = 1000), param_grid=param_grid, cv=10, return_train_score = True, n_jobs=-1)
evaluationMetricsGCV(Xm_test, ym_test, dt_grid_imb)
# Output:
+++++ Accuracy score 0.972%
+++++ Precision score 0.824%
+++++ Recall score 0.636%
+++++ F1 score 0.718%
precision recall f1-score support
0.0 0.98 0.99 0.99 366
1.0 0.82 0.64 0.72 22
accuracy 0.97 388
macro avg 0.90 0.81 0.85 388
weighted avg 0.97 0.97 0.97 388
+++++ Best Score: 97.93%
+++++ Best params: {'max_depth': 9, 'max_features': 6, 'min_samples_leaf': 5}
+++++ AUC (Area under the ROC Curve) : 0.814
+++++ GridSearchCV MSE : 0.028
+++++ GridSearchCV RMSE : 0.168
|
4.2. Decision Tree - with Balanced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
param_grid = {'max_depth': [i for i in range(1, 11)],
'max_features': [i for i in range(1, 8)],
'min_samples_leaf': [i for i in range(1, 11)]}
dt_grid_bal = GridSearchCV(DecisionTreeClassifier(random_state = 1000), param_grid=param_grid, cv=10, return_train_score = True, n_jobs=-1)
dt_grid_bal.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, dt_grid_bal)
# Output:
+++++ Accuracy score 0.980%
+++++ Precision score 0.964%
+++++ Recall score 0.989%
+++++ F1 score 0.977%
precision recall f1-score support
0.0 0.99 0.97 0.98 388
1.0 0.96 0.99 0.98 274
accuracy 0.98 662
macro avg 0.98 0.98 0.98 662
weighted avg 0.98 0.98 0.98 662
+++++ Best Score: 98.26%
+++++ Best params: {'max_depth': 8, 'max_features': 7, 'min_samples_leaf': 1}
+++++ AUC (Area under the ROC Curve) : 0.982
+++++ GridSearchCV MSE : 0.020
+++++ GridSearchCV RMSE : 0.140
|
As we can see, the model with balanced data have better results than imbalanced data. So for now to the end we will use balanced data.
4.3. Logistic Regresion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
param_grid = {'penalty': ['l1', 'l2','elasticnet']}
lr_grid = GridSearchCV(LogisticRegression(random_state = 1000, solver = 'lbfgs'), param_grid=param_grid, cv=10, return_train_score=True, n_jobs=-1)
lr_grid.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, lr_grid)
# Output:
+++++ Accuracy score 0.831%
+++++ Precision score 0.798%
+++++ Recall score 0.792%
+++++ F1 score 0.795%
precision recall f1-score support
0.0 0.85 0.86 0.86 388
1.0 0.80 0.79 0.79 274
accuracy 0.83 662
macro avg 0.83 0.83 0.83 662
weighted avg 0.83 0.83 0.83 662
+++++ Best Score: 82.71%
+++++ Best params: {'penalty': 'l2'}
+++++ AUC (Area under the ROC Curve) : 0.825
+++++ GridSearchCV MSE : 0.169
+++++ GridSearchCV RMSE : 0.411
|
4.4. Support Vector Machine
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
param_grid = {
'gamma': np.arange(0.01, 0.4, 0.1),
'kernel': ['linear','poly','rbf']
}
SVC_grid = GridSearchCV(SVC(random_state = 1000, probability=True), param_grid=param_grid, cv=10, return_train_score=True,n_jobs=-1)
SVC_grid.fit(X_train, y_train)
evaluationMetricsGCV(X_test, y_test, SVC_grid)
# Output:
+++++ Accuracy score 0.903%
+++++ Precision score 0.880%
+++++ Recall score 0.887%
+++++ F1 score 0.884%
precision recall f1-score support
0.0 0.92 0.91 0.92 388
1.0 0.88 0.89 0.88 274
accuracy 0.90 662
macro avg 0.90 0.90 0.90 662
weighted avg 0.90 0.90 0.90 662
+++++ Best Score: 88.09%
+++++ Best params: {'gamma': 0.31000000000000005, 'kernel': 'rbf'}
+++++ AUC (Area under the ROC Curve) : 0.901
+++++ GridSearchCV MSE : 0.097
+++++ GridSearchCV RMSE : 0.311
|
4.5. K-Nearest Neighbors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
op_knn = KNeighborsClassifier(n_neighbors = 31)
op_knn.fit(X_train,y_train)
y_pred = op_knn.predict(X_test)
print_accuracy(accuracy_score(y_test, y_pred), 'KNN accuracy:')
# Output:
KNN accuracy: 89.88%
print(classification_report(y_test,y_pred))
# Output:
precision recall f1-score support
0.0 0.99 0.83 0.91 388
1.0 0.81 0.99 0.89 274
accuracy 0.90 662
macro avg 0.90 0.91 0.90 662
weighted avg 0.92 0.90 0.90 662
print("Auc Score: {}".format(roc_auc_score(y_test,y_pred)))
# Output:
Auc Score: 0.9125874783655656
|
5. Prediction
We saved the model wil pickle option in order to use later in prediction test. For prediction we use the serialized model and new images for testing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
# Load the model:
filename = 'modelLBPDT.pickle'
modelLBPDT_Loaded = pickle.load(open(filename, 'rb'))
# Use the loaded model to make predictions:
results = modelLBPDT_Loaded.predict(X_test_pred)
for i in list(zip(imagesBB, results)):
print(i, ' ')
# Output:
('./Prediction/ImagesBB/orig_0006_129_226.png', 0.0)
('./Prediction/ImagesBB/orig_0004_172_226.png', 0.0)
('./Prediction/ImagesBB/orig_0002_172_226.png', 0.0)
('./Prediction/ImagesBB/orig_0002_129_0.png', 0.0)
('./Prediction/ImagesBB/orig_0005_129_113.png', 0.0)
('./Prediction/ImagesBB/orig_0002_86_0.png', 0.0)
('./Prediction/ImagesBB/orig_0004_129_0.png', 0.0)
('./Prediction/ImagesBB/orig_0009_43_113.png', 1.0)
('./Prediction/ImagesBB/orig_0005_172_226.png', 0.0)
('./Prediction/ImagesBB/orig_0004_43_226.png', 1.0)
('./Prediction/ImagesBB/orig_0009_172_0.png', 0.0)
('./Prediction/ImagesBB/orig_0009_43_226.png', 0.0)
('./Prediction/ImagesBB/orig_0005_86_113.png', 0.0)
('./Prediction/ImagesBB/orig_0009_86_226.png', 0.0)
('./Prediction/ImagesBB/orig_0006_0_0.png', 0.0)
('./Prediction/ImagesBB/orig_0002_0_0.png', 0.0)
('./Prediction/ImagesBB/orig_0009_172_226.png', 0.0)
('./Prediction/ImagesBB/orig_0009_0_113.png', 0.0)
('./Prediction/ImagesBB/orig_0009_86_0.png', 1.0)
('./Prediction/ImagesBB/orig_0002_43_113.png', 0.0)
('./Prediction/ImagesBB/orig_0001_43_113.png', 0.0)
('./Prediction/ImagesBB/orig_0002_172_113.png', 0.0)
('./Prediction/ImagesBB/orig_0009_86_113.png', 0.0)
('./Prediction/ImagesBB/orig_0005_129_0.png', 0.0)
|
6. Conclusions and Lessons learned
- During dataset creation, we faced many issues related to bounding box which cannot fit well the car plate area. We had to experiment different values for with, height like media, max, increment in some values.
- For exploration of many variables, we split into 2 small set which analyzed and get some insights.
- During evaluation, we found that Decision Tree model has better performance than others, but there are some models we did not test like ensambles, CNN.
- During project, we’ve realize that LBP descriptors can help to reduce computation resources because dispate of work with NxMxZ arrays, with descripts we work with NxM.