Identification of glucose levels in urine based on classification using -nearest neighbor algorithm method

Diabetes mellitus is one of the diseases with the highest number of sufferers in the world. According to the World Health Organization (WHO), as many as 415 million people suffered from this disease in 2019 [1]. Early diagnosis and ongoing control are important for diabetics to ensure a healthy life and to avoid complications and death [2].

Measurement of glucose levels in the urine is carried out to prevent the entry of bacteria into the body through a syringe [3]. Obesity triggers an increase in glucose due to the accumulation of fatty tissue in the body [4]. When glucose levels in the blood are >180 mg/100 mL, glucose is extracted by the kidneys and flushed out through the urine [5,6]. The level of glucose contained in the urine is thereafter determined.

This study designed a prototype to detect glucose in urine using the AS7262 sensor as the main component for detecting the color of urine specimens. These specimens are grouped into five classes, namely, Normal, Positive 1, Positive 2, Positive 3, and Positive 4. Urine produces color changes according to glucose levels based on Benedict's test [7]. The analysis of this research is strengthened using machine learning (ML) with the K-nearest neighbor (KNN) algorithm.

Materials and Methods

2.1.

System design

This system is designed by researcher to detect the color of the specimen using the AS7262 sensor based on the intensities of six colors. They are violet, orange, blue, green, yellow, and red. The supporting components are installed in this system including: the DS18B20 sensor to regulate temperature, a heater to process the specimen, van as a coolant to control the temperature when overheated, buzzer and light-emitting diode (LED) as markers of each process, printed circuit board (PCB) as a microcontroller, liquid crystal display (LCD) to show the menu and the result, a button to control the system, and a switch to turn on or turn off the system. Block diagrams of the system design are made to explain the overall concept specifically of the prototype shown in Figure 1.

Block diagram. PLN, State Electricity Company; I2C, Inter Integrated Circuit System.

Figure 1 shows that this prototype uses a digital communication line, while the AS7262 sensor uses the Inter Integrated Circuit System (I2C). Note that the AS7262 and DS18B20 sensors are input. The value generated from DS18B20 is used to set the temperature on the heater to work automatically, and the AS7262 value is used as a dataset in the next process. The wiring diagram is shown in Figure 2.

Figure 2 shows that each pin on the Arduino Nano is connected to each component in order to control the performance of each component through a program using digital communication. The flow of data collection specifically can be seen in Figure 3.

Figure 3 shows that after the initialization process, the temperature is set and the heating process is started. This process produces color in the specimen and is detected using the AS7262 sensor, with the resulting data displayed on the LCD and sent to the serial monitor. The results of the data are used as a dataset to be processed in the next step.

The data generated by the sensor are processed during ML using the KNN algorithm. The data process in ML is shown in Figure 4.

ML is used to extract relevant data [8]. The dataset used contains the results of the AS7262 sensor, which is converted into comma-separated value (csv) form so that it can be read using Python program. The data are processed for classification to determine the glucose levels based on the color of the specimen. After the data are classified, the algorithm begins to predict, and the accuracy of the classification process is determined.

2.2.

Procedure of collecting data

2.2.1.

Sample preparation

The specimens in this study were mixtures of urine processed by Benedict's method to produce a colored precipitate, which was assayed according to the grade of the total glucose level. Benedict's test refers to the provisions contained in previous studies (Pratiwi et al. 2020), with the test using as many as five urine samples with different glucose levels. The difference between such studies and the current study lies in the use of different sensors; previous studies used photodiode sensors to detect colors, while the current study uses AS7262 sensors to detect colors [9]. Benedict's test sample was prepared by using 10 mL of Benedict's solution and 20 mL of urine specimen (containing 0.4 mL, which is equivalent to eight drops). This test is done to detect the glucose content in the urine. The calculation of the concentration is done by using Eq. (1). (1) $m = \frac{(%) Solute}{100 %} V solution .$ m = {{\left( \% \right)Solute} \over {100\%}}V\,solution.

Here, m is the total of glucose level, ‘(%) Solute’ is the glucose level, and V solution is the total volume of the specimen solution [5].

2.2.2.

K-Nearest Neighbor Algorithm

The working principle of KNN is to make predictions based on the proximity of the object's characteristics to those of the neighborhood training data closest to the object [10, 11]. This algorithm works by determining the input in the form of training data, testing data, and the value of K [12]. The value of K is the number of nearest neighbors of the data to be classified to determine the class label [13,14].

The advantage of this method is that data are classified based on the nearest neighbor class. The value of “K” in KNN indicates the number of neighbors that are taken into account in determining the class [15]. KNN is widely used in ML because of its simple performance [16]. The visualization of the KNN algorithm is shown in Figure 5.

Figure 5 shows the data, which consist of two classes, namely, the red class and the green class. The test data are indicated by a blue arrow, with a value of K = 5. With the number of K values, the test data are classified in the green class based on the majority of the nearest neighbors. A high K-value results in lower accuracy because it can make the boundaries between each classification become more blurred [17]. The calculation of the neighbor distance is done by using the Euclidean equation, which is shown in Eq. (2). (2) $euc = \sqrt (\sum_{i = 1}^{n} {(pi - qi)}^{2}),$ euc = \surd \left( {\sum\nolimits_{i = 1}^n {{{\left( {pi - qi} \right)}^2}}} \right), where pi = sample data, qi = test data/testing data, i = variable data, and n = data dimension [18]. Tests were carried out to determine the level of accuracy of classification of data using KNN. Accuracy is the level of closeness of the predicted value to the actual values of the precision, recall, and f-measure [19]. The calculation of accuracy is done as shown in Eq. (3). (3) $Accuracy = \frac{Valid value}{Total data} \times 100 % .$ Accuracy = {{Valid\,value} \over {Total\,data}} \times 100\%.

Precision is defined as a measure of accuracy to predict all data samples [20]. The precision value is calculated by using Eq. (4). (4) $Precision = \frac{TP}{TP + FP} .$ Precision = {{TP} \over {TP + FP}}.

Here, TP indicates true positive, and FP indicates false positive.

Recall is a completeness measure of the sample scale taken with all significant samples [20]. The recall value is calculated by using Eq. (5). (5) $Recall = \frac{TP}{TP + FN} .$ Recall = {{TP} \over {TP + FN}}.

Here, FN indicates false negative. F-measure is the mean or average value between precision and recall [21]. For the calculation of the F-measure value, see Eq. (6). (6) $F - measure = 2 \frac{Precision \times Recall}{Precision + Recall} .$ F - {\rm{measure}} = 2{{{\rm{Precision}} \times {\rm{Recall}}} \over {{\rm{Precision}} + {\rm{Recall}}}}.

Description of the variables:

TP = true positive;

TN = true negative;

FP = false positive;

FN = false negative.

TP indicates the presence of a characteristic, TN indicates the absence of a characteristic, FP indicates the presence of a certain condition, and FN indicates the absence of a certain condition.

Results and Discussion

3.1.

Sample test

Tests carried out on urine samples containing glucose with Benedict's test resulted in five specimen colors, classified as Normal, Positive 1, Positive 2, Positive 3, and Positive 4. Classes were characterized by grade, color yield, and glucose levels. The results of the characterization can be seen in Table 1.

Table 1.

Glucose characterization

No.	Urine sample	Glucose level (%)	Glucose rate (g)	Color result	Urine positive rate (mg/dL)
1.	Sample 1	0–0.5	0	Slightly greenish blue and a bit cloudy	Normal
2.	Sample 2	0.5–1	0.2	Yellowish green	Positive 1
3.	Sample 3	1–1.5	0.3	Greenish yellow	Positive 2
4.	Sample 4	2–3.5	0.4	Slightly brownish orange	Positive 3
5.	Sample 5	>3.5	1	Slightly brownish brick red	Positive 4

Based on this classification, it can be observed in Table 2 that the urine specimens in each class have different characteristics. This is caused by the amount of glucose levels contained in the urine. Calculation of glucose concentration is done using Eq. (1).

Table 2.

Results of the data from the sensor

Sensor color indicator						Class category

Violet	Blue	Green	Yellow	Orange	Red
21.34	23.36	18.68	19.64	14.02	6.66	Normal
14.93	13.53	24.27	21.48	19.23	6.53	Positive 1
12.19	10.07	22.63	24.14	21.87	9.05	Positive 2
11.12	8.21	16.97	21.91	26.02	15.74	Positive 3
13.68	9.93	22.58	20.77	22.58	10.43	Positive 4

3.2.

Data collection

This process is carried out using a prototype design made with the aim of making it easier for users to retrieve specimen data in terms of time and place. This prototype is made using components that function to help the prototype carry out its functions properly. The prototype in this study is shown in Figure 6.

Prototype: (a) front view, (b) inside view, and (c) upper view.

Figure 6(a) shows the output results for each input displayed on the LCD. Figure 6(b) is the layout of the component series for the data collection process. Figure 6(c) shows the main component AS7262 sensor.

3.3.

Data collection

Data testing is carried out to test the accuracy of the sensor by identifying the color intensity of the specimen based on six indicators. Testing this prototype resulted in 1,200 data on each specimen. The total number of data generated by the five specimens is 6,000. The data results are shown in Table 2.

Table 2 shows some of the data results obtained from the AS7262 sensor. The resulting data are quite stable because the sensor produces color intensity data that matches the color of the specimen. The data are displayed in graphical form in Figure 7.

Graphs of the data results: (a) Normal, (b) Positive 1, (c) Positive 2, (d) Positive 3, and (e) Positive 4.

3.2.

Classification of glucose levels using KNN

Classification is the process of analyzing the same model on a set and classifying it into different classes [22,23]. The data are classified into five classes based on the glucose levels, with a K value of 45; of the 6,000-item dataset, the testing data comprised 30% and the training data comprised 70%. Prediction from the data led to an accuracy of 96.33%. Accuracy is predicted using Eq. (3). The classification is achieved using Eq. (2) based on the distance of each neighbor. The purpose of this process is to obtain a classification of five classes based on six color indicators. The accuracy value is shown in the form of a graph in Figure 8.

Figure 8 shows that the higher the graph, the more are the error data. The graph moves up and down because of the error data for the value of K. The unstable graph is caused by several reasons, including the suitability of the data [24]. The classification process works on the KNN algorithm based on a dataset that is calculated using precision, recall, and F-measure. The resulting value for each variable is shown in Figure 9.

KNN classification results. KNN, K-nearest neighbor.

Figure 9 shows the results of the predicted values generated from the test data, training data, and the value of K. The figure presents the precision, recall, and f-score values, which are calculated based on the formulas in Eqs. (5)–(7) with support values in each class. To analyze the classification performance of the KNN algorithm, the confusion matrix shown in Figure 10 can be used.

Confusion matrix serves to provide information on the comparison of the classification results carried out by the system with the actual classification results [25].

Conclusions

The KNN algorithm can be quite effective in classifying the glucose class. In this study, each glucose level was determined by color matching based on the intensity of the colors violet, blue, green, orange, yellow, and red. The results obtained from the comparison made between the classification carried out by the system and the actual classification results showed a fairly large accuracy of 96.33%.

This system is designed to be used by many people, not necessarily diabetics because the examination is carried out without causing physical harm and other appliances do not come into direct contact with the body. In the future, we aim to create a noninvasive blood sugar–checking system with simpler hardware size and weight so that it is easier to carry anywhere.

eISSN:: 1178-5608
Language:: English

Publication timeframe:: Volume Open
Journal Subjects:: Engineering, Introductions and Overviews, other

Journal RSS Feed

Identification of glucose levels in urine based on classification using k-nearest neighbor algorithm method

Article Category: Article

Published Online: Aug 03, 2023

Page range: -

Received: Aug 15, 2022

DOI: https://doi.org/10.2478/ijssis-2023-0006

Keywords
Glucose, urine, AS7262, classification, K-nearest neighbor

© 2023 Anton Yudhana et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10: