Deep learning for daily care: medicine recognition and reminder systems for the visually impaired

The visually impaired community faces numerous challenges in their daily lives, and one of which is the identification and management of medications. The solutions to the problems related to the daily life of the visually impaired persons are hidden in the embedding of advanced technology, artificial intelligence (AI), and deep learning, which aims to make ease of their lives. The demonstration of proof of concepts in this study shows that the AI, more specifically, we can say the deep learning models, to build an intelligent system can assist blind persons in identifying and managing their medications more effectively.

The innovative applications are developed in the field of AI, machine learning, and deep learning, which shows the capabilities of the computer vision, natural language processing, robotics, etc. The intelligent system in future with the features, such as voice recognition, sentiment recognition-based, and scene description in text using the NLP, can handle the complex environments, which are suited for the visually impaired humans. The system whose proof of concept shown in this research work utilizes the deep neural networks inception model to identify the medications based on their visual appearance.

By preprocessing the images of the medicines, it will get converted into the desired format which would be understood by the model. This improves the reliability of the system in accurately predicting the medicines given as input images from the camera while testing.

This system will also incorporate a user-friendly interface, allowing visually impaired individuals to interact with it easily. To understand the concept, it is shown with the user interface, making it easy to visualize for a normal person and get to know the exact area where the proposed system is creating impact. As an add-on to the system, it can provide additional information about the medicines, such as their usage and purpose for which it is generally recommended to patients. The laptop speakers are responsible for the audio outputs, which enable the visually impaired individuals to hear and manage their medications independently.

The features, such as medicine recommendations, medicine timing reminders, medicine information system, and easy UI for the caretakers, will be impactful in the use of the proposed system for the visually impaired people.

Overall, we can summarize that the system is needed to tackle the issues of the visually impaired people, such as reminding to take medicines, identifying the exact medicine from the list of the different medicines in the box, and understanding the uses of the medicines. The use of advanced algorithms embedded with the new applications is helpful and effective in making the life ease of the visually impaired persons. The detailed understanding of the images is nourished by the neural network layers through the use of filters. Various filters are responsible for extracting features and providing the input to the fully connected neural network. Some advanced functions, such as softmax, etc., are also helpful in creating the vectors for the processing purpose. The dense layer is where the features that were extracted from the convolution layers are combined into a single vector. This one-dimensional vector has a very large number of parameters that are discussed in this study.

II.

Literature Review

The literature survey demonstrates a wide range of methodologies and techniques intended to support blind individuals, emphasizing on strengthening their independence and quality of life.

One tactic involves integrating deep learning and imagination to detect items and offer speech responses for harmless navigation [1]. Another article study emphasizes the identification of money and clothing for visually impaired individuals in daily activities [2]. These novelties highlight the important progressions in assistive tools for the visually impaired individuals. Computer vision in AI plays a vital role in understanding photographic data, with applications in aiding visually impaired individuals. For example, a learning concentration on using a convolutional neural network (CNN) designed for a mobile banknote recognition app, aiding visually impaired individuals [3]. This article reading underlines the significance of dataset composition and augmentation in accomplishing high accuracy rates for real-time money identification. One more central characteristic is contact-free heart rate measurement, which is vital in several fields. A study suggests a multi-person approach using CNNs for heart rate approximation, refining runtime and accurateness for huge groups [4]. This method demonstrates how deep learning can provide contactless health monitoring.

As a result of large developments in assistive technology, the compact computationally constrained portable aid was developed using a Raspberry Pi to guide blind individuals in detecting objects and recognizing information about them [5].

This development showcases the real-world applications of deep learning in aiding visually impaired persons in their everyday lives. There is a high need for the development of solutions to the communication challenges faced by individuals with auditory as well as visual impairments. One study suggested an original method using a combination of CNN and long short-term memory (LSTM) network on Raspberry Pi to facilitate communication for the visually and hearing impaired [6].

Diagnostic examinations are essential for evaluating development in building deep learning systems that can reason and respond to visual input enquiries. Visual question answering (VQA) tackles the challenge of providing feature descriptions of web images to the visually impaired individuals by responding to their questions [7]. This study explores VQA using CNN and LSTM with the VQA-2.0 dataset, highlighting its strong points and flaws associated with other models.

For the country's security, the boundary and surveillance systems are critical for detecting hazardous events. Deep learning provides an intelligent and autonomous approach for object detection using visual data [8]. This study explores a dataset for training and testing YOLO, attaining great accuracy in sensing ground pits, which can serve as hidden locations for illegal actions.

In the field of movable machine navigation, precise pedestrian trajectory forecast is vital for enhancing problem-escaping performance [9]. This study presents a new algorithm for pedestrian trajectory prediction built on a panoramic camera, signifying its potential for mobile robot navigation and obstacle avoidance in human-robot coexistence surroundings.

We can conclude from the above studies that the variety of solutions using deep learning and machine learning is available, which are part of assistive technology for visually impaired persons.

Table 1 shows the approaches and systems used in modern investigations to support visually impaired persons. These methods include deep learning, computer vision, and innovative technologies to improve navigation, object acknowledgment, communication, and healthcare for the visually impaired persons.

Table 1:

Literature review

Reference	Method/algorithm used	Merits	Demerits
[1]	YOLO and OpenCV	Provides real-time object detection and visual replacement for the blind.	YOLO may struggle with small or distant objects, and OpenCV's accuracy can vary based on lighting conditions and object complexity.
[2]	TensorFlow API, CNN, SSD, and MobileNet V2	Achieves high accuracy without needing an Internet connection.	May require significant computational resources, especially for training the model.
[3]	CNN	Utilizes data augmentation to achieve a 94% accuracy rate for banknote recognition.	Edge-detected images negatively affect accuracy, indicating a need for larger datasets and varied lighting conditions.
[4]	CNN	Improves runtime and accuracy for heart rate estimation in large groups.	Specific details about the adapted algorithm and its implementation are needed for a thorough evaluation.
[5]	YOLO and SSD	Uses Raspberry Pi devices for a compact travel aid, demonstrating real-world implementation.	The system may face limitations in detecting objects in complex environments or under varying lighting conditions.
[6]	CNN and LSTM	Uses Braille and sound bite hearing devices for communication, achieving high accuracy.	The system's effectiveness may depend on the user's familiarity and comfort with Braille.
[9]	FSP algorithm	Utilizes a panoramic camera for pedestrian trajectory prediction, improving real-time performance.	The algorithm's accuracy and performance in dynamic or crowded environments need further evaluation.
[10]	CNN	Achieves 98% accuracy in diagnosing glaucoma using retinal images.	The effectiveness of the technique in clinical settings and its generalizability to diverse populations need further validation.
[11]	Four-layered CNN	Detects and classifies objects with high accuracy and low response time.	The device's performance may vary based on the complexity of the environment and the types of objects present.
[12]	CNN and fuzzy logic	Provides auditory feedback for obstacle detection, enhancing interaction with surroundings.	The computational complexity of the algorithms may affect real-time performance.
[13]	CNN	Uses smart glasses to identify medicine, showing promise for real-world implementation.	The system's accuracy and reliability in identifying specific medications need further validation.

CNN, convolutional neural network; LSTM, long short-term memory.

III.

Main Challenges

The visually impaired persons face many issues in their daily life to perform daily activities, so there is a scope for building innovative solutions that will be useful for making their life independent and improving the quality of life.

Tools, such as deep learning, computer vision, and smart devices, can help resolve many complications. These technologies make it easy for people to identify objects, navigate their environment, communicate well, and access healthcare. They also help create inexpensive assistive devices for persons in need. The main encounters associated with these solutions are discussed below:

Object detection and recognition: Development of algorithms that are efficient and helpful for object detection in real-time scenarios used by visually impaired persons, ensuring that maximum daily life objects are recognized in different environments/scenarios.

Navigation in unfamiliar environments: A system that can guide and help blind persons in new places/environments needs to be developed through audio cues and object identification mechanisms.

Communication solutions: To assist with communication difficulties, different solutions can be created, similar to converting SMS emails into Braille for blind people and using special audible range devices for phone calls. These solutions make it easier for individuals to cooperate with their environment.

Healthcare and diagnostic tools: The use of deep learning models for eye disorders, such as glaucoma, can be accurately diagnosed with retinal images, which is helpful in early detection and treatment outcomes.

Cost-effective assistive devices: Emerging profitable solutions, such as smart glasses, for recognizing drugs, ensuring convenience for visually impaired persons without depending on costly technologies.

Enhancing navigation aids: Refining existing navigation supports for the blind, such as guide dogs, by incorporating deep learning algorithms for object recognition and grouping, enhancing security and independence.

IV.

Proposed System Model

In this investigation work, a dataset is curated specifically for medications, such as Dolo, Strepsils, Volini gel, and Jovees shampoo. The dataset is vital for training the suggested model, which uses cutting-edge CNN models for image cataloguing. To generate the training dataset, videos capturing the medicines from many viewpoints (front, side, and back) are recorded. These videos are then treated to extract pictures at intervals of milliseconds, resulting in about 4000+ images for each drug. This dataset augmentation strategy improves the model's capability to identify and categorize these drugs precisely.

a.

Methodology

This study presents a system for medicine recognition employing deep learning. Initially, the system augments the medicine image dataset using a probabilistic method for multilevel classification. It then extracts features from the augmented images through convolutional layers and a rectified linear unit (ReLU) activation function. Finally, the system classifies the medicine using fully connected layers and a softmax function. Similar to the CNN layers, the inception model is used with a greater number of layers.

The proposed system employs the inception CNN architecture for medicine recognition. A custom dataset of approximately 4000 images per medicine was created by recording videos from multiple angles—front, side, and top views—of Dolo, Strepsils, Volini gel, and Jovees shampoo. The videos were processed using OpenCV's cv2.VideoCapture() method to extract frames at intervals of 200 milliseconds, generating consistent and diverse training images for each medicine. These extracted frames were then resized and normalized to 150 × 150 pixels to ensure uniformity before being fed into the model. The system was trained using TensorFlow on an NVIDIA GTX 1080 Ti GPU for 50 epochs with a batch size of 32, utilizing the RMSprop optimizer and categorical cross-entropy as the loss function.

b.

Block diagram and working

A schematic representation of the system workflow is depicted in Figure 1. A CNN comprises input, hidden, and output layers, with millions of parameters for learning complex patterns. The network subsamples input through convolution and pooling, followed by activation functions in hidden layers. These layers are partially connected, leading to a fully connected layer at the end and producing the output. The output maintains the original input image dimensions.

Figure 1 shows the blocks used to build the system. The data augmentation process is performed on the video shots taken for each of the medicines. This will create a number of dataset images that will be helpful for training the model. There is a need for hyperparameter tuning, which includes the number of layers, kernel size, activation function, loss function, optimizers, learning rate, etc. Preprocessing of the images is performed after creating the dataset of the medicine images.

In this preprocessing step, we have to decide the dimension for the input image that is used. Also, if there is any noise present, then we have to remove it. The third and fourth steps involve model training and classification, where feature extraction, parameter processing, batch processing, pooling, and multilevel classification are performed.

c.

Processing of the system

Figure 2 provides a detailed flowchart depicting the sequential processes involved in the CNN-based recognition system.

Data augmentation: This step is crucial for expanding the dataset of medicine images. It involves creating new variations of existing images by applying random transformations, such as cropping, rotating, and flipping. These transformations help make the dataset more diverse, which can improve the performance and reliability of the deep learning model trained on it. Figures 3–5 display the image datasets of Dolo, Volini, and Strepsils medicines, respectively, captured from multiple angles for robust model training.

Feature extraction: This step is essential for medicine recognition. It involves extracting relevant features from augmented images. Convolutional layers are used to learn these features from the images. The ReLU activation function is then applied to introduce nonlinearity into the network, helping it capture complex patterns. Max pooling is used to reduce the output's dimensionality and make the network more robust to image variations. Finally, the flatten layer converts the output of the convolutional layers into a 1D array for further processing. Other pooling methods, such average pooling, global average pooling, min pooling, stochastic pooling, etc., are also available; however, max pooling is the most popular due to its invariance to small translations, dimensionality reduction, feature selection, simplicity, and efficiency.

Classification: In this step, a deep neural network is used to classify the medicine in the image. Fully connected layers are employed to establish relationships between the extracted features and the different medicine classes. These layers enable the network to understand the complex patterns and characteristics specific to each medicine, ensuring accurate classification. This ability is crucial for identifying medicines based on the features extracted from the image, which is useful for tasks, such as medication reminders and inventory management. Softmax is used in this study because the softmax function normalizes the output probabilities, ensuring that they sum up to 1. This property is crucial for interpreting the output as probabilities, making it easier to determine the most likely class for the medicine being recognized, and multiclass classification feature. As seen in Figure 6, the inception model achieves stable accuracy after several training epochs.

V.

Mathematical Expression for CNN Algorithm

Figure 7 illustrates the architecture of the CNN model implemented for medicine classification. For convolution layers, the formula used in this study is as follows: (1) ${(n - k + p) / S} + 1$ \{({\rm{n}} - {\rm{k}} + {\rm{p}})/{\rm{S}}\} + 1 where n is the dimension of the input image, k is the filter size, p is- padding, and S is stride.

Output for first layer convolution layer is as follows: {[(150 − 3 + 2 × 0)/1] + 1} = 148

Output of the max pooling layer: {[(148 − 2 + 2 × 0) /2] + 1} = 74

In this study, the dimension is decreasing after each of the convolution layer and max pooling layer.

The output of the flatten layer is a one-dimensional vector of the features extracted through the various convolution and max pooling layers.

Flatten layer output will be: 17 × 17 × 64 = 18496.

Dense layer output will be: 512.

Dense layer 2 output will be: 4.

At the output, we have the four-level classification of the results with a probabilistic approach using the softmax function. In the inception model, in total, 50 layers are used for building a model.

It takes a lot of time to train the model, but the validation accuracy is very good with this model.

VI.

Results and Discussion

In this study, the input image size is 150 × 150. The kernel size is 3 × 3, and the activation function used is ReLu. In this study, the loss function used is cross entropy, the optimizer used is RMSprop, and the learning rate is set to 0.01. This hyperparameter training will give us a resultant summary as shown in Figure 8. Sample recognition results for different medicines processed by the system are illustrated in Figures 9 – 12.

• Success rate of the proposed system:

Table 2 summarizes the success rate of the proposed system across different medicines, with all achieving over 94% accuracy.

Table 2:

Success rate of the proposed system

Sr. No.	Medicine name	No of trails		Success rate (%)
Sr. No.	Medicine name	Succeed	Failed	Success rate (%)
1.	Strepsils	46	4	92
2.	Volini gel	48	2	96
3.	Dolo	47	3	94
			Success rate≥	94

Table 3 presents a comparative analysis between the proposed system and three recent similar works in the domain of assistive technologies using deep learning. The comparison focuses on key parameters, such as the method used, dataset size, achieved accuracy, number of medicines or categories handled, and distinct advantages offered by each system. The proposed system demonstrates superior accuracy and practical utility through audio-based feedback and an integrated medication reminder feature. Unlike prior works which either focus on a limited number of items or different domains (e.g., currency recognition), this system directly addresses the medication needs of visually impaired individuals.

Table 3:

Comparative study for the success rate of the proposed system with existing work

Reference	Method used	Dataset size	Accuracy (%)	Medicines handled	Key advantage
[13]	CNN	∼2000 images	92	3	Smart glasses integration
[6]	CNN + LSTM	∼3000 images	94	5	Includes Braille output
[3]	CNN	Augmented banknote data	94	Not medicine-specific	Focused on money recognition
Proposed system	Inception CNN	16,000 images	96.98	4	Audio output + medication reminder

CNN, convolutional neural network; LSTM, long short-term memory.

VII.

Conclusion

In this study, we proposed a computer vision and audio feedback-based, deep learning-assisted guide system for visually impaired people to identify and manage drugs. Using the inception 919 model [12] trained using 16,141 image samples across four categories of medicine, Dolo, Strepsils, Volini, and Jovees Shampoo, 98% of training accuracy and 96.98% of validation accuracy were achieved. The user interface hosted locally enables users to activate the system by a key press, taking pictures with the camera in real-time and recognizing images. The feedback of identifying medicine is via audio, where it mentions the name and use of the detected medicine; moreover, it also enables caregivers to help remind patients by setting the medicine's name and schedule.

The future work would further extend the system by adding larger varieties of medicines to the dataset to improve generalizability. There is also potential for greater integration with mobile apps, wearable devices, and real-time voice interaction for further autonomy. Usability studies with visually impaired users will provide further iterations and refinements as well.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Engineering, Introductions and Overviews, Engineering, other

Journal RSS Feed

Deep learning for daily care: medicine recognition and reminder systems for the visually impaired

Uttam Waghmode

Pooja Bagane

Ashwini Naik

Rajendra Balaso Mohite

Aasha Mahesh Chavan

Published Online: Jun 10, 2025

Received: Mar 18, 2025

DOI: https://doi.org/10.2478/ijssis-2025-0025

KeywordsCNN, inception, visually impaired, video processing

© 2025 Uttam Waghmode et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
CNN, inception, visually impaired, video processing