Cite

Introduction

Fine-Grained image classification [1, 28] is a real-world emerging problem and it has received great attention from research communities around the globe. In computer vision, fine grained [1] image involves the problem of assigning images to classes where different instances of different classes differ slightly in their appearances e.g., flower species, animal species, product/place types. Therefore, fine-grained image classification [28] is a challenging assignment due to the slight variations among highly-confused categories of instances belonging to various classes of objects, which are hard to distinguish. Further, in some of the cases, human intervention or specific knowledge of a particular domain is also required to perform precise fine-grained image classification [1].

For some years, fine-grained image classification [1] is also being applied for natural scene classification. This involves the natural images of wide and diverse nature. It is witnessed that classification of shops, variety of products in shops, etc. are the considered areas where fine-grained image classification is being used [4]. Using fine-grained image classification [28] techniques on the soft drink dataset is an area that has received limited attention from the researchers, whereas this area has also exciting applications in restaurants and shops, where automated orders could be places once a specific brand of soft drink is going to be out of stock.

In this work, we have exploited text and visual cues in form of features to be used with Convolutional Neural Network for attaining good performance in the fine-grained classification [1, 28] of soft drinks. To the best of our knowledge, it is a unique application of a classification technique in the domain of fine-grained image classification of soft drink datasets.

The second section of this paper discusses the proposed classification technique, the third section involves results and discussion and conclusions are drawn in the last section.

Figure 1.

T-SNE Plot for word Embedding

Literature Survey
Text detection and recognition

Text in images carries a high level of information which makes this property of images very rich in computer vision as well as for humans. The information encoded in the text can be very beneficial for many computer vision applications. Text detection and recognition [7] face some challenges like the diversity of nature, the complexity of background, interference factors, etc. The first novel approach of a real end-to-end model for text detection and recognition in a scene was proposed in 2010 by Neumann et al [8]. It achieved a highly significant increase in recognition rate from 53% to 72% on the Char74k dataset (de Campos et al.) but there is a weakness in this system is that it was only applicable to horizontal texts. Later on Coates et al. [9] 2011 apply large-scale algorithms to build highly effective classifying model for both detection and recognition. Their end-to-end system has high accuracy and performs well on complex natural images but the drawback is that it requires a relatively large volume of training data.

Further in 2014 Jaderberg et al [10] addresses the problem of text detection and recognition by generating text proposals with CNN and provides an end-to-end system for reading text in natural scene images. That system was capable of both text spotting and image retrieval and perform excellently on complex natural images. Yao et al [11] present a unified framework for detecting and recognizing the text in images to handle the texts of different orientations. They also provide a method for ‘search dictionary’ to correct the recognition errors. The system achieves highly competitive performance, especially on multi oriented texts. Also, the works of Shiva kumara et al. [12] and C Yao et al. [13] realize the significance of multi-oriented text detection and recognition to the research community. In the paper of Yingying Zhu et al. [14], they discussed in detail the recent advances and future trends for scene text detection and recognition.

Fine-Grained Classification:

Fine-Grained classification [28] aims for the deep insight into image that is why this problem got a lot of attention from researchers around the globe. Many approaches have been developed to address the particular problem till now with the margin of improvement in the future.

Existing deep learning-based fine-grained image classification approaches [15] could be sub-classified into the following according to the use of additional information or human inference:

1) Approaches that directly use the general deep CNNs for image fine-grained classification [16].

2) Part detection and alignment-based approaches [17].

3) Ensemble of network-based approaches [18].

4) Approaches based on attention mechanisms [19].

The prior work in fine-grained classification [28] can be simply divided into two paths. The first is to detect the discriminative object parts in the image to compensate for nuisance variations such as pose. Many parts-based methods with geometric constraints have been proposed for bird classification [16] and dogs [20].

The second track is to derive discriminative and robust features. Classic hand-crafted feature descriptors such as the Scale Invariant Feature Transform (SIFT) [21], Histogram of Oriented Gradients (HoG) [22], and Color Histogram [23] other methods such as the Part-based One-vs-One Features (POOFs) [24] focus on modeling corresponding parts activation. Deep convolutional neural network (DCNN) approaches for general object classification achieve state-of-the-art performance for fine-grained classification by applying transfer learning [25].

Attention Mechanism

The idea of attention is one of the most influential ideas in deep learning allows the network to focus on specific aspects of a complex input. The main idea of the attention t mechanism is to allow the decoder to “look back” at the original input and extracts only the significant information that is important for decoding [27].

Consider we are attempting machine translation on the following sentence: “The cat is beautiful.” If you can ask someone to pick out the keywords of the sentence, i.e. which ones describe the most meaning, they would likely say “cat” and “beautiful.” Articles like “the” and “is” are not as relevant in translation as the previous words (though they aren’t completely insignificant). Therefore, we focus our attention on important words. We use the attention mechanism on texts to get our most relevant words from text features and we put attention on the visual features to get our most relevant features like edges, color, size of object, etc.

The attention mechanism [27] scores each input word (via dot product with attention weights), then to create a distribution scores are passed through softmax function. An attention vector is produce by multiplying distribution with the context vector and then passed to the decoder. The advantage of attention are its ability to identify the information in an input most pertinent to accomplishing a task, increasing performance especially in natural language processing but it increases computations unlike to human.[30].

Multimodal Fusion:

Multimodal processing [35] significantly enhances the understanding, modeling, and performance of human-computer interaction. In multimodal fusion [31], user interaction with system is through various input modalities like speech, gesture, and eye gaze. In our context, different multimedia researchers presented different fusion strategies used for combining multiple modalities in order to an encoder-decoder architecture accomplish various tasks [28, 29, 33].

The literature on multimodal fusion [31] research is presented through several classifications based on the fusion methodology. The methods can be described from their advantages, weaknesses, basic concept, and their usage in various analysis tasks but multimodal fusion has several issues that influence the process such as contextual information, confidence level, synchronization between different modalities, etc. In 2016 [32] uses multilayer and multimodal fusion of deep neural networks for video classification. In 2017 [33] uses weakly paired multimodal fusion for object recognition. [34] Uses Multimodal deep networks for image-based document and text classification by introduce an end-to-end learnable multimodal deep network that jointly learns text and image features and performs the final classification based on a fused heterogeneous representation of the document. They validated their approach on the Tobacco3482 and RVL-CDIP datasets. In 2020 [28] did Fine-grained Classification by the Combination of Visual and Locally Pooled Textual Features.

Proposed methodology

Our approach is to classify soft drinks images into their respective classes with the help of text and visual features. We extract textual and visual features from the input image with the help of different models [37, 40] and treat those features as input for our multimodal [31] which combines both the inputs to anticipate the classification of the given image. Textual cues play a key role in fine-grained classification [28], especially in the classification of business places such as bakery, café, bookstore, and daily use products. Multiple models have been developed to address the particular problem. These models assist us to extract the textual information that is highly useful for image classification. We adopt word2vec [40].

Visual features are the second input to our modal. The visual feature is the information about the contents of an image which describes its specific structures such as shapes, edges, objects, patterns, and colors i.e. properties. In our case, we extract the visual features (works as second input) with the help of VGG pertained modal [37] on ImageNet [36] by fine-tuning the modal with our dataset. These visual features work as building blocks with the texture input to give us the desired results.

The 224x224 Input images are transferred to VGG model [37] for visual features extraction. We extract visual features by importing the VGG [37] model from tensor flow keras with the pretrained weights. The top layer is set to false when loading the model. Further, we unfreeze the last five layers to fine tune the modal. After defining the model, PIL object of the image has to be converted in a pixel data NumPy array where we only have one sample and the values are then appropriately scaled to get the features. We get the feature vector ‘yf ’ from the last max-pooling layer as a 4096 dimensional features. Feature extraction part is from the input layer to the last max pooling layer.

At the same time, the input is send to OCR for character recognition. OCR gives us the classification of text by localizing and recognizing the text. Then OCR saves the recognized text in a file which is then passed to word2vec [40] model for textual representation ‘xf ’. Word2Vec [40] is a two-layer neural networks which has been trained for the reconstruction of linguistic contexts of words. It takes a large corpus as its input and produces a vector space, of given dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. The purpose and usefulness of Word2vec [40] is to group similar words vectors together in vector space. That is, it detects similarity between vectors by using the ‘cosine similarity’ function. Let ‘a’ be the first vector and ‘b’ be the second vector,

Cos ( a , b ) = a b a b x f = cos ( a , b )

After that, we put an attention mechanism [30] on both inputs of multimodal [31] i.e. visual features and textual features. There can be some recognized text that is more relevant than others at the moment of discriminating similar classes. So we need to capture the inner correlation between the textual and visual features. The attention mechanism learns a tensor of weight that is used between the visual features and textual features. Let ‘X’ be the extracted textual features and ‘Y’ is visual features and weight is ‘W’. We compute the attention mechanism by:

w a = Softmax ( tan h ( y f a t W x f ) ) x f a = w a x f

Figure 2.

Proposed Model Pipeline_ Text and Visual features are attended and Combined for Fine Grained Classification.

The resulting normalized attended vector ‘wa ’, is multiplied with the textual features ‘xf ’ to obtain the final attended textual features ‘wfa ’. The obtained textual features ‘xfa ’ and the visual features ‘yfa ’ are concatenated in the multimodal to form the final features by

Z = [ x f a + y f a ]

Finally, the resulting vector serves as input to a final classification layer that outputs the probability of a given class based on low-rank bilinear pooling operation.

Experiments and results
Dataset:

We have collected the new dataset of soft drink bottles of 10 classes in Pakistan with 375 original images. The dataset contains several occluded, rotated, low quality and blurred text instances which increases the difficulty of performing successful text recognition. Due to limited resources, our dataset is not fully organized and has many limitations. The images are divided into 2 sets having 200 training images, 175 test images.

Implementation Details:

We start by augmentation the training images flipped vertically and horizontally to make 2 more images of every image to avoid the problem of overfitting for feature extraction mode. So the total number of training images is 600. We load the image and convert it to array using keras preprocessing. We also expand the dimensions and use a preprocess function in keras to fit the image according to our model. Then the predicted output is sent to for attention.

We extract text from all of the images by using the combination of tesseract [39] and easyOCR [38]. EasyOCR is used to localize a text and tesseract is used for the recognition of the text. The extracted text is saved in two files each set of training set text and test set text. Then these sets are loaded for word embedding [40]. To recognize context meanings we use two layer shallow neural network to describe word embedding. We import word2vec [40] from genism library.

We put the attention on both textual and visual features for the fine-grained classification [28] of our dataset. The network is trained for 5 epochs with Adam optimizer. The batch size employed in all our experiments is pre-defined from a library ‘config’, with a learning rate of 0.0001, momentum of 0.9.

These experiments were implemented by using tensor flow deep learning framework on a simple laptop of 8 GB ram and 2.7 GHz (i7).

Results:

Random results of some classes are shown below.

Bad results are because of lot of noise or the quality of the image.

Comparison Graphs:

Figure 3.

Visualization of Epoch vs Loss with different window size and dimension

Figure 4.

Parameters visualization with different window size and dimension

Conclusion

In this paper we demonstrated the importance of textual and visual cues for fine grained classification. We developed a frame work for precise classification of soft drink bottles by combining pre-trained models. The results show that the textual features plays more important role then visual features for classification of real life products.

Furthermore, as the system is created with low resources it can be modified and enhanced for better performance on large datasets and real world classification applications.

eISSN:
2470-8038
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, other