Otwarty dostęp

A novel approach to capture the similarity in summarized text using embedded model


Zacytuj

Introduction

Near duplicate documents are similar but not having identical content i.e., not identical bitwise (Xiao et al., 2008). In order to make searching faster, there is a need to remove duplication of content on the World Wide Web (WWW). The presence of near duplicates in text documents affects performance badly in the performance while integrating data from different sources. Several text extraction techniques—Topic Modeling, Key Phrase extraction and Text Summarization (Mishra et al., 2019) are available to fetch relevant information from unstructured text data. Different text extraction techniques can show different results even if applied in the same document. Text summarization generates concise and coherent summary from large pieces of text without any modification for preserving key contents in the original text. For text documents, the near duplicate detection task is more challenging. Even though there exists a proportion of the same words in two pairs of documents but in different order will not be considered as identical. Synonyms can be another important issue that needs to be addressed.

Traditional techniques like Bag of Words (BOW), Shingling, Hashing (MinHash and SimHash) are good to identify duplicate documents but not efficient for detection of near duplicates. Commonly used approaches for duplicate detection include shown in Table 1.

Conventional near duplicate detection techniques.

Category Approach Characteristics Merits
Keyword based BOW (Bag of Words) Comparing words and frequency of words with respect to other documents Used in large documents uses Term Frequency -Inverse Document Frequency (TF-IDF) to create fingerprints. Reduces storage space
Fingerprint based Shingling Compares short phrases adding context to the word Fingerprints are created with tokenized documents by using overlapped substrings and consecutive words. Statistical concepts are used to find near duplicates
SimHash Generate fixed length hashes for each document which are stored for duplication detection Obtain ‘f’ bit fingerprint for each document. Used as dimension reduction
Hash based MinHash Phrases are hashed into numbers for comparison to identify duplication and content hashes are stored It stores a small amount of information for each document for effective comparison
Locality Sensitive Hashing (LSH) Probabilistic approach to detect similar documents. Hash function generated similar hashes for similar shingles Search space contains only those documents which tend to be similar which maximizes the probability of collision for similar content

Available research mostly relates the task of detection of near duplicates as the detection of intermediate level of similarity and mostly similarity estimation is done by using statistical techniques like hashing, singling and signature based. With the help of recent Artificial Intelligence tools like Machine Learning, Deep Learning and Natural Language Processing, text embedding models can be used to generate vectors to capture more semantic similarity during similarity estimation. Text embedding models are used to capture semantics which are not often detected by commonly used approaches like shingling and hashing. Summary can be used to represent the whole document as it is generated by extracting relevant content, so it can be used for capturing similarity instead of working with the whole document which can save both time and storage.

A text summarization-based near duplicate detection approach with efficient text representation by using text embedded models is presented in this research. In the section “Text embedding in text representation and text similarity”, the role of text embedding in text representation and commonly used text similarity techniques is discussed. In the section “Related work”, related work about near duplicate detection, text embedding models and text summarization is elaborated. In the section “Proposed methodology”, the proposed approach is discussed. In the section “Experimental results and discussion” presents related experimental results followed by conclusion and future scope in the last section.

Text embedding in text representation and text similarity

Text similarity or similarity estimation is one of the active research trends nowadays that acts as a basis of various Natural Language Processing (NLP) tasks and play an important used in many research domains including detection of near duplicates as it plays important role in document matching (Wang and Dong, 2020). In order to label two entities as near duplicate in a quantitative manner, similarity function can used to measure whose value can range between the interval [0, 1]. Higher values of similarity score indicates more similarity. Any text similarity technique will first convert or map the input documents into vectors which contain real valued numbers. Next, suitable similarity measures can be applied on these vectors. Performance of text similarity algorithms considers two aspects—efficient text representation and choice of similarity measure function. Objective of text similarity algorithms is to determine commonness between two input documents as similarity scores are directly proportional to commonness. Traditional similarity measurement methods like statistical, corpus and knowledge based considers only text representation. In the traditional approach, the first way is to divide text into overlapping groups of sequential words called shingles. Similarity is considered or measured by the proportion of identical shingles found in the pair of text documents. In the second way, the vector of words is defined for representing a particular document and then similarity is computed by comparing the vectors. With the growth of modern Artificial Intelligence tools, semantic aspect integration can increase the efficiency of text representation techniques. Various text representation techniques are shown in Table 2.

Text representation techniques.

Text representation method Concept used Characteristics Merits Demerits
Vector Space Model Word count/BOW model It uses the concept of linear algebra to compute similarity Simple to compute based on the frequency of words Ignore the importance of rare words
Document vectors TF-IDF vectors It also computes the count of documents in which a particular word is present along its significance It does not give importance to most frequent words in the document which does not contribute much in similarity computation Does not consider the semantic aspect
Embedding model Word embedding These are the high dimensional representations of words Handle words having similar meaning i.e., synonyms. Does not require any feature engineering It cannot be applied directly in the computation of text similarity
Topic modeling Latent Dirichlet Allocation (LDA) Documents are represented by inherent latent topics where each topic can be drawn as probability of distribution of words Probabilistic model, for defining feature matrix of a document based on semantics Requires prior knowledge of the number of and it does not capture correlation

Text embedding models represent words in the form of numeric values or vectors based on the context and order in a document. These models are used for text representation and can be utilized in finding similarity between documents (Khattak et al., 2019). Text embedding models can detect similarity even when it is mixed or modified. It maps each document to a low dimension and dense vector in a continuous vector space. While word embedding considers only the world, text embedding considers phrases/paragrams. It can be used in several ways while computing text similarity (Tan and Phienthrakul, 2019). Related words are closer in vector space. Various embedding models are listed in Table 3. Commonly used text similarity measurement techniques and various metrics whose value lies in the range of [0, 1] used in this regard are shown in Tables 4 and 5, respectively.

Different embedding models for text representation (Khattak et al., 2019; Mishra et al., 2020).

Embedding model Characteristics Merits Demerits Variants
One hot encoding Maps each word from vocabulary to unique index in vector space Learn dense representation of words Dependent on corpus knowledge
Word2Vec Maps each word to a point in vector space E.g. Continuous Bag of Words (CBOW), Skip Gram Used in Neural networks for predicting focus words as prediction-based models Dimension is between 50 and 500.Context window is between 5 and 10 Doc2Vec paragraph2vec e.g., Distributed Memory Model of Paragraph Vectors (PV-DM), Paragraph Vector Continuous Bag of words (PV-CBOW)
GloVe Term co-occurrence matrix based on vocabulary size is used Minimized reconstruction error, captures larger dependency due to larger context window, Count based model Order of dependencies are not preserved; performance depends on data type GloVe with skip gram window
FastText Sub words are also considered Extends the functionality of Word2Vec skip gram to handle out of vocabulary (OOV) words Longer time to train Probabilistic FastText
Embedding from Language Models (ELMo) Captures context at both word and character level. Same word can be used for different contexts Performs sentence level embedding by using bidirectional Recurrent Neural Networks (RNN), can be used in transfer learning Unable to use left to right and right to left context at the same time
Bidirectional Encoder Representations from Transformers (BERT) Considers n bidirectional representations in unsupervised mode It can be pre trained using one extra output layer Random sentence is replaced by special tokens(‘Mask’) to consider both left to right and right to left information at the same time Robustly Optimized BERT Pre Training Approach (RoBERTa), A lite version of BERT(ALBERT), Encoder that Classifies Token Replacement Accurately’(LECTRA), Generalized Autoregressive Pre Training for Language Understanding (XLNet), Distilled version of BERT (DistilBERT), BERT for Summarization (BERTSUM)

Categorization of Text similarity measurement techniques.

Text similarity measure Category Considers semantic? Approach used Characteristics
String based Character based No Hamming Distance, Levenshtein distance, Damerau-Levenshtein, Needleman-Wunsch, Longest Common Subsequence, Smith-Waterman, Jaro, Jaro-Winkler and N-gram Used to find typographical mistakes but less efficient text analytics and computationally less effective for large text documents. Used in String matching approximation
Token/term based No Jaccard similarity Dice’s coefficient Cosine similarity Manhattan distance and Euclidean distance Useful in case of recognition of term rearrangement
Statistics based Corpus/knowledge base Yes TF-IDF, (Latent Semantic Indexing (LSI)word2Vec, GloVe, Bidirectional Encoder Representations from Transformers (BERT), Latent Semantic Analysis (LSA), LDA It uses only text representation and does not consider distance between texts

Popular Text similarity metrics (Pamulaparty et al., 2014, 2015; Gali et al., 2016; Yung-Shen et al., 2013).

Similarity measurement method Highlights
Euclidean distance Consider the distance of text in vector form. Uses frequency of tokens to generate feature vectors
Cosine Consider the angle between two vectors. Fails to capture variations of the representation for unstructured/semi structured text
Manhattan Consider the distance between two real vectors
Hamming Consider the count of positions in which two bits are different. Binary strings must be of the same length
Jaccard distance Compute’s length of two strings and then finds common characters to indicate the presence in near locations. Transposition in reverse order is performed to find matching characters between two strings
Jaro Winkler It extends the Jaro distance metric by a prefix value (p = 0.1). This provides a higher value of weights to the strings having common prefix length whose value lies in the range of (Xiao et al., 2008; Khattak et al., 2019)
Cosine similarity with k shingles/k gram Shingling the document means considering consecutive words and grouping as a single entity. A more general approach is to shingle the document. This takes consecutive words and groups them as a single object. In general, the set of all 1-shingles represents the’ bag of words’ model
TF-IDF Based on the concept of term frequency (TF) which is the count of occurrence of a token in a document. The inverse document frequency (IDF) is the way to find the relevance of unique or odd words. Cosine similarity with TF-IDF is used to find similarity scores
Normalized Levenshtein Based on the minimum number of edit operations
Soft-TFIDF TF-IDF and Jaro Winkler are combined to measure similarity. First Jaro Winkler finds pairs of tokens common to both strings and then TF-IDF is used to find similarity scores exceeding the suitable value of threshold set in Jaro Winkler
Related work

Pamulaparty et al. (2014): Research work involving initial pre-processing of documents includes stop word removal and stemming. Keywords generated are passed as an input to the Near Duplicate detection algorithm. Using a similar hash (SimHash) function with respect to various thresholds (<60%, 60–70%, 70–80%, > 80%) near duplicate documents are determined.

Pamulapartya et al. (2015): Proposed a frame work for near duplicate document detection using machine learning models. In phase 1, fuzzy C means clustering is performed on the document before putting directly to the near duplicate which reduces the scope of comparison of the document. In phase 2 a discriminative function is used for classification exploiting the inherent features present in the documents computed as weighted terms. A decision is made by function verifying the similarity vector created from features.

Yung-Shen et al. (2013): Proposed a method for detecting duplicate documents using three key components. First pre-processing on input document for feature selection. Highly weighted features are selected. Second similarity measure metrics are used for finding similarity degree between input and set of all pairs of documents. Third component is to learn a discriminant function using the Support Vector Machine (SVM) classifier.

Gali et al. (2016): Evaluated 21 measures to find similarity between two titles. Damerau-Levenshtein performed well by detecting changes in character/token and real data. Smith-Waterman performed well in case of character change while Bi-Jaccard worked well for both character/token and real data.

Hassanian-esfahania and Karga (2018): Due to the unordered nature of sets, the MinHash algorithm does not cover all near duplication properties. Even though the count of shared attributes in the documents is more, position of attribute also matters. A MinHash algorithm (min-wise) is proposed to enhance the data structures of traditional MinHash algorithms for better representation of near duplications. This approach showed an unbiased estimate of Jaccard coefficient with less variance.

Feng and Wu (2015): In this paper, authors improved the work of Wang and Chang (2009) by using a suffix tree for comparing two documents instead of using fixed sized sliding windows. By using the suffix tree all possible pairs of identical sentences were found. Also, they add a validation step by comparing selected terms at specified patterns in all matched sentences. The algorithm “SL + ST” (sentence length +Suffix tree) is compared with SpotSigs and 3 Shingles.

Rodier and Carter (2020): In this paper, authors proposed an online system to detect near duplicate documents on the dataset of web-based news articles by adapting the shingling algorithm (Broder, 2000). Further they used this system in an application where situational awareness tool to increase the efficiency of human analysts. This system works in two phases- In the first phase, it determines whether a new document is near duplicate of previously processed document. Each document is represented as a sketch consisting of a set of 8-byte numbers. For two similar documents, it will generate sets of 8-byte numbers that overlap proportionality to their similarity. This method results in very high precision scores with increased recall and F1 scores.

Hajishirzi et al. (2010): In this paper, authors proposed an algorithm for near duplicate document detection in which each document is represented as a k-gram (sparse) vector. Weight of the vector is learned to optimize for similarity functions (cosine or Jaccard coefficient) which are further mapped to hash values by using the technique of locality sensitive hashing. These hash values are used as document signatures and contribute to calculating similarity. News articles and email messages are used as target domains. This method was found to be more accurate than Shingles and I match.

Arun and Sumesh (2015): In this paper, four phase sentence level features, word mapping technique, term document weighting scheme and modified similarity technique is used which gives improved precision and recall.

Yandrapally et al. (2020): A study of near duplicate algorithms based on state pairs is presented for web app model inference. Webpages were divided into three categories-clone, near duplicate and distinct. Threshold values were systematically computed and used by 10 near duplicate detection techniques for three different domains.

Pamulaparty et al. (2017): Proposed random forest method random forest- Streaming Random Forest (SRF) and Oblique Random Forest (ORF) showed better accuracy as compared to other algorithms while detecting near duplicates in context of web crawling. Keyword extraction, URL indexing and similarity computation were the three phases to distinguish between near duplicate and non-duplicate web pages.

Do and LongVan (2015): Proposed an algorithm for detection of near duplicates in articles by extracting key phrases based on ontology and matching signatures. Similarity is calculated between extracted key phrases. A set of characteristic key phrases present in the articles were used to find near duplicates. Proposed algorithm showed good precision and recall.

Al-Subaihin et al. (2019): Analysed different text representation techniques for mobile application in order to describe textual content, Vector Space Model (VSM) using TF-IDF with frequency weighting combined with Latent semantic Indexing (LSI) were used. This was compared with other text feature extraction techniques like topic modeling. Results showed that cluster quality by topic modelling approach were more favourable as it captures more similarity.

Jain et al. (2017): Proposed a text summarization approach in which extractive text summary is generated by calculating similarity score between the abstractive summary and original sentences of text data using neural network approach for feature extraction.

El-Kassas et al. (2021): Explained different applications, approaches (Extractive, Abstractive and Hybrid), methods used in these approaches, building blocks—text summarization operations, text representation models and statistical and linguistic features. Also, it discusses various datasets, automatic evaluation tools.

Hendre et al. (2021): Highlights the relevance of semantic similarity while analysing text data by using the approach of a neural embedding model for text data representation. Sentence embedding models-Elmo, Glove and Google Sentence Encoder were used to combine with TF-IDF and Jaccard similarity for experimental purpose. ELMO and Google Sentence Encoder showed best results by capturing maximum similarity.

Albalawi et al. (2020): Provides a detailed description of applications, methodology and tools for topic modelling which is used for finding important topics present in the short text like comments, reviews and short length text messages. A comparison of five topic modelling methods-Latent Semantic Analysis (LSA), LDA, Non-Matrix Factorization (NMF), Principal Component Analysis (PCA) and Random projection on the basis of standard statistical evaluation metrics -Precision, Recall, F Score and topic coherent were established on two textual datasets. LDA and NMF topic modelling methods produced valuable output by extracting more meaningful topics.

Alqahtani et al. (2021): In order to generate patterns from text efficiently several processes like text mining, clustering, natural language processing and text similarity are involved. String based tools are suitable for lexical similarity. LCS, Jaro, and N-gram, Damerau Levenshtein (character-based algorithm) and the Cosine similarity, Euclidean Distance, Jaccard similarity, Block Distance, and Matching Coefficient (term-based algorithm) are popular techniques to measure lexical similarity. LSA is a popular corpus-based technique which is not suitable for nonlinear text distribution. WordNet is based on a semantic network and is based on a knowledge tool.

Chandrasekaran and Mago (2021): Semantic textual similarity is one of the most challenged NLP tasks. Measuring semantic similarity techniques can be knowledge, corpus, and deep neural network or can use hybrid-based techniques. Knowledge based include Edge counting methods (LCS), Feature based method (WordNet), Information content, Word embedding based (GloVe, FastText, BERT, word2vec), corpus based include LSA, Hyperspace Analogue to Language (HAL), Explicit Semantic Analysis (ESA), Word-Alignment models, Latent Dirichlet Allocation (LDA), Normalised Google Distance (NGD), Dependency-based models, Kernel-based models, In addition to this methods, deep learning based model includes Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM), and Recursive Tree LSTM which can be used to measure semantic similarity.

Roul and Sahoo (2020): Semantic content based near duplicate detection is one of the relevant research aspects in information retrieval as it avoids redundancy in the search results during query processing and removal of near duplicate pages improves page ranking. Authors proposed a novel method for the detection of near duplicate documents in a corpus on the semantic similarity score. A heuristic based method is used to rank the documents according to their semantic similarity scores. This has been achieved by applying an averaging method on DUC datasets which associates a similarity score to each individual document in the corpus based on semantic content. Effectiveness of the proposed method was concluded based on the empirical results performed. To achieve this, Word2Vec, WordNet, Normalized Google Distance, and Latent Dirichlet Allocation (LDA)) are used for computing the similarity scores between pairs of documents in the corpus. The computed score is used as features for training classifiers to generate document semantic similarity scores for document pairs. Experiments showed improved performance on DUC datasets.

Mansoor et al. (2020): Proposed a deep learning-based method to compute semantic similarity by using Long Term Short Memory (LSTM) which is an explicit type of Recurrent Neural Networks (RNN) to capture sequence among different elements in a sentence combined with Convolutional Neural Networks (CNN) for extracting local features. Proposed model used word2vec for text representation. Experiments carried out in Quora dataset showed better F score, precision and recall as compared to traditional text similarity methods (Naïve Bayes, Decision Tree, CNN, LSTM with word2vec and LSTM with GloVe).

Peters et al. (2018): Introduced a deep context-based learning model for word representation. Word vectors are internal states of a deep bidirectional model. In this model, each token acts as a function for the entire input sentence with the help of bidirectional LSTM and ELMo model. Word representation using ELMo model where higher-level LSTM states capture context aspects of words while lower-level state model aspects of syntax Performance of the model was analysed across six NLP challenging tasks including question answering, which showed reduction of relative error in a range of 6–20% over other models.

Shashavali et al. (2019): Proposed a method for measuring sentence similarity score using weighted N-gram, sliding window, cosine similarity and FastText embedding techniques. Improved results with accuracy, precision and recall by 6%, 2% and 80%, respectively, were obtained as compared to Universal Sentence encoder technique. Proposed work performs well for small training dataset. Concept of sliding windows were used as cosine similarity with weighted average word embedding does not perform well while computing sentence similarity between short and long sentences.

Stefanovič et al. (2019): Proposed a method to calculate similarity between two texts using word level n-gram to form a bag of n-gram combined with self-organising map (SOM). For evaluation Dice, Cosine, Overlap and extended Jaccard similarity measures were considered. N gram frequency is used to generate a frequency matrix of a dataset (A corpus of plagiarized short answers). Highest similarity was captured by using overlap measure.

Han et al. (2021): Presented a survey based on semantic similarity measurement for short text. The study categorizes the techniques into three categories–Corpus based (LSA, LDA, word2Vec, para2Vec, VSM), knowledge based (shortest path, Resnik, ESA) and deep learning based (CNN, LSTM, BERT).

Li and Gong, (2021): Used four embedding models i.e., word2Vec, doc2Vec, TF-IDF and embedding layer for text classification on Chinese news dataset. Deep Learning models (CNN, LSTM, GRU, MLP, 2 layer GRU, CNNGRU and CNNGRU_Merge, TextCNN) are used for classification purposes. The 2-layer GRU model with word2Vec embedding showed highest accuracy.

Wang et al. (2019): Proposed text summarization technique combining both extractive and abstractive approaches. In order to capture semantic features, a BERT text embedding model is used. Important sentences are selected from the input sentences (corpus). Next abstractive based summary is used for generating summary. For this, two sub models (both extractive and abstractive) and for updating in end-to-end training, reinforce learning. Proposed method achieved better accuracy.

Ajees et al. (2021): A machine learning based deep level tagging is used to provide more context to each noun and verb words for any Malayalam words. Two methods are combined for this-word embedding which uses word2Vec with skip gram variant and suffix stripping SVM classification to identify animate noun identification. This method exploits morphological features of the input text document.

Table 6 highlights various recent research studies for text mining tasks including near duplicate detection which uses text similarity measurement techniques and text embedding models for text representation.

Recent research studies on text similarity and representation.

Concept/algorithm/method used Author(s) Usage
Text similarity (SimHash, MinHash), Text clustering Pamulaparty et al., 2014, 2015, 2017)Hassanian-esfahania and Karga (2018) Near Duplicate detection on the basis of keywords generated from text, Fuzzy C means clustering with discriminant function, Random forest method for classification of near duplicates
Text similarity Yung-Shen et al. (2013)Gali et al. (2016) Near Duplicate detection on the basis of 21 similarity metrics computation between a pair of documents or two titles
Signature based text similarity measurement Mohammadi and Khasteh, 2020 (Hajishirzi et al., 2010) Reference texts are generated using genetic algorithms to obtain signatures for text documents as a sequence of 3 grams for detection of duplicate and near duplicate documents. For generating signature cosine text similarity measure is used on the datasets on CiteseerX, Enron and Gold Set of Near-duplicate News Articles
Text similarity Do and LongVan (2015) Near Duplicate detection by applying signatures generated based on ontology on extracted key phrases
Text representation methods Al-Subaihin et al. (2019), Mishra (2019) TF-IDF combined with LSI for topic modeling, spam classification
Text mining, clustering, natural language processing and text similarity Alqahtani et al. (2021) Text matching methods
Semantic similarity Chandrasekaran and Mago (2021) Any NLP task which involves semantic textual similarity
Semantic similarity Roul and Sahoo (2020) Near Duplicate detection of web pages on DUC dataset
Deep learning based semantic similarity Mansoor et al. (2020) Sentence similarity using LSTM and CNN per trained with word2vec on Quora dataset
Text representation using ELMo model Peters et al. (2018) Question answering, Textual entailment, semantic role labelling, Named entity extraction, sentiment analysis
Text representation using FastText model Shashavali et al. (2019) In goal oriented conversational agents (Chabot)
Text similarity based on distance Stefanovič et al. (2019) Plagiarism detection
Semantic similarity for short text based on corpus, knowledge and deep learning model Han et al. (2021) Text classification and text clustering, sentiment analysis, information retrieval, social networks plagiarism detection on the dataset
Text classification based on text embedding method Li and Gong (2021) Deep Learning Text classification on the dataset Sohu news dataset
Text Similarity based on text distance and text representation Wang and Dong (2020) Information retrieval, Machine translation, question answering, machine, document matching
Text representation using BERT model Wang et al. (2019) Extractive-Abstractive Text summarization with BERT embedding model with Reinforcement Learning on CNN/Daily Mail dataset and DUC2002
Word Embedding Model, Text classification, Word tagging Ajees et al. (2021)Alqrainy and Alawairdhi (2021) SVM classification to classify animate nouns for Malayalam text, comprehensive tag for Arabic language
Lexical Taxonomy Nazar et al. (2021) Elimination of incorrect hypernym links, taxonomy with new relations in Spanish, English and French

A critical look at the available literature reveals that the following issues need to address:

Need to reduce the summarized latency in text summarization tasks.

Need to generate an open summarized framework since existing work is mostly domain specific.

Need to increase the accuracy of framework for capturing similarity with the help of emerging AI tools.

Since efficient summary can be generated with proper feature representation and better semantic understanding with the help of advanced AI tools, it can play an important role for detection of near duplicates by taking summarized text as input with an objective of reducing both time and storage

Proposed methodology

In the proposed approach, similarity metrics are applied to find the degree of relatedness on summarization. For generating text summary, the LSA method as an extractive text summarizer is considered. For better semantic aspect, text embedding models are used for better vector representation. Extractive text summarization is a technique used in various domains of text analytics to extract meaningful textual content by keeping only important sentences without any modification in the original content. Figure 1 shows a generic approach for detecting near duplicates in two input pairs of text. For better utilization of time and storage while performing near duplicate detection the first summary of original content is generated. Moreover, to capture semantic similarity, a text embedding model is applied on a summary generated before applying a suitable text similarity algorithm for calculating similarity scores on the vector representation of text. Detailed working approach is shown with the help flowchart in Figure 2.

Figure 1

Block diagram for proposed approach.

Figure 2

Workflow of proposed approach of near duplicate detection.

Algorithm 1, Algorithm 2, Algorithm 3 and Algorithm 4 presents complete details about the various phases and sequence of concepts involved in the proposed method.

Near duplicate detection using summarized text

1. document_set := {Text 1, Text 2}, threshold := ø // Initialize
2. function Near_Duplicate_Detection(document_set)
  Input: Pair of text documents
  returns labeled documents as near duplicate or non-duplicate
3. output_set=Generate_Summary(document_set) ; // Phase 1: Generation of summary
4. vector_set = Generate_vector(output_set) ; // Phase 2: Text representation
5. similarity_score=calculate_similarity_score(vector_set; // Similarity score calculation
6. if similarity_score > ø then // comparison with threshold
7.   label ‘Near Duplicate’
8. else
9.   label ‘ Non Duplicate’
10. end function

Generation of summary for the input documents present in document_set using Extractive approach

1. function Generate_Summary(document_set)
  Input: pair of text documents
  returns generated summary
2. forall text document in document_set do
3. Pre-processing: Block level breaking of text into key phrases or sentences, Tokenization (sentences), Lemmatization, stemming, stop word removal, POS tagging, Named Entity Recognition
4. Identification of interrelated sentences: Similarity measuring functions are used to find related sentences to be included in the summary
5. Weighting and ranking of selected sentences: Numeric values are assigned to find important features. Higher ranked sentences are selected for summary
6. Output_set:= {text 1_summary, text 2_summary};
7. return output_set // pair of summarized text

Text representation using embedding model to generate vectors

1. function Generate_vector(output_set)
  Input: Pair of summarized text documents
  returns vector representation for input document pairs
2. forall summarized text document in output_set do
3. vector_set = embedding_model(output_set);
4. vector_set={VText1, VText2};
5. return vector_set // pair of vectors

Similarity score calculation for summarized text vectors

1. function calculate_similarity_score (vector_set)
Input: pair of vectors
returns similarity scores of the summarized text documents
2. similarity_score = similarity_function(vector_set)
3. return similarity_score
Experimental results and discussion

For experimental purposes, abstracts of research articles (Elrefaiy et al., 2018; Mishra et al., 2019) as Text 1 and Text 2 are considered as shown in Table 7. Table 8 shows the result of text summarization technique which is applied for generation of text summary of input documents, LSA which is based on extractive text summarization is used. For better vector representation of text, text embedding models are used which act as function parameters for similarity calculation. For analysing the performance of text similarity functions with embedding models, we have considered 6 models—Word2Vec, Universal Sentence Encoder, FastText, ELMo, GloVe and BERT. Similarity score is calculated using various similarity functions on both original and summarized text with and without embedding models. To get detailed insights more, similarity functions are applied on both original and summarized text on other text extraction strategies like topic modelling and key phrase extraction. Table 9 and Table 10 show topics generated when LDA method is applied on both original and summarized text pair respectively. It can be easily interpreted that high weighted topics from original text are included as topics in the summary also. Table 11 shows key phrases generated using TF-IDF method.

Input texts.

Input Original text
Text 1 “Everyday large volume of data is gathered from different sources and are stored since they contain valuable piece of information. The storage of data must be done in efficient manner since it leads in difficulty during retrieval. Text data are available in the form of large documents. Understanding large text documents and extracting meaningful information out of it is time-consuming tasks. To overcome these challenges, text documents are summarized in with an objective to getrelated information from a large document or a collection of documents. Text mining can be used for this purpose. Summarized text will have reduced size as compare to original one. In this review, we have tried to evaluate and compare different techniques of Text summarization.”
Text 2 “In the view of a significant increase in the burden of information over and over the limit by the amount of information available on the internet, there is a huge increase in the amount of information overloading and redundancy contained in each document Extracting important information in a summarized format would help a number of users. It is therefore necessary to have proper and properly prepared summaries. Subsequently, many research papers are proposed continuously to develop new approaches to automatically summarize the text. “Automatic Text Summarization” is a process to create a shorter version of the original text (one or more documents) which conveys information present in the documents. In general, the summary of the text can be categorized into two types: Extractive-based and Abstractive-based. Abstractive-based methods are very complicated as they need to address a huge-scale natural language. Therefore, research communities are focusing on extractive summaries, attempting to achieve more consistent, non-recurring and meaningful summaries. This review provides an elaborative survey of extractive text summarization techniques. Specifically, it focuses on unsupervised techniques, providing recent efforts and advances on them and list their strengths and weaknesses points in a comparative tabular manner. In addition, this review highlights efforts made in the evaluation techniques of the summaries and finally deduces some possible”

Text summarization on original text.

Text summarization (using LSA method) on Generated summary
Text 1 “Everyday large volume of data is gathered from different sources and are stored since they contain valuable piece of information. The storage of data must be done in efficient manner since it leads in difficulty during retrieval. To overcome these challenges, text documents are summarized in with an objective to get related information from a large document or a collection of documents.”
Text 2 “In the view of a significant increase in the burden of information over and over the limit by the amount of information available on the internet, there is a huge increase in the amount of information overloading and redundancy contained in each document. Specifically, it focuses on unsupervised techniques, providing recent efforts and advances on them and list their strengths and weaknesses points in a comparative tabular manner. In addition, this review highlights efforts made in the evaluation techniques of the summaries and finally dedtices some possible future trends.”

Topic modeling on original text.

Topic modelling (using LDA method) on Topics with weights
Text 1 Topic #1 [(‘different’, 1.06), (‘since’, 1.03), (‘data’, 0.97), (‘try’, 0.88), (‘evaluate’, 0.88), (‘technique’, 0.88), (‘review’, 0.88), (‘summarization’, 0.88)]
Topic #2 (‘text’, 1.42), (‘document’, 1.39), (‘large’, 1.16), (‘form’, 1.01), (‘available’, 1.01), (‘summarize’, 0.91), (‘information’, 0.9), (‘meaningful’, 0.85)]
Text 2 Topic #1 [(‘information’, 1.24), (‘summary’, 1.1), (‘summarize’, 1.05), (‘research’, 1.0), (‘amount’, 0.9), (‘increase’, 0.9), (‘help’, 0.84), (‘would’, 0.84)]
Topic #2 [(‘text’, 1.36), (‘based’, 1.34), (‘provide’, 1.08), (‘extractive’, 1.07), (‘summarization’, 1.07), (‘abstractive’, 1.06), (‘technique’, 1.02), (‘summary’, 1.01)]

Topic modeling on summary of original text.

Topic modelling (using LDA method) applied on Topics with weights
Text 1 Summary Topic #1 [(‘document’,0.091),(‘data’,0.065),(‘information’,0.065), (‘piece’,0.039)’,’(‘contain’,0.039), (’summarize’,0.039), (’manner’, 0.039), (‘do’,0.039), (‘must’,0.039), (‘large’, 0.039)]
Topic #2 [(‘document’,0.044), (‘information’, 0.044), (‘data’,0.044), (‘source’, 0.044), (‘different,’0.043), (‘valuable’,0.043), (‘lead’, 0.043), (‘challenge’, 0.043), (‘collection’, 0.043), (‘relate’, 0.043]
Text 2 Summary Topic #1 [(‘information’,0.056),(‘increase’,0.040),(‘effort’,0.040), (‘amount’,0.040), (‘technique’,0.040),(‘specifically‘,0.024), (‘unsupervised‘,0.024),(‘future’,0.024), (‘overload’,0.024),(‘comparative’, 0.024)]
Topic #2 [(‘information’,0.027), (‘technique’, 0.027), (‘amount’,0.027), (‘effort’, 0.026), (‘increase’,026), (‘possible’, 0.026), (‘redundancy’,0.026), (‘make’,0.026), (‘summary’,0.026), (‘strength’, 0.026)]

Key phrase extraction on Text 1 and Text 2 using weighted TF-IDF method.

Key phrase extraction method applied on Key phrases with weights
Text 1 [(‘form’, 0.57699999999999996), (‘large documents’, 0.57699999999999996), (‘text data’, 0.57699999999999996),(‘large text documents’, 0.57699999999999996), (‘meaningful information’, 0.57699999999999996), (‘time-consuming tasks’, 0.57699999999999996), (‘different techniques’, 0.57699999999999996), (‘review’, 0.57699999999999996), (‘text summarization’, 0.57699999999999996), (‘different sources’, 0.47599999999999998)]
Text 2 [(‘prepared summaries’, 1.0), (‘abstractive-based methods’, 0.70699999999999996), (‘huge-scale natural language’, 0.70699999999999996), (‘documents’, 0.66700000000000004), (‘summary’, 0.63200000000000001), (‘types’, 0.63200000000000001), (‘elaborative survey’, 0.57699999999999996), (‘extractive text summarization techniques’, 0.57699999999999996), (‘review’, 0.57699999999999996), (‘many research papers’, 0.53400000000000003)]

Table 12 shows values of similarity scores generated when various text similarity functions based on various traditional distance based metrics are applied on original pair of text, topics modelling, key phrase extraction and summary. Figure 3 shows similarity values generated by extractive approaches almost matches the scores when same algorithm is applied in original text.

Figure 3

Similarity scores for the various text extraction methods.

Similarity scores using traditional similarity metrics on original texts, topics, keyword extracted and summary.

Text similarity measure Similarity score (in %) between Text 1 and Text 2 Similarity score (in %) between topics of Text 1 and Text 2 Similarity score (in %) between key word extracted of Text 1 and Text 2 Similarity score (in %) between Text 1 and Text 2 Summary
Euclidean distance [ED] 23.70 22.40 20.03 15.36
Normalized Levenshtein [NL] 27.80 26.43 33.69 29.08
Hamming Distance [HD] 40.0 7.14 10.8 27.0
Term Frequency-Inverse Document Frequency [TF-IDF] 53.71 55.90 38.86 41.11
Jaccard Distance [JD] 56.23 38.2 42.75 48.97
Cosine Similarity [CS] 63.0 29.46 30.15 41.86
Jaro Winkler [JW] 68.0 76.8 70.0 72.80
Cosine similarity with k shingles [CS_kshingles] 89.0 62.5 61.92 81.30

Table 13 shows results generated when text embedding models are used to generate vectors for similarity calculation. Figure 4 shows better text representations resulting in better similarity score even when it is applied on summarized text. Figures 5 and 6 show graphical comparison and similarity distribution based on similarity scores using both traditional and embedding model approaches respectively on both original text pair and its summary.

Figure 4

Impact of text representation on similarity calculation.

Figure 5

Graphical representation of similarity scores using various similarity measure techniques.

Figure 6

Similarity Score distribution using Various Similarity Search Techniques on original and summarized text.

Similarity scores using text embedding models on original and summarized document.

Embedding model Similarity score (in %) between Text 1 and Text 2 Similarity score (in %) between Text 1 summary and Text 2 summary
Word2Vec 5.28 14.26
Universal Sentence Encoder [USE] 81.36 69.39
FastText with soft cosine similarity [FT_SoftCS] 81.76 92.40
ELMo with cosine similarity (ELMo_CS) 88.59 76.32
Glove with cosine similarity (GloVe_CS) 97.89 95.60
BERT with cosine similarity (BERT_CS) 72.28 82.29

From the above experimental details, it can be seen that in traditional similarity measures Jaro Winkler performs best in all three-text extraction approaches i.e., topic modelling, keyword extraction and text summary generation as compared to other approaches shown in Table 12. Use of embedding models provides efficient text representation which leads to enhancing the performance of similarity algorithms as shown in Table 13. Soft cosine similarity using FastText [SoftCS_FT] performs best as text representation technique summarized text, while GloVe with cosine similarity captures the highest degree of similarity in both original and summarized text shown in Table 14. Heat map for GloVe embedding model is graphically represented in Figure 7. Table 15 highlights overall analysis for the proposed methodology. Figure 8 shows a graphical comparison of similarity scores for both original and summarized text combined with and without text embedding model.

Figure 7

Heat map (GloVe) using both approaches.

Figure 8

Comparison of similarity score of original vs. summarized text.

Result analysis.

Similarity function Original text Summarized text
Without embedding model Jaro Winkler [JW] 68 72.80
Cosine similarity with k shingles [CS_kshingles] 89.0 81.30
With embedding model Soft cosine similarity using FastText [FT_SoftCS] 81.76 92.40
Cosine similarity with GloVe (GloVe_CS) 97.89 95.60

Analysis of impact of embedding models on Text similarity measurement.

No. of Text similarity algorithms Approach used Average similarity score (in %) between Text 1 and Text 2 Average similarity score (in %) between Text 1 summary and Text 2 summary Difference (in %)
8 Without text embedding models 52.68 44.685 7.995
6 With text embedding models 71.19 71.71 0.52
Conclusion and future scope

Extractive approach of text summary generation is used to make the proposed approach independent of domain knowledge. So, in this paper an attempt has been made to use this concept to design and develop a near duplicate detection algorithm. Proposed approach performs reasonably well even for a higher value of threshold (>50%). Based on results obtained by the proposed method, it is possible to consider summary instead of whole document along with text embedding’s to capture better similarity, as results shows average similarity score of 6 summarized embedded text results in an increase of 0.52%. By using a suitable embedding model this percentage can increase by considerable value as word2Vec performance was poor.

The functionality of the text summarization algorithm can be increased by adding other coherent elements such as synonym, antonymy, collocation, calculation, similarity and the element of transformation. In terms of results, the syntax of sentences to work more efficiently should be more mathematical and linguistic. The integration method consists of Grammatical and Lexical Linking within the text as well as a sentence containing a sentence and provides important details. In future operations alternatives may be used in an invisible way which creates an internal semantic representation and use of native language generation strategies for making a summary. In the future, Deep Learning can be used for developing generalized text embedding models to handle insufficient data and adding a deeper level context to POS tagging. Also abstractive text summarization can be used which generates summary on the basis of hidden text.

eISSN:
1178-5608
Język:
Angielski
Częstotliwość wydawania:
Volume Open
Dziedziny czasopisma:
Engineering, Introductions and Overviews, other