A novel approach to capture the similarity in summarized text using embedded model

The presence of near duplicate textual content imposes great challenges while extracting information from it. To handle these challenges, detection of near duplicates is a prime research concern. Existing research mostly uses text clustering, classification and retrieval algorithms for detection of near duplicates. Text summarization, an important tool of text mining, is not explored yet for the detection of near duplicates. Instead of using the whole document, the proposed method uses its summary as it saves both time and storage. Experimental results show that traditional similarity algorithms were able to capture similarity relatedness to a great extent even on the summarized text with a similarity score of 44.685%. Moreover, degree of similarity capture was greater (0.52%) in case of use of embedding models with better text representation as compared to traditional methods. Also, this paper highlights the research status of various similarity measures in terms of concept involved, merits and demerits.

eISSN:: 1178-5608
Language:: English

Publication timeframe:: Volume Open
Journal Subjects:: Engineering, Introductions and Overviews, other

Journal RSS Feed

A novel approach to capture the similarity in summarized text using embedded model

Published Online: Apr 16, 2022

Page range: 1 - 20

Received: Oct 25, 2021

DOI: https://doi.org/10.21307/ijssis-2022-0002

Keywords
Embedding models, Extractive text summarization, Near duplicate, Similarity measures, Text representation

© 2022 Asha Rani Mishra et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

A novel approach to capture the similarity in summarized text using embedded model

Published Online: Apr 16, 2022

Page range: 1 - 20

Received: Oct 25, 2021

DOI: https://doi.org/10.21307/ijssis-2022-0002

KeywordsEmbedding models, Extractive text summarization, Near duplicate, Similarity measures, Text representation

© 2022 Asha Rani Mishra et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Embedding models, Extractive text summarization, Near duplicate, Similarity measures, Text representation