Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Kailash Pati Mandal; Prasenjit Mukherjee; Devraj Vishnu; Baisakhi Chakraborty; Tanupriya Choudhury; Pradeep Kumar Arya

Acceso abierto

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

| 12 jun 2024

International Journal on Smart Sensing and Intelligent Systems

Volumen 17 (2024): Edición 1 (January 2024)

Acerca de este artículo

Artículo anterior

Artículo siguiente

Cite

Article Category: Article

Publicado en línea: 12 jun 2024

Páginas: -

Recibido: 13 may 2023

DOI: https://doi.org/10.2478/ijssis-2024-0018

Palabras clave
natural language processing, skip-gram model, noise-contrastive estimation, stochastic gradient descent, structured query language

© 2024 Kailash Pati Mandal et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Workflow diagram of the proposed system.

Semantically closer words of কিডনি (kidney).

Semantically closer words of পাথর (stone).

Semantically closer words of চিকিৎসা (treatment).

Semantically closer words of হাসপাতাল (hospital).

Execution time with respect to number of tokens.

Graphical representation of the confusion matrix.

Comparative study of the very same design system with the proposed work.

Author(s)	Methodology applied in near identical system	Methodology applied in proposed system
Arefin et al. [28]	(1) In the article [28], the pre-processing stage utilizes a lowercase conversion, removing escaped words, tokenization, and PoS tagging techniques. (2) The word similarity has been used to select closely related words. (3) The Jaro-Winkler matching algorithm and the Naive method Bayes are utilized to develop this system as in the article [28].	(1) The proposed system employs tokenization, stop-words removal, and stemming in the pre-processing stage. (2) This system uses semantic similarity to select entities and attributes that help to frame the SQL. (3) The skip-gram model is used to detect close related words whereas Noise Contrastive Estimation is used to remove irrelevant words. Stochastic Gradient Descent is employed to optimize the proposed model.
Liu et al. [29]	(1) The proposed model contains three layers that are Encoder layer, parse layer and output layer. (2) The encoder is used to encode the input information where the parse layer is responsible for connecting vectors of various parts. (3) The output part is handling the logit output of this model.	(1) The machine learning model part is divided into three layers namely input layer, hidden layer, output layer. (2) In this case each word is assigned a unique value to encode the information. (3) The cosine similarity captures semantically close words.
Sanyal et al. [30]	(1) This system converts queries from the English language to SQL. (2) The work [30] is related to the high performance of the SQL generation. (3) Tokenization, stop words removal, parsing, lexical analysis, synonym detection and Formation, and filtered word mechanism of NLP to implement the work [30].	(1) The proposed system converts the Bengali queries to SQL. (2) The unsupervised machine learning model has been used to generate SQL from NL queries. (3) NLP based data preprocessing task and the Skip-gram model have been applied to implement the proposed system.
Sugandhika and Ahangama. [31]	(1) The XML file contains the metadata of a particular database where the XML extractor works to read this XML file for metadata. This metadata has been used to generate SQL because that contains information about table names, column names, operator details, etc. (2) The BASIC clause generator is used to extract elements that are useful to form the BASIC clause of the SQL. (3) The column names, column value, relational operators, and concatenating operators are used to frame SQL.	(1) The unlabelled Bengali text has been used to train the skip-gram model to find out the related words. (2) A set of predefined rules has been applied to form SQL with ‘SELECT’, ‘WHERE’, IN, AND, OR, etc. clauses. (3) Semantically close words have been used to predict the entity and attribute of a particular table.
Pal et al. [32]	(1) The proposed model [32] is a deep learning-based model that handles natural language questions and generates valid SQL. (2) According to the sensitive nature of some databases, the data privacy approach has been included in this proposed model [32]. (3) The model [32] used vectors that are RoBERTa embedding and data-agnostic knowledge vectors. Vectors are passed through some sub-models that are also LSTM-based models to predict the final SQL query.	(1) The proposed system uses the unsupervised skip-gram model to covert Bengali NL queries to SQL. (2) Noise Contrastive Estimation (NCE) is used to discriminate between actual data and the noise of the data. (3) A unique value is assigned to each and every unique word to form a vector. Vectors have been used as input in the skip-gram model to identify the close words.
Huo et al. [33]	(1) In SyntaxSQLNet, a Bi-directional LSTM is used to encode a natural language phrase as in [33]. (2) This approach incorporates both the global table information and the local column information that is used as input to the BiLSTM. (3) SQL-specific tree-based decoder has been used in the proposed model to understand the SQL structure. The COL module is used to predict the column and this model has been improved here.	(1) The proposed system encodes Bengali phrases with the help of word embedding method. (2) The created dictionary is employed to train the skip-gram model. (3) Output of the skip-gram model is used to identify the entity and attributes in a predefined healthcare database.

Confusion matrix.

Expected output Vs. Select output	Select output-Positive	Select output-Negative
Expected output-Positive	True positives (TP)-65	False positives (FP)-17
Expected output-Negative	False negatives (FN)-25	True negatives (TN)-49

Structure of department table.

id_dept	name_dept	id_hos
70	নবজাতক (neonate)	7
170	মনোরোগ (psychiatry)	12

Structure of hospital table.

id_hos	name_hos	add_hos	district_hos	State_hos
1	মুর্শিদাবাদ জেলা হাসপাতাল (murʃid̪abad̪ ɟela haʃpat̪al) (Murshidabad District Hospital)	লালগোলা (lalgola) (Lalgola) (proper noun)	মুর্শিদাবাদ (murʃid̪abad̪) (Murshidabad) (proper noun)	পশ্চিমবঙ্গ (poʃcimbɔŋgo) (West Bengal) (proper noun)
8	হাওড়া জেলা হাসপাতাল (ha͡o̯ɽa ɟela haʃpat̪al) (Howrah District Hospital)	আমতা (amot̪a) (Amta) (proper noun)	হাওড়া (ha͡o̯ɽa) (Howrah) (proper noun)	পশ্চিমবঙ্গ (poʃcimbɔŋgo) (West Bengal) (proper noun)

Structure of doctor table.

id_doc	name_doc	qualification_doc	specialist_doc	id_hos	id_dept
1070	সুজন দাশগুপ্ত (ʃuɟon d̪aʃgupt̪o) (Sujon Dasgupta)	এম.বি.বি.এস. (emo.bi.bi.es.) (M.B.B.S.)	নেফ্রোলজিষ্ট (nepʰrolɟiʃto) (nephrologist)	8	80
1080	সোমেন দে (ʃomen d̪e) (Somen De)	ডি.এম. (di.em.) (D.M.)	নিউরোলজিস্ট (ni͡u̯rolɟist) (neurologist)	9	90

Execution time with respect to number of tokens.

Query	Number of tokens	Execution time units
Q1	4	4
Q2	5	5
Q3	7	7
Q4	9	9
Q5	12	12

Performance statistics of the proposed system.

Precision	Recall	Accuracy	F1 score
79%	72%	73%	75%

eISSN:: 1178-5608
Idioma:: Inglés

Calendario de la edición:: Volume Open
Temas de la revista:: Engineering, Introductions and Overviews, other

RSS Feed de revista

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Article Category: Article

Publicado en línea: 12 jun 2024

Páginas: -

Recibido: 13 may 2023

DOI: https://doi.org/10.2478/ijssis-2024-0018

Palabras clave
natural language processing, skip-gram model, noise-contrastive estimation, stochastic gradient descent, structured query language

© 2024 Kailash Pati Mandal et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Comparative study of the very same design system with the proposed work.

Confusion matrix.

Structure of department table.

Structure of hospital table.

Structure of doctor table.

Execution time with respect to number of tokens.

Performance statistics of the proposed system.

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Article Category: Article

Publicado en línea: 12 jun 2024

Páginas: -

Recibido: 13 may 2023

DOI: https://doi.org/10.2478/ijssis-2024-0018

Palabras clavenatural language processing, skip-gram model, noise-contrastive estimation, stochastic gradient descent, structured query language

© 2024 Kailash Pati Mandal et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Comparative study of the very same design system with the proposed work.

Confusion matrix.

Structure of department table.

Structure of hospital table.

Structure of doctor table.

Execution time with respect to number of tokens.

Performance statistics of the proposed system.

Palabras clave
natural language processing, skip-gram model, noise-contrastive estimation, stochastic gradient descent, structured query language