Accès libre

Parsing Korean Classical Literature by Integrating Text Mining and Semantic Analysis

 et   
19 mars 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

Korean literature belongs to the category of East Asian literature, and the interpenetration of regional cultures and the frequent exchanges between scholars of Korean native culture and Chinese scholars have promoted the mutual influence of cultures of different countries. Various scholars have their own views and understanding of the Korean Peninsula and Korean literature. For Korean scholars, the concept of Korean literature is based on the concept of “Beta literature” [1]. This concept is based on the categorization of linguistic conventions and expressions in literature [2]. This concept does not take into account the time and place of creation, as well as the nationality of the creator. The language in which a literary work is written is defined as a literary creation of such a country [34]. That is, literature written in Chinese language is categorized as Chinese literature, literature written in English language is categorized as English literature, and literature written in Korean language is Korean literature.

Literature is, to some extent, a subjective description of objective things, which has its own law of development, and the connotation of literature reflects the spirit of the times and the people's cognitive level of things [57]. As the commander-in-chief of literary works, the theme leads and runs through the whole text, which is emphasized in the characters and storylines of the works [89]. The themes of literary works are diversified, which can highlight the truth, goodness and beauty, reflect family and friendship, and pursue equality, freedom and happiness, etc. [1011]. Drawing on mature literary and linguistic theories and methods, it is worth exploring and researching the topic of theme creation in Korean literature. Taking classical works as text mining objects, the rectangular network system of literary themes based on semantic squares is constructed to provide data support for analyzing Korean classical literature by analyzing the deep structure of antagonistic, contradictory and complementary relationships between semantics [1216].

In this paper, we obtain the text data of Korean classical literature through web crawler, and after the text data are subjected to lexical segmentation and lexical annotation, we conduct text mining and semantic analysis of Korean classical literature through methods such as the TF-IDF algorithm and the LDA topic model, extract the keywords in the text of Korean classical literature, statistically count the frequency of the keywords and analyze the co-occurrence of the keywords, and generate the keyword co-occurrence matrix to study the connection between keywords and the keywords. The connection between keywords. The associations between keywords in Korean classical literature were explored by calculating the TF-IDF value and Lift value of the keywords. Based on the TF-IDF values and Lift values, a semantic network analysis map of Korean classical literature is drawn to visualize the semantic relationships of Korean classical literature.

Natural Language Processing of Korean Classics
Text data acquisition and processing

This section mainly introduces the related technologies and theoretical knowledge used in the text mining experiment of Korean classical literature, which involves web crawlers, lexical segmentation, deletion and TF-IDF and other related technologies in the process of text mining.

Introduction to Web Crawlers

The most basic of text mining is the acquisition of text data, and now there are fewer kinds of publicly available datasets on the Internet, which is difficult to meet the research needs of researchers [17]. Web crawler is the first choice for researchers to obtain data, it is a program set according to the developer's program, mimicking the browser, from the target web page automatically and efficiently and accurately obtain the web page content of the program, and does not require users to browse the web page to obtain the information, saving the user to obtain the information of the time and energy, but also to avoid the collection of errors due to the problem of manual attention. With the development of crawler technology, the anti-crawler measures of various websites are also getting better and better, and the users have to abide by the law when collecting data.

Segmentation and lexical annotation

A word is the smallest unit that expresses a valuable meaning, which can consist of one or more consecutive words and is the basis of NLP. It can be said that the good or bad of word separation directly affects all the subsequent analysis results. Due to the difference between epigraphic and ideographic texts, epigraphic text participles can be well separated with the help of lexical direct spaces, while ideographic texts are more difficult to combine with the relationship between words. Currently ideographic participle methods can be divided into three main categories: dictionary-based methods, word frequency-based methods and machine learning-based methods. The various methods are described as follows:

Dictionary-based participle methods. This method compares and matches the text with the existing dictionary. The matching can also be categorized into Forward Maximum Matching Method, Reverse Maximum Matching Method and Bidirectional Maximum Matching Method.

Word Frequency Based Segmentation Method. This method is actually through the text of the “word frequency” for statistical sorting, but here the word, is the text of any two adjacent words into a vocabulary, and then statistics on the frequency of this composition of the word in the text, and so on, and finally all the statements in this way through the composition of the word, the word frequency of the sort, when the word frequency of the word frequency, when pre-set threshold is reached, the conclusion of the word division is obtained.

Machine Learning Based Segmentation Method. This method of word separation is relatively cutting-edge at home and abroad, and many large Internet companies use this method. The main idea of this method is: through the existing, mature corpus, combined with appropriate algorithms for continuous training and optimization, and finally complete the word separation. In this paper, we comprehensively use three kinds of word segmentation methods, depending on the situation.

Lexical annotation, also known as word class annotation, is the procedure of lexical marking for each participle in the result of the participle, according to the systematic word package, to determine whether each participle is a noun, a negative word, a verb, an adjective, or any other lexical property, and there is a unified lexical annotation for punctuation marks as well. Lexical labeling in ideographs is relatively simple compared to epenthetic lexical labeling because it is very rare to have more than one lexical attribute in ideographic words, most of the words possess only one lexical attribute, and a small number of words possess two attributes, but the first lexical attribute occurs much more frequently than the second lexical attribute.

Discontinued words and the TF-IDF

Discontinued words

Deactivated words are a very important tool to be used in the text mining process, which mainly refers to high-frequency and low-value words in text data, and removing the deactivated words in the article will reduce the dimensionality of the text features or enhance the quality of the text data features. In performing the text data mining task, some words are based on the Korean grammatical norms and are used to act as a linkage between other words, which makes the utterance fluent and does not provide any valuable information about the text data, even if these words are removed, it will not affect the general meaning of the text data, it will just cause the disambiguation of the utterance, but when retaining these words, it will cause a high frequency of occurrence of the words, which will have an impact on the text data analysis produces interference noise.

Researchers can make their own deactivation word list, which requires a very important statistical method for support - TF-IDF.TF-IDF is a weighting technique commonly used in information retrieval and data mining, the algorithm is simple and efficient, and it is often used for text data cleaning [18]. TF-IDF has two meanings, one is “Term Frequency” (TF) and the other is “Inverse Document Frequency” (IDF). TF means the frequency of occurrence of a word in a single document, while IDF is the frequency of occurrence of the word in all documents, TF is calculated as the total number of occurrences of a word in an article/the total number of words in the article, and IDF is calculated as log(the total number of documents in the corpus/(the number of documents that contain the word+1)). Once the TF (word frequency) and IDF (inverse document frequency) are calculated, these two values are multiplied to get the value of TF-IDF for a word. The larger the TF-IDF of a word in a text, the more important the word is in the document, on the contrary, the smaller the TF-IDF of a word in a text, the less important the word is to the document. At this point we will summarize the k words with very low TF-IDF values, which is the prototype of a deactivation word list, after screening these words and deleting the valuable words, we get a usable deactivation word list. In text mining, it is possible to improve the accuracy of text mining by comparing the partitioned words with the deactivated word list after the partitioning, and eliminating them if they are in the deactivated word list.

TF-IDF algorithm

Natural language text is usually represented by a string of characters, numbers, punctuation marks and some special symbols. The most basic text elements, such as words or characters, form words from the bottom up, and then form phrases, sentences, paragraphs and chapters. In order to deal with natural text effectively, it is necessary to find a more reasonable formal representation of the text, which on the one hand should reflect the content of the document as completely as possible (subject, domain or structural information), and on the other hand, it should have the ability to distinguish between different documents, so it is possible to use “vectors” to represent the document, and to try to Therefore, it is possible to represent documents as “vectors” and to try to solve information retrieval or other problems related to text processing with the help of modern mathematical tools.

TF-IDF is derived from the classical Vector Space Model (VSM), which was originally applied to intelligent information retrieval systems and is one of the classical models for natural language processing.

In the VSM model, given a copy of text D{x1,w1;x2,w2;…;xn,wn}, (xk,wk) denotes the text feature xk and its corresponding weight wk, which needs to be satisfied for text D:

Each feature xk(1≤kn) of the text is mutually exclusive from each other.

No internal structure is considered between each feature xk(1≤kn) of the text.

With the above qualifications, the text feature term {x1,x2,…,xn} is considered as an n-dimensional coordinate system, and the weight {w1,w2,…,wn} is the corresponding coordinate value. Then, the text can be represented as a vector in a n-dimensional space, and D{w1,w2,…,wn} is a text vector specific representation.

Some common words appear frequently in the corpus, but their ability to influence the subject of the text is small. In order to address this problem, with the help of vector space modeling idea, feature weights are used to measure the importance of a feature item in document representation, and an advanced variant of bag-of-words method, the Word Frequency-Inverse Document Frequency (TF-IDF), is obtained instead of the general bag-of-words model, which is used for assessing the importance of the words to the documents in the document set or the corpus. Similarly for text {x1,x2,…,xn}, in order to measure the importance of the word in the text, the general idea is to perform a “word frequency (TF)” statistic by looking at the number of times the word appears in the text: tf(x,d)=count(x,d)size(d) where count(x,d) denotes the total number of occurrences of the word x in the article d, and size(d) denotes the sum of occurrences of all words in the article d. Stop words need to be filtered out and only words with actual meaning are considered. For the remaining words, we need a word importance metric to determine whether a word is important or not. If a word is non-stop word but has appeared many times in an article, it reflects the character of the article and is exactly the keyword we need. So it is represented in statistical language and corresponding weights are assigned to words based on word frequency. Among them, the weight of common deactivated words is the smallest, the weight of general words is smaller, and the weight of uncommon words is larger, and the weight is inversely proportional to the frequency of the word, which is called the “inverse document frequency” (IDF): idf(x,D)=log10NNx+1 N is the total number of documents in the current corpus D, and Nx is the number of documents in which word x occurs. If there is a word that does not appear in the existing corpus, the denominator will be zero, so it is usually denoted as Nx + 1. After determining the Term Frequency (TF) and the Inverse Document Frequency (IDF), the product of these two values is used as the TF-IDF value of the word. The more important the word is to the content of the article, the larger the corresponding TF-IDF value: tfidf=tf(x,d)×idf(x,D)

In this case, the word features, weights are proportional to the frequency of occurrence of the word in the document and inversely proportional to the number of documents containing the word in the whole corpus, combining the high-frequency and rare words.

Semantic analysis

Semantic network is a method to express the concepts and relations of a language in a network format, which was applied to natural language processing work in 1972 [19]. Unlike vector space modeling, semantic network analysis can be used to mine the semantic associations between word items, and to a certain extent, it can reintegrate the messy text structure relationships caused by word separation, thus restoring part of the original text information that cannot be expressed by individual word items.

Semantic networks are usually represented in the form of graphs called semantic network graphs, which are actually directed graphs for describing the relationships between lexical items, usually consisting of multiple nodes representing lexical items and directed arcs representing semantic associations. The direction of the directed arcs represents the causal direction of the semantic associations between the lexical items, i.e., if lexical item A points to lexical item B, it means that there exists a semantic association between lexical item A and lexical item B, and that lexical item A is the active party and lexical item B is the passive party. As the number of lexical items increases, the directed arcs connected between the nodes also increase, and eventually form a complex semantic network.

Thematic models

The topic model widely used in text mining can mine the hidden information in the text, and its basic idea is to regard the document as a mixture of multiple topics, and regard the topic as a collection of topic words obeying a certain probability distribution. The theme model is a probabilistic generation process of the document, the specific steps are to first select a probability distribution of the theme, and then each word in the document according to the probability distribution of the theme to generate a theme word, and finally through the statistical knowledge of the generation process will be reversed to generate the document's potential theme set, and at the same time to get the corresponding theme of the theme of the largest probability of the theme word set.

Among many topic models, LDA (Latent Dirichlet) is a very effective topic model [20]. Its innovative introduction of Dirichlet prior distribution on top of the PLSA model, i.e., the topic distribution θi of document i and the word distribution φk of topic k satisfy the Dirichlet prior distribution with parameters α and β, respectively [21]. The formula can be expressed as: Dir(θi|α)=Γ(k=1Kαk)k=1KΓ(αk)k=1Kθikαj1(Otherk=1Kθik=1) Dir(φk|β)=Γ(j=1kβj)j=1lΓ(βj)j=1rφkjβj1(Otherj=11φkj=1) where K denotes the total number of topics, θik denotes the probability distribution of documents i corresponding to topic k, V denotes the total number of words, and φkj denotes the probability distribution of words j corresponding to topic k.

LDA Subject Modeling

The structure of the LDA topic model is shown in Fig. 1, where θ represents the M×K nd order document-topic distribution matrix, φ represents the K×V th order topic-word distribution matrix, z represents topics, w represents words, N is the number of words in a given document, M is the number of documents in the document set, and K is the number of topics in the document set.

Figure 1.

LDA topic model structure

The formula for generating the probability of each word in a particular document in the LDA topic model can be expressed as: p(Word|Documentation)=Topicsp(Word|Topics)×p(Topics|Documentation)

Its matrix form is shown in Figure 2.

Figure 2.

The topic generation matrix represents

The LDA topic model document generation process is shown in Figure 3.

Figure 3.

Topic document generation process

Parameter Estimation of LDA Thematic Models

In the process of LDA topic model document generation, α is usually taken as 50K (where K is the number of topics), and β is usually taken as a fixed value of 0.01. Since the solution of parameter α depends on the number of topics K, the optimal number of topics of the topic model should be determined before parameter estimation, and the commonly used methods are mainly three kinds:

Manual judgment based on experience.

Calculated using great likelihood estimation.

Calculated using complexity Perplexity, which is calculated as follows: Perplexity(D)=exp{ d=1Mlogp(wd)d=1MNd }

Where M is the number of documents, Nd represents the number of words in the document, and p(wd) is the probability of occurrence of word wd in the document. It is generally believed that the number of topics corresponding to the minimum complexity is optimal, and in the process of practical application should also be used first to reduce the perplexity of the cross-validation method to prevent overfitting.

As for the other parameters φ and θ, because the model solution of LDA topic model is very complex, it is difficult to achieve accurate solution, so it usually adopts imprecise methods to solve the model, mainly based on the EM solution of the variational expectation and Gibbs sampling two methods, of which the Gibbs sampling method is more widely used because of its better solution and easy to derive.

Gibbs sampling method is an approximate inference algorithm based on Markov chain Monte Carlo (MCMC), the basic idea is to connect all the words appearing in the document into a string in a non-repeating way during the parameter iteration, and match a topic for each of them to construct a Markov chain of the state space, and finally use Gibbs sampling to update the node states continuously until the convergence to a steady state is obtained as an approximation of the LDA probability distribution on the document set [22].

Text and semantic analysis results
Keyword extraction results
Keyword word frequency

In this paper, the preprocessed textual data was divided into words and counted the words and their frequencies, and after excluding words such as deactivated words, some of the words with higher word frequencies were taken, and since these words were selected from all the captured Korean classical literature, the high-frequency words with more than 500 occurrences of these words were taken as shown in Table 1.

High frequency keywords in Korean classical literature works

Keywords Frequency Keywords Frequency
Heaven 2909 Mythology 857
Earth 1965 Official 790
Folklore 1594 Poetry 759
Confucian 1403 Goli 748
Buddhism 1392 Three Kingdoms 743
Taoism 1287 Chinese 732
Morality 1266 Benevolence 726
Dynasty 1149 Politeness 716
Legend 1101 Filial piety 703
Aristocracy 986 Loyalty 684
Drama 956 Ethics 675
Art 947 North 621
Qu Yuan 943 Religious belief 579
Identity 942 Analects 569
Maidservant 931 Compassion 568
Imperial envoy 916 Tao Yuanming 557
People 912 Happiness 551
Pariah 903 Disaster 529
Court 901 Elegance 516
China 881 Root 500

As can be seen from Table 1, among the keywords extracted from the extracted Korean classical literature, “heaven” appears the most frequently, with a total of 2,909 occurrences, and a total of 9 keywords with a frequency of more than 1,000, namely “heaven”, “earth”, “folk tales”, “Confucianism”, “Buddhism”, “Taoism”, “morality”, “dynasty” and “legend”.

Keyword co-occurrence matrix

Taking the word frequency higher than 1000 keywords as an example, according to the co-occurrence relationship, the co-occurrence matrix is generated as shown in Table 2, where the data of the upper triangle or lower triangle cells are the number of times the two keywords appear in the same work at the same time, and K1~K9 refer to “heaven”, “earth”, “folk tales”, “Confucianism”, “Buddhism”, “Taoism”, “morality”, “dynasty” and “legend” respectively. For example, the frequency of the words “heavenly” and “earthly” is 685, which means that the two literary intentions appear together 685 times in the same work. The more times two keywords appear together in the co-word matrix, the closer the relationship between the two keywords is.

Keywords common matrix

K1 K2 K3 K4 K5 K6 K7 K8 K9
K1 0 685 823 542 293 77 112 302 20
K2 685 0 326 187 83 227 101 128 27
K3 823 326 0 152 54 89 115 378 23
K4 542 187 152 0 85 26 31 11 43
K5 293 83 54 85 0 38 62 48 29
K6 77 227 89 26 38 0 53 92 4
K7 112 101 115 31 62 53 0 221 9
K8 302 128 378 11 48 92 221 0 3
K9 20 27 23 43 29 4 9 3 0

Taking the Korean classical novel “Samguk Yusa” as an example, in one volume of the well-preserved classical novel “Samguk Yusa” one can see a complete account in the text of the novels of the founding of Korea in ancient times, such as “The Myth of Dangun” and “The Myth of Jumong”. Korean classical novels divide the entire universe into two spatial structures, heaven and earth. The narrative style is also considered to be a unique form of narrative in which mythological stories are a combination of reality and fiction in both heavenly and earthly genres. For example, “The Myth of Tangun” is a depiction of the gods and goddesses in the sky and the places where they live, and it has a legendary character similar to that of the ancient Chinese myths. Scenes such as Hwan-woong's desire to go from the sky to the human world and the girl on earth who prays to Hwan-woong to let her bear children for him are scenes of real life and labor in ancient Korean society. Therefore, in this work, the co-occurrence of “heavenly” and “earthly” is high.

Keyword association analysis
Calculation of keyword TF-IDF values

For the word set obtained after data cleaning, the TF-IDF statistic value of each word needs to be calculated to extract the features out. In this way the importance of each word in the word set can be reflected by the magnitude of its statistical value, which can also be used as a weighting factor for keywords. In order to extract the words (i.e. keywords) that reflect the user's attention, the most suitable metric is the word frequency, so it is very appropriate to use the TF-IDF value to represent the weight of the keywords. The meaning of word frequency TF is easy to understand, it is the number of times a word appears in the text. The inverse document frequency IDF considers the number of times the keyword appears in all documents, with the general statistics is different, this time the keyword appears less often on behalf of its distinction in the document the higher, that is, the higher the IDF value. When a word has a high word frequency TF within this paper, as well as a low document frequency IDF in the entire text collection, a high weighted TF-IDF can be generated, and important keywords can be screened.

After calculation and screening, this paper selected 15984 keywords from 121,585 words contained in the Korean Classical Literature Thesaurus, i.e., the words with the top 13% of TF-IDF values, and some of the keywords and their weights ordered according to the magnitude of the values are shown in Table 3.

Keywords TF-IDF weights

Number Keywords TF-IDF Number Keywords TF-IDF
1 Legend 0.0359 11 Earth 0.0200
2 Folklore 0.0347 12 Buddhism 0.0196
3 Dynasty 0.0345 13 Morality 0.0182
4 Goli 0.0341 14 Maidservant 0.016
5 Aristocracy 0.0310 15 Pariah 0.015
6 Confucian 0.0305 16 Taoism 0.0144
7 Heaven 0.0300 17 Court 0.0132
8 Imperial envoy 0.0280 18 People 0.0126
9 China 0.0246 19 Benevolence 0.0109
10 Identity 0.0211 20 Analects 0.0088

As can be seen from Table 3, among the classical Korean literary works, “legend” has the highest reader's attention, with a TF-IDF value of 0.0359, and the most mentioned keywords are “folk tales”, “dynasty”, “Goryeo”, “aristocracy”, “Confucianism”, etc.

The screened keywords were further analyzed to select the top five Korean classical literature works in terms of weight. The TF-IDF statistical values of the keywords of Korean classical literature works are shown in Table 4, where A~E represent “Tale of Chun-hyang”, “Tale of Heung-bu”, “Samguk Yusa”, “Tale of Yong Bi Eo Cheon Ga”, and “Tale of Hong Gil-dong”, respectively.

Korean classical literature works TF-IDF weights

Weight rank Literature works TF-IDF weight
1 A 0.0342
7 C 0.0179
28 B 0.0113
103 E 0.0098
112 D 0.0082

As shown in Table 4, the Korean classical literature with the highest weight is “Tale of Chun-hyang” with a TF-IDF value of 0.0342. The weight rankings of “Tale of Hong Gil-dong” and “Bi Eo Cheon Ga” are already outside of 100, so “Bi Eo Cheon Ga” and “Tale of Hong Gil-dong” will no longer be considered in the subsequent analysis.

Keyword Lift Value Calculation

After identifying the keywords and major works, the correlation between the keywords is further understood by calculating the Lift value between each keyword. The statistical results of Lift values between each work and some keywords are shown in Table 5. After calculating and analyzing the Lift values, it was found that in Korean classical literature, the keyword with the highest attention in both “Tale of Chun-hyang” and “Tale of Heung-bu” is “Legend”, with Lift values of 0.4093 and 0.3246 respectively, and the keyword with the highest attention in “Tale of Chun-hyang” is “Heaven”, with a Lift value of 0.3458.

Attribute keyword lift value analysis results

No. Work Cons Lift Work Cons Lift Work Cons Lift
1 A Legend 0.4093 B Legend 0.3246 C Heaven 0.3458
2 A Morality 0.3637 B Identity 0.3176 C Earth 0.3152
3 A Maidservant 0.3052 B Folklore 0.3102 C Morality 0.2891
4 A Benevolence 0.2519 B Morality 0.2551 C Goli 0.2675
5 A Folklore 0.2003 B Aristocracy 0.2173 C Identity 0.2326
6 A Court 0.1833 B Confucian 0.1878 C People 0.2188
7 A People 0.1284 B Pariah 0.1803 C Dynasty 0.2019
8 A Confucian 0.0721 B People 0.1093 C Pariah 0.1476
9 A Goli 0.0504 B Earth 0.0491 C Court 0.1463
10 A Pariah 0.0333 B Dynasty 0.0383 C Legend 0.0963
Semantic network analysis

Combining the keywords' TF-IDF values as well as Lift values, the entries with higher TF-IDF values as well as Lift values were subjected to web semantic analysis, and the final semantic analysis map of Korean classical literature is shown in Fig. 4, in which A, B, and C stand for “Tale of Chun-hyang”, “Tale of Heung-bu”, and “Samguk Yusa”, respectively.

Figure 4.

Semantic analysis diagram of keywords of Korean classical literature works

Conclusion

The article uses text mining technology to obtain text data of Korean classical literature, adopts TF-IDF algorithm and LDA topic model to process and semantically analyze the text of Korean classical literature, and visualizes the semantic network of Korean classical literature after extracting keywords and correlation analysis.

Among the extracted keywords, the most frequent keyword is “heaven” with 2909 occurrences, followed by “earth” with 1965 occurrences, and there are 9 keywords with more than 1000 occurrences. The co-occurrence frequency of “heaven” and “earth” is 685, which means that these two literary meanings appear together 685 times in the same work.

The highest TF-IDF value among Korean classical literature is “Legend” (0.0359). The work with the highest TF-IDF value is “Tale of Chun-hyang” (0.0342), which is ranked first in terms of weight. Other Korean classics with a weight ranking of 100 or less are “Tale of Heung-bu” (0.0179) and “Samguk Yusa” (0.0113). “Tale of Chun-hyang” and “Tale of Heung-bu” both had the most attention with the keyword “legend”, while “Tale of Samguk Yusa” had the most attention with the keyword “heaven”.