Uneingeschränkter Zugang

Identification of Keywords for Legal Documents Categories Using SOM

, , , , , , , ,  und   
31. März 2025

Zitieren
COVER HERUNTERLADEN

Figure 1.

Counts of example words “Może” (maybe), “Stażysta” (intern) and “Lekarz” (doctor) in each category
Counts of example words “Może” (maybe), “Stażysta” (intern) and “Lekarz” (doctor) in each category

Figure 2.

SOM visualization using GloVe embeddings
SOM visualization using GloVe embeddings

Figure 3.

SOM visualization of decision borders, built using NER model
SOM visualization of decision borders, built using NER model

Figure 4.

All keywords displayed on SOM
All keywords displayed on SOM

Figure 5.

Strong keywords displayed on SOM
Strong keywords displayed on SOM

Figure 6.

SOM obtained from RoBERTa embeddings
SOM obtained from RoBERTa embeddings

Figure 7.

Strongest keyword per query from RoBERTa model
Strongest keyword per query from RoBERTa model

Comparison of accuracy between the NER model and Polish RoBERTa in each category of document

Category RoBERTa [%] NER [%]
Civil law 68.88 72.00
Administrative law 31.40 55.73
Pharmaceutical law 67.80 74.36
Labor law 62.88 59.15
Medical law 65.56 77.05
Criminal law 55.77 62.22
International law 0.00 35.29
Tax law 12.77 77.27
Constitutional law 13.33 50.00

Number of words in each category with different thresholds of acceptance for how unique a keyword must be to a category_

Category Share of words
>50% >90% 100%
Civil law 587 470 466
Administrative law 209 192 191
Pharmaceutical law 253 210 210
Labor law 358 264 263
Medical law 834 599 595
Criminal law 105 90 89
International law 14 12 12
Tax law 42 34 34
Constitutional law 2 2 2
Total 2,404 1,873 1,862

Coherence (C) and descriptiveness (D) of classes using RoBERTa embeddings_

Category C D
Civil law 231 0.6 0.055
Administrative law 151 0.53 0.08
Pharmaceutical law 298 0.68 0.048
Labor law 583 0.6 0.023
Medical law 994 0.59 0.027
Criminal law 20 0.53 0.235
International law 137 0.61 0.076
Tax law 25 0.45 0.195
Constitutional law 280 0.64 0.056

Number of words in each category during each step of data preparation

Category Before Step 1 Step 2 Result
Civil law 10787 2764 1023 470
Administrative law 5182 1786 562 192
Pharmaceutical law 4742 1489 558 210
Labor law 9707 2081 828 264
Medical law 17503 3447 1407 599
Criminal law 4579 1347 418 90
International law 374 199 40 12
Tax law 1197 468 145 34
Constitutional law 73 40 6 2
Total 54,144 13,621 4,987 1,873

Number, coherence and descriptiveness of documents before and after using exclusively strong keywords

Category Before changes After changes
Number of documents Coherence score Descriptiveness score Number of documents Coherence score Descriptiveness score
Civil law 863 0.51 0.031 223 0.74 0.041
Administrative law 164 0.85 0.054 84 0.85 0.062
Pharmaceutical law 212 0.79 0.043 101 0.8 0.064
Labor law 327 0.79 0.043 154 0.84 0.055
Medical law 846 0.68 0.023 377 0.71 0.034
Criminal law 192 0.82 0.288 78 0.83 0.325
International law 7 0.94 0.09 3 0.95 0.123
Tax law 61 0.9 0.577 34 0.89 0.550
Constitutional law 2 1 0.038 2 1.00 0.066
Sum 4,402 1,056
Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
4 Hefte pro Jahr
Fachgebiete der Zeitschrift:
Informatik, Künstliche Intelligenz, Technik, Elektrotechnik, Mess-, Steuer- und Regelungstechnik, Maschinenbau, Grundlagen des Maschinenbaus