Identification of Keywords for Legal Documents Categories Using SOM
, , , , , , , , et
31 mars 2025
À propos de cet article
Publié en ligne: 31 mars 2025
Pages: 33 - 41
Reçu: 27 avr. 2024
Accepté: 01 nov. 2024
DOI: https://doi.org/10.14313/jamris-2025-004
Mots clés
© 2025 Paulina Puchalska et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Comparison of accuracy between the NER model and Polish RoBERTa in each category of document
Category | RoBERTa [%] | NER [%] |
---|---|---|
Civil law | 68.88 | 72.00 |
Administrative law | 31.40 | 55.73 |
Pharmaceutical law | 67.80 | 74.36 |
Labor law | 62.88 | 59.15 |
Medical law | 65.56 | 77.05 |
Criminal law | 55.77 | 62.22 |
International law | 0.00 | 35.29 |
Tax law | 12.77 | 77.27 |
Constitutional law | 13.33 | 50.00 |
Number of words in each category with different thresholds of acceptance for how unique a keyword must be to a category_
Category | Share of words | ||
---|---|---|---|
>50% | >90% | 100% | |
Civil law | 587 | 470 | 466 |
Administrative law | 209 | 192 | 191 |
Pharmaceutical law | 253 | 210 | 210 |
Labor law | 358 | 264 | 263 |
Medical law | 834 | 599 | 595 |
Criminal law | 105 | 90 | 89 |
International law | 14 | 12 | 12 |
Tax law | 42 | 34 | 34 |
Constitutional law | 2 | 2 | 2 |
2,404 | 1,873 | 1,862 |
Coherence (C) and descriptiveness (D) of classes using RoBERTa embeddings_
Category | |||
---|---|---|---|
Civil law | 231 | 0.6 | 0.055 |
Administrative law | 151 | 0.53 | 0.08 |
Pharmaceutical law | 298 | 0.68 | 0.048 |
Labor law | 583 | 0.6 | 0.023 |
Medical law | 994 | 0.59 | 0.027 |
Criminal law | 20 | 0.53 | 0.235 |
International law | 137 | 0.61 | 0.076 |
Tax law | 25 | 0.45 | 0.195 |
Constitutional law | 280 | 0.64 | 0.056 |
Number of words in each category during each step of data preparation
Category | Before | Step 1 | Step 2 | Result |
---|---|---|---|---|
Civil law | 10787 | 2764 | 1023 | 470 |
Administrative law | 5182 | 1786 | 562 | 192 |
Pharmaceutical law | 4742 | 1489 | 558 | 210 |
Labor law | 9707 | 2081 | 828 | 264 |
Medical law | 17503 | 3447 | 1407 | 599 |
Criminal law | 4579 | 1347 | 418 | 90 |
International law | 374 | 199 | 40 | 12 |
Tax law | 1197 | 468 | 145 | 34 |
Constitutional law | 73 | 40 | 6 | 2 |
54,144 | 13,621 | 4,987 | 1,873 |
Number, coherence and descriptiveness of documents before and after using exclusively strong keywords
Category | Before changes | After changes | ||||
---|---|---|---|---|---|---|
Number of documents | Coherence score | Descriptiveness score | Number of documents | Coherence score | Descriptiveness score | |
863 | 0.51 | 0.031 | 223 | 0.74 | 0.041 | |
164 | 0.85 | 0.054 | 84 | 0.85 | 0.062 | |
212 | 0.79 | 0.043 | 101 | 0.8 | 0.064 | |
327 | 0.79 | 0.043 | 154 | 0.84 | 0.055 | |
846 | 0.68 | 0.023 | 377 | 0.71 | 0.034 | |
192 | 0.82 | 0.288 | 78 | 0.83 | 0.325 | |
7 | 0.94 | 0.09 | 3 | 0.95 | 0.123 | |
61 | 0.9 | 0.577 | 34 | 0.89 | 0.550 | |
2 | 1 | 0.038 | 2 | 1.00 | 0.066 | |
4,402 | 1,056 |