Identification of Keywords for Legal Documents Categories Using SOM
, , , , , , , , oraz
31 mar 2025
O artykule
Data publikacji: 31 mar 2025
Zakres stron: 33 - 41
Otrzymano: 27 kwi 2024
Przyjęty: 01 lis 2024
DOI: https://doi.org/10.14313/jamris-2025-004
Słowa kluczowe
© 2025 Paulina Puchalska et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Comparison of accuracy between the NER model and Polish RoBERTa in each category of document
Category | RoBERTa [%] | NER [%] |
---|---|---|
Civil law | 68.88 | 72.00 |
Administrative law | 31.40 | 55.73 |
Pharmaceutical law | 67.80 | 74.36 |
Labor law | 62.88 | 59.15 |
Medical law | 65.56 | 77.05 |
Criminal law | 55.77 | 62.22 |
International law | 0.00 | 35.29 |
Tax law | 12.77 | 77.27 |
Constitutional law | 13.33 | 50.00 |
Number of words in each category with different thresholds of acceptance for how unique a keyword must be to a category_
Category | Share of words | ||
---|---|---|---|
>50% | >90% | 100% | |
Civil law | 587 | 470 | 466 |
Administrative law | 209 | 192 | 191 |
Pharmaceutical law | 253 | 210 | 210 |
Labor law | 358 | 264 | 263 |
Medical law | 834 | 599 | 595 |
Criminal law | 105 | 90 | 89 |
International law | 14 | 12 | 12 |
Tax law | 42 | 34 | 34 |
Constitutional law | 2 | 2 | 2 |
2,404 | 1,873 | 1,862 |
Coherence (C) and descriptiveness (D) of classes using RoBERTa embeddings_
Category | |||
---|---|---|---|
Civil law | 231 | 0.6 | 0.055 |
Administrative law | 151 | 0.53 | 0.08 |
Pharmaceutical law | 298 | 0.68 | 0.048 |
Labor law | 583 | 0.6 | 0.023 |
Medical law | 994 | 0.59 | 0.027 |
Criminal law | 20 | 0.53 | 0.235 |
International law | 137 | 0.61 | 0.076 |
Tax law | 25 | 0.45 | 0.195 |
Constitutional law | 280 | 0.64 | 0.056 |
Number of words in each category during each step of data preparation
Category | Before | Step 1 | Step 2 | Result |
---|---|---|---|---|
Civil law | 10787 | 2764 | 1023 | 470 |
Administrative law | 5182 | 1786 | 562 | 192 |
Pharmaceutical law | 4742 | 1489 | 558 | 210 |
Labor law | 9707 | 2081 | 828 | 264 |
Medical law | 17503 | 3447 | 1407 | 599 |
Criminal law | 4579 | 1347 | 418 | 90 |
International law | 374 | 199 | 40 | 12 |
Tax law | 1197 | 468 | 145 | 34 |
Constitutional law | 73 | 40 | 6 | 2 |
54,144 | 13,621 | 4,987 | 1,873 |
Number, coherence and descriptiveness of documents before and after using exclusively strong keywords
Category | Before changes | After changes | ||||
---|---|---|---|---|---|---|
Number of documents | Coherence score | Descriptiveness score | Number of documents | Coherence score | Descriptiveness score | |
863 | 0.51 | 0.031 | 223 | 0.74 | 0.041 | |
164 | 0.85 | 0.054 | 84 | 0.85 | 0.062 | |
212 | 0.79 | 0.043 | 101 | 0.8 | 0.064 | |
327 | 0.79 | 0.043 | 154 | 0.84 | 0.055 | |
846 | 0.68 | 0.023 | 377 | 0.71 | 0.034 | |
192 | 0.82 | 0.288 | 78 | 0.83 | 0.325 | |
7 | 0.94 | 0.09 | 3 | 0.95 | 0.123 | |
61 | 0.9 | 0.577 | 34 | 0.89 | 0.550 | |
2 | 1 | 0.038 | 2 | 1.00 | 0.066 | |
4,402 | 1,056 |