Otwarty dostęp

Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning

, ,  oraz   
01 wrz 2025

Zacytuj
Pobierz okładkę

Figure 1.

Research framework for financial fraud detection.
Research framework for financial fraud detection.

Figure 2.

Annual report semantic feature extraction framework.
Annual report semantic feature extraction framework.

Figure 3.

SHAP feature attribution diagram.
SHAP feature attribution diagram.

Figure 4.

Feature importance analysis (Top 10% Features).
Feature importance analysis (Top 10% Features).

Figure 5.

The relationship between Top10% important features and corporate fraud risk (Beeswarm plot).
The relationship between Top10% important features and corporate fraud risk (Beeswarm plot).

Figure 6.

Relationship between annual report semantic features (f119, f11) and corporate fraud risk.
Relationship between annual report semantic features (f119, f11) and corporate fraud risk.

Corporate Governance Indicators (11 Items)_

Indicator Type IndicatorVariable Variable Definition and Value
OwnershipxStructure ShrHolder1ShrHolder3ShrHolder5ShrHolder10StOwRt First Largest Shareholder Ownership RatioSum of Ownership Ratios of Top Three ShareholdersSum of Ownership Ratios of Top Five ShareholdersSum of Ownership Ratios of Top Ten ShareholdersState-Owned Share Ratio
InternalExecutiveCharacteristics Cmceo_DumCmgm_DumMShrRatBShrRatSShrRatInDrcRat Whether serving as both Chairman and CEO (l=Yes, 0=No)Whether serving as both Chairman and General Manager (l=Yes, 0=No)Management Ownership Ratio: Percentage of company shares held by managementBoard of Directors Ownership Ratio: Percentage of company shares held by all board membersSupervisory Board Ownership Ratio: Percentage of company shares held by all supervisory board membersProportion of Independent Directors: Ratio of independent directors to the total number of directors

Annual report feature indicators_

Indicator Type Indicator Variable Variable Definition
Tone LM_toneMDA_tone Tone value of the full annual report, calculated based on the LM dictionary.Tone value of the annual report’s MD&A section, calculated based on the LM dictionary.
SemanticFeatures flf2fn Semantic feature vector extracted from the MD&A (n-dimensional).
Readability SizeLsFinance_densityObscure_densityTransition_density Character count of the annual report’s MD&A sectionProportion of long sentencesDensity of financial terms: the number of financial and accounting terms per hundred charactersDensity of rare words: the number of less frequently used Chinese words per hundred charactersDensity of adversative conjunctions: the number of Chinese adversative conjunctions per hundred characters

Ablation analysis on multi-feature fusion_

Features AUC ΔAUC Accuracy ΔAccuracy F1 ΔF1
Financial 0.820 - 0.740 - 0.694 -
Financial + Textual 0.880 +6.0% 0.798 +5.8% 0.780 +8.6%
Financial + Textual + Corporate governance 0.894 +1.4% 0.812 +1.4% 0.796 +1.6%

Experimental results of word embedding models (Input Dimension 256)_

Embedding model Introduction Accuracy AUC F1
Doubao-embedding_v240715 A high-performance text embedding model developed by ByteDance’s Volcano Engine, part of the Doubao large model ecosystem, specializing in generating high-quality semantic vectors. 0.742 0.817 0.733
Embedding-V1 Baidu’s general-purpose text embedding model, incorporating the company’s NLP expertise. It supports Chinese and multilingual tasks, generating high-dimensional dense vectors (1024 dimensions). 0.738 0.812 0.715
BGE-large-zh An open-source Chinese text embedding model from Beijing Academy of Artificial Intelligence (BAAI). Fine-tuned on RoBERTa-large architecture, it delivers excellent performance on Chinese text. 0.739 0.802 0.709
Text2Vec-base-chinese A classic open-source Chinese text embedding model based on transformer architecture, providing static contextual embeddings. 0.609 0.642 0.577

Sample summaries of corporate annual reports_

Chinese English
藏格控股 2019 年年度报告显示, 管理层围绕年度目标努力基本完成任务。当年钾肥生产 108.29 万吨, 营收 20.18 亿元, 净利润 3.08 亿元, 业绩下降因产品销量、价格、产量及成本、存货跌价等因素。 The 2019 annual report of Zangge Holdings indicates that the management team largely accomplished its annual objectives. Potassium fertilizer production reached 1.0829 million tons, with revenue of 2.018 billion yuan and net profit of 308 million yuan. The decline in performance was attributed to factors such as product sales volume, price, production volume, costs, and inventory depreciation.
报告期主要工作包括推进电池级碳酸锂项目投产、改造技术工艺降成本、坚持创新发展获补助资金。多项财务数据有变动, 如营业收入合计同比减 24.70%, 销售、管理、财务、研发费用有不同变化。 Key initiatives during the reporting period included advancing the commissioning of the battery-grade lithium carbonate project, improving technical processes to reduce costs, and securing subsidy funds through innovation-driven development. Several financial metrics showed changes, such as a 24.70% year-on-year decrease in total operating revenue, with variations in sales, management, financial, and R&D expenses.
研发投入大幅增长, 现金流方面各活动现金流量有升降。非主营业务投资收益为负, 还涉及资产减值、营业外收支等。资产及负债有比重增减, 获取重大股权投资, 收购西藏巨龙铜业 37%股权, 本期投资亏损 1.68 亿元。 R&D investment increased significantly, while cash flows from various activities fluctuated. Non-core business investment returns were negative, and the report also covered asset impairments, non-operating income, and expenses. Assets and liabilities experienced shifts in proportions, with significant equity investments, including the acquisition of a 37% stake in Xizang Julong Copper for a loss of 168 million yuan during the period.
募集资金使用有进展, 曾自筹资金投入募投项目, 2019 年募投项目结项节余补流。公司积累了经验和技术储备, 目前面临钾肥价格、新业务开拓、环保等风险,将采取相应措施应对。还存在股份质押、业绩补偿股份回购注销风险, 补偿义务人正积极处置。 Progress was made in the use of raised funds, with self-raised capital allocated to fundraising projects. In 2019, surplus funds from completed fundraising projects were redirected to working capital. The company has accumulated experience and technical reserves but faces risks such as potassium fertilizer price fluctuations, new business expansion, and environmental challenges, for which corresponding measures will be implemented. Risks related to share pledges, performance compensation share repurchases, and cancellations remain, with obligors actively addressing them.
未来公司将在稳步发展氯化钾基础上开发高附加值产品, 利用老卤生产新能源材料, 开发锂系列产品, 加强技术和人才储备等。2019 年多次通过电话沟通方式接待个人咨询, 内容涉及公司多个方面情况, 多数未提供书面报告。 Moving forward, the company will focus on developing high-value-added products based on stable potassium chloride production, utilizing brine for new energy materials, expanding lithium-based products, and strengthening technical and talent reserves. In 2019, the company responded to numerous individual inquiries via phone consultations, covering various aspects of the company, most of which did not result in written reports.

Performance comparison of financial fraud prediction models with multi-dimensional embeddings_

Semantic features Dimension Accuracy Precision Recall F1 AUC
Embedding vectors 1024 0.732 0.715 0.693 0.704 0.811
512 0.732 0.706 0.714 0.710 0.805
256 0.742 0.713 0.733 0.733 0.817
128 0.695 0.656 0.705 0.679 0.762

Model performance under imbalanced data_

Fraud to Non-Fraud Sample Ratio Accuracy AUC F1 Precision Recall
1:1 0.742 0.817 0.713 0.713 0.733
1:2 0.732 0.770 0.568 0.564 0.571
1:5 0.856 0.755 0.470 0.511 0.430
1:10 0.819 0.613 0.222 0.273 0.188
1:20 0.946 0.586 0.167 0.273 0.120

Comparative performance of ML algorithms in fraud setection task_

ML Algorithms AUC Accuracy F1
LR 0.750 0.685 0.679
LightGBM 0.868 0.791 0.787
XGBoost 0.871 0.792 0.789
CatBoost 0.852 0.776 0.773

Impact of textural features: Ablation analysis_

Features AUC ΔAUC Accuracy ΔAccuracy F1 ΔF1
Financial 0.820 - 0.740 - 0.694 -
Financial+Tone 0.828 +0.8% 0.749 +0. 9% 0.727 +3.3%
Financial+Tone+Readability 0.841 +1.3% 0.766 +1.7% 0.745 +1.8%
Financial+Tone+Readability+Semantics 0.880 +3.9% 0.798 +3.2% 0.780 +3.5%

The ML models’ configuration for fraud detection_

Algorithms Parameters Value Meaning
LogisticRegression Cmax_iterpenaltysolvertol 100300‘l1’‘liblinear’1.00E-06 The reciprocal of regularization strengthThe maximum number of iterations for solver convergence.The type of regularization; ‘l1’ denotes L1 regularization.The algorithmThe error tolerance for the stopping criterion.
XGBoost subsamplen_estimatorsmin_child_weightmax_depthlearning_rategammacolsample_bytree 0.899449490.10.0002780.6 The sample proportion per tree during trainingThe boosting tree countThe minimum child node weight fractionThe tree’s maximum depthThe learning rateThe minimum loss reduction for node splitsThe feature proportion per tree during construction
LightGBM colsample_bytreemax_depthmin_child_samplesn_estimatorsnum_leavesrandom_statereg_alphareg_lambdasubsample 0.6692804000.010.0010.8 Feature proportion per treeMaximum tree depthMinimum child node samplesNumber of boosting treesMaximum leaf nodesRandom seed for reproducibilityL1 regularization coefficientL2 regularization coefficient Sample proportion per tree
CatBoost subsamplemin_data_in_leafl2_leaf_reglearning_rateiterationsdepthcolsample_bylevelborder_count 0.810.010.128060.840 The proportion of samples for training each treeMinimum samples per leaf nodeL2 regularization coefficientLearning rate for step size controlNumber of boosting treesMaximum tree depthFeature proportion per levelNumber of bins for numerical features

Financial Indicators (19 Items)_

Indicator Type Indicator Variable Variable Definition and Operation
Solvency AslbrtCurtrtQikrtEqurtPmcptdbrt Asset Liability Ratio; Total Liabilities / Total AssetsFlow Rate Ratio; Total Liability Ratio / Total Flow AssetQuick Ratio; (Current Assets – Inventory) / Current LiabilitiesEquity Ratio; Total Liabilities/Total Shareholders’ EquityLong-term Debt Ratio; Total Non-flowing Debt / (Total Shareholders’ Equity +Total Non-flowing Liabilities)
Profitability Roe_1SalnpmSalgmSalpm Return on Equity (ROE); Net Profit / Total Shareholders’ EquityNet Profit margin, Net Profit / RevenueGross Profit Margin, Gross Profit / RevenueProfit Margin, Profit / Revenue
Growth Potential AtrtOpicrtOirt Total Assets Growth Rate; (Current Period Adjusted Figure - Prior Year Same Period Adjusted Figure)/Prior Year Same Period Adjusted Figure for ABSOperating Revenue Growth Rate, (Current Period Adjusted Figure - Prior Year Same Period Adjusted Figure)/Prior Year Same Period Adjusted Figure for ABSOperating Profit Growth Rate, (Current Period Adjusted Figure - Prior Year Same Period Adjusted Figure)/Prior Year Same Period Adjusted Figure for ABS
Operational Efficiency ActrcbtoFxasttoTotasttoActpayto Accounts Receivable Turnover, Revenue/((Beginning Net Accounts Receivable + Ending Net Accounts Receivable)/2)Fixed Asset Turnover, Total Revenue/((Beginning Fixed Assets + Ending Fixed Assets)/2)Total Asset Turnover, Total Revenue/((Beginning Total Assets + Ending Total Assets)/2)Accounts Payable Turnover, Cost of Goods Sold (COGS)/((Beginning Accounts Payable + Ending Accounts Payable)/2)
Dash Flow Adequacy Opncf revOpncfrtCsopindex Net operating cash flow / Total RevenuePercentage of net cash flow from operating activities; Net cash flow from operating activities / (Net cash flow from operating activities + Net cash flow from investing activities + Net cash flow from financing activities)Cash flow to sales ratio, Net cash flow from operating activities / Cash from operations

Corporate fraud type frequency and incidence statistics_

Violation Type Description Count Frequency
Delayed disclosure Failure to disclose material information within the prescribed timeframe, violating the timely disclosure requirements under securities regulations. 72 16%
False statement Intentional misrepresentation or falsification of financial statements, reports, or other disclosures, including misleading statements that distort the true financial condition or operational performance of the company. 57 13%
Significant omissions Deliberate or negligent failure to disclose material information that could significantly impact investors’ decisions, thereby violating the principle of completeness in disclosure. 51 11%
Inflated assets Misstating the value or existence of assets in financial statements, including overstating asset values or recording non-existent assets. 28 6%
Fabricated profits Artificially inflating or fabricating profits through improper accounting practices, such as recognizing revenue prematurely or manipulating expense records. 9 2%
Guarantee breaches Referring to a listed company or its affiliates providing guarantees in violation of regulations or beyond authorized limits, harming the interests of the company and investors. 8 2%
Misleading disclosures Providing inaccurate, incomplete, or misleading information in disclosures, whether in financial reports, announcements, or other regulatory filings, which misleads investors or regulators. 5 1%
Others - 78 17%

Semantic feature extraction based on the Doubao LLMs_

LLM Prompt Input Output
Doubao-pro-32k_v241215 General Model You are an outstanding natural language processing assistant. Please summarize the input text, ensuring that no critical information is lost or misrepresented. The summary should be fluent, logically clear, and complete and not exceed 256 characters. Chunked text from MA&A (Original text) Chunked text summary
Doubao-pro-32k_v24l412 General Model You are an outstanding natural language processing assistant. Please summarize the input text, ensuring that no critical information is lost or misrepresented. The summary should be fluent, logically clear, and complete and not exceed 1024 characters. Joined summaries of chunked text Full-text summary
Doubao-embedding v240405 Embedding Vector Model None Full-text summary Embedding Vector
Język:
Angielski
Częstotliwość wydawania:
4 razy w roku
Dziedziny czasopisma:
Informatyka, Technologia informacyjna, Zarządzenie projektami, Bazy danych i eksploracja danych