Research on tobacco composition analysis and ratio optimization strategy using data mining technology

Tobacco quality and its stylistic characteristics are intrinsically linked to the chemical composition of tobacco, and the chemical composition of tobacco is the material basis of the stylistic characteristics of tobacco [1–2]. Tobacco is a complex mixture of many compounds, and the flavor components in the smoke mainly come from the tobacco itself. Consumer tastes have become diversified nowadays, no longer just needing the roasted tobacco type or being satisfied with the low-tar light-flavored blends [3–4]. Consumers have higher requirements for the content and quality of products. High-flavor and low-harm cigarettes will become the mainstream, and the market share of blended cigarettes and low-tar “Chinese cigarettes” will be gradually expanded, which puts forward higher requirements for tobacco planting, cigarette formulations, and cigarette processing [5–6].

At present, the existing management information system of the cigarette factory is mainly used in manufacturing, purchasing, warehousing, sales, finance, etc. There is no information system for product design, and there is a lack of a product data platform integrated into the scientific research institutes to serve the whole enterprise management information system. In the original tobacco base, although the formulation of the technical program, supervising and guiding the process of tobacco production, checking the quality of cultivation and modulation, and summarizing and analyzing the results, the large amount of knowledge and information accumulated in these processes is currently in a discrete state, distributed in the hands of individuals, and has not become the common knowledge wealth of the enterprise [7–7]. However, a large amount of knowledge and information accumulated in these processes on the theory of original tobacco production is currently in a discrete state, distributed in the hands of individuals, and has not become the common knowledge wealth of the enterprise [7–9]. A large amount of raw data generated in the production process has existed in the existing information system, limited to quality management and tracking, and can not be well suited to formula design, formulation research, and analysis. The need to integrate the existing management system in the relevant product data for the formula design, formulation research, and analysis services, a large amount of data detected in the laboratory, not centralized and unified management, data can not be shared, the need to The establishment of the laboratory management system, standardize the management of the laboratory, the collection of experimental data, integrated management of experimental data, so that experimental data can serve for product design [10–13].

In view of the current situation of the tobacco industry and the problems within the enterprise, the enterprise needs to establish a comprehensive information management system based on the value chain of tobacco, with the formulation design as the core, the process of realizing the value of tobacco as the process, and the product as the link, to realize the integrated and integrated management of the product-related data, processes, and resources [14–15]. Data, process, and resources for the management of information for the three major elements, to the static product structure and dynamic product design process for the information management of the two main lines, all the information organization and resource management are around the recipe design, at the same time, based on the fine management, in order to standardize the business as a precursor to the strategic support for the goal (in this case, the specific embodiment of the quality and cost), standardize the business process, the establishment of the corresponding Relationship data model [16–17]. Such as tobacco grade, origin, and other attributes, leaf group formula relationship model, the main chemical composition of tobacco, and product quality or index relationship model. Technology processing parameters and product quality or cost relationship model, auxiliary materials supporting parameters, and product quality relationship model. Through the establishment of a reasonable data analysis and mining mechanism, combined with the usual writing, testing data, etc., the use of fuzzy spatial information technology and other technical means, through a certain degree of human intervention, the establishment of a learning function of the data warehouse and data analysis system, knowledge management in one of the integrated product information service platform [18–19]. At the same time, through the specific operation of the integrated product management system, to find out the entire value chain of various data relationships and their corresponding formula to achieve the goal of meeting market requirements of the design and development program. Make the tobacco, formula, process, and product data relationship between the R & D and related management personnel in front of a clear, and in accordance with a unified and enterprise standards consistent data model will be a variety of data centralized management, processing, integration, correlation, provided in a centralized and consistent data center, and at the same time to provide the query, statistics, mining, analysis, and other methods, to solve the R & D global data management, exchange, management, computing and decision-making aids and other issues. At the same time, it provides query, statistics, mining, analysis, and other methods to solve the problems of global data management, exchange, management, operation, and decision-making. The ultimate realization of the formula design, process design and control, analysis, and evaluation of the three-in-one product management information integrated service system [20].

The article proposes a tobacco composition detection model based on big data mining and collects and analyzes tobacco data through a big data mining platform. The clustering algorithm was utilized to distinguish the chemical composition differences in intrinsic quality between different parts. Then, the C3F grade of roasted tobacco from 13 provinces was selected, and the physical properties and chemical composition of the tobacco were analyzed using the big data mining-based tobacco composition detection model proposed in this paper. Further, using a genetic algorithm based on the correlation between the physical and chemical composition of tobacco leaves and the sensory and smoke indexes of cigarettes, the constraints are determined to design an optimization model to achieve the goal of the lowest cost and the optimization model is operated to obtain the optimal combination of ratios for the desired style of cigarettes, and meanwhile, it can make a prediction in advance for the designed leaf group formula to achieve the focused control of key indexes. Finally, the tobacco formula optimization strategy designed in this paper is used for empirical testing.

2

Construction of tobacco composition analysis model based on big data mining

2.1

Architecture of the Big Data Mining Platform

The construction of an automatic tobacco composition detection model based on big data mining is realized under the cloud computing platform. Designed based on the cloud computing platform, it can flexibly demonstrate the characteristics of data collection and real-time analysis, and the platform has the following attributes:(1) It can extract high throughput from different data sources and store all the data in the big data center population. (2) It can analyze the data flow. (3) Support high-efficiency data export. The framework of a big data mining platform based on big data is shown in Fig. 1.

As can be seen from Figure 1, the platform architecture is composed of a data layer, a big data analysis and processing layer, and a management layer. After the platform is driven, the data layer can transmit various data sources of cigarettes to the analysis and processing layer in the form of data flow, and the data, after processing at this level, is displayed to customers through the management layer [21]. 1)

Data Collection

Tobacco companies will use the traditional relational database to store data and utilize the Redis database for data collection. Closely following the goal of cigarette intrinsic quality testing, the raw material sensory quality evaluation as a starting point to analyze and collect the data of first roasted cigarettes and single material cigarettes. The data collection of single-ingredient cigarettes mainly comes from two parts, one part of which collects ingredient cigarette data with typical characteristics accumulated by the formulation system for a long time, and the other part collects single-ingredient cigarette physicochemical indexes and sensory evaluation data of different places of origin provided by the tobacco company.

2)

Data analysis

After integration, it was found that the line data of first-roasted cigarettes had certain characteristics, in which line 2 integrated all the historical data of first-roasted cigarettes. The main research objects are physical characteristics of first-baked cigarettes, appearance quality, chemical composition, and sensory quality, of which the physical characteristics of first-baked cigarettes have a total of 7 indicators, respectively: thickness mm, tensile strength N, tensile strength KN/m, Tim recharge, equilibrium water content %, stalk rate %, leaf surface density. Appearance quality is a total of 6 indicators, respectively: leaf structure, identity, oil, color, length, and residual injury. Identity characteristics total 3 indicators respectively: variety, grade, and origin. There are 12 chemical indicators in total, namely nicotine, total nitrogen, total sugar, reducing sugar, chlorine, potassium, pH value, starch, volatile alkali, sugar-alkali ratio, nitrogen-alkali ratio, and potassium-chlorine ratio. Sensory evaluation indexes mainly include fresh aroma, sweet aroma, licorice aroma, pleasantness, aroma volume, translucency, fineness, extension, sweetness, miscellaneous gas, concentration, strength, softness, doughiness, irritation, cleanliness, zinginess, aftertaste.

The data on single ingredient cigarettes mainly come from the formulation system or directly provided by tobacco companies. The single-ingredient data from the formulation system mainly study the relationship between identity characteristics, chemical indicators, and sensory quality, in which the identity characteristics of single-ingredient tobacco are mainly variety, part, large grade, first-grade origin, second-grade origin, third-grade origin, and year, and the chemical indicators are mainly: total nicotine, total nitrogen, nicotine nitrogen, other nitrogen, total sugar, reducing sugar, smokers' value, sugar/alkali ratio, sugar/nitrogen ratio, and protein, and the chemical indicators are mainly: total nicotine, total nitrogen, nicotine nitrogen, other nitrogen, total sugar, reducing sugar, Smoke value, sugar/alkali ratio, sugar/nitrogen ratio, and protein. Sensory evaluation indexes are translucency, pleasantness, richness, aroma amount, fineness, sweetness, stretches, doughiness, softness, concentration, strength, miscellaneous gas, stimulation, and aftertaste.

2.2

Sensory quality analysis based on big data mining

The use of big data mining technology can obtain sensory-quality data sets. The key step is to use data mining algorithms, cluster analysis of data, according to the results of the selection of quality data to be detected, the use of cluster analysis as a dynamic clustering technique for training sample data sets, combined with the calculation of the Euclidean distance to evaluate the degree of similarity, the closer the distance, which indicates that the quality of the detection of the similarity of the degree of similarity will be higher, the specific steps of the mining are shown below: 1)

For the established data set containing n data points, it is necessary to select the data more concentrated K points as the initial clustering center, different objects represent the same category center.

2)

Calculate the Euclidean distance from each data point to the center, assign them to the clustering center with the highest similarity to it according to the distance, form K clusters, each cluster will represent a class, calculate the sum of the squares of the distances from each point to the clustering center of the class.

3)

Calculate the sum of squares of the total distance from each sample to the category clustering until the minimum, and finally calculate the mean value of all objects in the category.

4)

Determine whether the value in the clustering has changed. If it has changed, then you need to go to step 2). If it does not change, then the clustering is over.

3

Tobacco composition analysis experiments

3.1

Physical Characterization of Tobacco Leaves

3.1.1

Materials and Methods

1)

Materials

A, B, C, D, E, F, G, H, I, J, K, L, and M 13 origin 2023 C3F grade roasted tobacco samples were selected, totaling 250.

2)

Methods

Single leaf mass, thickness, leaf surface density, stalk content, equilibrium moisture content, filler value, tensile strength, and elongation of 250 tobacco samples were determined respectively, the results were statistically analyzed, and SPSS was applied to perform principal component analysis and cluster analysis on the data of physical characteristics of roasted tobacco leaves from different origins. Single leaf quality: 50 pieces of tobacco leaves were randomly selected from each sample, the moisture content was adjusted to 16.0%-18.0%, weighed, and the single leaf quality was calculated.

3.1.2

Analysis of the physical characteristics of tobacco from different origins

The matrix of correlation coefficients of each index of physical characteristics of tobacco from different origins is shown in Table 1. The eigenvalues and contribution rates of each principal component are shown in Table 2. The eigenvectors of each index in the first 5 principal components are shown in Table 3. It can be seen that the cumulative contribution rate of the first 5 principal components reaches 85.36% (>80%), indicating that the analysis of the first 5 principal components has been able to reflect most of the information of all the data. The contribution rate of the first principal component was 32.6%, and the four index factors of thickness, foliar density, single leaf quality, and pedicel content played a major role in the first principal component. The contribution rate of the second principal component was 19.65%, and the two index factors of tensile strength and elongation played a major role in the second principal component. The contribution of the third principal component was 14.69% and fill value and equilibrium moisture content played a major role in the third principal component. The contribution of the fourth and fifth principal components was 11.18% and 7.24%, respectively.

Table 1.

The correlation coefficient matrix of each index of different natural properties

Index	Thickness	Blade density	Single leaf mass	Balanced moisture content	Pull	Elongation	Fill value	Stemmatation
Thickness	1	0.682**	0.494**	−0.023	0.062	−0.054	−0.068	−0.441**
Blade density	0.685	1	0.524**	0.144*	0.031	−0.165*	−0.102	−0.488**
Single leaf mass	0.506**	0.502**	1	0.032	−0.041	−0.169*	−0.206*	−0.461**
Balanced moisture content	−0.024	0.166*	0.044	1	0.053	0.144	−0.123	−0.131
Pull	0.078	0.042	−0.042	0.051	1	0.453**	0.035	−0.085
Elongation	−0.074	−0.161*	−0.162*	0.134	0.463**	1	0.086	−0.021
Fill value	−0.082	−0.116	−0.202*	−0.116	0.025	0.098	1	0.194*
Stemmatation	−0.439**	−0.489**	−0.449**	−0.123	−0.113	−0.024	0.207*	1

Table 2.

The eigenvalues and contribution rate of each main component

Principal component	eigenvalue	Contribution rate	Cumulative contribution rate
1	2.608	32.60%	32.60%
2	1.572	19.65%	52.25%
3	1.175	14.69%	66.94%
4	0.894	11.18%	78.12%
5	0.579	7.24%	85.36%
6	0.416	5.20%	90.56%
7	0.41	5.13%	95.69%
8	0.346	4.31%	100%

Table 3.

The eigenvectors in the top five main components

Index Thickness	Principal component
Index Thickness	1	2	3	4	5
Blade density	0.81	0.047	0.303	0.033	0.206
Single leaf mass	0.851	0.003	0.116	0.224	0.24
Balanced moisture content	0.769	−0.102	0.033	−0.096	−0.173
Pull	0.154	0.319	−0.676	0.608	0.075
Elongation	0.028	0.818	0.182	−0.214	0.322
Fill value	−0.197	0.84	0.039	−0.065	−0.259
Stemmatation	−0.3	0.085	0.674	0.62	−0.174
Index	−0.732	−0.183	0.078	0.093	0.505

3.1.3

Cluster analysis

According to the contribution rate (F) of the first five principal components to generate the formula F=0.326PC₁+0.1965PC₂+0.1469PC₃+0.1118PC₄+0.0724PC₅, the composite score of the samples of each origin was obtained in the range of 13~18. Taking the comprehensive score of the principal components as the new index for evaluation, the squared Euclidean distance was used to measure the differences in the physical traits of tobacco in each origin, and the shortest distance method was used to analyze the physical condition of tobacco in each origin by clustering, and the results of the clustering of the physical traits of tobacco in different origins are shown in Fig. 2. At the squared Euclidean distance of 5.0, all the samples can be divided into four categories, the first category is A and B. The second category is C and D. The third category is E, F, and G. The fourth category is H, I, J, K, L, and M. The physical characteristics of the tobacco leaves of A and B are close to each other, which are mainly characterized by the thicker leaf blades, the larger leaf surface density and the quality of the single leaf, and the lower than the national average of the peduncle rate, among which, the peduncle rate of tobacco leaves of province B is relatively small. The physical characteristics of tobacco leaves in C and D are close to each other, and the physical characteristics can be classified into one category, with the thickness and leaf density higher than the national average, second only to A and B. The physical characteristics of tobacco leaves in E, F, and G can be categorized into one category, with the main manifestations of thin leaves, small leaf density, and single-leaf mass, which are relatively small among the various places of origin, and the rate of stems is higher than the national average. h, i, j, k, l and m. The physical characteristics of tobacco leaves are close, and the indicators are between the first three categories.

3.2

Analysis of the content of the main chemical components of tobacco leaves

In this section, we continue to take five different tobacco-producing areas in Province A as the object of empirical research and test the contents of major chemical components of tobacco in these five tobacco-producing areas respectively by applying the method proposed in this paper. The descriptive analysis of the chemical composition of tobacco leaves in different production areas is shown in Table 4 (capital letters indicate a significant level of 0.01, and lowercase letters indicate a significant level of 0.05), and the correlation coefficients between chemical indicators are shown in Table 5. From the table, it can be seen that the mean values of total soluble sugar, total nitrogen, nicotine, potassium, sugar-alkali ratio, and potassium-chlorine ratio of the chemical composition indicators of tobacco leaves in different tobacco-producing areas did not reach significant differences. There was a highly significant difference between zones d and e only in reducing sugars, while other indicators were not significant. Chlorine content was highest in zone A, and starch content was lowest in zone b smoke. Except for the coefficient of variation (50.78%) of the potassium-chlorine ratio in area c, which reached medium variation, the coefficients of variation of other chemical components were in low or weak variation, among which the coefficients of variation of total nitrogen, nicotine, potassium, chlorine, and nitrogen-alkali ratio were all less than 1%, which indicated that the quality balance and stability of the chemical components in the tobacco area of province A were good.

Table 4.

Descriptive analysis of chemical constituents in different producing areas

Index		a	b	c	d	e
Soluble sugar	M ± SD	30.54±1.89	32.72±4.96	31.73±4.43	30.38±2.79	33.62±2.03
Soluble sugar	CV	12.34	18.35	34.57	9.47	28.29
Reduction sugar	M ± SD	26.65±4.86	28.34±4.03	26.74±3.87	26.3±0.92	30.64±4.77
Reduction sugar	CV	17.6	21.18	17.14	11.54	11.97
Total nitrogen	M ± SD	1.93±1.79	1.88±0.17	1.5±0.24	1.68±0.55	1.79±3.36
Total nitrogen	CV	0.06	0.16	0.05	0.01	0.05
Ncotine	M ± SD	1.8±1.47	2.51±1.33	2.81±3.38	2.74±3.36	2.04±3.34
Ncotine	CV	0.14	0.38	0.22	0.05	0.16
Potassium	M ± SD	1.92±2.2	2.36±2.85	2.4±0.17	2.29±3.45	2.79±3.31
Potassium	CV	0.61	0.15	0.23	0.2	0.28
Clorine	M ± SD	0.37±4.76	0.4±4.81	0.22±2.35	0.23±0.6	0.61±4.66
Clorine	CV	0.04	0.04	0.01	0.07	0.05
Sarch	M ± SD	3.33±1.3	2.53±0.55	4.42±2.43	2.75±2.12	3.93±1.37
Sarch	CV	0.22	0.4	1.17	0.76	0.42
Glycosoda ratio	M ± SD	14.69±4.98	15.3±0.27	13.66±4.09	12.53±3.98	14.62±4.61
Glycosoda ratio	CV	24.04	33.89	21.59	6.33	17.09
Nitrogen base ratio	M ± SD	1.33±4.98	1.15±1.45	0.68±3.79	1.07±1.01	1.07±0.78
Nitrogen base ratio	CV	0.03	0.08	0.04	0.01	0.09
Potassium chlorbium	M ± SD	5.53±2.58	10.01±4.8	8.33±1.79	7.56±3.69	9.97±4.36
Potassium chlorbium	CV	3.26	23.08	50.78	14.56	13.19

Table 5.

Correlation coefficients between chemical indexes

Index	1	2	3	4	5	6	7	8	9	10
1. Total sugar	1
2. Reduction sugar	0.707	1
3. Total nitrogen	−0.168	−0.312	1
4. Nicotine	−0.499	−0.452	0.343	1
5. Potassium	0.367	−0.01	0.395	−0.292	1
6. Chlorine	−0.157	−0.079	0.298	−0.073	0.13	1
7. Starch	−0.101	−0.038	−0.242	0.11	−0.28	−0.054	1
8. Glycosoda ratio	0.779	0.641	−0.272	−0.859	0.334	0.029	−0.128	1
9. Nitrogen base ratio	0.377	0.239	0.306	−0.789	0.533	0.213	−0.265	0.747	1
10. Potassium chlorbium	0.273	−0.027	0.063	−0.135	0.418	−0.666	−0.088	0.16	0.142	1

3.2.1

Chemical composition analysis

The principal component loadings for each chemical component are shown in Figure 3. The principal component factor loading matrix is shown in Table 6. Extracting the three principal components with eigenroots greater than 1, the cumulative variance contribution rate can reach 93.0%, indicating that the first three principal components can basically reflect all the information of the original data. From Figure 1 and Table 6, the variance contribution rate of the first principal component was 46.319%, in which total nitrogen, chlorine, and total sugar occupied higher loadings. The variance contribution of the second principal component was 30.937%, in which nicotine, sugar-alkali ratio, nitrogen-alkali ratio, and potassium occupied higher loadings. The third principal component had a variance contribution of 16.644%, with reducing sugar, potassium-chlorine ratio, and two-sugar ratio occupying higher loadings.

Table 6.

Principal component factor load matrix

Constituent	Main component factor load
Constituent	1	2	3
Total nitrogen / %	0.922	0.079	−0.097
Chlorine / %	0.848	−0.092	−0.257
Total sugar	0.859	0.499	0.21
Reduced sugar / %	0.781	0.235	0.593
Potassium chlorbium	−0.761	0.378	0.497
Potassium / %	−0.746	0.586	0.265
Nicotine / %	0.197	1.004	−0.312
Glycosoda ratio	0.472	0.881	−0.094
Nitrogen base ratio	−0.621	0.764	−0.116
Bisugar ratio	0.387	−0.169	0.943
Characteristic root	4.624	3.1	1.69
The percentage of variance / %	46.319	30.937	16.644
Cumulative contribution rate	46.319	77.256	93.900

According to the contribution rate of the three principal components, we can know the composite score of chemical components of roasted tobacco in each county and district of province A: F=0.463PC₁+0.309PC₂+0.166PC₃

The scores of chemical components of tobacco leaves in counties and districts of Province A are shown in Table 7, and the composite scores of principal components of chemical components of tobacco leaves in counties and districts of Province A are i, c, d, a, f, g, h, e, and b in descending order.

Table 7.

Chemical composition of tobacco tobacco

	PC1	PC2	PC3	F
a	-1.107	7.974	-6.422	0.883
b	-1.644	7.71	-6.57	0.516
c	-1.32	8.651	-6.281	1.02
d	-1.069	8.09	-6.163	0.981
e	-1.285	7.363	-6.007	0.682
f	-1.52	8.417	-6.178	0.864
g	-1.785	8.587	-6.298	0.771
h	-2.161	9.382	-6.926	0.746
i	-2.428	11.858	-6.711	1.434

3.2.2

Cluster analysis

Based on the content of the main chemical components will be nine counties and districts in province A according to the intergroup linkage method for systematic clustering. The main chemical analysis of the clustering is shown in Figure 4. At the similarity degree of 90.04, province A can be divided into three categories of tobacco areas, the first category for a, b, and c. The second category for d, e, f, g, h. The third category for i. The three types of tobacco areas are located in the northeast, central, and southwest of Province A, indicating that the chemical composition characteristics of Province A show obvious regionality. From the chemical composition characteristics of roasted tobacco in the three types of tobacco areas, there is no significant difference in the nicotine content of roasted tobacco in different types of tobacco areas, and all of them are within the range of high-quality tobacco, indicating that the nicotine content of tobacco is moderate and relatively stable. The total nitrogen content of the first category of tobacco areas was significantly lower than the second and third categories of tobacco areas, but all within the range of high-quality tobacco. The nitrogen-alkali ratio was significantly higher in the third category of tobacco areas than in the first category, but all three categories were within the range of high-quality tobacco. Reduced sugar, total sugar, and saccharine were significantly lower than that in the second category than in the first category, but all three categories were at relatively high levels, indicating that the sugar content of roasted tobacco in the Yuxi Tobacco Region was worth focusing on. From the first to the third type of tobacco area, the chlorine content decreased significantly, and the potassium content and potassium-chlorine ratio increased gradually with significant differences among different areas, indicating that Province A showed a significant decrease in chlorine content and a significant increase in potassium content and potassium-chlorine ratio from the northeast to the southwest.

4

Genetic algorithms and formulation optimization

4.1

Genetic Algorithm for Cigarette Formulation Design Problems

A genetic algorithm is an optimization solution method that draws on the principles of natural selection and natural evolution and simulates the evolutionary process of organisms in nature. Its search space is large, and the search direction is relatively diffuse. Comparatively speaking, the use of expert experience to find the best is easily subject to the expert's knowledge structure, subjective awareness, objective conditions, and other constraints, so the spatial scope of the search is relatively small, but the search direction is clear, the search speed is fast, and it can be used to deal with complex nonlinear combinatorial optimization problems [22–23]. Cigarette formula design is such a complex combinatorial optimization problem the combination of the two makes the search space increased while the search direction is also relatively clear, thus accelerating the search speed, improving the system performance, solving the complex combinatorial optimization problem of the cigarette formula design, so as to overcome the bottleneck of the traditional expert system in the acquisition of knowledge.

Leaf group formula design is a key link in cigarette product design, which plays a decisive role in the color, aroma, and taste of cigarette products. Leaf group formula is a variety of different types, different aromas, different places of origin, different grades, and different nature factors of tobacco, according to the type of cigarette products, aroma, grade, style, and other quality standard requirements, with different ratios to be a reasonable match, and to ensure that the quality of the leaf group formula, the style and the stability of the cost. When designing the leaf group formula, the region, grade, and part of the tobacco, as well as certain characteristic chemical indexes, sensory indexes, or smoke indexes, are generally taken as constraints, and the formula cost is taken as the target variable. The formulation process is described as follows: under the premise of satisfying the constraints b₁,b₂,…,b_m, the number of packs (i.e., the proportion) of various tobacco leaves in the formulation is found when the objective function s is minimized. The genetic algorithm recommends the system to meet the requirements of a number of formula programs by the formula designer to choose or in the recommended formula on the basis of the modification of the formation of a new leaf group formula.

The expression is as follows:

Objective function: $M i n Y = \sum_{i} = 1^{n} c_{i} x_{i}$ is the cost of the leaf group formulation and the constraints satisfied: 1 $\begin{array}{l} b_{0} < = a_{11} X_{1} + a_{12} X_{2} + \dots + a_{1 n} X_{n} < = b_{1} \\ a_{21} X_{1} + a_{22} X_{2} + \dots + a_{2 n} X_{n} = b_{2} \\ ... \\ a_{m 1} X_{1} + a_{m 2} X_{2} + \dots + a_{m n} X_{n} > = b_{m} \end{array}$

Where: x_i is one of the n leaves and X(x₁,x₂,…;x_i, …;x_n)i = 1,2, …n leaves in the leaf group formulation. c_i denotes the price ($/kg) corresponding to the i tobacco leaf.

b₁,b₂,…,b_m for each constraint. a₁₁,a₁₂,…,a_mn stands for the relevant indicators of tobacco, such as chemical, sensory, smoke, grade, origin, and total number of packs.

c₁,c₂,…,c_n is the price of each tobacco in the tobacco collection.

X₁,X₂,…,X_n is the proportion (%) of each tobacco leaf in the tobacco leaf pool. 2 $\sum_{i = 1}^{n} X_{i} = 100, 0 \leq X_{i} \leq 10$

4.2

Genetic Algorithm for Cigarette Formulation Optimization and Analysis

4.2.1

Chromosome coding design

According to the business needs, the chromosome is encoded in real numbers, which facilitates large space search, and the local search ability is strong, not easy to fall into the local extremes, high precision, and fast convergence. Chromosome for the selected initial set of tobacco leaves, for R(r₁,r₂,…r_i,…,r_n), i on behalf of the location of each set of tobacco leaves, the length of the chromosome for the set of tobacco leaves in all the number of tobacco leaves, the chromosome on behalf of a leaf group recipe program, the meaning of which is: r_i indicates that the program for the number of packets of each tobacco, the size of the value set by the external user to determine the minimum and maximum number of packets.

4.2.2

Initializing populations

According to the constraints, N real numbers within a specified range are randomly generated and arranged together to form an individual, which successively produces the size population, i.e., the initial population. Starting from the initial population to simulate the evolutionary process, selecting the best and eliminating the worst, and finally arriving at a very good group whose better solution is the solution to the problem. In this paper, the size of the initial population is set to 50.

4.2.3

Constructing the fitness function

In the process of chromosome evolution, if the leaf composition is more than 10% higher than the original leaf composition, the chromosome will be rejected directly, and new chromosomes will be generated. Therefore, the establishment of the fitness function is mainly based on the user-set index parameters as the independent variable, such as chemical composition, three major smoke indicators, and four major origins as the constraints, then the construction of the fitness function: 3 $\begin{matrix} f (u_{1}, u_{2}, \dots, u_{10}, v_{1}, v_{2}, v_{3}, d_{1}, d_{2}, \dots, d_{4}) \\ = {(\sum_{i = 1}^{11} (w_{i} | u_{i} - u_{i}^{'} |^{2}))}^{- 1} + {(\sum_{i = 1}^{3} (λ_{i} | v_{i} - v_{i}^{'} |^{2}))}^{- 1} + {(\sum_{i = 1}^{3} (δ_{i} | d_{i} - d_{i}^{^{'}} |^{2}))}^{- 1} \end{matrix}$

Let the chemical indicators, the three main smoke indicators, and the four main origins be fitness functions: 4 $g (u_{i}) = λ_{i} | u_{i} - u_{i}^{'} |^{2}, h (ν_{i}) = η_{i} | ν_{i} - ν_{i}^{'} |^{2}, k (ν_{i}) = δ_{i} | d_{i} - d_{i}^{'} |^{2}$

Transform the formula into: 5 $\begin{array}{l} f (u_{1}, u_{2}, \dots, u_{10}, v_{1}, v_{2}, v_{3}, d_{1}, d_{2}, \dots, d_{4}) \\ = {(\sum_{i = 1}^{10} g (u_{i}))}^{- 1} + {(\sum_{i = 1}^{3} h (v_{i}))}^{- 1} + {(\sum_{i = 1}^{4} k (d_{i}))}^{- 1} \end{array}$

Where 0 < w_i ≤ 1 is the weight of the i th chemical indicator item on the whole chromosome (formulation scheme), 0 < η_i ≤ 1 is the weight of the i th flue gas indicator on the whole chromosome (formulation scheme), 0 < δ_i ≤ 1 is the weight of the origin indicator item on the whole chromosome (formulation scheme), and $\sum_{i = 1}^{10} w_{i} + \sum_{j = 1}^{3} λ_{j} + \sum_{l = 1}^{4} δ_{l} = 1$ , the exact value will depend on the setting by the formulation designer.

4.2.4

Population evolution

1)

Selection operation

The purpose of the selection operation is to select superior individuals from the current population based on evolutionary principles. Selection is based on fitness; individuals with high fitness have a higher chance of reproducing in the next generation and thus have more offspring, while individuals with lower fitness produce fewer offspring and are eventually eliminated. Therefore, it is necessary to find a way to make the probability of individuals being selected stop proportional to their fitness. There are existing methods to achieve this purpose, such as proportional selection, merit preservation, and ranked selection. In this paper, the roulette method is used.

Calculate the selection probability and cumulative probability for each chromosome based on the fitness f_i of each individual and the total fitness $F = \sum_{i = 1}^{m} f_{i}$ of the population. The m individuals are selected by dividing the roulette wheel into unequal sectors of $F = \sum_{i = 1}^{m} f_{i}$ size and spinning the wheel m times. The larger the fitness, the larger the area on the roulette wheel, and the higher the corresponding probability of selection. This simulates the natural law of survival of the fittest. 6 $P_{i} = \frac{f_{i}}{F} i = 1, 2, \dots, m$ 7 $q_{i} = \sum_{j = 1}^{i} P_{j} i = 1, 2, \dots, m$

2)

Crossover operation

The selection operation selected two chromosomes x = (x₁,x₂,…,x_n) and y = (y₁,y₂,…,y_n) to crossover rate pc randomly selected these two chromosomes at a certain location, then crossover after the two individuals are: 8 $\begin{array}{l} x_{i}^{'} = {(1 - a)}^{*} x_{i} + b^{*} y_{i} \\ y_{i}^{'} = {(1 - b)}^{*} x_{i} + a^{*} y_{i} \end{array}$

Where a,b is a random number between 0 and 1 i.e. 0 ≤ a,b < 1.

To ensure the feasibility of the formulation scheme, each new individual is checked individually to see if it satisfies the constraints of the leaf group formulation, and if it does, the new individual is used as a member of the new generation. Otherwise, discard the new individual and re-generate the new individual.

3)

Mutation operation

When performing chromosome mutation is different from chromosome replication because the chromosome constraints may not be satisfied after chromosome mutation. In this paper, we utilize a combination of guided mutation and genetic mutation to perform mutation operations. The whole process of mutation is divided into two steps, which are: guided mutation using the cumulative mean. After the mutation is completed, each chromosome must still satisfy the predefined chromosome constraints.

Sort the 40 chromosomes G(R₁,R₂,…,R₄₀) and let the set of sorted chromosomes be $G (R_{1}^{'}, R_{2}^{'}, \dots, R_{40}^{'})$ . Take a certain probability p_m = 0.3 method for random guided mutation of chromosomes. Select a chromosomal locus at random from these 16 chromosomes to generate a random number between 0 and 1 r, if r ≥ p_m then mutate the current chromosome such that: 9 ${\bar{R}}_{i} = \frac{\sum_{j = 1}^{i} R_{j}}{i}$

Otherwise the next chromosome is extracted until all the chromosomes are taken. Obviously the above variation has ${\bar{R}}_{1} = R_{1}^{'}$ , i.e., the best chromosome is preserved.

To ensure that the formulation scheme satisfies the constraints, check whether the mutated new individual satisfies the constraints. If it does, then the new individual is used as a member of the new generation; otherwise, discard the new individual and re-generate it. 10 $\sum_{i = 1}^{n} X_{i} = 100, 0 \leq X_{i} \leq 100$

4.3

Optimized design of leaf group formulation

Taking cigarette products, the author develops formulation constraints within a wide range. Including the use of different types, regions, and parts of the tobacco range, as well as the range of changes in various chemical indicators. Selected 40 sets of tobacco leaves of different origins by the system using genetic algorithms and neural networks constitute the optimization design model recommended to meet the constraints of the leaf group formulation scheme.

5

Tobacco ratio optimization experiment

5.1

Validation of the optimal raw material formulation

The validation test of the optimal raw material formula for the characterization of reclaimed tobacco is shown in Table 8. From the table, it can be seen that the average score of the sensory evaluation of the validation test of the optimal raw material formula was 84.5, with an RSD of 0.13%, and the test repeatability was good, which was basically consistent with the results of the orthogonal test, indicating that the proportion of the raw material formula was feasible.

Table 8.

The test of the best formula for the best raw material of tobacco

Serial Number	Odor	Tune	Volume Of Aroma	Aroma Quality	Offensive Odor	Aftertaste	Irritating	Total Score
1	4	19	8	7	19.5	15	16	86
2	4	17.5	9.5	7	18.5	15	16	85.5
3	4	16.5	10.5	7	17	15	16	82.5
Average								84.5

5.2

Chemical content and sensory evaluation of absorption quality

The conventional chemical composition content of tobacco designed using the strategy of this paper is shown in Table 9. The harmful components of mainstream tobacco designed using the strategy of this paper are shown in Table 10. From the table, it can be seen that the total sugar content of the tobacco designed by using the strategy of this paper and the conventional tobacco is not much different, and the content of the rest of the components is significantly lower than that of the conventional tobacco. Since the tobacco designed by the strategy of this paper does not use raw tobacco materials, it is composed of natural plant raw materials, i.e., the samples basically do not contain total phytochemicals (the detection limit of total phytochemicals is 0.21%). Compared with conventional tobacco, the content of the mainstream tobacco designed by using the strategy of this paper decreased, except the content of benzo(a)pyrene, phenol, and 4-methylnitrosamino-1-3-pyridinyl-1-butanone (NNK) increased, the content of the rest of the components decreased, among which, hydrocyanic acid and crotonaldehyde had the greatest decrease, the content of the components decreased by 79 and 16.55, respectively. The harm index of tobacco designed by using the strategy of this paper decreased significantly, decreased by 1.44. In the sensory evaluation of smoking quality, the concentration of conventional tobacco was relatively high, but the aroma was rougher, the irritation was more obvious, the smoke had a burning sensation, there was a withered and charred odor and woody odor, and the residual sensation in the mouth was more obvious. The tobacco designed by using the strategy of this paper was mainly clean and floral, supplemented by burnt and sweet aroma, with fresh and translucent aroma, uplifting and relaxing smoke, rich aroma, and no obvious miscellaneous gas. As a result, the smoking quality of the tobacco designed by using the strategy of this paper is better than that of conventional recycled tobacco and meets the requirements of product development.

Table 9.

Standardized concentration

Sample	Total Sugar	Reduction Sugar	Total Nitrogen	Total Vegetable Base	Total Vegetable Base	Chloride Ion
Conventional Tobacco	9.25	7.84	1.54	1.15	2.54	0.88
Ours	7.88	7.65	0.33	0.15	0.75	0.36

Table 10.

The harmful content of mainstream smoke

Sample	Carbon monoxide	Hydrocyanic acid	benzpyrene	Crodal aldehyde	phenol	Ammonia gas	NNK	Hazard index
Conventional Tobacco	12.2	95	4.21	18.65	3.08	3.9	7.05	6.65
Ours	10.5	16	5	2.1	3.27	3.2	9.45	5.21

6

Conclusion

The study used principal component analysis and clustering methods to analyze the tobacco composition, and combined with genetic algorithm to intelligently optimize its ratios. As can be seen from the experimental results, after using the big data mining-based tobacco composition model constructed in this paper to analyze the composition of five tobacco-producing regions, it was found that the chemical composition indicators of tobacco in different tobacco regions were more stable in terms of total nitrogen, nicotine, potassium, chlorine, and nitrogen-alkali ratio. Total soluble sugar, total nitrogen, nicotine, potassium, sugar-alkali ratio, and potassium-chlorine ratio were more similar, and the difference was not significant. The correlation between total sugar and reducing sugar, sugar-alkali ratio, nitrogen-alkali ratio, and nicotine reached a highly significant level, in which the correlation between nicotine and nitrogen-alkali ratio also reached a highly significant level. After using the tobacco ratio optimization strategy designed in this paper for tobacco product development, the tobacco hazard index decreased significantly by 1.44, which can be obtained that the products developed using the strategy designed in this paper are better than the quality of conventional tobacco products, and the quality of cigarette products has been effectively improved.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Research on tobacco composition analysis and ratio optimization strategy using data mining technology

Xingliang Li

Weixian Ren

Guangwei Liu

Publicado en línea: 19 mar 2025

Recibido: 02 oct 2024

Aceptado: 29 ene 2025

DOI: https://doi.org/10.2478/amns-2025-0481

Palabras claveData mining techniques, Tobacco composition, Chemical composition, Genetic algorithm, Tobacco rationing

© 2025 Xingliang Li et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
Data mining techniques, Tobacco composition, Chemical composition, Genetic algorithm, Tobacco rationing