Uneingeschränkter Zugang

Analysis and prediction of second-hand house price based on random forest


Zitieren

Introduction

In recent years, with the soaring price of new houses, people's living pressure is increasing. In order to reduce the pressure of house purchase, some people will consider buying second-hand houses. However, in the face of a large number of second-hand housing data, and the fact that the housing price is related to many factors, the means to identify the most satisfactory house for purchase has become a challenging problem. By analysing the factors affecting the price of secondhand housing and predicting the house price, this paper provides a reference for the prediction of second-hand housing price and helps people to solve the problem of difficulty in identifying a house for purchase. Zhang [1] uses a linear regression model to analyse the factors of housing price. Sun [2] uses a support vector machine model to evaluate the second-hand housing price, and finds that the prediction effect is better than the regression model. Xu [3] and Wang [4] apply a neural network model to predict house prices. The results of these researches indicate that the accuracy of linear regression is low, support vector machine is sensitive to noise and the stability of neural network is poor.

As a new and highly flexible machine learning algorithm, Random Forest (RF) has the advantages of good tolerance to noise and outliers, high prediction accuracy, and not being prone to over fitting; thus, it has been widely used in the fields of economic management, ecology, biomedicine, credit evaluation and so on [5,6,7]. Taking Chengdu second-hand housing as an example, this paper uses web crawler technology to capture and clean the housing source data of Chengdu second-hand housing from Beike Network, and then visually analyses the cleaned data, as well as the impact of 11 factors, including area, orientation, layout and building structure, on house prices. An RF model is established based on 38,363 second-hand housing data elements. According to the visual analysis results, the model variables are revalued based on the impact of various factors on house prices; the higher the impact, the greater the value. The seven most critical factors affecting house prices are found through the importance of characteristics. The grid search parameter adjustment is used to optimise the RF model, and finally, the optimised model is used to predict house price.

Introduction to Python Data Analysis
Crawler Technology

Web crawler, also known as web spider [8], is a program that can automatically browse web pages, grab web page information, analyse and store it. The key technologies of web crawler are mainly reflected in three aspects: Web page acquisition, web page analysis and data storage [9].

Web page acquisition. Requests are issued through HTTP library, based on urllib and written in Python. This paper uses the requests library to simulate the browser to send a request to the web server. After obtaining the server-side response, the target web page resources are downloaded.

Web page analysis. When the web crawler obtains the web page information, it needs to denoise the web page. Web page denoising is mainly carried out to extract the required effective data and eliminate redundant files. This paper uses the Beautiful Soup library to parse the HTML structure, extract the web page data and finally obtain the formatted text data.

Data storage. The obtained data can be stored locally or in a database. In this paper, the obtained data are saved in the CSV file format and stored directly locally.

Data Analysis Technology

Today is the era of big data, and data analysis is applied to all aspects of society. Data analysis is generally embodied in three aspects [10]. (1) Data pre-processing. The raw data obtained may have irregular problems, such as missing data or data duplication, and thus data pre-processing is required before data analysis. (2) Descriptive analysis. Some laws of the data can be displayed intuitively through graphs, tables, etc. Matplotlib and Pandas will be used in the analysis performed in the current study. Matplotlib can generate data visualisation modules to help users draw. Pandas data analysis package can quickly provide structured data resources to Python. (3) Predictive analysis. Some laws discovered through descriptive analysis are processed by feature engineering, and mathematical statistical algorithms, machine learning models, deep learning models, etc. are selected to predict the development of things.

Data Collection and Pre-processing
Website Data Crawling

The data come from Beike Network (https://www.cd.ke.com). Beike Network is an online estate trading platform created for new houses, rented houses and second-hand houses. The housing data that need to be collected mainly include basic information (house type, floor, building area, housing orientation, etc.), transaction attributes (transaction ownership, housing use, housing age), housing characteristics, etc.

There are only 30 listings on each page. If you want to get more listings, you need to jump to the next page. Observing the composition of the url, we can find that the pg of each page increases by 1, and thus the url can be spliced into ‘https://www.cd.ke.com/pg’+str(page)+’/’, controlled by the ‘for loop’. The 30 detailed urls on each page are located through beautiful.find_all(‘li’) and then saved in the list. Through the url in the list, you can enter the source information details page of each house, locate the required field information and then save it in the CSV file. The specific process of the crawler is shown in Figure 1.

Fig. 1

The specific process of the crawler

Data Pre-processing

When crawling, the collected data are crawled by region, one table per region. Therefore, you first need to combine the data of all regions into one table using pandas.concat(). After the data are merged, the basic information column of the house contains multiple pieces of information, such as the house type, floor, building area, building structure, decoration and orientation, etc., and these are extracted using the re.findall() function to form a new column, which is the community profile column. The building age is extracted using the same method and becomes a new column; the regular information collected matches the basic information column of the house and the characteristic column of the house, including whether there are elevators and subways, and the results are assessed in terms of whether there is positive affirmation, which is given by 1, and no is 0. Finally, we delete the unnecessary columns such as community name, house number, basic house information, house transaction information, community brief introduction and house characteristics. The reprocessed data are shown in Table 1.

The processed data

District Total price (104yuan/set) Unit price (yuan/m2) Bedroom Hall Floor Elevator Subway Direction Area (m2) Building structure Renovation condition Building age
Jingjiang 95.0 18497 2 1 middle 1 1 northeast 51.36 frame other 2008
Jingjiang 205.0 22410 2 2 middle 1 0 southwest 91.48 frame paperback 2005
Jingjiang 73.0 15049 2 1 low 0 0 southwest 48.51 brick concrete other 1990
Jingjiang 260.0 35887 2 2 high 0 1 southeast 72.45 brick concrete paperback 1989
Jingjiang 105.0 12920 3 1 middle 0 1 southeast 81.27 brick concrete paperback 1988
Data Analysis
Descriptive Analysis

We use the pandas describe() function to perform descriptive analysis on the data set, and the analysis result is shown in Table 2. It can be seen from the table that the average total price of second-hand houses in Chengdu is 1.49 million yuan, and the average area is 94 m2. Dividing the former by the latter, the average price is obtained as about 15,900 yuan/m2, which is not much different from the average unit price in the table. It shows that the abnormal value of the data is not obvious, and the next step of data visualisation analysis can be carried out.

Descriptive analysis

Total price (104yuan/set) Unit price (yuan/m2) Area (m2)
count 38363.000000 38363.000000 38363.000000
mean 148.570649 15703.469723 93.787843
std 98.264107 6679.299573 41.674855
min 13.500000 3403.000000 15.880000
25% 86.000000 11062.000000 69.490000
50% 123.000000 13862.000000 87.480000
75% 180.000000 18952.000000 112.490000
max 3110.000000 39300.000000 980.000000
Visual Analysis

Housing prices are used as dependent variables. Also, area, bedroom, hall, floor, elevator, subway, orientation, building structure, decoration and construction age are used as independent variables. We study the influence of independent variables on dependent variables.

Regional Impact on House Prices

The average price of houses in each area is shown in Figure 2. As can be seen from the figure, the top three areas with average unit price are High-tech Zone, Tianfu New Zone and Jinjiang District. Tianfu New District has great potential for development; it is a national-level new district with the theme of a park city, which makes people come in an endless stream, and thus the second-hand housing market is also developing. The high-tech zone is a bridge connecting the old city and Tianfu New District, new houses are in short supply and the secondhand housing market is also rising. There are many scenic spots near Jinjiang District, and life and entertainment are very convenient. Not only is Xinjin District far from the main urban area but also the current development there is not prominent among the districts, and thus the housing prices are the lowest.

Fig. 2

Average price of houses in various regions

The Impact of the Number of Bedrooms on House Prices

The price per unit area for various numbers of bedrooms is shown in Figure 3. From the figure, we can see that the more bedrooms there are, the higher the house price. When the number of bedrooms is six, the price is the highest. When the number of bedrooms is seven, the impact on the house price is reduced. The number of bedrooms in most houses is below six; the ones that exceed this number are generally villas or old houses, and the house prices of old houses will be lower. In the data collected during this time, most of the seven bedrooms are old houses, and thus it becomes clear as to why the number of bedrooms has increased despite the fact that the house price has decreased.

Fig. 3

Price per unit area of various number of bedrooms

The Impact of the Number of Halls on House Prices

The price per unit area of various numbers of halls is shown in Figure 4. One and two halls have the highest price per unit area; the lowest price per unit applies in the case of five halls. The rest have less impact on housing prices. The vast majority of second-hand houses on the market are those with one hall or two halls. If there are more rooms, the room type will be unacceptable.

Fig. 4

Price per unit area of various number of halls

The Impact of Floors on House Prices

Various numbers of floors have a certain impact on house prices, but the impact is not particularly significant. Generally, people think that the house price on the middle floor is the most expensive, but on this occasion the data show that the house price on the lower floor is the highest. The price per unit area of various floors is shown in Figure 5.

Fig. 5

Price per unit area of various floors

The Impact of Orientation on House Prices

Figure 6 shows the number of houses with various orientations. It can be seen from the figure that the number of houses facing south and southeast is the largest: they constitute more than half of the market. China is in the northern hemisphere, a house facing south is conducive to receiving more sunlight and the sunshine time is longer.

Fig. 6

Number of houses with various orientations

The housing prices per unit area of various orientations are shown in Figure 7. The housing prices for houses facing north and west are the most expensive, and the difference in housing prices between houses facing other directions is not obvious.

Fig. 7

House prices per unit area in various orientations

The Impact of Decoration on House Prices

The decoration of second-hand houses can be categorised mainly into simple decoration, hard decoration and rough decoration. Hardcover rooms have the greatest impact on housing prices, and there is a certain difference between rough and simple rooms. The price per unit area of various decoration conditions is shown in Figure 8.

Fig. 8

House prices per unit area in various decoration conditions

The Impact of Building Structure on House Prices

The housing prices per unit area of various building structures are shown in Figure 9. The steel-concrete and frame structure houses are the most expensive. Reinforced concrete has good seismic performance and durability. At the same time, the cost is higher. Many high-rise and century-old buildings use steel-concrete structure. Frame structure houses are also very good in terms of seismic and impact resistance, and the cost is a bit cheaper. Therefore, the housing structures on the market are largely composed of steel-concrete and frame. The rest of the building structure has a small market share and causes a small impact on housing prices.

Fig. 9

House prices per unit area of various building structures

The Impact of Elevators and Subways on House Prices

The question of whether there are subways and elevators has an impact on the price per unit area, as shown in Figures 10 and 11. As can be seen from the figure, elevators and subways have a great impact on housing prices. Elevators and subways provide great convenience to people's daily lives and transportation. In today's society, people will take these two factors into consideration when buying a house, and the same is true for the second-hand housing market.

Fig. 10

The impact of proximity to the subway on housing prices

Fig. 11

Whether there are elevators affecting house prices

The Impact of Construction Age on House Prices

The influence of construction age on the price per unit area is shown in Figure 12. From the figure, we can see that since 2000, housing prices have basically shown a linear upward trend. But 2016–2018 house prices are on a downward trend. Investigating the reasons, it was discovered that during this period, there was a steady supply of new housing, and the price-limited housing under policy control was also cheaper. The demand for second-hand housing was small, and the price also fell. Since 2018, housing prices have continued to rise, and more and more people are buying houses by lottery. Those who are anxious to buy a house but don’t have enough deposits turn to second-hand houses. As a result, market demand for second-hand housing increased, and housing prices continued to rise.

Fig. 12

House prices per unit area in various years

The Impact of Area on House Prices

The relationship between the area and the total price of second-hand houses is shown in Figure 13. House area and total price are positively correlated. The larger the area, the higher the total price. In this area, with a housing area of 0–300 m2 and a total price of 0–10 million yuan, the distribution of data points is relatively concentrated.

Fig. 13

The relationship between area and total price

Predictive Analysis
Random Forest

RF uses the idea of integrated algorithm to randomly form a forest of multiple decision trees, each decision tree is independent of every other and the trees have no correlation [11, 12]. These decision trees are all built for the same task, and their results are finally averaged.

The key to RF is randomness [13, 14]. If the model of each tree is the same, then the average result is also the same. We need to make the model diversified. The first is the randomness of the data set, and sampling with replacement is adopted. For example, the first decision tree randomly selects 80% of the data as input data, and the second decision tree also randomly selects 80% of the data as input data, so that the input data of each tree is different. The second is the randomisation of features, and each tree randomly extracts 60% of the features. The flow of the RF model is shown in Figure 14.

Fig. 14

RF model process. RF, Random Forest.

This paper uses the function train_test_split() to divide the data set into a training set and a test set at a ratio of 3:1. The training set data are used to train the RF model to predict the unit housing price in the test set. The evaluation standard of this model is the mean absolute percentage error (MAPE), and the calculation formula is as follows: MAPE=i=1N| yipredictediyi |×100%N MAPE = \sum\limits_{i = 1}^N {\left| {{{{y_i} - {predicted_i}} \over {{y_i}}}} \right| \times {{100\% } \over N}} where yi represents the actual unit price and predictedi represents the predicted unit price. The smaller the value of MAPE, the better the prediction effect of the model.

Feature Engineering

Feature engineering is the process of transforming the original data into model training data, and this process involves extracting features from the original data to the greatest extent. Applying these features to the prediction model can improve the prediction accuracy of the model [14]. The better the feature selection and preparation, the better the results obtained.

Values of Characteristic Variables

According to the results of the previous visual analysis, when the data are sorted according to the degree of influence of the value of each dependent variable on the housing price, we observe that the higher the influence, the greater the value. The processed characteristic variables are shown in Table 3.

Characteristic variables after processing

District Total price (104yuan/set) Unit price (yuan/m2) Bedroom Hall Floor Elevator Subway Direction Area (m2) Building structure Renovation condition Building age
11 95.0 18497 2 1 2 1 1 5 51.36 frame other 13
11 205.0 22410 2 2 2 1 0 2 91.48 frame paperback 16
11 73.0 15049 2 1 3 0 0 2 48.51 brick concrete other 31
11 260.0 35887 2 2 1 0 1 1 72.45 brick concrete paperback 32
11 105.0 12920 3 1 2 0 1 1 81.27 brick concrete paperback 33
Feature Selection

After feature processing, feature selection is required [15]. These 11 characteristic variables are manually selected based on experience. However, these characteristic variables may not necessarily improve the accuracy of model prediction. In fact, some variables have little effect on model prediction, but it takes more time for the effect to become prominent. This paper selects feature variables by calculating feature importance.

Feature importance is to find the contribution of each feature on each decision tree in the RF, then take the average and finally compare the contribution between the features. This contribution uses the Gini index [16] as the evaluation index, and the calculation formula is as follows: GIm=1k=1| k |pmk2 {GI_m} = 1 - \sum\limits_{k = 1}^{\left| k \right|} {{p_{mk}}^2} where K indicates that there are K categories and Pmk indicates the proportion of category k in node m.

The exponential change of feature x j before and after branching on node m is VIMjm=GImGIAGIB {VIM_{jm}} = {GI_m} - {GI_A} - {GI_B} where GIA and GIB represent the Gini index of the two nodes after the branch of node m.

The node of feature xj in decision tree i is set as M, and there are N decision trees in the RF; then, the importance of xj is VIMj=1Ni=1NmεMVIMjm {VIM_j} = {1 \over N}\sum\limits_{i = 1}^N {\sum\limits_{m\varepsilon M}^{} {{VIM_{jm}}} }

We use Python's feature_importances_ function to find the importance of each feature variable, as shown in Figure 15.

Fig. 15

The importance of each feature variable

Feature Combination

Based on the calculation results of the above feature importance, three feature selection schemes are set up next:

Select all characteristic variables;

Select the top six feature variables of importance;

Select feature variables whose feature importance exceeds 94% after accumulation.

We use the cumsum() function to calculate the accumulated value. It can be seen from Figure 16 that after the feature importance is accumulated, more than 94% of the feature variables are district, elevator, area, building age, bedroom, direction and building structure.

Fig. 16

The cumulative value of the importance of feature variables

The modelling time of the three schemes applied on the training set and the effect of the predicted error values on the test set are shown in Table 4. It can be seen from Table 4 that the modelling time of the third solution proposed in this paper is greatly shortened, and the error value is reduced. Therefore, we can conclude that the most critical factors affecting housing prices are district, elevator, area, building age, bedroom, direction and building structure.

Comparison of three schemes

Adopted scheme Time (s) Predicted error values (%)
Scheme1 37.37 4.83
Scheme2 37.24 4.84
Scheme3 21.91 2.62
Model Tuning

In the RF model, there are a total of 17 parameters. Optimising the parameters can improve the accuracy of the model. Here, we select several parameters that have a great impact on the model, such as n_estimators (number of trees) and max_ Features, Max_ Depth, in_ samples_ Split (the minimum number of samples required for node splitting) and min_ samples_ Leaf (the minimum sample number of leaf nodes) for tuning. Due to time and performance reasons, we limit the value range of the given parameters in this paper. The value range is shown in Table 5.

Parameter value range

Parameter Value
n_estimators 200~ 2000
max_features ‘auto’, ‘sqrt’
max_depth 10~ 20
min_samples_split 2, 5, 10
min_samples_leaf 1, 2, 4

The experiment uses the grid search algorithm (GridSearchCV) in the sklearn library to tune the parameters. For the grid search algorithm, each possible parameter combination is used as a parameter input for modelling, the optimal parameter result is given and then the best parameter combination is determined through cross-validation. The optimal parameter combination obtained by the grid search algorithm is shown in Table 6.

Optimal parameter combination

Parameter Value
n_estimators 1400
max_features ‘auto’
max_depth 20
min_samples_split 2
min_samples_leaf 1

The characteristic variable combination of Scheme 3 in Section 5.2 is used as the input of the RF model, and the optimal parameter combination and the default parameter combination are, respectively, used for comparative experiments. The error results of prediction on the test set are shown in Table 7.

Comparison of two parameter combinations

Adopted parameters Predicted error values (%)
Default parameter 2.62
combination 2.16

It can be seen from the results that the optimal parameter combination further improves the prediction effect of the model.

House Price Forecast

The adjusted RF model is used to predict the second-hand housing prices in Chengdu, and the prediction results are obtained. For example, a user wants to buy a second-hand house in 2015, with an area of 95 m2, having elevators and subways and located in Jinjiang District, with three bedrooms and two halls, west-facing, having a steel-concrete structure and mid-floor, and being a hard-decorated second-hand house. The model predicts that the unit price is 27,065.8 yuan/m2, and the total price is 2,571,300 yuan. The user has rigid requirements for the area, house type, and elevator and subway, but the budget is not enough. Based on the predictive analysis, the building structure, orientation and building age have a greater impact on the housing price. According to the visual analysis, corresponding to the attributes of frame structure, south facing and constructed in 2018, a lower annual average unit price is established, so that users can adjust according to their own situation and choose the most cost-effective second-hand housing.

Conclusions

At present, with the advent of rapid economic development, housing prices are getting higher and higher, and there are many factors that affect housing prices, which can be categorised into own factors and external factors. We can’t judge and control external factors such as population, policy and other macro reasons, and thus we can only analyse the aspects that can be known. In this paper, we take Chengdu second-hand housing as an example. We use Python to crawl the relevant housing information from Beike Network. After cleaning and filtering, the data are visually analysed and a RF model is established for predictive analysis. The most critical factors affecting house prices are discovered by using Python to judge the importance of specific features, including but not limited to district, elevator, area, building age, bedroom, orientation and building structure. The use of grid search improves the accuracy of model prediction. Finally, the model is used to predict the prices of secondhand houses and the prediction results are rendered more accurate and discovered to have good application value.

eISSN:
2444-8656
Sprache:
Englisch
Zeitrahmen der Veröffentlichung:
Volume Open
Fachgebiete der Zeitschrift:
Biologie, andere, Mathematik, Angewandte Mathematik, Allgemeines, Physik