Research on Data Collection and Analysis of Second Hand House in China Based on Python

With the rapid development of information technology in today’s society, people’s demand for information is also greatly increased. The society has gradually become a collection of information, and in this collection of information there are various data. The data is a form of information. In most cases, these data are hidden in the network, and these data are complex and diverse. It is very difficult to extract these complex data from the network by using our traditional processing methods, and to analyze and study to obtain useful information.

II.

HOME LINK PLATFORM STATISTICAL SURVEY

This topic will choose HOME LINK mall property information platform as the crawler research object, through the crawl to the guest’s regional secondary housing, the second-hand housing prices and in different parts of the room to crawl data, through the analysis of the data from different areas, different house type and price of second hand house information to extract useful information. By further exploring the practical process, some conclusions are drawn on the crawler and basic data analysis methods, and a summary is made.

III.

RELATED TECHNOLOGIES AND FRAMEWORKS

In the process of system design, we mainly use Python+Djnago+Scrapy+ WordCloud in the work technology. Scraper is an open source web crawler framework written in Python. Scrapy was originally designed to grab the network, but also can be used as building data extraction of API or general web crawler Scrapy framework provides a series of efficient and powerful component, through Scrapy framework developers can quickly build a web crawler program, even if is a complex application can also be done through a variety of plug-in or middleware to build. The basics of the Scrapy framework are shown in Figure 1.

BeautifulSoup is a library for parsing HTML or XML text. BeautifulSoup handles ironical markup in HTML or XML text by generating a parse tree and provides an interface that allows developers to easily navigate the parse tree

(Navigating), searching and modifying operations. Compared to other HTML/XML parsing tools, BeautifulSoup has the advantages of simplicity, high error tolerance, and developer friendliness.

WordCloud is a third-party library that takes the Wordcloud as an object. It can draw the Wordcloud with the frequency of words in the text as a parameter, and the size, color and shape of the Wordcloud can be set.

IV.

DESIGN PROCESS

The Scrapy framework is mainly made up of four parts: items, spiders, piplines, and middlewares. Based on the basic structure of Scrapy, the crawling tool of house rental information is divided into four modules, which are crawling data module, crawling module, configuration module and data processing module. In Items, you define the item entity to climb, while in this program you define Lianjialtem item, including price, orientation, location, name, floor, area, and layout of the house.

In Python, Django is a framework based on MVC constructs. In Django, however, the framework handles the part of the controller that accepts user input, so Django focuses more on models, templates, and Views, called MTV modes. Their respective responsibilities are as follows:

Chart 1

Django Responsibilities sheet

Hierarchy	Responsibilities
Model and data access layer	Handle everything related to the data: how it is accessed, how it is validated, what behaviors are involved, and the relationships between the data.
Template that is the presentation layer	Handle presentation-related decisions: how to display in a page or other type of document.
View and business logic layer	The logic associated with accessing the model and fetching the appropriate template. Bridge between model and template

The crawler, which defines the crawling logic and the parsing rules for web content, is responsible for parsing responses and generating results and new requests. Crawler is the key point of this design. It defines how to grab items entities. Both the initial dynamic page connection and static page information crawl are defined in this file. The key code is as follows:

HOME LINK platforms provide second-hand housing landlord page will all kinds of second-hand housing information platform (including price, house type, floor location, etc.) published on the web page, a list of your products to crawl at the beginning of the page load will only load 30 house for candidate, selenium can be used for web to simulate human drop-down operations, but each time the drop-down will only on the basis of the original in the loading 30 entries, so use Scr num read out to crawl the total number of items of the project, divided by 30 and more down, so you can set drop-down list number according to the total number of entries, close the browser after reading.

The crawling process is shown in the figure:

A.

Data analysis

1)

General data analysis methods

Data analysis refers to the process of analyzing a large number of collected data with appropriate statistical analysis methods, extracting useful information and forming conclusions, and studying and summarizing the data in detail. Sometimes the resulting data needs to be further processed and extracted before it can be turned into useful information for people. Data analysis can help people make judgments so that they can take appropriate actions. The mathematical basis of data analysis had been well established as early as the 20th century, but it was not until the advent of computers that the practical operation of data analysis became possible and data analysis became widespread. So data analysis is a combination of mathematics and computer science.

Data visualization is one of the more representative aspects of data analysis. It is the trends that make data visible to human eyes. According to different needs, there are many methods of data visualization, ranging from training AI to deeply learn various patterns in data and make predictions, to analyzing basic functions in Excel sheets, which can all be the process of data analysis.

2)

Key technologies and technical difficulties

a) Engine, processing the whole system of data flow processing, starting things, the core of the framework. Scheduler: The Scheduler accepts requests from the engine, queues them up, and delivers them to the engine when it requests them again.

b) The Downloader downloads the web content and returns the downloaded content to the spider.

c) Itempipcline, the project pipeline, is responsible for processing the data extracted from the web pages by spiders, mainly responsible for cleaning, verifying and storing data in the database.

d) Downloading middleware DowmLoader Middlewares. It’s the processing block between Scrapy’s Request and requestponse.

e) Spider Middlewares, Spider Middlewares, which is located between the Spider and the Spider, mainly handles the response of the Spider input and the result of the output and the new request MIDDLCWARespy.

f) Front-end and back-end connection: Since data needs to be stored in the database, the database and front-end connection need to be connected. The database connection pool is responsible for allocating, managing and releasing the database connection. It allows the application process to reuse an existing database connection. For data interaction with the back end, this article mainly uses Ajax, which is a small asynchronous framework of JavaScript and an interaction tool. The standard format of Ajax is as follows:

Using Echart to display data: In order to make the data look orderly, this paper adopts Echart to visualize the data and make the data more concise and objective.

The data must be representative, and the amount of data must be large enough, otherwise it will not be convincing. The data of large size should be clear, and enough time and energy are needed to process the data during data preprocessing, otherwise problems may occur.

3)

Data visualization processing results display

a) All data table status display, key code: def index(request):

b) Data pie chart shows the key codedef Per_charts(request):

Pie Chart of rental housing according to the ratio of each direction

c) Data line chart display, key code: def line(request):

The broken line chart of the average rent price in Beijing

d) Data bar statistics show, key codedef Dot_chart(request):

e) Word cloud display, key code:def start_worldcloud():

V.

CONCLUSION

Through the research and analysis of the second-hand housing data of HOME LINK in Beijing, this paper studies how to scrapy structure climb the rental information on the HOME LINK website, how to build scrapy structure, rental orientation, the location of what impact on the price of Beijing.

Język:: Angielski

Częstotliwość wydawania:: 4 razy w roku
Dziedziny czasopisma:: Informatyka, Informatyka, inne

Kanał RSS czasopisma

Research on Data Collection and Analysis of Second Hand House in China Based on Python

Data publikacji: 21 lut 2022

Zakres stron: 37 - 47

DOI: https://doi.org/10.21307/ijanmc-2021-015

Słowa kluczowe
Python Crawler, Scrapy framework, Django Framework, HOME LINK

© 2021 Hejing Wuet et al; published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Research on Data Collection and Analysis of Second Hand House in China Based on Python

Hejing Wu

Ran Cui

Data publikacji: 21 lut 2022

Zakres stron: 37 - 47

DOI: https://doi.org/10.21307/ijanmc-2021-015

Słowa kluczowePython Crawler, Scrapy framework, Django Framework, HOME LINK

© 2021 Hejing Wuet et al; published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Słowa kluczowe
Python Crawler, Scrapy framework, Django Framework, HOME LINK