With the rapid development of information technology in today’s society, people’s demand for information is also greatly increased. The society has gradually become a collection of information, and in this collection of information there are various data. The data is a form of information. In most cases, these data are hidden in the network, and these data are complex and diverse. It is very difficult to extract these complex data from the network by using our traditional processing methods, and to analyze and study to obtain useful information.
This topic will choose HOME LINK mall property information platform as the crawler research object, through the crawl to the guest’s regional secondary housing, the second-hand housing prices and in different parts of the room to crawl data, through the analysis of the data from different areas, different house type and price of second hand house information to extract useful information. By further exploring the practical process, some conclusions are drawn on the crawler and basic data analysis methods, and a summary is made.
In the process of system design, we mainly use Python+Djnago+Scrapy+ WordCloud in the work technology. Scraper is an open source web crawler framework written in Python. Scrapy was originally designed to grab the network, but also can be used as building data extraction of API or general web crawler Scrapy framework provides a series of efficient and powerful component, through Scrapy framework developers can quickly build a web crawler program, even if is a complex application can also be done through a variety of plug-in or middleware to build. The basics of the Scrapy framework are shown in Figure 1.
Basic principles of Scrapy frame
BeautifulSoup is a library for parsing HTML or XML text. BeautifulSoup handles ironical markup in HTML or XML text by generating a parse tree and provides an interface that allows developers to easily navigate the parse tree
(Navigating), searching and modifying operations. Compared to other HTML/XML parsing tools, BeautifulSoup has the advantages of simplicity, high error tolerance, and developer friendliness.
WordCloud is a third-party library that takes the Wordcloud as an object. It can draw the Wordcloud with the frequency of words in the text as a parameter, and the size, color and shape of the Wordcloud can be set.
The Scrapy framework is mainly made up of four parts: items, spiders, piplines, and middlewares. Based on the basic structure of Scrapy, the crawling tool of house rental information is divided into four modules, which are crawling data module, crawling module, configuration module and data processing module. In Items, you define the item entity to climb, while in this program you define Lianjialtem item, including price, orientation, location, name, floor, area, and layout of the house.
In Python, Django is a framework based on MVC constructs. In Django, however, the framework handles the part of the controller that accepts user input, so Django focuses more on models, templates, and Views, called MTV modes. Their respective responsibilities are as follows:
Django Responsibilities sheet
Hierarchy | Responsibilities |
---|---|
Model and data access layer | Handle everything related to the data: how it is accessed, how it is validated, what behaviors are involved, and the relationships between the data. |
Template that is the presentation layer | Handle presentation-related decisions: how to display in a page or other type of document. |
View and business logic layer | The logic associated with accessing the model and fetching the appropriate template. Bridge between model and template |
The crawler, which defines the crawling logic and the parsing rules for web content, is responsible for parsing responses and generating results and new requests. Crawler is the key point of this design. It defines how to grab items entities. Both the initial dynamic page connection and static page information crawl are defined in this file. The key code is as follows:
HOME LINK platforms provide second-hand housing landlord page will all kinds of second-hand housing information platform (including price, house type, floor location, etc.) published on the web page, a list of your products to crawl at the beginning of the page load will only load 30 house for candidate, selenium can be used for web to simulate human drop-down operations, but each time the drop-down will only on the basis of the original in the loading 30 entries, so use Scr num read out to crawl the total number of items of the project, divided by 30 and more down, so you can set drop-down list number according to the total number of entries, close the browser after reading.
The crawling process is shown in the figure:
Crawling process
Data analysis refers to the process of analyzing a large number of collected data with appropriate statistical analysis methods, extracting useful information and forming conclusions, and studying and summarizing the data in detail. Sometimes the resulting data needs to be further processed and extracted before it can be turned into useful information for people. Data analysis can help people make judgments so that they can take appropriate actions. The mathematical basis of data analysis had been well established as early as the 20th century, but it was not until the advent of computers that the practical operation of data analysis became possible and data analysis became widespread. So data analysis is a combination of mathematics and computer science.
Data visualization is one of the more representative aspects of data analysis. It is the trends that make data visible to human eyes. According to different needs, there are many methods of data visualization, ranging from training AI to deeply learn various patterns in data and make predictions, to analyzing basic functions in Excel sheets, which can all be the process of data analysis.
a)
Using Echart to display data: In order to make the data look orderly, this paper adopts Echart to visualize the data and make the data more concise and objective.
The data must be representative, and the amount of data must be large enough, otherwise it will not be convincing. The data of large size should be clear, and enough time and energy are needed to process the data during data preprocessing, otherwise problems may occur.
Full information display
Pie Chart of rent per district
Pie Chart of rental housing according to the ratio of each direction
The broken line chart of the average rent price in Beijing
The bar chat of Renting in Beijing
Generating wordcloud presentation
Through the research and analysis of the second-hand housing data of HOME LINK in Beijing, this paper studies how to scrapy structure climb the rental information on the HOME LINK website, how to build scrapy structure, rental orientation, the location of what impact on the price of Beijing.