Modern machine learning techniques for computer vision, like Deep Learning, provided unprecedented opportunities for academic research and industrial applications. Examples include using satellite images for deforestation monitoring in South America (Finer et al., 2018) or poverty estimation in Africa (Jean et al., 2016), prediction of skin cancer from skin lesion images (Esteva et al., 2017), or automatic detection of pulmonary tuberculosis from a chest radiograph (Lakhani & Sundaram, 2017).
One of the resources recently leveraged for research is Google Street View—a platform from Google—where images of buildings are taken using cars equipped with a set of cameras (Anguelov et al., 2010). This data source has recently been explored by researchers to answer questions in social science, for example demographic makeup of neighbourhoods across the US (Gebru et al., 2017), estimating city-level travel patterns in Great Britain (Goel et al., 2018) or crime rate in Brazil (Andersson, Birck, & Araujo, 2017).
Our work explores whether Google Street View images of houses are predictive of their residents’ risk of car accident. So far, researchers were looking for determinants of car accidents among characteristics more directly related to driving, for example, driving experience (McCartt, Shabanova, & Leaf, 2003), drunk driving (Bingham, Shope, & Zhu, 2008) and using cell phones while driving (Strayer, Drews, & Crouch, 2003). There are also studies about the road and environmental conditions influencing car accidents (Karlaftis & Golias, 2002; Shankar, Mannering, & Barfield, 1995). We are not aware of any study exploring a direct link between housing conditions and car accident risk; however, a handful of research studies have proved that neighbourhood and house characteristics are correlated with health risk behaviours (Spilkova, Dzúrova, & Pitonak, 2014), which in turn correlate with driving behaviours (Rolison, Hanoch, Wood, & Liu, 2014).
The correlation we aim to verify in this article might be particularly interesting to the insurers. It is essential for the insurers to accurately estimate the risk of the client and set up a proper pricing in order to avoid adverse selection (Gogol, 1993). For this purpose, they search for systematic and time-invariant clients’ characteristics that are observable at the moment of issuing a policy and correlate with the number of claims incurred during the insurance cover period. For example, the classical motor insurance risk factors identified worldwide are the age of the driver, the characteristics of his car, the occurrence of car accidents in the past and geography (Werner & Modlin, 2016, p.159). For this reason, the insurers tend to ask for these and other details before providing the motor insurance offer.
Although insurers often collect address information from the client, they typically use only zip-code for risk modelling and pricing purposes. Claims data aggregated to zip-codes are still too volatile and require spatial smoothing (Taylor, 2001) and further aggregation to larger geographical zones (Yao, 2008). Such a commonly used methodology is based on the assumption that neighbours are driving in a similar manner. In this article, we challenge this assumption and show that volatility can be explained at the granularity of individual addresses. Moreover, we show that this information can be extracted from publicly available images from the Google Street View (Figure 1).
Study of this insurance problem enabled following sociological and methodological discoveries: (1) features of the house correlate with the car accident risk of its resident, (2) compared to other uses of Google Street View for research, our variables are sourced from the address rather than aggregated by zip-code or district and they allow for new sociological discoveries at a very granular level, (3) variables extracted from the address (the image of a house) can be used in insurance and other industries, notably for price discrimination, (4) modern data collection and computational techniques, which allow for unprecedented exploitation of personal data, can outpace the development of legislation and raise privacy threats.
We examine a motor insurance dataset of 20,000 records—a random sample of an insurer’s portfolio collected in Poland from January 2012 to December 2015. Each record represents the characteristics of an insurance policy covering motor third party liability (MTPL) including the address of the policyholder, risk exposure defined as a fraction of the year in which the policy was active from 2013 to 2015 and the corresponding count of incurred property damage claims defined as events where any third-party property (car, motorbike, bicycle, as well as fence, house, tree, etc.) has been damaged from partial or full fault of the driver of the insured car. The insurer provided us also with the expected frequency of property damage claims for those policies, estimated by their current best-in-class risk model that includes zoning based on the client’s zip-code.
We collect Google Satellite View and Google Street View images for the addresses provided in the database. Six experts annotated the following features of the houses visible in the images: their type, age, condition, estimated wealth of its residents, as well as type and density of other buildings from the neighbourhood (Figure 2). Four out of six annotators gave moderately consistent answers for the common subsample of 500 addresses—Fleiss’ kappa statistics indicate mostly moderate agreement among them (Table 1). These four annotators continued annotating remaining 19,371 addresses (we removed 129 addresses from the scope of this study as they were either foreign or could not be found by Google Maps), but this time each annotator was given a separate, randomly selected, set of addresses. We compared distributions of collected annotations and finally applied small corrections to match the mean and standard deviation among all four annotators.
Statistics for seven newly created variables—original granularity, inter-rater reliability of 4 selected annotators on the common set of 500 observations and significance in our risk model after applying necessary simplifications.
Neighbourhood type | Seven types, multi-choice | 0.52 | Moderate agreement | 2 | 00.01 |
Building density | Scale 1–5 | 0.50 | Moderate agreement | Not significant | |
Street View quality | Good/bad/missing | 0.79 | Substantial agreement | 2 | 00.02 |
House type | Five types, single-choice | 0.69 | Substantial agreement | 2 | 00.01 |
House age | Scale 1–3 | 0.51 | Moderate agreement | 2 | 00.03 |
House condition | Scale 1–3 | 0.54 | Moderate agreement | 2 | 00.04 |
Wealth of residents | Scale 1–10 | 0.32 | Fair agreement | Not significant |
Next, we estimated a generalized linear model (GLM) to investigate the importance of newly created variables for risk prediction (Kolyshkina, Wong, & Lim, 2004; Spedicato, Dutang, & Petrini, 2018; Werner & Modlin, 2016, p.176–183). We assume the following probabilistic model of claim frequency
For relative evaluation of the value added by our approach, we introduce three models:
odel A (null model), where vector Model B (best-in-class insurer’s model): where vector Model C (our model): where vector
The insurer provided us with the realisation of the model B for each record from the dataset. That model was estimated on a larger undisclosed dataset and contains
Intuitively, in this representation, the estimated coefficients
To do so, we refit each of A, B and C models on an 80% train sample and check its predictive power on a 20% test sample through the corresponding Gini coefficient. We observe a significant variability of Gini coefficient on test sample—in particular for model A (null model with intercept only and no other variables selected) it varies from 20 to 38% within 20 resampling trials. We interpret it as the evidence that the dataset provided is extremely small (20,000 records) for modelling such rare events as property damage claims within MTPL insurance (average frequency of 5%).
Despite the high volatility of data, adding our five simple variables to the insurer’s model improves its performance in 18 out of 20 resampling trials and the average improvement of the Gini coefficient is nearly 2 percentage points (from 38.2% to 40.1%). To put this value into perspective, the best-in-class insurer’s model fitted on much bigger dataset and including a broad selection of variables (e.g. driver characteristics, car characteristics, claim history and geographical zones based on the client’s zip-code) improves the Gini coefficient versus null model by 8 percentage points from ~30% to ~38% (Figure 3).
We found that features visible on a picture of a house can be predictive of car accident risk, independently from classically used variables such as age, or zip code. This finding is not only a step towards more granular risk prediction models, but also illustrates a novel approach to social science, where the real-world granular data is collected and analysed at scale.
From the practical perspective of insurance companies, the results we present are remarkably powerful, when compared to the best-in-class insurance model. Our 5 variables, containing already some bias from the imperfect annotation, improve Gini coefficient by nearly 2 percentage points, which is massive, compared to the improvement of 8 percentage points brought by numerous variables that the insurer has already been using in his best-in-class risk model. The insurance industry could be quickly followed by the banks, as there is a proven correlation between insurance risk models and credit risk scoring (Golden, Brockett, Ai, & Kellison, 2016). The approach itself to extract valuable information from Google Street View opens a variety of opportunities not only for the financial sector. Any company that collects clients’ addresses could adapt our methodology, and the deep learning technology enables to make it in an automated way on a massive scale (Zhou, Lapedriza, Xiao, Torralba, & Oliva, 2014).
Such a practice, however, raises the concerns about the privacy of data stored in publicly available Google Street View, Microsoft Bing Streetside, Mapillary, or equivalent privately held datasets like CycloMedia. The consent given by the clients to the company to store their addresses does not necessarily mean a consent to store information about the appearance of their houses. In particular, features of the house may be a proxy of ethnicity, religion or other characteristics associated with a social status of a person (Braver, 2003; Gillis, 1974), which are forbidden by the law to be used for any discrimination, for example, price discrimination in certain jurisdictions (Gaulding, 1994). Fast development of modern data collection and computational techniques allows for the unprecedented exploitation of various data of clients being not even aware of it (Blitz, 2012), and the development of corresponding legislation in this matter seems to be outpaced.
The methods we present could be substantially improved by employing more annotators for the same set of the images. Potentially, the average or ensemble of their answers would match the reality better than an annotation of a single person (Levenson, Krupinski, Navarro, & Wasserman, 2015; Tran-Thanh, Stein, Rogers, & Jennings, 2014). Another limitation is the small size of the dataset provided by the insurance company, but we reduced this problem using bootstrapping and by using elementary modelling techniques, like the GLMs.
There is a question if we could extrapolate our findings on countries other than Poland. Because of historical reasons, the close neighbours in Poland may have a very different socio-economic profile, and there is a significant heterogeneity of house types and conditions within the same zip-code. In the Western countries, the architecture might be more homogeneous; therefore, the granular information at the address level might not add much value on top of the statistics aggregated at the zip-code level.
In this article, we describe a sample dataset obtained from the insurance company and the methodology for creating new variables from Google Street View and checking their impact on the MTPL risk prediction.
We examine a motor insurance dataset of 20,000 records—a random sample of an insurer’s portfolio written in Poland from January 2012 to December 2015.
One record represents one insurance policy covering MTPL. Each record has the following characteristics attached:
its risk exposure from 2013 to 2015 (fraction of a year that policy was active during the period 2013–2015) expected property damage claim frequency for that policy, estimated by the current, best-in-class risk model of the insurance company zip-code of the declared main driver of the car (used by the insurer to derive geographical zones) a set of four various addresses:
registered address of the policyholder mail address of the policyholder registered address of the car owner mail address of the car owner the property damage claim count incurred in 2013–2015 from that policy and reported before 28 February, 2016.
Note, that there is a natural lag of reporting insurance claims to the insurer, but property damage claims from MTPL cover are rather quickly reported in Poland—95% of property damage claim are notified within first 3 months from accident occurrence, so we may assume that the observed claim count in our dataset is very close to the ultimate one.
MTPL insurance in Poland is attached to the car, not to the driver. The policy could also be purchased by a person who is neither a driver nor a car owner. Therefore, in theory, all four addresses could be different and could have a different zip-code than the one taken for the geographical zone. In practice, however, they have a lot in common:
84% of policies have all A, B, C and D addresses the same 88% of policies have common A and B addresses 96% of policies have common A and C addresses 96% of policies have common B and D addresses 75% of policies have common zip-code A and zip-code of the main driver 77% of policies have common zip-code B and zip-code of the main driver
For this study, we needed to select one address as the primary one, so we decided to select address B for the following reasons:
The policyholder is most likely a person who is responsible for maintenance of the car and is actively using it (apart from the main driver) Mail address is most likely the up-to-date address of residence, while the registered address is often the one declared in the person’s ID (not updated often as there is no legal obligation for it to reflect the actual residence)
On the basis of the address B, some data cleansing has been done—129 records out of 20,000 were removed from the sample as the address was either foreign or could not be found on Google Maps (Table 2).
Summary statistics of the dataset—before and after cleansing.
20,000 | 19,871 | |
11,349 | 11,209 | |
571 | 570 | |
5,03% | 5,09% |
In addition, we checked claims data for any outliers—there is only one record with three claims (where earned exposure is 0.2), and there are no records of four or more claims. Such a thin tail of our claim count distribution along with a high representation of no claim policies, and let us assume that our claims data follow Poisson distribution—a classical distribution assumed in the actuarial literature for rare events like car accidents (Goldburd, Khare, & Tevet, 2016). To confirm it, we conducted a formal test (a Chi-squared goodness-offit test). The test statistic X2 is 0.08, which determines a
Data for calculation of X2 statistic for hypothesis verification whether claims in our dataset follow the Poisson distribution. On average
( | ||||
---|---|---|---|---|
0 | 10,784 | 96% | 10,785 | 0,00 |
1 | 417 | 4% | 416 | 0,01 |
2 | 7 | 0% | 8 | 0,08 |
All | 11,209 |
The dataset examined in this article is a random sample of the insurer’s portfolio; therefore, the geographical distribution of our addresses reflects the footprint of the insurer. It covers the whole territory of Poland with certain concentrations of policies in the big cities—Warszawa, Katowice, Kraków, Gdańsk, Szczecin, Poznań, Wrocław and Łódź (Figure 4).
For each of 19,871 addresses from the dataset, we have collected an image from Google Satellite View and an image from Google Street View (when available). We selected a random subsample of 500 addresses and asked 6 experts to annotate images from this subsample independently. They were supposed to annotate the following characteristics:
From Google Satellite View:
Types of houses and greenery prevailing in the neighbourhood (detached houses, terraced houses, blocks of flats, fields and forest) Building density (on a scale 1–5) From Google Street View:
Street view quality (OK; not provided by Google; provided but its quality does not allow for annotation) Type of the house (detached house, terraced house, low/medium/high-rise block of flats) Age of the house (old, medium and new) The condition of the house (good, medium and bad) Wealth of the residents (on a scale 1–10)
Four out of six annotators gave quite consistent answers for the common subsample of 500 addresses. Fleiss’ kappa statistics (Table 1) indicate mostly moderate agreement among them. We asked these four annotators to continue annotating remaining 19,371 addresses, but this time each annotator was given a separate set of addresses, not overlapping with the addresses of other annotators. After collecting all annotations, we compared the distributions of labels among annotators. Assessing the wealth of house’ residents must be too subjective as its distribution varies significantly among annotators. Small differences identified in the two other variables, namely house age and house condition, were corrected by normalising the distributions among the annotators to match the mean and the standard deviation. Basic statistics of the variables after all corrections is shown in Figure 5.
It is worth noting that for 22% of addresses, there was no Google Street View available. These addresses were either in very remote locations or the road leading to them was not open to the public. Other 16% of addresses had Google Street View that did not allow for proper annotation of the house, for various reasons: the Google camera was directed at the wrong side of the road, there was an obstacle (e.g. a tree, a fence and an overtaking bus) that covered the house. As a result, only 63% of all addresses had proper Google Street View, and thus, variables, such as house type, the age of the house, condition of the house and wealth of the residents of the house, were properly annotated. Variables, such as neighbourhood type and building density, are fulfilled in 100% as they are based on the Google Satellite View that was available for all observations from the dataset.
In the previous section, we have presented the univariate claim frequency variable by variable (Figure 5). Some of the segments appear to have claim frequency outstanding from the population average, for example, relatively new houses, houses in bad condition, mid-rise blocks of flats, or houses surrounded by blocks of flats with no signs of greenery.
The outstanding claim frequency can be, however, driven by another variable that is already controlled by the risk model of the insurance company. For example, people living in the new houses can be relatively young, and driver’s age is a classical ratemaking variable for motor insurance. There could be also some correlations among the newly created variables themselves, for example, a mid-rise block of flats is more likely to be surrounded by other blocks of flats rather than detached houses and fields. To fairly assess the impact of the newly created variables for risk prediction, we need to use a multivariate method that considers all selected variables simultaneously and automatically adjusts for exposure correlations between them.
Such a method is the GLM that has been widely adopted by the insurance pricing practitioners around the world (Cizek, Härdle, & Weron, 2005; Werner & Modlin, 2016, 176–183). GLMs extend linear models by allowing distributions of error terms other than Gaussian. In particular, residuals of models in insurance are typically assumed to follow Poisson or Gamma distributions. Despite this relaxation of assumptions on error terms, classical maximum likelihood estimates can be computed, after transforming the model with a so-called link function. Moreover, the application of log link function makes GLM coefficients interpretable and could be directly used for risk premium calculation. For these reasons, GLMs remain the most prevalent statistical tool in insurance, despite the growing popularity of complex machine learning models in other disciplines of science.
We assume following the probabilistic model of claim frequency (defined as the number of claims divided by risk exposure):
An analogical formula is assumed for the best-in-class insurer’s model, and its realisation was provided for each of the records from our dataset. We can then replace a part of the model formula by provided expected frequency that does not require estimation. Assuming the insurer uses
We estimate such a model formula in R package. The variables are being added to the model step by step, and the necessary grouping of levels is being made meanwhile to achieve the most robust results. The modelling process is iterated until all factors used in the model appear significant (
Once the modelling process is finished, we validate the model by refitting it on 80% train sample and checking its performance on 20% test sample through the Gini coefficient. Gini coefficient is most commonly known as a measure of the inequality of income, but it has been adopted by insurance practitioners as a metric for model validation and model comparison (Frees, Meyers, & Cummings, 2011). It is computed as follows:
The policies in 20% of test sample are sorted from the lowest to the highest claim frequency expected by the model fitted on the 80% of train sample The cumulative observed claim count from sorted policies in 20% of test sample is plotted on the graph (representing inequality of risk distribution in the population, analogically to inequality of wealth distribution in Lorenz curve (Lorenz, 1905)) Gini coefficient is computed as the area between the Lorenz curve and the no-discrimination line multiplied by 2 (where the Lorenz curve is described in point 2 and illustrated in Figure 6) (Gini, 1921)
Our preliminary analysis has shown the variability of the Gini coefficient due to the small size of the dataset provided. To reduce this variability in the analysis of model performance, we compute the estimates of the Gini coefficient from 20 resampling trials, each time randomly assigning observations to train and test set from the beginning.