Cite

Introduction

Modern machine learning techniques for computer vision, like Deep Learning, provided unprecedented opportunities for academic research and industrial applications. Examples include using satellite images for deforestation monitoring in South America (Finer et al., 2018) or poverty estimation in Africa (Jean et al., 2016), prediction of skin cancer from skin lesion images (Esteva et al., 2017), or automatic detection of pulmonary tuberculosis from a chest radiograph (Lakhani & Sundaram, 2017).

One of the resources recently leveraged for research is Google Street View—a platform from Google—where images of buildings are taken using cars equipped with a set of cameras (Anguelov et al., 2010). This data source has recently been explored by researchers to answer questions in social science, for example demographic makeup of neighbourhoods across the US (Gebru et al., 2017), estimating city-level travel patterns in Great Britain (Goel et al., 2018) or crime rate in Brazil (Andersson, Birck, & Araujo, 2017).

Our work explores whether Google Street View images of houses are predictive of their residents’ risk of car accident. So far, researchers were looking for determinants of car accidents among characteristics more directly related to driving, for example, driving experience (McCartt, Shabanova, & Leaf, 2003), drunk driving (Bingham, Shope, & Zhu, 2008) and using cell phones while driving (Strayer, Drews, & Crouch, 2003). There are also studies about the road and environmental conditions influencing car accidents (Karlaftis & Golias, 2002; Shankar, Mannering, & Barfield, 1995). We are not aware of any study exploring a direct link between housing conditions and car accident risk; however, a handful of research studies have proved that neighbourhood and house characteristics are correlated with health risk behaviours (Spilkova, Dzúrova, & Pitonak, 2014), which in turn correlate with driving behaviours (Rolison, Hanoch, Wood, & Liu, 2014).

The correlation we aim to verify in this article might be particularly interesting to the insurers. It is essential for the insurers to accurately estimate the risk of the client and set up a proper pricing in order to avoid adverse selection (Gogol, 1993). For this purpose, they search for systematic and time-invariant clients’ characteristics that are observable at the moment of issuing a policy and correlate with the number of claims incurred during the insurance cover period. For example, the classical motor insurance risk factors identified worldwide are the age of the driver, the characteristics of his car, the occurrence of car accidents in the past and geography (Werner & Modlin, 2016, p.159). For this reason, the insurers tend to ask for these and other details before providing the motor insurance offer.

Although insurers often collect address information from the client, they typically use only zip-code for risk modelling and pricing purposes. Claims data aggregated to zip-codes are still too volatile and require spatial smoothing (Taylor, 2001) and further aggregation to larger geographical zones (Yao, 2008). Such a commonly used methodology is based on the assumption that neighbours are driving in a similar manner. In this article, we challenge this assumption and show that volatility can be explained at the granularity of individual addresses. Moreover, we show that this information can be extracted from publicly available images from the Google Street View (Figure 1).

Fig. 1

Examples of extremely different houses located in the same zip code and residents of which have the same expected claim frequency by the current insurer’s model.

Study of this insurance problem enabled following sociological and methodological discoveries: (1) features of the house correlate with the car accident risk of its resident, (2) compared to other uses of Google Street View for research, our variables are sourced from the address rather than aggregated by zip-code or district and they allow for new sociological discoveries at a very granular level, (3) variables extracted from the address (the image of a house) can be used in insurance and other industries, notably for price discrimination, (4) modern data collection and computational techniques, which allow for unprecedented exploitation of personal data, can outpace the development of legislation and raise privacy threats.

Results

We examine a motor insurance dataset of 20,000 records—a random sample of an insurer’s portfolio collected in Poland from January 2012 to December 2015. Each record represents the characteristics of an insurance policy covering motor third party liability (MTPL) including the address of the policyholder, risk exposure defined as a fraction of the year in which the policy was active from 2013 to 2015 and the corresponding count of incurred property damage claims defined as events where any third-party property (car, motorbike, bicycle, as well as fence, house, tree, etc.) has been damaged from partial or full fault of the driver of the insured car. The insurer provided us also with the expected frequency of property damage claims for those policies, estimated by their current best-in-class risk model that includes zoning based on the client’s zip-code.

We collect Google Satellite View and Google Street View images for the addresses provided in the database. Six experts annotated the following features of the houses visible in the images: their type, age, condition, estimated wealth of its residents, as well as type and density of other buildings from the neighbourhood (Figure 2). Four out of six annotators gave moderately consistent answers for the common subsample of 500 addresses—Fleiss’ kappa statistics indicate mostly moderate agreement among them (Table 1). These four annotators continued annotating remaining 19,371 addresses (we removed 129 addresses from the scope of this study as they were either foreign or could not be found by Google Maps), but this time each annotator was given a separate, randomly selected, set of addresses. We compared distributions of collected annotations and finally applied small corrections to match the mean and standard deviation among all four annotators.

Fig. 2

Features annotated from Google Satellite View and Google Street View image of a particular address.

Statistics for seven newly created variables—original granularity, inter-rater reliability of 4 selected annotators on the common set of 500 observations and significance in our risk model after applying necessary simplifications.

VariableOriginal granularityInter-rater reliabilityRisk model
Fleiss’ kappaInterpretationGranularity after simplificationp-value
Neighbourhood typeSeven types, multi-choice0.52Moderate agreement200.01
Building densityScale 1–50.50Moderate agreementNot significant
Street View qualityGood/bad/missing0.79Substantial agreement200.02
House typeFive types, single-choice0.69Substantial agreement200.01
House ageScale 1–30.51Moderate agreement200.03
House conditionScale 1–30.54Moderate agreement200.04
Wealth of residentsScale 1–100.32Fair agreementNot significant

Next, we estimated a generalized linear model (GLM) to investigate the importance of newly created variables for risk prediction (Kolyshkina, Wong, & Lim, 2004; Spedicato, Dutang, & Petrini, 2018; Werner & Modlin, 2016, p.176–183). We assume the following probabilistic model of claim frequency f, defined as the number of claims divided by risk exposure: log(E(f))=log(E(Y/exposure))=βX\log \left( {{\rm{E}}\left( f \right)} \right) = \log \left( {{\rm{E}}\left( {Y/exposure} \right)} \right) = \beta X where Y is a number of property damage claims within MTPL insurance following Poisson distribution, X is a vector of independent variables and β is the vector of coefficients.

For relative evaluation of the value added by our approach, we introduce three models:

odel A (null model), where vector X is [1]

Model B (best-in-class insurer’s model): where vector X is [1,X1,...,Xj]

Model C (our model): where vector X is [1, X1,...,Xj,Xj+1,...,XN]

The insurer provided us with the realisation of the model B for each record from the dataset. That model was estimated on a larger undisclosed dataset and contains j predictive variables (driver characteristics, vehicle characteristics, claim history, geographical zone, etc.). Using properties of GLMs we can decompose Model C into two parts: one corresponding to Model B and one incorporating the new variables. We refer to the realisation of the Model B multiplied by exposure as an offset (Yan, Guszcza, Flynn, & Wu, 2009) and do not estimate it. Therefore, Model C takes form log(E(Y))=β0+βj+1Xj+1++βNXN+log(offset)\log \left( {{\rm{E}}\left( Y \right)} \right) = \beta _0 + \beta _{j + 1} X_{j + 1} + \ldots + \beta _N X_N + \log \left( {offset} \right)

Intuitively, in this representation, the estimated coefficientsβj+1,...,βN explain the signal that is not explained by the best-in-class risk model of the insurer (model B) and will also adjust for the earned exposure of the policy shorter than 1 year. We investigate if the values of these coefficients are non-zero, indicating that the variables we constructed provide additional predictive power to the model. We found that five out of seven newly created variables within this research were significant for predicting property damage MTPL claim frequency model (on top of many other rating variables used in the best-in-class insurer’s model). We report their p-values in Table 1; unfortunately, the data provider did not authorise us to publish the significant levels or corresponding estimated model coefficients. Definitely more research would be needed for testing joint significance of all five selected variables, but the aim of the article is more general—to verify if there is any information in the satellite and Street View images that can be predictive of motor claim frequency and is not being captured by the existing risk factors.

To do so, we refit each of A, B and C models on an 80% train sample and check its predictive power on a 20% test sample through the corresponding Gini coefficient. We observe a significant variability of Gini coefficient on test sample—in particular for model A (null model with intercept only and no other variables selected) it varies from 20 to 38% within 20 resampling trials. We interpret it as the evidence that the dataset provided is extremely small (20,000 records) for modelling such rare events as property damage claims within MTPL insurance (average frequency of 5%).

Despite the high volatility of data, adding our five simple variables to the insurer’s model improves its performance in 18 out of 20 resampling trials and the average improvement of the Gini coefficient is nearly 2 percentage points (from 38.2% to 40.1%). To put this value into perspective, the best-in-class insurer’s model fitted on much bigger dataset and including a broad selection of variables (e.g. driver characteristics, car characteristics, claim history and geographical zones based on the client’s zip-code) improves the Gini coefficient versus null model by 8 percentage points from ~30% to ~38% (Figure 3).

Fig. 3

Gini coefficients obtained on 20% test sample in 20 bootstrapping trials from the null model (A), the best-in-class insurer’s model (B) and our model with newly created variables (C).

Discussion

We found that features visible on a picture of a house can be predictive of car accident risk, independently from classically used variables such as age, or zip code. This finding is not only a step towards more granular risk prediction models, but also illustrates a novel approach to social science, where the real-world granular data is collected and analysed at scale.

From the practical perspective of insurance companies, the results we present are remarkably powerful, when compared to the best-in-class insurance model. Our 5 variables, containing already some bias from the imperfect annotation, improve Gini coefficient by nearly 2 percentage points, which is massive, compared to the improvement of 8 percentage points brought by numerous variables that the insurer has already been using in his best-in-class risk model. The insurance industry could be quickly followed by the banks, as there is a proven correlation between insurance risk models and credit risk scoring (Golden, Brockett, Ai, & Kellison, 2016). The approach itself to extract valuable information from Google Street View opens a variety of opportunities not only for the financial sector. Any company that collects clients’ addresses could adapt our methodology, and the deep learning technology enables to make it in an automated way on a massive scale (Zhou, Lapedriza, Xiao, Torralba, & Oliva, 2014).

Such a practice, however, raises the concerns about the privacy of data stored in publicly available Google Street View, Microsoft Bing Streetside, Mapillary, or equivalent privately held datasets like CycloMedia. The consent given by the clients to the company to store their addresses does not necessarily mean a consent to store information about the appearance of their houses. In particular, features of the house may be a proxy of ethnicity, religion or other characteristics associated with a social status of a person (Braver, 2003; Gillis, 1974), which are forbidden by the law to be used for any discrimination, for example, price discrimination in certain jurisdictions (Gaulding, 1994). Fast development of modern data collection and computational techniques allows for the unprecedented exploitation of various data of clients being not even aware of it (Blitz, 2012), and the development of corresponding legislation in this matter seems to be outpaced.

The methods we present could be substantially improved by employing more annotators for the same set of the images. Potentially, the average or ensemble of their answers would match the reality better than an annotation of a single person (Levenson, Krupinski, Navarro, & Wasserman, 2015; Tran-Thanh, Stein, Rogers, & Jennings, 2014). Another limitation is the small size of the dataset provided by the insurance company, but we reduced this problem using bootstrapping and by using elementary modelling techniques, like the GLMs.

There is a question if we could extrapolate our findings on countries other than Poland. Because of historical reasons, the close neighbours in Poland may have a very different socio-economic profile, and there is a significant heterogeneity of house types and conditions within the same zip-code. In the Western countries, the architecture might be more homogeneous; therefore, the granular information at the address level might not add much value on top of the statistics aggregated at the zip-code level.

Materials and methods

In this article, we describe a sample dataset obtained from the insurance company and the methodology for creating new variables from Google Street View and checking their impact on the MTPL risk prediction.

Dataset provided by the insurer

We examine a motor insurance dataset of 20,000 records—a random sample of an insurer’s portfolio written in Poland from January 2012 to December 2015.

One record represents one insurance policy covering MTPL. Each record has the following characteristics attached:

its risk exposure from 2013 to 2015 (fraction of a year that policy was active during the period 2013–2015)

expected property damage claim frequency for that policy, estimated by the current, best-in-class risk model of the insurance company

zip-code of the declared main driver of the car (used by the insurer to derive geographical zones)

a set of four various addresses:

registered address of the policyholder

mail address of the policyholder

registered address of the car owner

mail address of the car owner

the property damage claim count incurred in 2013–2015 from that policy and reported before 28 February, 2016.

Note, that there is a natural lag of reporting insurance claims to the insurer, but property damage claims from MTPL cover are rather quickly reported in Poland—95% of property damage claim are notified within first 3 months from accident occurrence, so we may assume that the observed claim count in our dataset is very close to the ultimate one.

MTPL insurance in Poland is attached to the car, not to the driver. The policy could also be purchased by a person who is neither a driver nor a car owner. Therefore, in theory, all four addresses could be different and could have a different zip-code than the one taken for the geographical zone. In practice, however, they have a lot in common:

84% of policies have all A, B, C and D addresses the same

88% of policies have common A and B addresses

96% of policies have common A and C addresses

96% of policies have common B and D addresses

75% of policies have common zip-code A and zip-code of the main driver

77% of policies have common zip-code B and zip-code of the main driver

For this study, we needed to select one address as the primary one, so we decided to select address B for the following reasons:

The policyholder is most likely a person who is responsible for maintenance of the car and is actively using it (apart from the main driver)

Mail address is most likely the up-to-date address of residence, while the registered address is often the one declared in the person’s ID (not updated often as there is no legal obligation for it to reflect the actual residence)

On the basis of the address B, some data cleansing has been done—129 records out of 20,000 were removed from the sample as the address was either foreign or could not be found on Google Maps (Table 2).

Summary statistics of the dataset—before and after cleansing.

Original databaseAfter data cleansing
Number of polices20,00019,871
Risk exposure11,34911,209
MTPL PD claim count571570
Observed MTPL PD frequency5,03%5,09%

In addition, we checked claims data for any outliers—there is only one record with three claims (where earned exposure is 0.2), and there are no records of four or more claims. Such a thin tail of our claim count distribution along with a high representation of no claim policies, and let us assume that our claims data follow Poisson distribution—a classical distribution assumed in the actuarial literature for rare events like car accidents (Goldburd, Khare, & Tevet, 2016). To confirm it, we conducted a formal test (a Chi-squared goodness-offit test). The test statistic X2 is 0.08, which determines a p-value above 50% (from the distribution of the X2 statistic with 1 degree of freedom). We cannot, therefore, reject the null hypothesis that the claims follow the Poisson distribution (Table 3).

Data for calculation of X2 statistic for hypothesis verification whether claims in our dataset follow the Poisson distribution. On average λ = 3.9% and the corresponding X2 = 0.08 with 1 degree of freedom.

Number of claimsObserved exposure (O)Expected prob.P(X = k)Expected exposure (E)(E – O)2/E
010,78496%10,7850,00
14174%4160,01
270%80,08
All11,209
The process of creating new variables from Google Satellite View and Google Street View based on the address provided

The dataset examined in this article is a random sample of the insurer’s portfolio; therefore, the geographical distribution of our addresses reflects the footprint of the insurer. It covers the whole territory of Poland with certain concentrations of policies in the big cities—Warszawa, Katowice, Kraków, Gdańsk, Szczecin, Poznań, Wrocław and Łódź (Figure 4).

Fig. 4

Geolocation of the addresses from the dataset examined in this paper.

For each of 19,871 addresses from the dataset, we have collected an image from Google Satellite View and an image from Google Street View (when available). We selected a random subsample of 500 addresses and asked 6 experts to annotate images from this subsample independently. They were supposed to annotate the following characteristics:

From Google Satellite View:

Types of houses and greenery prevailing in the neighbourhood (detached houses, terraced houses, blocks of flats, fields and forest)

Building density (on a scale 1–5)

From Google Street View:

Street view quality (OK; not provided by Google; provided but its quality does not allow for annotation)

Type of the house (detached house, terraced house, low/medium/high-rise block of flats)

Age of the house (old, medium and new)

The condition of the house (good, medium and bad)

Wealth of the residents (on a scale 1–10)

Four out of six annotators gave quite consistent answers for the common subsample of 500 addresses. Fleiss’ kappa statistics (Table 1) indicate mostly moderate agreement among them. We asked these four annotators to continue annotating remaining 19,371 addresses, but this time each annotator was given a separate set of addresses, not overlapping with the addresses of other annotators. After collecting all annotations, we compared the distributions of labels among annotators. Assessing the wealth of house’ residents must be too subjective as its distribution varies significantly among annotators. Small differences identified in the two other variables, namely house age and house condition, were corrected by normalising the distributions among the annotators to match the mean and the standard deviation. Basic statistics of the variables after all corrections is shown in Figure 5.

Fig. 5

Distribution of labels and corresponding observed claim frequency for the variables generated for this study.

It is worth noting that for 22% of addresses, there was no Google Street View available. These addresses were either in very remote locations or the road leading to them was not open to the public. Other 16% of addresses had Google Street View that did not allow for proper annotation of the house, for various reasons: the Google camera was directed at the wrong side of the road, there was an obstacle (e.g. a tree, a fence and an overtaking bus) that covered the house. As a result, only 63% of all addresses had proper Google Street View, and thus, variables, such as house type, the age of the house, condition of the house and wealth of the residents of the house, were properly annotated. Variables, such as neighbourhood type and building density, are fulfilled in 100% as they are based on the Google Satellite View that was available for all observations from the dataset.

Methodology for checking the significance of the newly created variables in the risk model

In the previous section, we have presented the univariate claim frequency variable by variable (Figure 5). Some of the segments appear to have claim frequency outstanding from the population average, for example, relatively new houses, houses in bad condition, mid-rise blocks of flats, or houses surrounded by blocks of flats with no signs of greenery.

The outstanding claim frequency can be, however, driven by another variable that is already controlled by the risk model of the insurance company. For example, people living in the new houses can be relatively young, and driver’s age is a classical ratemaking variable for motor insurance. There could be also some correlations among the newly created variables themselves, for example, a mid-rise block of flats is more likely to be surrounded by other blocks of flats rather than detached houses and fields. To fairly assess the impact of the newly created variables for risk prediction, we need to use a multivariate method that considers all selected variables simultaneously and automatically adjusts for exposure correlations between them.

Such a method is the GLM that has been widely adopted by the insurance pricing practitioners around the world (Cizek, Härdle, & Weron, 2005; Werner & Modlin, 2016, 176–183). GLMs extend linear models by allowing distributions of error terms other than Gaussian. In particular, residuals of models in insurance are typically assumed to follow Poisson or Gamma distributions. Despite this relaxation of assumptions on error terms, classical maximum likelihood estimates can be computed, after transforming the model with a so-called link function. Moreover, the application of log link function makes GLM coefficients interpretable and could be directly used for risk premium calculation. For these reasons, GLMs remain the most prevalent statistical tool in insurance, despite the growing popularity of complex machine learning models in other disciplines of science.

Generalized linear models

We assume following the probabilistic model of claim frequency (defined as the number of claims divided by risk exposure): log(E(f))=log(E(Y/exposure))=βX\log \left( {{\rm{E}}\left( f \right)} \right) = \log \left( {{\rm{E}}\left( {Y/exposure} \right)} \right) = \beta X where Y is a number of property damage claims within the MTPL insurance that follows the Poisson distribution, X is a vector of independent variables and Beta’s are corresponding coefficients to be estimated. Assuming N independent variables, the model formula can be written as follows: log(E(Y))=β0+β1*X1++βNXN+log(exposure)\log \left( {E\left( Y \right)} \right) = \beta _0 + \beta _1 *X_1 + \ldots + \beta _N X_N + \log \left( {exposure} \right)

An analogical formula is assumed for the best-in-class insurer’s model, and its realisation was provided for each of the records from our dataset. We can then replace a part of the model formula by provided expected frequency that does not require estimation. Assuming the insurer uses j independent variables in the model (where j<N), our model formula transforms as follows: log(E(Y))=β0+βj+1Xj+1++βNXN+log(offset)\log \left( {{\rm{E}}\left( Y \right)} \right) = \beta _0 + \beta _{j + 1} X_{j + 1} + \ldots + \beta _N X_N + \log \left( {offset} \right)

We estimate such a model formula in R package. The variables are being added to the model step by step, and the necessary grouping of levels is being made meanwhile to achieve the most robust results. The modelling process is iterated until all factors used in the model appear significant (p-value < 0.05).

Model evaluation

Once the modelling process is finished, we validate the model by refitting it on 80% train sample and checking its performance on 20% test sample through the Gini coefficient. Gini coefficient is most commonly known as a measure of the inequality of income, but it has been adopted by insurance practitioners as a metric for model validation and model comparison (Frees, Meyers, & Cummings, 2011). It is computed as follows:

The policies in 20% of test sample are sorted from the lowest to the highest claim frequency expected by the model fitted on the 80% of train sample

The cumulative observed claim count from sorted policies in 20% of test sample is plotted on the graph (representing inequality of risk distribution in the population, analogically to inequality of wealth distribution in Lorenz curve (Lorenz, 1905))

Gini coefficient is computed as the area between the Lorenz curve and the no-discrimination line multiplied by 2 (where the Lorenz curve is described in point 2 and illustrated in Figure 6) (Gini, 1921)

Fig. 6

Illustration of the Gini coefficient computation for one of the bootstrapping trials.

Our preliminary analysis has shown the variability of the Gini coefficient due to the small size of the dataset provided. To reduce this variability in the analysis of model performance, we compute the estimates of the Gini coefficient from 20 resampling trials, each time randomly assigning observations to train and test set from the beginning.

eISSN:
2543-6821
Language:
English