EXPLORING EXPLORATORY DATA ANALYSIS: AN EMPIRICAL TEST OF RUN CHART UTILITY

: This paper explores Exploratory Data Analysis (EDA). Graphical methods are used to gain insights in EDA and these insights can be useful for forming tentative hypotheses when performing a root cause analysis (RCA). The topic of EDA is well addressed in the literature; however, empirical studies of the efficacy of EDA are lacking. We therefore aim to evaluate EDA by comparing one group of students identifying salient features in a table against a second group of students attempting to identify salient features in the same data presented in the form of a run chart, and then extracting relevant conclusions from such a comparison. Two groups of students were randomly selected to receive data; either in the form of a table or a run chart. They were then tasked with visually identifying any data points that stood out as interesting. The number of correctly identified values and the time to find the values were both evaluated by a two-sample t-test to determine if there was a statistically significant difference. The participants with a graph found the correct values that stood out in the data much quicker than those that used a table. Those using the data in the form of a table too much longer and failed to identify values that stood out. However, those with a graph also had far more false positives. Much has been written on the topic of EDA in the literature; however, an empirical evaluation of this common methodology is lacking. This paper confirms with empirical evidence the effectiveness of EDA.


INTRODUCTION
Using simple graphical tools and expertise is the key to effective problem solving [1].One possible approach for applying such graphical tools is Exploratory Data Analysis (EDA).Tukey explains that EDA is performed to find questions, which can later be confirmed with confirmatory data analysis; data is viewed graphically to find insights, which can later be further investigated using other statistical methods such as hypothesis testing [2].The part of an analysis that uses EDA is both speculative and open and intended for the identification of salient features within the data [3] and EDA is useful for hypothesis generation [4].Unlike Confirmatory Data Analysis (CDA), such as statistical hypothesis testing, EDA seeks mainly to generate ideas [5].The tools used in EDA may often be simple, yet they are needed to form the right questions early in the problem solving process [6] as learning involves alternating cycles of applying subject matter knowledge to data or facts to form theories, hypothesis, conjectures, ideas, or models [7].The use of EDA is a critical aspect of the scientific process [8] and one area where EDA can be useful is root cause analysis (RCA), where data is explored graphically to gain insights into possible root causes.These insights are then evaluated in detail.Hypotheses related to causes and effects based on EDA may be due to the observation of gaps in data, the presence of outliers, or patterns [9] observed when depicting the data graphically.Patterns and features in the data can be detected by using EDA and then used as a basis for further analysis [10].
The absence of a pattern in the data can also be viewed as being potentially interesting and in EDA, the well informed human eye is the main method used [11].This research attempts to determine if the human eye, when using a graphical method, can detect anomalous data in a large data set better than the human eye simply observing the data set in table form.The run chart has been around long enough for its origin to be lost in time [12], yet the effectiveness of this EDA tool has not yet been empirically evaluated.A run chart can be useful for forming hypotheses during a failure analysis [13] and this study evaluates the effectiveness of run charts for supporting EDA.

LITERATURE REVIEW
The concept of EDA was widely intruded by the statistician John W. Tukey in the book Exploratory Data Analysis [14].According to Tukey [15 p. 329], "the purpose of display is comparison (recognition of phenomena), not numbers."The three key steps to EDA are the following: "(1) display the data; (2) identify salient features; (3) interpret salient features" [16 p. 366].In other words, data is displayed graphically, the graphic display is observed to see what, if anything, stands out, and then possible meaning is assigned to the resulting observation.During an RCA, instead of looking at data and attempting to come to a concreate conclusion, EDA is used to generate tentative ideas which could then be evaluated through additional statistical methods.For example, an investigator analyzing the cause of a failure may use EDA to gain insights which can be used to form an explanatory hypothesis which could then be further investigated and tested.Although the ideas generated though EDA may not lead directly towards the root cause of a failure, they may point the investigator in the right direction [17].Often, EDA is performed at an early stage in problem solving to identify relevant variables [18] and is well-suited to analysis of historical data for the generation of questions [19].In addition to being used for RCA, EDA is also performed prior to statistical analysis of data [20] and is a part of the Six Sigma quality improvement methodology [21].Often, EDA is used in the analyze phase of a Six Sigma project [22,23].According to Tukey [24], graphs are used both for searching for the presence and the absence of phenomena in data.Although EDA relives heavily upon graphs, subject matter experts (SMEs) are also needed for interpretation of the data in graphs [25] and EDA is more of an approach and mindset than a tool set.However, there are many associated tools, such as box-and-whisker plots, steam-andleaf diagrams, [26], histograms, multi-vari charts, [27], and run charts [28].Box-and-whisker plots, also known as box plots, are useful for comparing multiple data sets and can also be used for making observations of one data set.A box-and-whisker plot displays the median of the data set as well as a box containing half of the values and whiskers containing the remaining values, with the exception of outliers, which are depicted beyond the limit of the whiskers [29].
Gojanovic [30] used box-and-whisker plots to display the amount of dissolved oxygen in a beverage that had unacceptably high average values.Batch data was plotted by month for a year's worth of data and it was quickly determined that the mean value was unexpectedly high due to the presence of outliers in the data and not due to a process that was actually running too high.In this case, more quantitative methods gave an inaccurate picture because outliers that were easily viewable in a box-and-whisker plot were originally missing .A stem-and-leaf plot displays the shape of the data with stems containing the first digits of the values and the leafs contain the remaining digits from each value.This is helpful for looking at the centering of the data, its spread, and looking for patterns in the data [31].Histograms can display attribute or variable data.For variable data, the x-axis consists of bins of equal size corresponding to specific values and the y-axis consist of bars indicating the number of values in a bin.The histogram provides information on the spread of the data as well as the shape of the data.In one case study, a histogram showed evidence of bimodality so operators were asked about potential causes; parts came from two molds, which may have explained the bimodality of the data.In this case, there was no way to tell which mold the parts came from [32]; however, a planned study could have been performed to compare parts from the two molds.This situation well illustrates the use of EDA where insights form a graphical depiction to data yield hints, which could be further investigated.Another graphical method is the multi-vari chart.A multivari chart presents a graphical depiction of sources of variation such as variation from day to day, production line to production line, or between locations [33].Although the multi-vari chart has been around since 1950, it is not often used.One situation in which it was used was to identify variation in the grinding operation of an aviation component.Variation was observed originating from a part that was processed on a grinding wheel, which did lead the investigators to looking at the dressing tool which showed the grinding variation was happening around the times the grinding wheel was dressed; the dressing tool was then found to be worn [34].When possible, data should be viewed in time intervals [35] displaying the order in which it was created.Here, a run chart can be a useful tool for visualizing data in time order; which for example, is useful for detecting trends [36].This can be done using statistical software such as Python, R, and MATLAB, or by simply looking at the data in an Excel spreadsheet to see if there is a structure in the data [37].In one example of a simple run chart, Dooly [38] plotted the percentage of time a baby cried during diaper changes versus number of diaper changes and quickly determined there was no pattern visible in the data.Although EDA was introduced in the 1970s for use with pen and paper [39], EDA still retains relevance for the software supported analysis of big data [40].Big data is often describe as having a large volume of data, increasing at a quick velocity, and having a wide variability [41].Big data has come to prominence due to Industry 4.0, with Cyber-Physical-Systems (CPS) linked together though the Internet of Things (IoT) [42].Industry 4.0 has also led to Quality 4.0, where quality professionals must concern themselves with Artificial Intelligence (AI) [43] and machine learning [44] for the analysis of big data [45].Ou et al. [46] have proposed process for the analysis of big data in Quality 4.0 consisting of data collection, preparation of data, EDA and modeling of data, visualization of results and creation of an analysis report, and then decision making based on the results.Bou and Satorra [47] suggest the use of multivariate EDA with software programs such as SPSS, Strat, and R for the exploration of databases with big data.Escobar et al. [48] propose a curricula for adapting Six Sigma Green Belt, Black Belt, and master Black Belt training and certification to Quality 4.0 with EDA for the analysis of big data.Given this overall overview of EDA and some of its tools, the paper will now describe in a particular case study how a graph was assessed empirically as well as the possible gains achieved by using run charts to come up with the identification of abnormal occurrences in a given data set.

METHODOLOGY
Two groups of students were presented with the same data.One group received the data in the form of a run chart with production date and measurement results (see Figure 1) and the other received the same data in table form.Table 1 depicts the first ten rows of data.There are six sets of data, labelled dimension 1 through dimension 6, and each was created using a random distribution generator and the standard normal distribution.There are 28 days' worth of data with five values per day.The five values for dimension number 4 on date 20.02 were increased by 0.00181, which was the standard deviation of all values before the change was made.All six data sets were evaluated using an individual chart to ensure that only the values for dimension number 4 on date 20/02 stood out (see Figure 2).An individual chart is a type of statistical process control chart used when the sample size is one and it provides control limits to identify values that are not in a state of statistical control [49].The control chart in Figure 3 shows six values outside the upper control limit.Five of these values are the data that should stand out as salient and one is a value from dimension number 3 that just crosses the upper control limit.The false alarm rate for a stable and normally distributed process using an individual chart is 0.27% [50] and therefore this out of control point should not be unanticipated.Furthermore, it is one value just crossing the upper control limit and does not stand out as much as the planned out of control values for dimension number 4.

Fig. 1 Run charts of data provided to a group of students
The study participants were bachelor students of Information Management attending a Machine Learning course at the NOVA IMS (Information Management School), located in Lisbon, Portugal.The study was performed at the beginning of the fall semester, involving students already having some sound statistical background.
The questions were made available using an online platform that automatically timed the students from start to finish.Students were instructed to "Look at the data and identify as fast as you can the value or group of values that stand out."The students were randomly assigned to either seeing the data as a table or as a graph.Twenty students were assigned data in a graph and 29 were assigned data in a table.

RESULTS
Only two out of the 20 study participants in the graph group failed to identify the salient features in the data.In contrast, only one study participant in the table group successfully identified the salient features.Minitab Statistical Software® was used to assess these results using a test of two hypothesis (see Table 2), which is used to compare the means of two samples [51].The hypothesis test found the difference in portions to be statistically significant with a p-value much less than 0.05, as expected given the differences found and mentioned above.

Null hypothesis
H0: P1 -P2 = 0 Alternative hypothesis H1: P1 -P2 ≠ 0 Method Z-Value P-Value Normal approximation 11.52 0.000 Fisher's exact 0.000 Out of the 19 students who correctly identified the salient features in the graph group, only five out of them did not incorrectly identify additional values as salient features.That means only 35.7% did not incorrectly identify additional values.In other words, the study participants with a graph did much better in identifying salient features, but they also had false positives in the form of values identified as standing out when the values did not actually stand out.As for the second group of students, that used the data provided as a table, the number of false positives was only 15, while the first group of students had 156 false positives.

Fig. 2 Control charts of study data
There was also a difference in time needed to complete the task between the two groups of students.The table group of students took a mean of 487 seconds to complete the task and the graph group took a mean of 275 seconds to complete the same task.An hypothesis test for the two means done with the Minitab Statistical Soft-ware® also shows a very significant statistical differences between the two groups, with a p-value much lower than 0.05 (see Table 3).

DISCUSSION
The study group of students with a run chart graph correctly identified salient features much more often than the group with just a table of data.In addition, the graph group took a mean of 4.6 minutes to complete the task while the table group needed a mean of 8.1 minutes to complete the same task.The group with a graphical depiction of the data correctly identified far more salient features than the group with only a table of data and the salient features were identified much quicker than with a table of data.This empirical evidence validated the existence of very strong statistically significance differences, therefore showing that in this case a run chart leads to both much higher efficiency and efficacy (better and faster decisions made).This paper has implications for managers.Using EDA is a first start in problem solving whether to improve the quality of a process or to identify the cause of a failure.In both cases, use of EDA tools can lead to a quicker identification of salient features, which can then be further investigated.Although this has been argued for a long time, there is a need to provide statistically based evidence of these positive impacts and outcomes related to EDA and the use of appropriate EDA tools, as done in our study, with impressive results.However, it is also important to mention that study participants with a graphical depiction of the data also identified many salient features that were not actual salient features.Investigating false positives can consume resources better used elsewhere and when using graphical methods it is important to keep in mind the need to avoid pointing out many false positives.To counter this risk, we propose the use of graphical methods with built in methods for identifying values that greatly differ from others.
Traditional control charts identify values that exceed the three standard deviation control limits and should be considered for data collected in time order, as a compromise between the proper identification of abnormal situations and the avoidance of false positives.However, control charts require knowing how to construct a control chart or a statistical software package with a control chart option and data collected in time order, as well as knowledge of the underlying assumptions and their validity.Control charts also require sufficient data for calculating control limits.
A box-and-whisker plot can be a good EDA option for situations in which a control chart is not the most appropriate solution.The box-and-whisker in Figure 3 depicts the data used in this study.According to this tool, there are outliers in five of the six data sets; however, only dimension 4 has outliers far removed from the rest of the data.
A box-and-whisker plot identifies outliers and can be created using a spreadsheet program or even hand calculations if a statistical software package is unavailable.Such a data representation can be used to quickly identify the salient features in the data, which could then, if collected in time order, be viewed in a run chart to identify the time of abnormal occurrences.These salient features could then be investigated to determine if they are related to or even causing the problem under investigation.

CONCLUSION
A study was performed using two groups of students.One group was assigned a run chart and the other was given the same data in a table.The data contained salient features, which were values that stood out.The group with the run chart identified far more salient features than the group with a table.Furthermore, the group with a table often falsely identified values that should not have stood out.In addition the group with a table too much longer to complete the task.
The study showed the utility of using a graph to view data.However, the study has a limitation; only one type of graph was studied.As a final note, the authors hope to have been able through the study to show how practical empirical comparisons may help in providing additional scientific evidences regarding possible benefits related to EDA and the proper application for some of EDA's most popular tools, an area that should be possibly further explored in the future, through similar and additional studies.