Monday, April 11, 2016

Anscombe's Quarte: The significance of visualizing your data

Depending only on summary statistics without observing how the data is laid out can lead one to make erroneous assumptions. Frank Anscombe set out to emphasis this in his paper Graphs in Statistical Analysis http://www.jstor.org/stable/2682899. A good read for every budding data scientist, and should probably be part of an essential reading list.

In his paper Anscombe presents four different datasets, each with identical statistical summaries, but each producing very different graphs:

 Anscombe's Quarte from Wikipedia
Statistical summary
Number of observations:                                     11
Mean of x's:                                    9.0
Mean of y's:                                    7.5
Sample variance of x:                                     11
Sample variance of y (to 3 decimal places): between 4.122 and 4.127
Pearson's Correlation between x and y of each set:                                0.816
Linear regression of each set ( to 2 decimal places):              y = 3.00 + 0.50x

Of much interest is the dramatic effect of outliers. Some measures of central tendency such as the arithmetic mean are quite sensitive to outliers and as the summaries of the Quarte's show, may be quite misleading. Nonetheless the presence of outliers may indicate an error in input of data or spike due to some change in the environment, and may warrant a closer look to assess their merit, outlier in x3, y3 may be the result of a spike, whereas x4,y4 will most likely be an input error.

In data science or any such statistical analysis field whose input is some raw data that is manipulated to be to unearth some insight, make projections or serve as an intermediary step to inform other processes the importance of visualising your data cannot be stressed enough. Visualizing the data informs our intuition and guides the selection and development of statistical models. As is always said: a picture is worth a thousand words...of statistical summaries?