In his paper Anscombe presents four different datasets, each with identical statistical summaries, but each producing very different graphs:
Anscombe's Quarte from Wikipedia |
Statistical summary
Number of observations: 11
Mean of x's: 9.0
Mean of y's: 7.5
Sample variance of x: 11
Sample variance of y (to 3 decimal places): between 4.122 and 4.127
Pearson's Correlation between x and y of each set: 0.816
Linear regression of each set ( to 2 decimal places): y = 3.00 + 0.50x
Of much interest is the dramatic effect of outliers. Some measures of central tendency such as the arithmetic mean are quite sensitive to outliers and as the summaries of the Quarte's show, may be quite misleading. Nonetheless the presence of outliers may indicate an error in input of data or spike due to some change in the environment, and may warrant a closer look to assess their merit, outlier in x3, y3 may be the result of a spike, whereas x4,y4 will most likely be an input error.
In data science or any such statistical analysis field whose input is some raw data that is manipulated to be to unearth some insight, make projections or serve as an intermediary step to inform other processes the importance of visualising your data cannot be stressed enough. Visualizing the data informs our intuition and guides the selection and development of statistical models. As is always said: a picture is worth a thousand words...of statistical summaries?
Read more:
https://www.quora.com/What-is-the-significance-of-Anscombes-quartet
http://data.heapanalytics.com/anscombes-quartet-and-why-summary-statistics-dont-tell-the-whole-story
https://en.wikipedia.org/wiki/Anscombe%27s_quartet