tag:blogger.com,1999:blog-28277986613890022432017-04-04T16:09:40.118+01:00dots of data adventuresKwamena Appiah-Kubinoreply@blogger.comBlogger1125tag:blogger.com,1999:blog-2827798661389002243.post-3829615176922332402016-04-11T12:12:00.000+01:002016-04-11T12:12:03.963+01:00Anscombe's Quarte: The significance of visualizing your dataDepending only on summary statistics without observing how the data is laid out can lead one to make erroneous assumptions. Frank Anscombe set out to emphasis this in his paper <i>Graphs in Statistical Analysis</i> <a href="http://www.jstor.org/stable/2682899">http://www.jstor.org/stable/2682899</a>. A good read for every budding data scientist, and should probably be part of an essential reading list.<br /><br />In his paper Anscombe presents four different datasets, each with identical statistical summaries, but each producing very different graphs:<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-gvcf7kcXW24/Vwt76qqehlI/AAAAAAAAPwo/uwP8b8VMS2UA2lWinndc_Ctdch-17_WWA/s1600/Anscombe%2527s_quartet.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Anscombe's Quarte" border="0" height="290" src="https://3.bp.blogspot.com/-gvcf7kcXW24/Vwt76qqehlI/AAAAAAAAPwo/uwP8b8VMS2UA2lWinndc_Ctdch-17_WWA/s400/Anscombe%2527s_quartet.jpg" title="Anscombe's Quarte" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>Anscombe's Quarte</i> from <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet" target="_blank">Wikipedia</a></td></tr></tbody></table><div style="text-align: center;">Statistical summary</div><div style="text-align: right;">Number of observations: 11</div><div style="text-align: right;">Mean of x's: 9.0</div><div style="text-align: right;">Mean of y's: 7.5</div><div style="text-align: right;">Sample variance of x: 11</div><div style="text-align: right;">Sample variance of y (to 3 decimal places): between 4.122 and 4.127</div><div style="text-align: right;">Pearson's Correlation between x and y of each set: 0.816</div><div style="text-align: right;">Linear regression of each set ( to 2 decimal places): y = 3.00 + 0.50x</div><br /><br />Of much interest is the dramatic effect of outliers. Some measures of central tendency such as the arithmetic mean are quite sensitive to outliers and as the summaries of the Quarte's show, may be quite misleading. Nonetheless the presence of outliers may indicate an error in input of data or spike due to some change in the environment, and may warrant a closer look to assess their merit, outlier in x3, y3 may be the result of a spike, whereas x4,y4 will most likely be an input error.<br /><br />In data science or any such statistical analysis field whose input is some raw data that is manipulated to be to unearth some insight, make projections or serve as an intermediary step to inform other processes the importance of visualising your data cannot be stressed enough. Visualizing the data informs our intuition and guides the selection and development of statistical models. As is always said: a picture is worth a thousand words...of statistical summaries?<br /><br />Read more:<br /><a href="https://www.quora.com/What-is-the-significance-of-Anscombes-quartet">https://www.quora.com/What-is-the-significance-of-Anscombes-quartet</a><br /><a href="http://data.heapanalytics.com/anscombes-quartet-and-why-summary-statistics-dont-tell-the-whole-story">http://data.heapanalytics.com/anscombes-quartet-and-why-summary-statistics-dont-tell-the-whole-story</a><br /><a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">https://en.wikipedia.org/wiki/Anscombe%27s_quartet</a><br /><br /><br /><br /><br />Kwamena Appiah-Kubihttps://plus.google.com/109909108376293583633noreply@blogger.com0