Inside Collection (Book): The Art of the PFUG
Summary: Many people perform data analysis, but few have offered a theoretical model for the process. The descriptions that have been offered disagree with each other and appear to be based on personal intuition. This module examines the accuracy of conceptualizing data analysis as a sense making process, as described in cognitive science literature. A review of 11 articles that feature data analysis tasks suggests that a sense making model for data analysis would be accurate. Future work will examine if and how statistical data analysis safeguards itself against the sources of bias contained in the sense making process.
MOTIVATION
Data analysis is the process by which we glean understanding from data. While the origins of data analysis extend at least as far back as Francis Bacon and certainly further, the term “Data Analysis” was first introduced as a field of academic study in 1962 by John Tukey.
Improvements in technology have increased both the amount of data that we can store and the speed with which we can analyze it (Friedman 1997). With each improvement, data analysis becomes more relevant. Modern commentators now claim we live in the midst of a “data deluge,” where we no longer have the cognitive power to understand all of the data available (Hey 2003). Further advances in data collection technology will require further advances in data analysis methods.
The fields of Machine Learning, Data Mining, InfoVis, and Visual Analytics are all attempts to improve upon Data Analysis to better meet our analytical needs. But even with the research already done in these areas, scientists claim that there is very little Data Analysis theory to build upon, and that the theory that is available is hard to access (Unwin 2001, Mallows 2006, Cox 2007). This lack of theoretical understanding stymies improvement in the field. Many academic disciplines create innovations by extending existing theory in new ways. Data analysis appears to proceed through a trial and error process.
Researchers have offered multiple suggestions to remedy this. Cox and Mallows propose reviewing data analysis case studies to induce a general pattern of analysis. Unwin suggests creating a pattern language of Data Analysis similar to the pattern language first proposed by architects Alexander, Ishikawa, and Silverstein (1977), and used successfully in the field of software engineering (Coplien 1996). While we are intrigued by Unwin’s proposition, we do not presently have the resources to define a complete pattern language. However, we begin our examination of data analysis by reviewing the data analysis case studies that exist in the literature of statistical consulting, as suggested by Cox and Mallows.
RESEARCH QUESTION
Can the sensemaking model of cognitive science provide a theoretical model for data analysis?
PREVIOUS MODELS OF DATA ANALYSIS
Past efforts to describe data analysis reveal a lack of consensus about the process. Below are three illustrations of the process provided by Box (1976), Box, Hunter, and Hunter (1978), and Wild and Pfannkuch (1999).
![]() |
![]() |
![]() |
While different, the three diagrams suggest some salient aspects of the data analysis process:
Data analysis shares these features with a process that has been well studied by cognitive scientists: sense making.
SENSE-MAKING
Sensemaking is an area of cognitive science that examines how the human brain creates understanding from its surroundings. It began in the 1970’s as an extension of communication theory, but was then adopted by experimental and theoretical psychologists. According to sensemaking research, the human brain continuously scans its environment for data and builds this data into a mental model that explains its surroundings. A couple of sensemaking models exist to explain how this occurs (e.g, the cost structure model, the data-frame model), but each has the same basic components.
![]() |
The brain begins with a tentative theory, which is also called a model, a schema, or a frame. This theory suggests to the brain what is and what is not relevant data. The brain then constructs this data from the external stimuli it receives through the sense organs. An important facet of sensemaking is that the mind does not automatically accept all present stimuli as data. It instead decides which stimuli would be relevant, searches for them, and then synthesizes them into a piece of data.
The brain compares its currently held theory to the data it has collected. It confirms the theory if the theory accurately fits the data. Otherwise, it will modify the theory to better fit the data or completely reject the theory in favor of a new one. The process occurs continuously; the brain constantly refines existing theories against new data.
A theory provides understanding by describing the relationships between data. These relationships assign meaning to the data points and also allow predictions of unobserved data from observed data. A theory also allows the mind to encode data more efficiently than just storing the raw bits. In this way, sensemaking resembles parametric modeling. The brain retains the theory instead of the raw data, but retains the information contained in the data in the parameters of the theory. Different types of theories can describe different types of relationships among data. Mental maps describe spatial relationships, stories describe temporal and causal relationships, scripts describe roles, plans describe an intended sequence of events, etc. (Klein et al. 2003)
WHY SENSE-MAKING?
Sensemaking shares all of the salient features of data analysis noted above, but there are other reasons to suspect that cognitive science may offer a theoretical foundation for data analysis.
Almost all data analysis is conducted by humans in order to improve their understanding of the world. Hence, data analysis extends the sensemaking process. Moreover, data analysts may use their internal reasoning processes as a model for their data analysis.
As Velleman (1997) points out, data analysis is a revival of Francis Bacon’s scientific method and could be considered the modern incarnation of that method. The history of this method resembles a movement from an internal sensemaking process, which can often be subjective, to an external sensemaking process that tries to be objective. If so, we should expect data analysis to display a foundation based on sensemaking with added safeguards against the biases that sensemaking is vulnerable to.
PRELIMINARY RESULTS
I followed Cox and Mallows suggestions and compared data analysis case studies and suggestions available in the statistical literature to the sensemaking model. In all cases most of the data analysis prescriptions fell into one of the four sensemaking steps. The remaining prescriptions were all “meta-steps” which dealt with the data analysis process itself (e.g, plan, understand the problem). These meta-techniques may be evidence that data analysis has incorporated safeguards against the vulnerabilities of the internal sensemaking process. A visual description of the compliance of 11 papers:
![]() |
LOOKING FORWARD
This preliminary analysis supports the hypothesis that sensemaking may provide a theoretical model for data analysis. Further study must address the question, “How can we provide a rigorous demonstration that data analysis follows a sensemaking model?” As Cox points out, only a small number of data analysis case studies are available in the statistical literature. Future research may employ more direct methods such as observing actual data analyses or scouring computer code used to perform data analyses.
if a cognitive basis is demonstrated, cognitive science may provide opportunities to improve the activity of data analysis. Do current data analysis methods provide adequate safeguards to the well documented list of sensemaking biases?
Finally, a firmly established model for data analysis can be used to expand the academic understanding of the sub-field. The author originally embarked on this study to address the lack of well defined objectives for data visualization techniques. A better definition of the purpose of data analysis methods may provide new opportunities to optimize data analysis techniques.
ACKNOWLEDGEMENTS
REFERENCES
Alexander, et al. (1977). A pattern language: towns, buildings, construction. Oxford University Press, USA.
Bailyn (1977). ‘Research as a cognitive process: Implications for data analysis’. Quality and Quantity 11(2):97–117.
Becker, et al. (1987). ‘Dynamic Graphics for Data Analysis’. Statistical Science2(4):355–383.
Box (1976). ‘Science and Statistics’. Journal of the American Statistical Association 71 (356):791–799.
Box, et al. (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons.
Cabrera & McDougall (2002). Statistical consulting. Springer Verlag.
Chatfield (1995). Problem solving: a statistician’s guide. Chapman & Hall/CRC.
Coplien (1996). Software patterns. Citeseer.
Cox (2007). ‘Applied statistics: A review’. Annals of Applied Statistics1(1):1–16.
Friedman (1997). ‘Data mining and statistics: what’s the connection? ’Computing Science and Statistics: Proceedings of the 29th Symposium on the interface.
Hey & Trefethen (2003). ‘The Data Deluge: An e-Science Perspective’ pp. 809–824.
Klein, et al. (2003). ‘A Data/Frame Theory of Sense Making"’. In Expertise out of context: proceedings of the sixth International Conference on Naturalistic Decision Making, pp. 113–155.
Mallows (2006). ‘Tukey’s Paper after 40 years (with discussion)’. Technometrics48(3):319–325.
Pirolli & Card (2005). ‘The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis’. Proceedings of International Conference on Intelligence.
Ribarsky, et al. (2009). ‘Science of analytical reasoning’. Information Visualization 8(4):254–262.
Tukey & Wilk (1966). ‘Data analysis and statistics: an expository overview’. In Proceedings of the November 7-10, 1966, fall joint computer conference, pp. 695– 709. ACM.
Tukey (1962). ‘The Future of Data Analysis’. The Annals of Mathematical Statistics 33(1):1–67.
Wild & Pfannkuch (1999). ‘Statistical thinking in empirical enquiry’. International Statistical Review/Revue Internationale de Statistique67(3):223–248.
Velleman (1997). The Philosophical Past and the Digital Future of Data Analysis. Princeton University Press.