Susan Halford (University of Southampton)
The emergence of ‘big data’ is rapidly becoming one of the hallmarks of the 21st Century. It’s almost a cliché to say this, which is perhaps the point. Following earlier observations about the emergent dominance of ‘information’ in the evolution of contemporary Western societies (Bell 1973; Toffler 1980; Castells 1996, 1997, 1998), the scale and pace of recent digital data accumulation have outstripped all expectations. By August 2013, it was estimated that 90% of all the data in the world had been generated in the preceding 24 months and predicted that this scale and rate of data accumulation would persist into the near future, at least.
In the wake of this radical expansion of data, specifically of digital data that can be easily shared and computationally processed, have come some extraordinary claims. In 2008, the highly respected technology magazine Wired claimed that: ‘[this] is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behaviour, from linguistics to sociology. Forget taxonomy, ontology and psychology. Who knows why people do what they do? The point is they do it, and we can track it and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.’
Now, in 2015, to present this as the dominant discourse on Big Data would be to set up a straw-argument. Despite some persistence in claims such as these, it has become increasingly clear although there are a few examples that might justify this position – the Google search engine algorithm, for instance – the numbers rarely speak for themselves. The Web is awash with examples of bad big data analytics, echoing earlier generations of spurious correlation (one of my own favourites recently being the correlation between sociology PhDs awarded and world-wide non-commercial space launches) and otherwise poor analytical practice. By 2013 Wired magazine was reporting that as ‘… an increasing number of experts are saying more insistently … Big Data does not automatically yield good analytics’ and insisting that ‘Big Data is a tool, but should not be considered the solution’ l.
In short, the jury is still out on Big Data. Or to be more sociological about it: the field is in formation. Not least, there are disputes about its definition. In common usage for business and government, it is simply the quantity of data that counts: the label ‘Big Data’ kicks in at the point that the data are too large to be stored locally and analyzed by standard computers and software (Manovich 2011). As computer storage and processing capacities continue on a steep upward curve, so too ‘Big Data’ becomes a moving target. From a more computational perspective, replying on this definition alone is to miss some of the more specific qualities of Big Data. In addition to their volume, it is increasingly common to point to their variety and velocity as well (Ianey 2001). ‘True’ Big Data are unstructured, or present multiple incommensurate formats and are dynamic, generated at speed, constantly changing. That is to say, not all data are equal and size isn’t the only qualifier for the label, the qualitative nature of the data matters too. Meanwhile, from a more substantive position – that of those concerned with using data to address particular real-world problems – the practical definition of Big Data might mean all the data we can harness, whatever its’ size in relation to computing capacity or its particular structure and means of generation.
How we define Big Data matters because it shapes our understanding of the expertise that is required to engage with it – to extract the value and deliver the promise. Is this the job for mathematicians and statisticians? Computer scientists? Or ‘domain experts’ – economists, sociologists or geographers – as appropriate to the real-world problems at hand? As the Big Data field forms we see the processes of occupational closure at play: who does this field belong to, who has the expertise, the right to practice? This is of observational interest for those of us who research professions, knowledge and the labour market, as we see how claims to expert knowledge are made by competing disciplines. But it is also of broader interest for those of us concerned with the future of Big Data: the outcome will shape the epistemological foundations of the field. Whether or not it is acknowledged, the disciplinary carve-up of big data will have profound consequences for the questions that are asked, the claims that are made and – ultimately – the value that is derived from this ‘new oil’ in the global economy.
We are not on a level playing field. Whilst the promise of Big Data may be more carefully calibrated now than it was in the early days, the appeal to foundational modernist discourses of progress through Science, and the discovery of ‘truth’, persists. For business and government, the promise of finite knowledge harnessing the power of prediction has obvious appeal. This privileges those disciplines that can accept this epistemological position (more easily) and makes it harder for others to take a position in the field. Partiality, contingency and uncertainty, or recognizing the vested power of different interpretations and solutions, are not the order of the day.
But however important the politics of discipline may be – and it really does matter as more and more funding appears to be targeted to fewer and fewer disciplines in an academy where ‘publish or perish’ is (at least) being joined by ‘generate income or perish’ – to frame the problem in this way may be to miss the point. There is nothing essential or absolute about our academic disciplines or the boundaries that define them at present. Rather, these are inherited from the codification of knowledge through the Enlightenment, on which model of disciplinary expertise the social sciences developed and flourished in the twentieth century. Whilst the discourses of Big Data are deeply shaped by this history as it lives on in the academic divisions and practices of the 21st century, we should try to think about what good Big Data analysis might look like outside of those inherited boxes.
The current development of Data Science, Computational Social Science and Web Science (amongst others perhaps) promise such intervention. But just because they have new labels doesn’t mean that they will overcome the epistemological and disciplinary politics described above or solve the problem and there is no point in being naïve about this. From the perspective of Web Science – an interdisciplinary approach to the Web, with a particular focus on Web data and analytics – there is no doubt that the tensions described above continue to play out. Indeed, it would be a surprise if they did not, so fundamental are the epistemological differences at times. But working together – designing curricula, teaching, working up funding bids and articles – provides the everyday opportunities to question each other’s position, to find compromises and sometimes new positions that work across disciplines. Whether this is so for Data Science or Computational Social Science I cannot say. My guess is that neither of these describes homogeneous activities but, rather, that there will be considerable diversity under the (new) disciplinary umbrella that each provides. At least I hope so. A more cynical view would be that the labels are more colonizing in their effect, if not their intent: capturing the ground of social science for the mission of computational/data science.
More important than the labels, we might begin to conceptualize the way forward for interdisciplinary Big Data analytics as operating on (at least) three different levels. At the most minimalist, Big Data analytics follow the mathematical model, using computational techniques for data processing, but seek interpretation from substantive disciplines. Theory and models still matter but only in evaluating outputs. This might be the most common scenario at present, perhaps even the best case scenario in the current climate. But at least that way we avoid some of the more obvious pitfalls of Big Data analytics (see Gary King’s discussion of a project on work and unemployment that analyzed social media data using sentiment analysis – based on identification of key words – the week that Steve Jobs died and might have made some very wrong conclusions if Jobs and ‘jobs’ had not been distinguished in the analysis ) and start to draw on subject experts for deeper interpretation.
To add more value, we might work to shape informed research questions to drive our data analytics, rather than assuming that the patterns that emerge are an organic representation of what it is worth knowing. This is not to deny that established research agendas might blinker the definition of questions or to suggest that the unexpected does not lie in Big Data. But it is to recognize that significant value will come from Big Data if we use the insights it might offer to address intransigent social problems, to push forward our understanding of these, rather than suggesting that the data should just ‘speak for themselves’. And we might also think more critically about this final point too. Big Data are not ‘raw’ data: they are thoroughly mediated by the mechanisms that generate them – whether these are concepts, methods, platforms, scientific instruments, etc. – and by the way that these data are harvested, stored, analyzed and visualized. If we are really to build Big Data analytics on the most appropriate forms of expertise we must understand this and develop robust methodologies accordingly.
Daniel Bell (1974) The Coming of Post-Industrial Society. New York: Harper Colophon Books.
Manuel Castells (1996) The Rise of Network Society: The Information Age: Economy, Society and Culture Oxford: Blackwell.
Manuel Castells (1997) The Power of Identity: The Information Age: Economy, Society and Culture Oxford: Blackwell.
Manuel Castells (1998) End of the Millennium: The Information Age: Economy, Society and Culture. Oxford: Blackwell.Castells 1996, 1997, 1998
Doug Laney (2001) 3D Data Management: Controlling Data Volume, Velocity, and Variety. Blogpost. Meta Group Inc. February, 2001.
Lev Manovich (2011) ‘Trending: The Promises and the Challenges of Big Social Data’ Debates in the Digital Humanities, 1–17.
Alvin Toffler (1980) The Third Wave New York, Random House.
Susan Halford is Professor of Sociology and Head of the Division of Sociology and Social Policy at the University of Southampton. She is Co-Director of the Work Futures Research Centre and Co-Director of the Web Science Institute. @susanjhalford