Big Data. Computational social science. Data science. Analytics. These buzzwords are everywhere these days—business, government, the nonprofit sector, you name it—and the social sciences are no different. The question of what to do about the explosion of datasets and -sources far larger than the local norm has moved to the center of many disciplines. Recently in my own field of communication, we’ve seen special issues of several journals devoted to computational social science (my term of choice; hereafter CSS), a growing number of faculty openings, and enough panels at various annual meetings to fill out a mid-size conference unto itself.
But it’ll take more than this to do CSS effectively, and by “do” I mean “conduct research and teach at a world-class level.” These are goals that can, and should, be implemented at the departmental level—just as some communication departments are known for their expertise in survey or rhetorical methods, an enterprising upstart could become the first to gain fame for excellence in CSS. Such a department would need to build strength in at least four distinct areas: faculty, curriculum, hardware, and data. A few departments have addressed one or two of these at some level, but I don’t know of a single one (at least in the US) that shines in all four. It can’t all be done cheaply, but if you believe as I do that CSS looms large in comm’s future, it’ll be well worth it.
But I don’t expect you to take me at my word, so before I go into detail on the four areas, I’d like to justify the enterprise a bit. Lots of strong rationales come to mind but here are three of the most important:
There are some kinds of analysis you can’t do any other way.
Computational skills open an entirely new dimension of empirical possibilities to their practitioners. This dimension holds the potential to radically transform every step of the research process—data acquisition, preprocessing, analysis, visualization, and interpretation. Here I’ll offer two specific examples demonstrating different aspects of this general point.
First, CSS practitioners often denigrate the act of preprocessing raw data into more manipulable forms as “data janitorial” work. This metaphor is extremely misleading: preprocessing determines which analytical methods can be applied to a given dataset, and therefore an expert “data janitor” has many more analytical options at her disposal than one who lacks such skills. For example, one of the first steps in network analysis of Twitter data is the task of converting tweet text into formats suitable for network analysis. NodeXL, which offers perhaps the user-friendliest means of doing so, automatically creates network edges between tweet authors and any usernames included in their tweets. The program can distinguish between “replies” (created by using Twitter’s “reply-to” function) and “mentions” (created by simply including another user’s name in a tweet), but not retweets, modified tweets, CCs, or other referential conventions. The ability to make such distinctions is important given research that shows meaningful differences in how these conventions are used (e.g. Conover, 2011). I don’t know of any off-the-shelf software that can do this, but it’s a trivial task in most programming languages. The broader point is that relying on off-the-shelf software tends to sharply limit researchers’ data manipulation options.
The second point can be explained very briefly. Many of the most powerful tools for analyzing digital data are modules or libraries for use within different programming environments. A few Python libraries comm researchers might find useful include pandas, scikit-learn, statsmodels, NetworkX, and my own TSM. But working knowledge of the language is a prerequisite for their use.
The field of communication is uniquely positioned to apply CSS in innovative ways.
Computer science and information science already have long head starts on CSS compared to the social sciences. Many of the best CSS tools were created by the students, graduates, and faculty of such departments, some of whom already study communication phenomena such as the flow of news memes online (Leskovec et al., 2009) and partisan polarization in social media (Conover et al., 2011). So one possible response to proposals to build CSS strength in comm departments is: well, CS and IS are the experts here—how could we do better than them? The answer is: in areas of relevance to communication theory and practice, we have a couple distinct advantages.
First, most computer and information scientists lack the theoretical background to explain the meaning and significance behind their findings. Their research orientation is informed primarily by the priorities of engineering, which include speed, accuracy, efficiency, and algorithmic elegance (Freelon, in press). As such, many are more concerned with chasing the cutting edge of software development than with explaining social phenomena. (I’m talking about general trends here—I don’t want to dismiss those CS and IS scholars who have reached across disciplinary lines to produce excellent social scientific research.) In contrast, we marshal our methods in service of communication theory and practice—CSS is no different in this than depth interviews, surveys, or ethnographies. In short, their comparative CSS advantage is in the development of new software and techniques, whereas ours lies in using those tools to analyze and explain communication phenomena.
Second, our capacity for methodological pluralism, particularly the combination of CSS and qualitative methods, is greater than in the engineering sciences. While pluralism is by no means unknown among them, as a group they strongly privilege algorithmic and automated methods. Communication researchers are comparatively more comfortable mixing methods and can more easily apply qualitative and CSS methods to complex research questions. A couple of my own forthcoming papers offer examples of how this can be done (Freelon & Karpf, in press; Freelon, Lynch, & Aday, in press). As a field we are uniquely positioned to cultivate a strong dialectic between macro (CSS) and micro (qualitative) empirical levels that raises the quality of our theoretical explanations.
PhD (and master’s) graduates will be strong candidates for both academic and non-academic positions.
Tenure-track faculty positions are in short supply across academia. Comm is actually doing better than some fields in this regard, but there still aren’t nearly enough TT jobs for all qualified candidates. Training comm master’s and PhD graduates in CSS can be one part of the solution. More than other methodological specializations, CSS training prepares students for jobs outside the academy. The end of this article includes several lists of essential skills for industry-focused data scientists, and many of these include some variant of “communication skills,” “storytelling,” “curiosity,” “visualization,” and/or “domain knowledge.” These non-technical capabilities are already part of most decent PhD programs—add key technical components and you’ve got most of the skills employers are looking for in a data scientist. Comm graduates would obviously be best suited to working in communication-related industries such as PR, journalism, advertising, and social media. Indeed, a handful of comm PhDs have already been hired by major social media and tech companies (e.g. David Huffaker, Lauren Scissors, and Loi Sessions Goulet), although not all are in CSS. We could make this a more common occurrence.
Comm also has an opportunity to make some of its unique insights relevant to industry. For example, to avoid the problematic assumption that digital traces such as Facebook likes and retweets have fixed meanings (authority, influence, endorsement, etc.), we can point out when such assumptions are more and less likely to hold (Freelon, 2014). Similarly, we can help hold a critical eye to companies like Klout that claim to measure concepts such as “influence” using proprietary formulas of unknown validity. Closely scrutinizing such practices holds real business value: it’s important to know whether a given product actually measures what it claims to measure before buying or using it.
All right; now that I’ve sold you on the general prospect, let’s move to the four key areas for CSS.
This one’s pretty obvious—several comm departments have recently hired in CSS (UPenn, UW-Seattle, and UMD-College Park, among others), and this will likely continue into the foreseeable future. Trouble is, you can’t just hire one prof and call yourself a CSS powerhouse. Critical mass is needed—probably at least three faculty and preferably more—to support multiple courses, advisees, and research projects. Eventually, you want students to look at your department and think “wow, look at all the CSS faculty they have; seems like a really supportive place for that kind of work.” Ideally your faculty would specialize in diverse areas of CSS such as machine learning, network analysis, visualization, predictive modeling, etc. But all should be ready, willing, and able to apply these skills to communication research questions. That doesn’t necessarily go without saying: most CSS PhDs don’t have a comm background, and many don’t care much about doing comm research. But giving those that do a supportive work environment will be critical in nurturing the next generation of comm CSS scholars.
Exactly how much these contributions should count is up for debate. I certainly don’t think anyone should be able to earn tenure on visualizations alone, but if they provide scholarly value, they should count for something. This is part and parcel of signaling to CSS faculty that their work is valued—and we all know what happens to talented researchers who don’t get that message.
CSS faculty must be given the latitude to teach in their area(s) of methodological expertise. But our department needs more than just a single introductory-level CSS course. Quantitatively-oriented comm grad students often take three or more stats courses, and those who want to learn research-grade CSS should have similar options. One effective way to start would be to offer a multi-course CSS track similar to the statistics tracks many departments currently offer. Such a track could start with an introduction to Python or R and continue with courses in data manipulation, visualization, machine learning, and/or statistical modeling. Successful completion of the track could earn the student a master’s or PhD certificate in CSS.
It bears emphasizing that any comprehensive CSS curriculum needs to start by teaching students how to code. Our department will not be able to assume that students will enter knowing how to code, just as most currently don’t assume any particular level of statistical knowledge. This isn’t something that can simply be outsourced to the computer science department—communication students will use code for very specific purposes that computer scientists don’t always understand. In addition, learning how to apply computer programming to communication research questions from the start will help keep students motivated and stem the high attrition rates that plague traditional CS education.
Like video and sound production, CSS is an infrastructure-intensive enterprise. Small-scale projects can be executed cheaply on repurposed in-house servers or low-capacity virtual cloud servers, but our lofty goals require a much more substantial capital investment. There are two general directions we could go here: the first is to commit to paying a company like Amazon a monthly fee for a dedicated chunk of virtual computing resources for data collection, analysis, and storage. The major advantage of this approach is convenience: the cloud provider handles all the administrative details so that all our faculty and students need do is login and get to work. But going the cloud route is like paying for web hosting: you lock yourself into a long-term relationship with your provider, which means we need to be rich enough to pay it indefinitely. And deciding to switch providers or move to an in-house option down the road is a logistical nightmare proportional to the amount of time spent with our original provider.
The other option is to use university-hosted hardware. The biggest advantage here is cost—the initial capital investment on the machines is a one-time expenditure. This consideration alone may make it the only feasible option for less wealthy departments. There are a number of ways to self-host, each with its own set of issues. Some universities make high-performance computing clusters (HPCCs) available to the entire campus—depending on the exact setup, our department could outsource some or possibly all of its computing needs to it. Obviously this would be very attractive from a budget perspective, but other departments will almost certainly be using the cluster already, which will limit available processing capacity. There may also be other limits on who is allowed to use it, what kinds of software can be installed, how much data can be stored, and how the system is allowed to access the Internet, among others. We would need to have a long conversation (probably several) with the HPCC administrator to determine the extent to which it will suit our needs.
The other self-hosted option would be for the department to build its own small server cluster. This would maximize control and configurability but also require active management and monitoring. Ideally this could be done by someone on the department’s IT staff; it’s not the sort of thing faculty or students should spend their time on. But that probably means adding to an existing staff person’s workload, which may entail a pay raise. Alternatively, if there’s room in the budget, the department could hire a full-time staff person to handle things like cluster administration, purchasing, keeping the disk images up to date, troubleshooting, user management, basic training, etc.
(A quick note about software before I move on to data: most CSS software is FLOSS, and your faculty will know what’s best to use, so it’s not a major planning concern. But if there are specific packages that need to be purchased, those can be added to the data budget, which will almost certainly be much larger.)
There are three basic ways to obtaining CSS data: you can collect it yourself, you can buy it, or you can make it. Collecting data in-house is cheaper but more time-consuming and error-prone, while buying it costs money but usually results in better quality. To take social media data as an example, many platforms restrict the amount of data that can be extracted from their public APIs as a quality-of-service measure. As a result it’s difficult to know just how representative self-collected samples are. Purchasing data from an authorized data vendor such as Gnip also buys you some degree of assurance that you’re actually getting all data relevant to your sample frame. For example, if you were to collect tweets from the #Ferguson hashtag using a harvesting server like 140dev, you’d have no idea whether your data were representative or how many tweets you were leaving behind. But purchasing the data allows you to obtain all of the relevant data for whatever time period you’re interested in (at least in theory).
There are also many non-social media types of data of interest to communication researchers that can be purchased. Companies like Nielsen, Comscore, and Alexa sell high-quality audience measurement data for the non-social web. Nielsen sells comparable data for TV (as they have for decades), books (Nielsen BookScan), and music (Nielsen SoundScan). Many TV news transcripts are available through a pay source most comm departments already have access to—LexisNexis. I’m sure there are many other sources I’m not aware of, but this brief list conveys a sense of what’s available to departments with research budgets.
Lastly, some CSS researchers generate their own data by measuring user interaction with bespoke sociotechnical systems. The tradition of computer-based experiments actually has a longer history in communication than many realize (e.g. Sundar & Nass, 2000). Probably the main logistical issue here is the provision of lab space for small-scale, in-house computational experiments. Such resources can also be used to pre-test measures and instruments for later use in online experiments where many factors lie outside the researchers’ control.
As noted earlier, checking all these boxes can’t be done on the cheap. The total cost must be tallied not only in money but also in non-monetary transition costs and (potentially) resistance from skeptical colleagues. There are no guarantees when it comes to shifts of this magnitude—failure’s always a possibility, especially if all the necessary resources don’t come through for one of the four areas. Moreover, there’s other important work that needs to occur at the disciplinary and interdisciplinary levels, including the establishment of official sections in the major professional orgs, specific initiatives to increase CSS visibility in top research outlets, and discipline-spanning institutes that bring together practitioners from across campus. All that said, it seems extremely unlikely to me that the importance of analyzing digital communication data through programming will wane in the near future. If I’m correct, the first comm department to do CSS effectively will emerge as a nationwide model for the discipline and beyond. Sounds like a place I’d like to work.
Thanks to Brian Keegan for his helpful comments on an earlier draft.