A few months back I started a crowdsourced list of social media data collection tools. A few folks tweeted it today, which reminded me that I never posted it to my website. At the time I had visions of lovingly curating it and maybe even making it searchable based on one’s specific research criteria and programming abilities… but the standard tenure-track priorities quickly caught up with me. Still, some folks seem to find it useful, so here it is:
Big Data. Computational social science. Data science. Analytics. These buzzwords are everywhere these days—business, government, the nonprofit sector, you name it—and the social sciences are no different. The question of what to do about the explosion of datasets and -sources far larger than the local norm has moved to the center of many disciplines. Recently in my own field of communication, we’ve seen special issues of several journals devoted to computational social science (my term of choice; hereafter CSS), a growing number of faculty openings, and enough panels at various annual meetings to fill out a mid-size conference unto itself.
But it’ll take more than this to do CSS effectively, and by “do” I mean “conduct research and teach at a world-class level.” These are goals that can, and should, be implemented at the departmental level—just as some communication departments are known for their expertise in survey or rhetorical methods, an enterprising upstart could become the first to gain fame for excellence in CSS. Such a department would need to build strength in at least four distinct areas: faculty, curriculum, hardware, and data. A few departments have addressed one or two of these at some level, but I don’t know of a single one (at least in the US) that shines in all four. It can’t all be done cheaply, but if you believe as I do that CSS looms large in comm’s future, it’ll be well worth it.
But I don’t expect you to take me at my word, so before I go into detail on the four areas, I’d like to justify the enterprise a bit. Lots of strong rationales come to mind but here are three of the most important:
There are some kinds of analysis you can’t do any other way.
Computational skills open an entirely new dimension of empirical possibilities to their practitioners. This dimension holds the potential to radically transform every step of the research process—data acquisition, preprocessing, analysis, visualization, and interpretation. Here I’ll offer two specific examples demonstrating different aspects of this general point.
First, CSS practitioners often denigrate the act of preprocessing raw data into more manipulable forms as “data janitorial” work. This metaphor is extremely misleading: preprocessing determines which analytical methods can be applied to a given dataset, and therefore an expert “data janitor” has many more analytical options at her disposal than one who lacks such skills. For example, one of the first steps in network analysis of Twitter data is the task of converting tweet text into formats suitable for network analysis. NodeXL, which offers perhaps the user-friendliest means of doing so, automatically creates network edges between tweet authors and any usernames included in their tweets. The program can distinguish between “replies” (created by using Twitter’s “reply-to” function) and “mentions” (created by simply including another user’s name in a tweet), but not retweets, modified tweets, CCs, or other referential conventions. The ability to make such distinctions is important given research that shows meaningful differences in how these conventions are used (e.g. Conover, 2011). I don’t know of any off-the-shelf software that can do this, but it’s a trivial task in most programming languages. The broader point is that relying on off-the-shelf software tends to sharply limit researchers’ data manipulation options.
The second point can be explained very briefly. Many of the most powerful tools for analyzing digital data are modules or libraries for use within different programming environments. A few Python libraries comm researchers might find useful include pandas, scikit-learn, statsmodels, NetworkX, and my own TSM. But working knowledge of the language is a prerequisite for their use.
The field of communication is uniquely positioned to apply CSS in innovative ways.
Computer science and information science already have long head starts on CSS compared to the social sciences. Many of the best CSS tools were created by the students, graduates, and faculty of such departments, some of whom already study communication phenomena such as the flow of news memes online (Leskovec et al., 2009) and partisan polarization in social media (Conover et al., 2011). So one possible response to proposals to build CSS strength in comm departments is: well, CS and IS are the experts here—how could we do better than them? The answer is: in areas of relevance to communication theory and practice, we have a couple distinct advantages.
First, most computer and information scientists lack the theoretical background to explain the meaning and significance behind their findings. Their research orientation is informed primarily by the priorities of engineering, which include speed, accuracy, efficiency, and algorithmic elegance (Freelon, in press). As such, many are more concerned with chasing the cutting edge of software development than with explaining social phenomena. (I’m talking about general trends here—I don’t want to dismiss those CS and IS scholars who have reached across disciplinary lines to produce excellent social scientific research.) In contrast, we marshal our methods in service of communication theory and practice—CSS is no different in this than depth interviews, surveys, or ethnographies. In short, their comparative CSS advantage is in the development of new software and techniques, whereas ours lies in using those tools to analyze and explain communication phenomena.
Second, our capacity for methodological pluralism, particularly the combination of CSS and qualitative methods, is greater than in the engineering sciences. While pluralism is by no means unknown among them, as a group they strongly privilege algorithmic and automated methods. Communication researchers are comparatively more comfortable mixing methods and can more easily apply qualitative and CSS methods to complex research questions. A couple of my own forthcoming papers offer examples of how this can be done (Freelon & Karpf, in press; Freelon, Lynch, & Aday, in press). As a field we are uniquely positioned to cultivate a strong dialectic between macro (CSS) and micro (qualitative) empirical levels that raises the quality of our theoretical explanations.
PhD (and master’s) graduates will be strong candidates for both academic and non-academic positions.
Tenure-track faculty positions are in short supply across academia. Comm is actually doing better than some fields in this regard, but there still aren’t nearly enough TT jobs for all qualified candidates. Training comm master’s and PhD graduates in CSS can be one part of the solution. More than other methodological specializations, CSS training prepares students for jobs outside the academy. The end of this article includes several lists of essential skills for industry-focused data scientists, and many of these include some variant of “communication skills,” “storytelling,” “curiosity,” “visualization,” and/or “domain knowledge.” These non-technical capabilities are already part of most decent PhD programs—add key technical components and you’ve got most of the skills employers are looking for in a data scientist. Comm graduates would obviously be best suited to working in communication-related industries such as PR, journalism, advertising, and social media. Indeed, a handful of comm PhDs have already been hired by major social media and tech companies (e.g. David Huffaker, Lauren Scissors, and Loi Sessions Goulet), although not all are in CSS. We could make this a more common occurrence.
Comm also has an opportunity to make some of its unique insights relevant to industry. For example, to avoid the problematic assumption that digital traces such as Facebook likes and retweets have fixed meanings (authority, influence, endorsement, etc.), we can point out when such assumptions are more and less likely to hold (Freelon, 2014). Similarly, we can help hold a critical eye to companies like Klout that claim to measure concepts such as “influence” using proprietary formulas of unknown validity. Closely scrutinizing such practices holds real business value: it’s important to know whether a given product actually measures what it claims to measure before buying or using it.
All right; now that I’ve sold you on the general prospect, let’s move to the four key areas for CSS.
This one’s pretty obvious—several comm departments have recently hired in CSS (UPenn, UW-Seattle, and UMD-College Park, among others), and this will likely continue into the foreseeable future. Trouble is, you can’t just hire one prof and call yourself a CSS powerhouse. Critical mass is needed—probably at least three faculty and preferably more—to support multiple courses, advisees, and research projects. Eventually, you want students to look at your department and think “wow, look at all the CSS faculty they have; seems like a really supportive place for that kind of work.” Ideally your faculty would specialize in diverse areas of CSS such as machine learning, network analysis, visualization, predictive modeling, etc. But all should be ready, willing, and able to apply these skills to communication research questions. That doesn’t necessarily go without saying: most CSS PhDs don’t have a comm background, and many don’t care much about doing comm research. But giving those that do a supportive work environment will be critical in nurturing the next generation of comm CSS scholars.
Exactly how much these contributions should count is up for debate. I certainly don’t think anyone should be able to earn tenure on visualizations alone, but if they provide scholarly value, they should count for something. This is part and parcel of signaling to CSS faculty that their work is valued—and we all know what happens to talented researchers who don’t get that message.
CSS faculty must be given the latitude to teach in their area(s) of methodological expertise. But our department needs more than just a single introductory-level CSS course. Quantitatively-oriented comm grad students often take three or more stats courses, and those who want to learn research-grade CSS should have similar options. One effective way to start would be to offer a multi-course CSS track similar to the statistics tracks many departments currently offer. Such a track could start with an introduction to Python or R and continue with courses in data manipulation, visualization, machine learning, and/or statistical modeling. Successful completion of the track could earn the student a master’s or PhD certificate in CSS.
It bears emphasizing that any comprehensive CSS curriculum needs to start by teaching students how to code. Our department will not be able to assume that students will enter knowing how to code, just as most currently don’t assume any particular level of statistical knowledge. This isn’t something that can simply be outsourced to the computer science department—communication students will use code for very specific purposes that computer scientists don’t always understand. In addition, learning how to apply computer programming to communication research questions from the start will help keep students motivated and stem the high attrition rates that plague traditional CS education.
Like video and sound production, CSS is an infrastructure-intensive enterprise. Small-scale projects can be executed cheaply on repurposed in-house servers or low-capacity virtual cloud servers, but our lofty goals require a much more substantial capital investment. There are two general directions we could go here: the first is to commit to paying a company like Amazon a monthly fee for a dedicated chunk of virtual computing resources for data collection, analysis, and storage. The major advantage of this approach is convenience: the cloud provider handles all the administrative details so that all our faculty and students need do is login and get to work. But going the cloud route is like paying for web hosting: you lock yourself into a long-term relationship with your provider, which means we need to be rich enough to pay it indefinitely. And deciding to switch providers or move to an in-house option down the road is a logistical nightmare proportional to the amount of time spent with our original provider.
The other option is to use university-hosted hardware. The biggest advantage here is cost—the initial capital investment on the machines is a one-time expenditure. This consideration alone may make it the only feasible option for less wealthy departments. There are a number of ways to self-host, each with its own set of issues. Some universities make high-performance computing clusters (HPCCs) available to the entire campus—depending on the exact setup, our department could outsource some or possibly all of its computing needs to it. Obviously this would be very attractive from a budget perspective, but other departments will almost certainly be using the cluster already, which will limit available processing capacity. There may also be other limits on who is allowed to use it, what kinds of software can be installed, how much data can be stored, and how the system is allowed to access the Internet, among others. We would need to have a long conversation (probably several) with the HPCC administrator to determine the extent to which it will suit our needs.
The other self-hosted option would be for the department to build its own small server cluster. This would maximize control and configurability but also require active management and monitoring. Ideally this could be done by someone on the department’s IT staff; it’s not the sort of thing faculty or students should spend their time on. But that probably means adding to an existing staff person’s workload, which may entail a pay raise. Alternatively, if there’s room in the budget, the department could hire a full-time staff person to handle things like cluster administration, purchasing, keeping the disk images up to date, troubleshooting, user management, basic training, etc.
(A quick note about software before I move on to data: most CSS software is FLOSS, and your faculty will know what’s best to use, so it’s not a major planning concern. But if there are specific packages that need to be purchased, those can be added to the data budget, which will almost certainly be much larger.)
There are three basic ways to obtaining CSS data: you can collect it yourself, you can buy it, or you can make it. Collecting data in-house is cheaper but more time-consuming and error-prone, while buying it costs money but usually results in better quality. To take social media data as an example, many platforms restrict the amount of data that can be extracted from their public APIs as a quality-of-service measure. As a result it’s difficult to know just how representative self-collected samples are. Purchasing data from an authorized data vendor such as Gnip also buys you some degree of assurance that you’re actually getting all data relevant to your sample frame. For example, if you were to collect tweets from the #Ferguson hashtag using a harvesting server like 140dev, you’d have no idea whether your data were representative or how many tweets you were leaving behind. But purchasing the data allows you to obtain all of the relevant data for whatever time period you’re interested in (at least in theory).
There are also many non-social media types of data of interest to communication researchers that can be purchased. Companies like Nielsen, Comscore, and Alexa sell high-quality audience measurement data for the non-social web. Nielsen sells comparable data for TV (as they have for decades), books (Nielsen BookScan), and music (Nielsen SoundScan). Many TV news transcripts are available through a pay source most comm departments already have access to—LexisNexis. I’m sure there are many other sources I’m not aware of, but this brief list conveys a sense of what’s available to departments with research budgets.
Lastly, some CSS researchers generate their own data by measuring user interaction with bespoke sociotechnical systems. The tradition of computer-based experiments actually has a longer history in communication than many realize (e.g. Sundar & Nass, 2000). Probably the main logistical issue here is the provision of lab space for small-scale, in-house computational experiments. Such resources can also be used to pre-test measures and instruments for later use in online experiments where many factors lie outside the researchers’ control.
As noted earlier, checking all these boxes can’t be done on the cheap. The total cost must be tallied not only in money but also in non-monetary transition costs and (potentially) resistance from skeptical colleagues. There are no guarantees when it comes to shifts of this magnitude—failure’s always a possibility, especially if all the necessary resources don’t come through for one of the four areas. Moreover, there’s other important work that needs to occur at the disciplinary and interdisciplinary levels, including the establishment of official sections in the major professional orgs, specific initiatives to increase CSS visibility in top research outlets, and discipline-spanning institutes that bring together practitioners from across campus. All that said, it seems extremely unlikely to me that the importance of analyzing digital communication data through programming will wane in the near future. If I’m correct, the first comm department to do CSS effectively will emerge as a nationwide model for the discipline and beyond. Sounds like a place I’d like to work.
Thanks to Brian Keegan for his helpful comments on an earlier draft.
I just completed a new version of the Python version of T2G which adds a few new features. Most prominent among these is the ability to extract only retweets or only mentions for visualization in Gephi. Recent research has shown substantive differences between networks based on these behaviors, so it is important for researchers to be able to distinguish between them. The new version also fixes a bug that halted processing whenever two @s appeared adjacent to one another in a tweet (i.e. “@@”).
T2G 0.3 features four extraction modes, each of which yields different sets of network edges:
- Extracts all edges, does not differentiate between retweets and mentions, and includes singletons (users who are mentioned by no one and mention no one)
- Extracts all edges, does not differentiate between retweets and mentions, and excludes singletons
- Extracts retweets only and excludes singletons
- Extracts mentions only (i.e. non-retweets) and excludes singletons
To try the different modes, change the value of the variable ‘extmode’ on line 16 to a number from 1 to 4 corresponding to the numbered modes above. Easy!
EDIT 07/10/13: A new version of T2G for Python has been posted here–it has options to extract only retweets or only mentions, among other new features.
A few weeks ago I posted a spreadsheet that converted tweet mention data into Gephi format for social network analysis. A key limitation of that spreadsheet is that it only converts the first name mentioned in each tweet, discarding the rest. For example, for the following tweet:
— Deen Freelon (@dfreelon) May 12, 2013
that spreadsheet would pull @alexhanna into the Gephi file as one of my mentions but not @cfwells or @kbculver.
To remedy this issue, I’ve created T2G, a solution that converts all Twitter mention data fed into it to Gephi format. T2G comes in two flavors, Python and PHP, each of which does the same thing. The PHP edition is more user-friendly, while the Python edition is faster and easier to set up. All you need to do is supply a CSV file containing two columns: the first (leftmost) filled with the tweet authors’ usernames, and the second filled with their corresponding tweets. You’ll find additional instructions in an extended comment at the top of each script. Please ensure that you have the appropriate interpreter installed (PHP or Python) before trying to use either of these scripts.
- T2G 0.1 (PHP edition) — rename “t2g.txt” to “t2g.php” to activate the script
- T2G 0.1 (Python edition)
Both of these scripts produce equivalent output, albeit in a slightly different order (you can rank the data in alphabetical order to check if you like).
You can test a “lite” version of the PHP edition below–it will convert only the first 100 tweets in your file. Feel free to test it using this sample file, which contains some of my own recent tweets formatted to the above specifications.
T2G Lite, PHP Edition 0.1
A couple recent articles have gotten me thinking about methods for geolocating Twitter users: Kalev Leetaru et al.’s recent double-sized piece in First Monday explaining how to identify the locations of users in the absence of GPS data; and the Floating Sheep collective’s new Twitter “hate map,” which has received a fair amount of media attention. The ability to know where social media users are located is pretty valuable: among other things, it promises to help us understand the role of geography in predicting or explaining different outcomes of interest. But we need to adjust our enthusiasm about these methods to fit their limitations. They have great potential, but (like most research methods) they’re not as complete as we might like them to be.
Let’s start with the gold standard: latitude/longitude coordinates. When Twitter users grant the service access to their GPS devices and/or cellular location info, their current latitude and longitude coordinates are beamed out as metadata attached to every tweet. Because these data are generated automatically via very reliable hardware and software, we can be reasonably certain of their accuracy. But according to Leetaru et al., only about 1.6 percent of users have this functionality turned on. Due to privacy concerns, Twitter offers it on an opt-in basis, which partly explains the low level of uptake.
For the social science researcher, relying on lat/long for geolocation in Twitter raises a major sampling bias issue: what if users who have this feature turned on differ systematically in certain ways from those who don’t? Here are a few plausible albeit untested (as far as I know) characteristics that geolocatable social media users may be more likely to exhibit:
- High levels of formal education
- General adeptness/comfort with digital technologies
- Living in a politically stable country
- An ideological commitment to “publicness“
I’m sure you could probably come up with more. The point is, we cannot assume that, for example, a map containing geotagged racist/homophobic/abelist tweets faithfully represents the broader Twitter hate community. If all hate-tweets came geotagged, the map might look very different, especially since as things stand now many are smart and motivated enough to render their hate less visible.
Fortunately, lat/long coordinates are not our only options for trying to figure out where tweeps are. Leetaru et al. helpfully offer a series of methods for doing so in cases where this information is absent (other stabs at this include Cheng et al., 2010 and Hecht et al., 2011). Leetaru et al.’s most effective methods focus on the freetext “Bio” and “Profile” fields, which when combined increase the number of correctly IDed locations at the city level to 34%. This represents an increase of more than an order of magnitude over what lat/long alone allow as well as a very cool research finding in its own right. However, the sampling bias problem applies with nearly equal force to this enhanced data: the strong possibility still exists that the nearly two-thirds of unlocatable tweeps differ in critical ways from those whose locations can be identified.
So, what to do? Ideally, to be able to generalize effectively, we want to be able to say that the individual-level characteristics and overall geographic distribution of our geoidentified users resemble those of a representative sample of all users within our sampling frame. But this is a very tall order methodologically, and even if we could accomplish it, the results would likely disappoint us.
Our options at this point depend upon the required level of location granularity: the coarser this is, the better we’ll be able to do. If for example we only need country-level data for a fairly small N of countries, we can take advantage of the fact that it is easier to identify a user’s country than her city. One strategy here would be to start with string-matching methods like those used by Leetaru et al. and Hecht et al., which attempt to identify locations listed in various Twitter fields using dictionaries of place names. Next, for users whose locations can’t be thusly identified, a less definitive machine-learning method could be substituted to guess locations based on tweet text. This second method has the notable disadvantage of forcing a location guess for each user, introducing the issue of misidentification, whereas the first simply leaves unlabeled all users that don’t yield conclusive dictionary matches. (It is also more computationally intensive due to the higher volume of data required and the complexity of most machine-learning algorithms.) Nevertheless, Hecht et al achieve between a 89% and 73% accuracy rate using this method at the country level (depending on how the data are sampled), suggesting that it could help researchers address the sampling bias issue in some scenarios. It would probably suffice to identify relatively small randomly-selected subsamples for each country of interest using machine learning, compare them to those IDed via string-matching, and search for major differences between each pair of groups.
The prospects for determining the representativeness of geolocated users at more specific locations than their countries are much slimmer. The accuracy of Hecht et al’s machine-learning geolocation technique drops from 89%-73% at the country level to 30%-27% at the US state level. Extending this logic, it’s probably safe to assume an inverse correlation between algorithm accuracy and location specificity when 1) location info is sparse in the data (as with tweets) and 2) the set of possible locations is very large or unbounded. Under these conditions I can’t think of how one might go about measuring how representative the known locations are of the unknowns. (If you have any ideas, leave a comment!) At that point you might simply have to grant that you can’t say much about how representative your sample is, and justify your study’s contributions on other grounds.
As the nation waits to find out who our next president will be, I thought it would be interesting to take a quick look at how Obama’s and Romney’s Facebook followers reacted to content posted to the candidates’ official Facebook walls. As part of a larger research project, I’m extracting all public comments posted to both walls between April 25 (the day the RNC endorsed Romney) and November 2, 2012. While doing so, I noticed some clear patterns in the kinds of content each group of followers showed most interest in. By charting the numbers of likes, shares, and comments for each message during the aforementioned time period, we can get a sense of when attention spiked and how much. Examining the top five most liked, shared, and commented-on posts reveals what topics attracted the most Facebook attention during the final leg of the campaign. Let’s start with Obama:
Likes, shares, and comments on Obama’s official Facebook wall, 4/25/12 – 11/02/12
As you’ve probably already noticed, I’ve included the associated text for the top five most-liked posts in the dataset and the images associated with the first four in chronological order. (I didn’t have room for the fifth image, but the text speaks for itself best among the five.) The first thing that jumped out at me here was how none of the top five most-liked posts had anything to do with politics–they were scenes from the Obamas’ family life, the kinds of moments that could be found in any American family photo album. The wholesome sentiments these shots convey couldn’t be farther from the knock-down drag-out negativity flooding the airwaves and the Internet throughout the timeframe, which may explain why they were so popular among Obama fans.
Romney’s fans, however, show a very different pattern:
Likes, shares, and comments on Romney’s official Facebook wall, 4/25/12 – 11/02/12
All of Romney’s top five most-liked posts were direct calls to push their “like” count over some numerical threshold. Romney’s fans seem to be more goal-oriented than Obama’s: rather than reveling in idyllic family scenes, they were most interested in showing off their support for Romney to their Facebook friends. One broader interpretation here is that Romney’s Facebook fans were more engaged in the campaign than Obama’s, who seemed less inclined to get political. This is also reflected in the fact that although Obama had much higher median numbers of likes (111,231 vs 64,182), shares (11,753 vs. 3644), and comments (7309 vs. 4376) than Romney during this period, Romney had much higher “like” peaks. (Romney posted over twice as many messages than Obama, so his “like” totals are higher: 58.5M to 42.7M.)
Likes vs. shares vs. comments
How did the most-liked messages stack up against the most-shared and most-commented-on messages? Let’s have a look, starting with Obama:
|1||20 years ago today.||674164||If you’re on Team Obama, let him know.||170046||It’s Barack’s birthday today–wish him a happy one by signing his card! http://OFA.BO/HHfZev||75876|
|2||The most important meeting of the day.||657501||Share if you agree: President Obama won the final debate because his leadership has made America stronger, safer, and more secure than we were four years ago.||95212||President Obama believes everybody deserves a fair shot–not just some: http://OFA.BO/8L6BWy||53537|
|3||Summer.||615734||Add your name, then pass it on: http://OFA.BO/8UHQcn||91234||Add your name, then pass it on: http://OFA.BO/8UHQcn||45674|
|4||“Being married to Michelle, and having these tall, beautiful, strong-willed girls in my house, never allows me to underestimate women.” –“President Obama||573270||http://OFA.BO/HxgaZz [link goes to a signup page for the OFA mailing list]||85407||[photo of Obama captioned “Same-sex couples should be able to get married.”]||41685|
|5||Michelle’s biggest fans watching her convention speech from home last night.||554713||Share this with your friends and family if you support this plan to keep us moving forward.||63551||Share if you agree: President Obama won the final debate because his leadership has made America stronger, safer, and more secure than we were four years ago.||39469|
My quick read on this table is that Obama supporters use shares and comments much more politically than they use likes. The top five most-shared messages are all about general support for Obama as opposed to specific policy issues. Comments are a mixed bag, with the top spot going to a call for birthday wishes and the fourth spot containing the sole policy statement in the entire table. The remaining most-commented posts are similar in nature to the most-shared.
Romney’s most-shared and -commented posts are both similar to and different from Obama’s:
|1||We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt!||1167589||We don’t belong to government, the government belongs to us.||93329||We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt!||105839|
|2||We’re almost there – Help us get to 10 million Likes!||1112300||It’s the people of America that make it the unique nation that it is. ‘Like’ if you agree that entrepreneurs, not government, create successful businesses.||70373||The American people know we’re on the wrong track, but how will President Obama get us on the right track? http://mi.tt/S8WQWZ||66622|
|3||Stand with Mitt. ‘Like’ and share to help us get to 6 million Likes!||986653||We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt!||62905||Like and share to help us get to 8 million likes!||62691|
|4||Help us get to 7 million likes! ‘Like’ and share to show you’re with Mitt.||719837||Like and share to help us get to 8 million likes!||46524||The path we’re taking is not working. It is time for a new path. Donate today and help us get America back on track http://mi.tt/QZkDpL||52059|
|5||Like and share to help us get to 8 million likes!||614492||I intend to lead and to have an America that’s strong and helps lead the world. ‘Like’ and share if you will stand with me.||39652||We don’t have to settle. America needs a new path to a real recovery. Contribute $15 and help us deliver it. http://mi.tt/Tacap0||47732|
Romney’s most-shared messages are similar to Obama’s in their lack of specificity. Unlike Obama, there is some overlap between the three modes of interaction–at least one “help us get to X million likes” post shows up on each list. The most interesting thing about the most-commented posts is that three of the five posts are pretty clear attacks on Obama, while I see a couple of Obama’s most-commented as indirect attacks at best. The idea that “everybody deserves a fair shot, not just some” could be a shot at Romney’s supposed elitism, but the claim that Obama’s “leadership has made America stronger, safer, and more secure than we were four years ago” is more about Obama than about Romney.
I think these data show some definite patterns in the types of engagement the Romney and Obama Facebook pages elicited. One important point about these data I want to stress is that they say much more about each campaign’s supporters than they do about the candidates. For example, Obama asked his supporters to like and share content, and Romney talked about his family, but those posts didn’t resonate as much with their followers. I also find the contrast with Twitter quite instructive–many studies, including my own research, have found Twitter activity to be highly event-driven, spiking when big stories break. Activity on the candidates’ walls looks to be much less so–few of the top five messages reference time-specific events, and few were posted on milestone days for either campaign. So it looks to me like the campaigns have a much greater capacity to drive attention with particular types of content on Facebook than on Twitter, which functions as more of a real-time information distribution network.
If anything in particular jumps out at you in this data or you disagree with any of my interpretations, I’d love to hear about it in comments.
Update 2/21/2012: As my colleague Alex Hanna recently informed me, up to 2% of the archives below may consist of duplicate tweet IDs. If you intend to work with this data, I highly recommend removing all the duplicates first.
Last May I posted some very basic descriptive statistics and charts on the usage of Arab Spring-related Twitter hashtags to this blog. Some of these findings were later included in a report titled “Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring?” published by the Project on Information Technology and Political Islam of the University of Washington. The data all came from the free online service formerly known as TwapperKeeper (now a part of HootSuite), which prior to March 20, 2011 allowed users to publicly initiate tweet archiving jobs and download the archived content at their convenience. Unfortunately for Internet researchers, Twitter decided on that date to forbid the public distribution of Twitter content, significantly diminishing the utility of TwapperKeeper and other services like it.
Since the release of the PITPI report in September 2011, several scholars have expressed interest in obtaining the Twitter data for their own research. I refused these requests categorically on the grounds that they would violate Twitter’s terms of service. However, one of these scholars asked Twitter directly about what data is allowed to be shared and what is not. He discovered that distributing numerical Twitter object IDs is not a violation of the TOS. I quote in full the message Twitter sent him below:
Under our API Terms of Service (https://dev.twitter.com/terms/api-terms), you may not resyndicate or share Twitter content, including datasets of Tweet text and follow relationships. You may, however, share datasets of Twitter object IDs, like a Tweet ID or a user ID. These can be turned back into Twitter content using the statuses/show and users/lookup API methods, respectively. You may also share derivative data, such as the number of Tweets with a positive sentiment.
Thanks, Twitter API Policy
Consistent with this official message, I have bundled up the numerical IDs for all of the users and individual tweets contained in all of the archives I have. They can be downloaded from the links below. The files are in CSV format, and each row contains both a tweet ID (left column) and the ID of the user who posted it (right column). As the message above notes, these IDs can be used to retrieve the complete datasets directly from Twitter via its API. All of these archives are based on Twitter keyword searches (except for those containing hash marks, which are hashtag archives) and were amassed between January and March 2011, except for “bahrain” which spans nearly a year beginning at the end of March 2010. None of these archives should be considered exhaustive for the months they cover, as TwapperKeeper was limited in its tweet collection capacity both by its own hardware and by Twitter’s API query restrictions. With those caveats out of the way, here are the data:
- algeria (86,844 tweets)
- bahrain (362,128 tweets)
- egypt (2,364,133 tweets)
- #feb14 (Bahrain) (48,698 tweets)
- #feb17 (Libya) (907,962 tweets)
- #jan25 (Egypt) (671,417 tweets)
- libya (2,745,912 tweets)
- morocco (85,542 tweets)
- #sidibouzid (Tunisia) (79,166 tweets)
- yemen (479,456 tweets)
These data will be all but useless to anyone without at least a basic understanding of all of the following:
- APIs and how to retrieve data from them,
- a programming language like PHP or Python,
- and a relational database system such as MySQL.
And even with this knowledge, recreating the full data sets would still take months of 24/7 automated querying given Twitter’s API limits.
Like many Twitter researchers, I reacted with dismay when Twitter changed its TOS last year to sharply restrict data sharing. To this day I struggle to think of a valid reason for the change, especially since their APIs remain open to anyone with the skills to query them. Nevertheless I feel bound to respect Twitter’s TOS, not so much out of fear for the consequences (although recent machinations at the DOJ may soon change that), but because so much social research depends critically upon the assumption that researchers will act according to the wishes of their subjects. The principle is similar to source confidentiality in journalism: if it became common practice for reporters to publish “off-the-record” information, sources would stop talking to them after awhile.
I have no idea what the prospects for getting Twitter to revert their TOS are since I don’t know why they changed it in the first place. However, if you would like to see this happen, you might consider leaving a comment to that effect on this blog post detailing why you think it’s important. If nothing else, such comments might convey a sense of some of the different ways Twitter’s API policy is hampering research, and may also start a conversation about possible workarounds or other ways of resolving the situation.
Just a brief announcement—shortly before New Year’s I earned my Ph.D. I have also updated this site’s front page to indicate that I am now an assistant professor at the School of Communication at American University in Washington, DC. Thanks for all your support.