dfreelon.org http://dfreelon.org Assistant professor, American University School of Communication Fri, 08 Aug 2014 00:08:01 +0000 en-US hourly 1 http://wordpress.org/?v=3.7.4 Co-citation map of 9 comm journals, 2003-2013 http://dfreelon.org/2013/09/05/co-citation-map-of-9-comm-journals-2003-2013/ http://dfreelon.org/2013/09/05/co-citation-map-of-9-comm-journals-2003-2013/#comments Thu, 05 Sep 2013 14:55:03 +0000 http://dfreelon.org/?p=671 [View the map. You'll need an up-to-date browser with Javascript enabled.]

In an effort to better understand the theoretical landscape of my chosen academic field, I have created a co-citation network visualization based on bibliographies found in nine major communication journals. The nine journals chosen are some of the best-known and longest-running in the field:

  • Communication Research
  • Communication Theory
  • Critical Studies in Media Communication
  • Human Communication Research
  • Journal of Broadcasting & Electronic Media
  • Journal of Communication
  • Journal of Computer-Mediated Communication
  • Journalism & Mass Communication Quarterly
  • Political Communication

Co-citation is a well-established technique for mapping academic disciplines (among other applications). The basic idea is that two publications are considered linked or “co-cited” when they appear in the same article’s reference list. After a basic co-citation network is created, a community-detection algorithm can be run to generate an organic impression of a discipline’s major subtopics and authors. In this map, the co-citation communities identified by the algorithm are grouped together by color. I doubt the specific groupings will surprise any seasoned scholars, but they will certainly help beginners (like me) get a sense of what our colleagues in other divisions have been thinking about over the past decade.

Maps similar to this one have been created for sociology and philosophy, and I credit those authors for giving me the idea to create this one. In doing so I relied heavily on Neal Caren’s excellent Python script for scraping citation data from Web of Knowledge (WoK). In the next section I give a guided tour of the map, after which I provide additional methodological details.

The map

One of the first things you’ll notice about the map is that publications are listed by first author only. This is how WoK stores references, but in most cases it shouldn’t be too hard to figure out which article or book is intended. Also, a few very popular articles probably have at least one duplicate node–I did not attempt to clean this dataset because I couldn’t figure out a non-manual way to do so.

Only highly-cited items appear on this map, a decision made for the sake of both parsimony and technical limitations. In order to make the initial cut, a publication had to 1) have at least ten citations according to WoK and 2)  be co-cited on at least five reference lists with another publication meeting the first criterion. In this way, a network of 80,880 unique cited publications* and 3,878,211 co-citation links drawn from 2,834 seed articles was whittled down to 1,124 pubs and 6,092 links.

If you mouse over a given publication you’ll see the others to which it is connected. A link between two publications means that the two are co-cited at least five times. Thicker links mean more co-citations. Intra-community links share the community’s color; inter-community links take on one of the two community’s colors at random. A publication’s node size reflects the number of bibliographies in which it appears.

The nine colored communities in this network represent the nine most densely-interlinked subtopics addressed in the journals. The community detection algorithm identified a total of 28 link clusters, so nine is an arbitrary number (I had to stop somewhere). These top nine represent about a third of the communities found, but this third contains 89.5% (1,006/1,124) of all the pubs that met the initial inclusion criteria.

Here I give each community a label and a short description, but I can’t claim expertise on all of them, so corrections and suggestions are welcome.

  • __ Interpersonal communication, offline and on. Unsurprisingly, this community was well-represented in JCMC. It incorporates pieces from both the digital age and long before, with Walther, Berger, and Knobloch being especially prominent. Classic works by Goffman, Spears, Altman & Taylor, and Parks & Floyd can also be seen.
  • __ Race and media. One of the smaller communities, this one builds on foundational work in both media studies and effects (e.g. Entman, Dixon, Gilliam, Valentino) and psychology (Fiske, Devine). Much of it focuses on pejorative perceptions of African Americans by whites.
  • __ Parasocial interaction/uses & gratifications. Drawing heavily on psychologists such as Bandura and Fishbein, this cluster examines how and why people consume media (especially popular media) as well as their relationships with the characters on the screen. (This is one of the ones I know less well, so let me know if there’s a better way to describe it.)
  • __ Selective exposure. From the foundational work of Festinger and Sears & Freedman in the 1950s and 60s to Sunstein’s Republic.com, this community focuses on how people select, reject, and justify media content and the consequences for their opinions, beliefs, and emotions.
  • __ Multimedia information processing/knowledge gap. This cluster is heavily anchored in the work of Lang, Grabe, Reeves, and Newhagen. Its objects of study are the influence of multimedia on cognition, specifically memory, emotion, and knowledge. The knowledge gap concept is also prominent here. (Again, I’m not an expert here, so please correct as appropriate!)
  • __ Civic engagement/political participation/deliberation/social capital. This cluster is concerned with the roles of media and communication in citizens’ engagement with politics and their communities. The second largest community by internal links, it incorporates leading research from sociology (Coleman, Wellman, Granovetter) and political science (Putnam, Norris, Huckfeldt) in addition to communication.
  • __ Psychology of communication/cultivation theory/statistical methods. This cluster shares a few links with the “visual images” and “parasocial interaction” cluster  but is distinct from both. With Petty & Cacioppo’s classic book on the elaboration likelihood model as its primary anchor, this research investigates concepts such as information processing, emotion, persuasion, influence, and attitudes as they pertain to communication. Interestingly, major pieces on statistical analysis by Holbert & Stephenson, Bollen, and Baron & Kenny are also included here.
  • __ Third-person effect/hostile media effect. This community is home to the closely-related hostile media and third-person effects, both of which involve people’s beliefs about how media messages relate to others. Though its originator (Davison) was a scholar of journalism and sociology, later third-person effect research increasingly relies on concepts borrowed from psychology (e.g. Eveland, Nathanson, Detenber, & Mcleod, 1999; Henriksen & Flora, 1999; Hoffner et al., 1999).
  • __ Agenda-setting/framing/priming. In a development that will surprise no one, the largest cluster by far is devoted to the study of three interrelated media effects: framing, priming, and agenda-setting. The major works and authors here will be known to nearly all students of mass communication: Iyengar, Entman, McCombs, Zaller, Gamson, Shoemaker, Bennett, Price, Scheufele, and many more…

There is much to say about these clusters–much more than I have time to articulate–so I’ll limit myself to an observation and a related caveat. First, critical theory is conspicuous in its absence from these clusters. Marx, Foucault, Adorno, Williams, Baudrillard, Butler, and other critical stalwarts are nowhere to be found among this list of landmark works. Among those critical theorists who do make the cut are Chomsky, Habermas, Hall, and Bourdieu, though I leave  to the reader the exercise of finding them on the map.

One reason for the omission may be the use of the journal as the sampling unit. Much critical work is published in books, and while many books appear on the map, it is clear that journal articles largely tend to cite other journal articles. And in communication, the better-known journals tend to publish work that is quantitative, empirical, epistemologically social-scientific, and American in focus. So the major caveat for this map is that it almost certainly underrepresents work that is qualitative, purely theoretical, critical, and non-American. Unfortunately, there is no easy way to integrate books into it, and even if there were, there is no preexisting list of the most-cited books in communication.

Additional method notes

From each journal, all reference lists from all research articles (specifically excluding book reviews and similar) available in WoK between 2013 and 2003 were extracted on September 3, 2013. A few items from 2002 were included for some journals.

For those who are interested, here is a quick summary of how I created this map:

  • Downloaded full reference lists from WoK  for all articles (excluding book reviews etc.) published between 2003-Sept 2013 from the above journals in plain-text format
  • Used Neal Caren’s Python script to create a network edgelist based on the criteria above
  • Opened edgelist in Gephi and ran the “fast unfolding” community detection algorithm (Blondel, 2008) to identify network clusters
  • Rearranged graph layout to color and group together network communities
  • Exported final graph file in GEXF format
  • Created web visualization with GEXF.js

The raw data for the map (1,124 nodes/6,092 edges) can be downloaded here.

If you have any questions about how I made the map, I’d be happy to answer them. Also, if you have suggestions for additional journals to add, let me know and I may be able to do it–but GEXF.js is limited in the amount of network data it can display so there’s no guarantee.

*The true number is somewhat less than this, as some pubs are listed under different names due to incompatible citation practices and miscellaneous citation errors.

http://dfreelon.org/2013/09/05/co-citation-map-of-9-comm-journals-2003-2013/feed/ 7
T2G 0.3: Visualize only RTs or mentions in Gephi http://dfreelon.org/2013/07/10/t2g-0-3-now-visualize-only-rts-or-mentions-in-gephi/ http://dfreelon.org/2013/07/10/t2g-0-3-now-visualize-only-rts-or-mentions-in-gephi/#comments Wed, 10 Jul 2013 22:08:37 +0000 http://dfreelon.org/?p=645 I just completed a new version of the Python version of T2G which adds a few new features. Most prominent among these is the ability to extract only retweets or only mentions for visualization in Gephi. Recent research has shown substantive differences between networks based on these behaviors, so it is important for researchers to be able to distinguish between them. The new version also fixes a bug that halted processing whenever two @s appeared adjacent to one another in a tweet (i.e. “@@”).

Download T2G 0.3 for Python (7/09/13)

T2G 0.3 features four extraction modes, each of which yields different sets of network edges:

  1. Extracts all edges, does not differentiate between retweets and mentions, and includes singletons (users who are mentioned by no one and mention no one)
  2. Extracts all edges, does not differentiate between retweets and mentions, and excludes singletons
  3. Extracts retweets only and excludes singletons
  4. Extracts mentions only (i.e. non-retweets) and excludes singletons

To try the different modes, change the value of the variable ‘extmode’ on line 16 to a number from 1 to 4 corresponding to the numbered modes above. Easy!

http://dfreelon.org/2013/07/10/t2g-0-3-now-visualize-only-rts-or-mentions-in-gephi/feed/ 0
T2G: Convert (all) Twitter mentions to Gephi format http://dfreelon.org/2013/05/14/t2g-convert-all-twitter-mentions-to-gephi-format/ http://dfreelon.org/2013/05/14/t2g-convert-all-twitter-mentions-to-gephi-format/#comments Wed, 15 May 2013 04:03:00 +0000 http://dfreelon.org/?p=616 EDIT 07/10/13: A new version of T2G for Python has been posted here–it has options to extract only retweets or only mentions, among other new features.

A few weeks ago I posted a spreadsheet that converted tweet mention data into Gephi format for social network analysis. A key limitation of that spreadsheet is that it only converts the first name mentioned in each tweet, discarding the rest. For example, for the following tweet:

that spreadsheet would pull @alexhanna into the Gephi file as one of my mentions but not @cfwells or @kbculver.

To remedy this issue, I’ve created T2G, a solution that converts all Twitter mention data fed into it to Gephi format. T2G comes in two flavors, Python and PHP, each of which does the same thing. The PHP edition is more user-friendly, while the Python edition is faster and easier to set up. All you need to do is supply a CSV file containing two columns: the first (leftmost) filled with the tweet authors’ usernames, and the second filled with their corresponding tweets. You’ll find additional instructions in an extended comment at the top of each script. Please ensure that you have the appropriate interpreter installed (PHP or Python) before trying to use either of these scripts.

Both of these scripts produce equivalent output, albeit in a slightly different order (you can rank the data in alphabetical order to check if you like).

You can test a “lite” version of the PHP edition below–it will convert only the first 100 tweets in your file. Feel free to test it using this sample file, which contains some of my own recent tweets formatted to the above specifications.

T2G Lite, PHP Edition 0.1

http://dfreelon.org/2013/05/14/t2g-convert-all-twitter-mentions-to-gephi-format/feed/ 2
Twitter geolocation and its limitations http://dfreelon.org/2013/05/12/twitter-geolocation-and-its-limitations/ http://dfreelon.org/2013/05/12/twitter-geolocation-and-its-limitations/#comments Sun, 12 May 2013 16:27:40 +0000 http://dfreelon.org/?p=594 A couple recent articles have gotten me thinking about methods for geolocating Twitter users: Kalev Leetaru et al.’s recent double-sized piece in First Monday explaining how to identify the locations of users in the absence of GPS data; and the Floating Sheep collective’s new Twitter “hate map,” which has received a fair amount of media attention. The ability to know where social media users are located is pretty valuable: among other things, it promises to help us understand the role of geography in predicting or explaining different outcomes of interest. But we need to adjust our enthusiasm about these methods to fit their limitations. They have great potential, but (like most research methods) they’re not as complete as we might like them to be.

Let’s start with the gold standard: latitude/longitude coordinates. When Twitter users grant the service access to their GPS devices and/or cellular location info, their current latitude and longitude coordinates are beamed out as metadata attached to every tweet.  Because these data are generated automatically via very reliable hardware and software, we can be reasonably certain of their accuracy. But according to Leetaru et al., only about 1.6 percent of users have this functionality turned on. Due to privacy concerns, Twitter offers it on an opt-in basis, which partly explains the low level of uptake.

For the social science researcher, relying on lat/long for geolocation in Twitter raises a major sampling bias issue: what if users who have this feature turned on differ systematically in certain ways from those who don’t? Here are a few plausible albeit untested (as far as I know) characteristics that geolocatable social media users may be more likely to exhibit:

  • High levels of formal education
  • Maleness
  • General adeptness/comfort with digital technologies
  • Living in a politically stable country
  • Extroversion
  • An ideological commitment to “publicness

I’m sure you could probably come up with more. The point is, we cannot assume that, for example, a map containing geotagged racist/homophobic/abelist tweets faithfully represents the broader Twitter hate community. If all hate-tweets came geotagged, the map might look very different, especially since as things stand now many are smart and motivated enough to render their hate less visible.

Fortunately, lat/long coordinates are not our only options for trying to figure out where tweeps are. Leetaru et al. helpfully offer a series of methods for doing so in cases where this information is absent (other stabs at this include Cheng et al., 2010 and Hecht et al., 2011). Leetaru et al.’s most effective methods focus on the freetext “Bio” and “Profile” fields, which when combined increase the number of correctly IDed locations at the city level to 34%. This represents an increase of more than an order of magnitude over what lat/long alone allow as well as a very cool research finding in its own right. However, the sampling bias problem applies with nearly equal force to this enhanced data: the strong possibility still exists that the nearly two-thirds of unlocatable tweeps differ in critical ways from those whose locations can be identified.

So, what to do? Ideally, to be able to generalize effectively, we want to be able to say that the individual-level characteristics and overall geographic distribution of our geoidentified users resemble those of a representative sample of all users within our sampling frame. But this is a very tall order methodologically, and even if we could accomplish it, the results would likely disappoint us.

Our options at this point depend upon the required level of location granularity: the coarser this is, the better we’ll be able to do. If for example we only need country-level data for a fairly small N of countries, we can take advantage of the fact that it is easier to identify a user’s country than her city. One strategy here would be to start with string-matching methods like those used by Leetaru et al. and Hecht et al., which attempt to identify locations listed in various Twitter fields using dictionaries of place names. Next, for users whose locations can’t be thusly identified, a less definitive machine-learning method could be substituted to guess locations based on tweet text. This second method has the notable disadvantage of forcing a location guess for each user, introducing the issue of misidentification, whereas the first simply leaves unlabeled all users that don’t yield conclusive dictionary matches. (It is also more computationally intensive due to the higher volume of data required and the complexity of most machine-learning algorithms.) Nevertheless, Hecht et al achieve between a 89% and 73% accuracy rate using this method at the country level (depending on how the data are sampled), suggesting that it could help researchers address the sampling bias issue in some scenarios. It would probably suffice to identify relatively small randomly-selected subsamples for each country of interest using machine learning, compare them to those IDed via string-matching, and search for major differences between each pair of groups.

The prospects for determining the representativeness of geolocated users at more specific locations than their countries are much slimmer. The accuracy of Hecht et al’s machine-learning geolocation technique drops from 89%-73% at the country level to 30%-27% at the US state level. Extending this logic, it’s probably safe to assume an inverse correlation between algorithm accuracy and location specificity when 1) location info is sparse in the data (as with tweets) and 2) the set of possible locations is very large or unbounded. Under these conditions I can’t think of how one might go about measuring how representative the known locations are of the unknowns. (If you have any ideas, leave a comment!) At that point you might simply have to grant that you can’t say much about how representative your sample is, and justify your study’s contributions on other grounds.

http://dfreelon.org/2013/05/12/twitter-geolocation-and-its-limitations/feed/ 0
Spreadsheet converts tweets for social network analysis in Gephi http://dfreelon.org/2013/04/26/spreadsheet-converts-tweets-for-social-network-analysis-in-gephi/ http://dfreelon.org/2013/04/26/spreadsheet-converts-tweets-for-social-network-analysis-in-gephi/#comments Fri, 26 Apr 2013 23:07:10 +0000 http://dfreelon.org/?p=579 EDIT 05/15/13: I’ve posted two scripts, one in PHP and one in Python, that overcome the main limitation of this spreadsheet–they pull in all mentioned names rather than just the first one. Download one or both here.

If you’ve ever wanted to visualize Twitter networks but weren’t sure how to get the tweets into the right format, this spreadsheet I’ve been using in my classes might be worth a try. It prepares Twitter data for importing into Gephi, an open-source network visualization platform. It requires a little cutting and pasting, but once you get the hang of it you’ll be visualizing social network data in no time. Here’s the link:


Download the file and open it locally in Excel or OpenOffice to add your own data (right now it uses some of my recent tweets as example data). Prep your data in 4 steps:

  1. Add the username(s) of your tweet author(s) to column A of the “code lives here” worksheet.
  2. Add your author(s)’ tweets to column B.
  3. Copy columns C through H as far down as your tweets go.
  4. Export the “output lives here” worksheet as a CSV and open it in Gephi (you may need to copy the formulae in columns A and B as far down as your data go).

Here is a network graph of the example data. Each tie represents the first person I mentioned in one of my past 200 tweets as of today.


TAGS is a free and fairly easy way to pull Twitter data into Google Spreadsheets.

http://dfreelon.org/2013/04/26/spreadsheet-converts-tweets-for-social-network-analysis-in-gephi/feed/ 0
What resonated with Obama’s and Romney’s Facebook followers? http://dfreelon.org/2012/11/04/what-resonated-with-obamas-and-romneys-facebook-followers/ http://dfreelon.org/2012/11/04/what-resonated-with-obamas-and-romneys-facebook-followers/#comments Mon, 05 Nov 2012 01:08:16 +0000 http://dfreelon.org/?p=534 As the nation waits to find out who our next president will be, I thought it would be interesting to take a quick look at how Obama’s and Romney’s Facebook followers reacted to content posted to the candidates’ official Facebook walls. As part of a larger research project, I’m extracting all public comments posted to both walls between April 25 (the day the RNC endorsed Romney) and November 2, 2012. While doing so, I noticed some clear patterns in the kinds of content each group of followers showed most interest in. By charting the numbers of likes, shares, and comments for each message during the aforementioned time period, we can get a sense of when attention spiked and how much. Examining the top five most liked, shared, and commented-on posts reveals what topics attracted the most Facebook attention during the final leg of the campaign. Let’s start with Obama:

Likes, shares, and comments on Obama’s official Facebook wall, 4/25/12 – 11/02/12

As you’ve probably already noticed, I’ve included the associated text for the top five most-liked posts in the dataset and the images associated with the first four in chronological order. (I didn’t have room for the fifth image, but the text speaks for itself best among the five.) The first thing that jumped out at me here was how none of the top five most-liked posts had anything to do with politics–they were scenes from the Obamas’ family life, the kinds of moments that could be found in any American family photo album. The wholesome sentiments these shots convey couldn’t be farther from the knock-down drag-out negativity flooding the airwaves and the Internet throughout the timeframe, which may explain why they were so popular among Obama fans.

Romney’s fans, however, show a very different pattern:

Likes, shares, and comments on Romney’s official Facebook wall, 4/25/12 – 11/02/12

All of Romney’s top five most-liked posts were direct calls to push their “like” count over some numerical threshold. Romney’s fans seem to be more goal-oriented than Obama’s: rather than reveling in idyllic family scenes, they were most interested in showing off their support for Romney to their Facebook friends. One broader interpretation here is that Romney’s Facebook fans were more engaged in the campaign than Obama’s, who seemed less inclined to get political. This is also reflected in the fact that although Obama had much higher median numbers of likes (111,231 vs 64,182), shares (11,753 vs. 3644), and comments (7309 vs. 4376) than Romney during this period, Romney had much higher “like” peaks. (Romney posted over twice as many messages than Obama, so his “like” totals are higher: 58.5M to 42.7M.)

Likes vs. shares vs. comments

How did the most-liked messages stack up against the most-shared and most-commented-on messages? Let’s have a look, starting with Obama:

Rank LikedText NLikes SharedText NShares CommentedText NComments
1 20 years ago today. 674164 If you’re on Team Obama, let him know. 170046 It’s Barack’s birthday today–wish him a happy one by signing his card! http://OFA.BO/HHfZev 75876
2 The most important meeting of the day. 657501 Share if you agree: President Obama won the final debate because his leadership has made America stronger, safer, and more secure than we were four years ago. 95212 President Obama believes everybody deserves a fair shot–not just some: http://OFA.BO/8L6BWy 53537
3 Summer. 615734 Add your name, then pass it on: http://OFA.BO/8UHQcn 91234 Add your name, then pass it on: http://OFA.BO/8UHQcn 45674
4 “Being married to Michelle, and having these tall, beautiful, strong-willed girls in my house, never allows me to underestimate women.” –“President Obama 573270 http://OFA.BO/HxgaZz [link goes to a signup page for the OFA mailing list] 85407 [photo of Obama captioned "Same-sex couples should be able to get married."] 41685
5 Michelle’s biggest fans watching her convention speech from home last night. 554713 Share this with your friends and family if you support this plan to keep us moving forward. 63551 Share if you agree: President Obama won the final debate because his leadership has made America stronger, safer, and more secure than we were four years ago. 39469

My quick read on this table is that Obama supporters use shares and comments much more politically than they use likes. The top five most-shared messages are all about general support for Obama as opposed to specific policy issues. Comments are a mixed bag, with the top spot going to a call for birthday wishes and the fourth spot containing the sole policy statement in the entire table. The remaining most-commented posts are similar in nature to the most-shared.

Romney’s most-shared and -commented posts are both similar to and different from Obama’s:

Rank LikedText NLikes SharedText NShares CommentedText NComments
1 We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt! 1167589 We don’t belong to government, the government belongs to us. 93329 We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt! 105839
2 We’re almost there – Help us get to 10 million Likes! 1112300 It’s the people of America that make it the unique nation that it is. ‘Like’ if you agree that entrepreneurs, not government, create successful businesses. 70373 The American people know we’re on the wrong track, but how will President Obama get us on the right track? http://mi.tt/S8WQWZ 66622
3 Stand with Mitt. ‘Like’ and share to help us get to 6 million Likes! 986653 We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt! 62905 Like and share to help us get to 8 million likes! 62691
4 Help us get to 7 million likes! ‘Like’ and share to show you’re with Mitt. 719837 Like and share to help us get to 8 million likes! 46524 The path we’re taking is not working. It is time for a new path. Donate today and help us get America back on track http://mi.tt/QZkDpL 52059
5 Like and share to help us get to 8 million likes! 614492 I intend to lead and to have an America that’s strong and helps lead the world. ‘Like’ and share if you will stand with me. 39652 We don’t have to settle. America needs a new path to a real recovery. Contribute $15 and help us deliver it. http://mi.tt/Tacap0 47732

Romney’s most-shared messages are similar to Obama’s in their lack of specificity. Unlike Obama, there is some overlap between the three modes of interaction–at least one “help us get to X million likes” post shows up on each list. The most interesting thing about the most-commented posts is that three of the five posts are pretty clear attacks on Obama, while I see a couple of Obama’s most-commented as indirect attacks at best. The idea that “everybody deserves a fair shot, not just some” could be a shot at Romney’s supposed elitism, but the claim that Obama’s “leadership has made America stronger, safer, and more secure than we were four years ago” is more about Obama than about Romney.

Closing thoughts

I think these data show some definite patterns in the types of engagement the Romney and Obama Facebook pages elicited. One important point about these data I want to stress is that they say much more about each campaign’s supporters than they do about the candidates. For example, Obama asked his supporters to like and share content, and Romney talked about his family, but those posts didn’t resonate as much with their followers. I also find the contrast with Twitter quite instructive–many studies, including my own research, have found Twitter activity to be highly event-driven, spiking when big stories break. Activity on the candidates’ walls looks to be much less so–few of the top five messages reference time-specific events, and few were posted on milestone days for either campaign. So it looks to me like the campaigns have a much greater capacity to drive attention with particular types of content on Facebook than on Twitter, which functions as more of a real-time information distribution network.

If anything in particular jumps out at you in this data or you disagree with any of my interpretations, I’d love to hear about it in comments.

http://dfreelon.org/2012/11/04/what-resonated-with-obamas-and-romneys-facebook-followers/feed/ 1
Arab Spring Twitter data now available (sort of) http://dfreelon.org/2012/02/11/arab-spring-twitter-data-now-available-sort-of/ http://dfreelon.org/2012/02/11/arab-spring-twitter-data-now-available-sort-of/#comments Sat, 11 Feb 2012 18:20:21 +0000 http://dfreelon.org/?p=504 Update 2/21/2012: As my colleague Alex Hanna recently informed me, up to 2% of the archives below may consist of duplicate tweet IDs. If you intend to work with this data, I highly recommend removing all the duplicates first.

Last May I posted some very basic descriptive statistics and charts on the usage of Arab Spring-related Twitter hashtags to this blog. Some of these findings were later included in a report titled “Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring?” published by the Project on Information Technology and Political Islam of the University of Washington. The data all came from the free online service formerly known as TwapperKeeper (now a part of HootSuite), which prior to March 20, 2011 allowed users to publicly initiate tweet archiving jobs and download the archived content at their convenience. Unfortunately for Internet researchers, Twitter decided on that date to forbid the public distribution of Twitter content, significantly diminishing the utility of TwapperKeeper and other services like it.

Since the release of the PITPI report in September 2011, several scholars have expressed interest in obtaining the Twitter data for their own research. I refused these requests categorically on the grounds that they would violate Twitter’s terms of service. However, one of these scholars asked Twitter directly about what data is allowed to be shared and what is not. He discovered that distributing numerical Twitter object IDs is not a violation of the TOS. I quote in full the message Twitter sent him below:


Under our API Terms of Service (https://dev.twitter.com/terms/api-terms), you may not resyndicate or share Twitter content, including datasets of Tweet text and follow relationships. You may, however, share datasets of Twitter object IDs, like a Tweet ID or a user ID. These can be turned back into Twitter content using the statuses/show and users/lookup API methods, respectively. You may also share derivative data, such as the number of Tweets with a positive sentiment.

Thanks, Twitter API Policy

Consistent with this official message, I have bundled up the numerical IDs for all of the users and individual tweets contained in all of the archives I have. They can be downloaded from the links below. The files are in CSV format, and each row contains both a tweet ID (left column) and the ID of the user who posted it (right column). As the message above notes, these IDs can be used to retrieve the complete datasets directly from Twitter via its API. All of these archives are based on Twitter keyword searches (except for those containing hash marks, which are hashtag archives) and were amassed between January and March 2011, except for “bahrain” which spans nearly a year beginning at the end of March 2010. None of these archives should be considered exhaustive for the months they cover, as TwapperKeeper was limited in its tweet collection capacity both by its own hardware and by Twitter’s API query restrictions. With those caveats out of the way, here are the data:

These data will be all but useless to anyone without at least a basic understanding of all of the following:

And even with this knowledge, recreating the full data sets would still take months of 24/7 automated querying given Twitter’s API limits.

Like many Twitter researchers, I reacted with dismay when Twitter changed its TOS last year to sharply restrict data sharing. To this day I struggle to think of a valid reason for the change, especially since their APIs remain open to anyone with the skills to query them. Nevertheless I feel bound to respect Twitter’s TOS, not so much out of fear for the consequences (although recent machinations at the DOJ may soon change that), but because so much social research depends critically upon the assumption that researchers will act according to the wishes of their subjects. The principle is similar to source confidentiality in journalism: if it became common practice for reporters to publish “off-the-record” information, sources would stop talking to them after awhile.

I have no idea what the prospects for getting Twitter to revert their TOS are since I don’t know why they changed it in the first place. However, if you would like to see this happen, you might consider leaving a comment to that effect on this blog post detailing why you think it’s important. If nothing else, such comments might convey a sense of some of the different ways Twitter’s API policy is hampering research, and may also start a conversation about possible workarounds or other ways of resolving the situation.

http://dfreelon.org/2012/02/11/arab-spring-twitter-data-now-available-sort-of/feed/ 1
I’m a new Ph.D. http://dfreelon.org/2012/01/08/im-a-new-phd/ http://dfreelon.org/2012/01/08/im-a-new-phd/#comments Sun, 08 Jan 2012 18:28:32 +0000 http://dfreelon.org/?p=499 Just a brief announcement—shortly before New Year’s I earned my Ph.D. I have also updated this site’s front page to indicate that I am now an assistant professor at the School of Communication at American University in Washington, DC. Thanks for all your support.

http://dfreelon.org/2012/01/08/im-a-new-phd/feed/ 1
The MENA protests on Twitter: Some empirical data http://dfreelon.org/2011/05/19/the-mena-protests-on-twitter-some-empirical-data/ http://dfreelon.org/2011/05/19/the-mena-protests-on-twitter-some-empirical-data/#comments Thu, 19 May 2011 18:20:38 +0000 http://dfreelon.org/?p=418 If you’ve been following the online commentary about the ongoing protests in the Middle East and North Africa (MENA), you know there’s been plenty of speculation about how digital communication technologies have aided, hindered, or failed to influence events on the ground. Opining without systematic evidence is all well and good—indeed, when done well it yields testable predictions about real-world outcomes—but at some point actual data must be brought to bear on these questions. One question I have found both interesting and testable is: to what extent are social media used by individuals in Arabic countries experiencing political unrest? An additional corollary question is, to what extent do social media serve as a conversation platform for a broader Arabic online public during times of widespread unrest?

To begin to address these questions, I focus in this post on Twitter, both because its ostensible revolutionary power has been widely discussed and because data from it is fairly easy to collect and manipulate. Country-specific hashtags such as #egypt conveniently collect relevant tweets, and until recently it was possible to create and save public hashtag archives using free tools like TwapperKeeper. Unfortunately, on March 20, 2011 Twitter changed its terms of service to disallow public sharing of tweet archives. So, shortly before the change went into effect, I exported archives of several MENA-related hashtags from TwapperKeeper for analysis. The subset of the data presented in this post totals over 5 million tweets, with each entry including the author’s username, the full text of the tweet, the date and time posted, and other metadata. They do not, however, include the user’s location field, which I had to collect separately based on lists of unique users posting to each hashtag. Combining the chronologically-ordered hashtag dataset with the location data allows me to plot in time series the number of tweets in each hashtag whose authors claimed to be in the country in question. A little additional filtering helps me capture the extent to which each hashtag was used by individuals located in other Arabic-speaking countries.

But we’ll get to that in a second. First, let’s have a look at total tweet counts over time for the TwapperKeeper archives of seven major MENA hashtags: #egypt, #libya, #sidibouzid (Tunisia), #feb14 (Bahrain), #morocco, #yemen, and #algeria. Each data line begins on the date of the archive’s earliest tweet. The total N of tweets represented in this chart is 5,888,641.


A couple of things jump out at me looking at this plot. First, Libya and Egypt clearly grabbed the lion’s share of the attention, attracting several hundreds of thousands more tweets on their respective peak days than the next most popular hashtag. Both peaks were pegged to significant events on the ground—Mubarak’s resignation in Egypt’s case, and the taking of Benghazi and a major speech by Saif al-Islam Gadhafi in Libya’s. The other hashtags register far less overall activity in comparison. One hypothesis is that tweet volume in different countries may be driven by the amount of newshole devoted (by CNN, NYT, al-Jazeera, etc.) to events in that country, but more data would be needed to verify that.

The next logical question here concerns where these authors are located. Are they primarily residents of the countries in which the events are unfolding, concerned observers from culturally and physically neighboring states, or international spectators (perhaps including diasporic populations) commenting from afar? Answering this question entailed creating an automated word filter that placed each user-provided location into one of four categories: 1) in the hashtag country; 2) in the greater Arabic region (defined as the following countries: Algeria, Bahrain, Djibouti, Egypt, Iran, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman, Palestinian Territories, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, United Arab Emirates, and Yemen)1; 3) outside of both the hashtag country and the Arabic region; and 4) no given location. The filter counted both instances of the country name and cities in each country. (If anyone is interested in specific details on what the country filters included, let me know and I’ll write up another post on it.) The total N of tweets analyzed for all six countries profiled below is 3,142,621 (#libya is not ready yet for reasons I explain later), some of which overlap due to the presence of multiple hashtags.

Between 25% and 40% of unique names in each hashtag lacked any location information. These include users who left the field blank, deleted their own accounts, or had their accounts suspended. Being essentially unclassifiable data, tweets by such users are excluded from the following charts.

First we’ll have a look at #egypt:

Here we can see a pattern that will recur throughout most of the hashtags: the major spikes are driven by individuals from outside of both the country and the broader Arabic region (who are almost certainly responding to media reports). It is only when outside attention dies down that local and regional voices even begin to achieve parity with their international peers. (Note that the conspicious gaps between 1/27 and 1/29 and 2/4 and 2/8 are due to TwapperKeeper archival overload. Some of the other hashtag archives feature similar gaps. This missing data is frustrating, but what is present is valuable nevertheless.)

Next is #sidibouzid (Tunisia):

A pattern similar to Egypt prevails here, wherein outsiders usually dominate when the total N of tweets tops about 1,000.


This archive again displays the by-now familiar pattern, with the major difference being that regional tweets often exceed local tweets. This is most likely due to Yemen’s low internet penetration (1.8%).

#algeria and #morocco are very similar, so I’ll present them next:

The final hashtag archive in this post is for #feb14 (Bahrain), which unfortunately is rather incomplete. But once again, outsiders outnumber locals and regionals at higher total Ns, while locals take over at lower Ns.

I am still working on creating a comparable chart for the #libya archive, but it is difficult to apply the country filters to such a large N of unique users. A preliminary analysis of its first five days (5/16-5/20) that I presented at the Theorizing the Web conference last month showed that as the total N of tweets increased, the proportion of tweets from Libya decreased. With a net penetration rate of only 5.5%, it would not be surprising to discover that the entire hashtag followed the established pattern.

What does it all mean?

The evidence from the hashtags analyzed here indicates that, at least in the early days of the Arab Spring, Twitter served primarily as a platform for communication by international observers about the events. There is also limited evidence of a pan-Arabic public conversation within these hashtags, but this is not their primary purpose. Both phenomena are definitely episodic and appear strongly event-driven. As in the Iranian protests of 2009, Twitter seems to fall into Aday et al.’s (2010) “external attention” category of new media roles.

Of course, this doesn’t necessarily mean that Twitter use is politically inconsequential. Attentive global citizens and diasporic populations could, for example, use it to promote action opportunities to sympathetic followers. They may also retweet content from local users liberally, thus amplifying the latter’s voices beyond what the above charts imply. For that matter, local users themselves may find these hashtags useful for sharing and verifying local news at times when they are not swamped by outsiders. Answering questions like these will require textual analysis, and it is unlikely that automated methods will suffice (except for the RT question). I’m envisioning lots of content analysis, translation from Arabic and French, and input from subject matter experts in my future…

A few caveats about this data are in order. First, they do not include all tweets posted to the hashtags for the given time periods. TwapperKeeper functions by drawing samples from the Twitter search API, so there is no way to know exactly how many tweets were posted without access to the definitive Twitter-hosted databases. Second, like any continuously-running software program, TwapperKeeper can fail, as can be seen in the chart gaps above. The reason I chose to analyze #feb14 and not #bahrain is that the TK archive for the latter contains a two-week gap that included the “official” start date of the protest, February 14. Third, it is possible that other hashtags not analyzed here served different functions. Some MENA hashtags have Arabic titles, and it seems unlikely that these would fall under the external attention banner. I have archives of some of these for Egypt and am interested in collaborating with an Arabic-speaking expert to interpret them. Fourth, other social media services such as Facebook may serve different protest-related functions, depending on the country’s level of net penetration and service diffusion.

Then there is the question of whether the authors’ stated locations are accurate. Critics of my method will probably hasten to point out that many Twitter users changed their locations to “Iran” during the 2009 protests in that country. If this phenomenon occurred to any significant extent during the Arab Spring protests, it would significantly reduce the value of the current research enterprise. However, there are several reasons I doubt this to be the case. For one, to my knowledge there was no high-profile campaign to convince Twitter users to change their locations to any given Arab Spring country. With simultaneous protests ongoing in multiple countries, such a campaign (if it existed) would either have had to target one country or spread itself among more, either way diluting its overall impact. Also, my country filters included many city names, which outsiders would be unlikely to know offhand. Finally, if large numbers of international users had changed their locations to the protest countries, the filters probably would have identified far more users as local than they did. The fact that comparatively few users self-identified as local strongly suggests that the Iran strategy was not widespread in this case.

I am interested in your questions and suggestions about my methods and interpretations, so please let me know what you think in comments. If there are other analyses you’d like to see, I might be able to pull them together.


[1] Credit for this list of countries goes to Phil Howard, who recently published a book on digital communication technologies and politics in the Islamic world.

http://dfreelon.org/2011/05/19/the-mena-protests-on-twitter-some-empirical-data/feed/ 4
Causality, politics, and the net http://dfreelon.org/2011/04/23/causality-politics-and-the-net/ http://dfreelon.org/2011/04/23/causality-politics-and-the-net/#comments Sat, 23 Apr 2011 20:16:35 +0000 http://dfreelon.org/?p=387 Henry Farrell recently declared himself against studying the internet, and while that headline oversells his argument a bit, compelling turns of phrase are a large part of what gets good online conversations started. His basic thesis is that we should not only not study “the internet” as a system isolated from the rest of society, but also that we should trade analyses of specific online platforms (Facebook, Twitter, Youtube, etc.) for analyses of abstract causal mechanisms—some of which may flourish upon those platforms, but which are almost certainly not limited to them—that contribute to various sociopolitical outcomes. This perspective is more or less a direct application of one of the most fundamental normative stances of mainstream political science (among other branches of social science), namely that of causality as the gold standard of social research. (This position is not universal, as the existence of antipositivism attests.) I agree with Henry that causality is wonderful if you can demonstrate it, but think we need to get a bit more specific about exactly what we’re talking about before we venture too far.

Explaining my reservations will require shedding a bit of light on three interrelated questions. The first of these is: what do we mean by “causality” in this context? Secondly, what factors can and cannot be causes of political outcomes? Finally, what are the prospects of causal analysis of ICT-augmented politics?

The term “causality” carries multiple definitions in different contexts. For the purposes of this blog post, I intend the nomothetic and probabilistic sense of the term that is used widely throughout the social sciences. “Nomothetic” simply means covering a wide variety of cases, as opposed to “idiographic” causes which only apply to a single case. Babbie (2008) lists three widely-used criteria for nomothetic causality: correlation, time precedence, and nonspuriousness. (Alternative criteria for nomothetic causality are offered by Brady [2008] and Rubin [1980].) Correlation and time precedence should be self-explanatory for anyone with a passing familiarity with social science, and nonspuriousness simply means having eliminated most major alternative explanations and potential hidden variables. Probabilism refers to causes that increase the likelihood of a given outcome rather than guaranteeing it. Probabilistic causes are neither necessary nor sufficient, but their effects are robust enough for them to serve as meaningful predictors of the social outcome(s) in question.

It is difficult to see how ICTs by themselves could serve as causes of any given political outcome in this sense. Correlation is difficult to demonstrate because technologies are often associated with wildly divergent social outcomes in different social contexts (Markus & Robey, 1988). This is the main reason why there are very few scholarly technological determinists working today. Time precedence is also hard to straighten out in societies suffused with ICTs and proficient users. The question of which came first—political action or ICT use—will increasingly yield a single, unenlightening answer: the latter, as more and more people begin using digital technologies at early ages. Nonspuriousness presents probably the strongest objection of the three, as net skeptics have marshaled various alternative explanations for ostensibly net-driven political participation (e.g. Hindman, 2008, Margolis & Resnick, 2000).

But this point doesn’t really dent Henry’s argument at all, because he doesn’t posit technology as a cause. Rather, he focuses on social processes such as peer-to-peer information sharing and social influence as potential causes of political phenomena. It seems clear that these variables could in principle function as causes, but if they’re doing all the work, what do we need the internet for? One possibility is that online access or the use of specific services are effects rather than causes: this is the position of the normalization hypothesis, which holds that preexisting political interests cause political uses of ICTs. Another is that the role of technology is simply too complex to theorize as nomothetically as we might like, as Markus and Robey’s empirical review suggests.

In any event, the nomothetic approach requires that the social processes of interest retain their predictive power across a wide array of cases. Thus it is not enough that social influence, for example, might be linked with revolutionary activities in a few countries or situations—the two would need to be correlated, properly time-sequenced, and spuriousness-tested in many if not most cases to support a strong general theory of political action. Failing this rather lofty empirical standard, we might profitably settle for devising theories of smaller subsets of cases that are conceptually linked in some way. So it might be possible to develop theories of (for example) protest activity in advanced democracies, developing countries, or Islamic countries that include distinctive sets of roles for ICTs (e.g. Howard, 2010). But notice how far we have moved from macro-level theories that posit context-independent relationships between social processes and politics, whose breadth makes them unlikely to accrue consistent empirical support. By bounding our theoretical scopes with contextual qualifiers like culture, country, time period, and level of technological development, it is possible to develop mid-level theories that strike a balance between explaining It All and simply describing reality.

In sum, all of this is to say: yes to mechanisms, but yes also to both tightly circumscribed contextual caveats and the possibility of significant roles for online platforms. In the end I think my position is compatible with a version of Henry’s; my primary concerns are about the scope of generalizability of causal claims. Big, broad, parsimonious theory is always attractive, but it may not always be possible.

http://dfreelon.org/2011/04/23/causality-politics-and-the-net/feed/ 0