Arab Spring Twitter data now available (sort of)

Update 2/21/2012: As my colleague Alex Hanna recently informed me, up to 2% of the archives below may consist of duplicate tweet IDs. If you intend to work with this data, I highly recommend removing all the duplicates first.

Last May I posted some very basic descriptive statistics and charts on the usage of Arab Spring-related Twitter hashtags to this blog. Some of these findings were later included in a report titled “Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring?” published by the Project on Information Technology and Political Islam of the University of Washington. The data all came from the free online service formerly known as TwapperKeeper (now a part of HootSuite), which prior to March 20, 2011 allowed users to publicly initiate tweet archiving jobs and download the archived content at their convenience. Unfortunately for Internet researchers, Twitter decided on that date to forbid the public distribution of Twitter content, significantly diminishing the utility of TwapperKeeper and other services like it.

Since the release of the PITPI report in September 2011, several scholars have expressed interest in obtaining the Twitter data for their own research. I refused these requests categorically on the grounds that they would violate Twitter’s terms of service. However, one of these scholars asked Twitter directly about what data is allowed to be shared and what is not. He discovered that distributing numerical Twitter object IDs is not a violation of the TOS. I quote in full the message Twitter sent him below:

Hello,

Under our API Terms of Service (https://dev.twitter.com/terms/api-terms), you may not resyndicate or share Twitter content, including datasets of Tweet text and follow relationships. You may, however, share datasets of Twitter object IDs, like a Tweet ID or a user ID. These can be turned back into Twitter content using the statuses/show and users/lookup API methods, respectively. You may also share derivative data, such as the number of Tweets with a positive sentiment.

Thanks, Twitter API Policy

Consistent with this official message, I have bundled up the numerical IDs for all of the users and individual tweets contained in all of the archives I have. They can be downloaded from the links below. The files are in CSV format, and each row contains both a tweet ID (left column) and the ID of the user who posted it (right column). As the message above notes, these IDs can be used to retrieve the complete datasets directly from Twitter via its API. All of these archives are based on Twitter keyword searches (except for those containing hash marks, which are hashtag archives) and were amassed between January and March 2011, except for “bahrain” which spans nearly a year beginning at the end of March 2010. None of these archives should be considered exhaustive for the months they cover, as TwapperKeeper was limited in its tweet collection capacity both by its own hardware and by Twitter’s API query restrictions. With those caveats out of the way, here are the data:

algeria (86,844 tweets)
bahrain (362,128 tweets)
egypt (2,364,133 tweets)
#feb14 (Bahrain) (48,698 tweets)
#feb17 (Libya) (907,962 tweets)
#jan25 (Egypt) (671,417 tweets)
libya (2,745,912 tweets)
morocco (85,542 tweets)
#sidibouzid (Tunisia) (79,166 tweets)
yemen (479,456 tweets)

These data will be all but useless to anyone without at least a basic understanding of all of the following:

APIs and how to retrieve data from them,
a programming language like PHP or Python,
and a relational database system such as MySQL.

And even with this knowledge, recreating the full data sets would still take months of 24/7 automated querying given Twitter’s API limits.

Like many Twitter researchers, I reacted with dismay when Twitter changed its TOS last year to sharply restrict data sharing. To this day I struggle to think of a valid reason for the change, especially since their APIs remain open to anyone with the skills to query them. Nevertheless I feel bound to respect Twitter’s TOS, not so much out of fear for the consequences (although recent machinations at the DOJ may soon change that), but because so much social research depends critically upon the assumption that researchers will act according to the wishes of their subjects. The principle is similar to source confidentiality in journalism: if it became common practice for reporters to publish “off-the-record” information, sources would stop talking to them after awhile.

I have no idea what the prospects for getting Twitter to revert their TOS are since I don’t know why they changed it in the first place. However, if you would like to see this happen, you might consider leaving a comment to that effect on this blog post detailing why you think it’s important. If nothing else, such comments might convey a sense of some of the different ways Twitter’s API policy is hampering research, and may also start a conversation about possible workarounds or other ways of resolving the situation.

5 comments

Joyce says:

March 18, 2013 at 8:21 am

Hello,
Thank you very much for your post and sharing this data with us.
I am currently doing a research on social media and 2011 social movements for my thesis (am a 2nd year PHD student). I am desperatly looking at ways to download content from Facebook(public pages) and Twitter (tweets)that have been used in Tunisia(the Jasmine revolution) and the UK riots. I have specific dates and a specific Hashtag List (for tweets) – Do you think you could guide me or orient me – I dont know how to have my hands on this data and my whole doctorate is blocked without this data. I join myself to your discontent regarding Twitter’s blocking access to data on March 2011. THANK YOU again for your post , the data and hope you will answer my email.
Best Regards,
Joyce.

1. Pablo says:
  
  October 7, 2014 at 2:22 am
  
  hi have you solved your problem? because i’m in the same situation
  Best regards
  
Julien says:

February 4, 2015 at 7:03 am

Hi,

Same situation here. Any solution / evolution?

Best regards

dfreelon says:

February 4, 2015 at 8:44 am

In a nutshell, you need to know how to code to extract and analyze lots of social media data rigorously. Some of the tools here may help: http://dfreelon.org/2015/01/22/social-media-collection-tools-a-curated-list/ but you’ll always be limited until you can manipulate the data with code. I suggest Python as a good language to start with.

Paul says:

November 1, 2017 at 8:39 am

Not long ago I came across a wonderful tool to rehydrate files like these with no programming knowledge needed. Hydrator offers a GUI, and runs on Windows, MacOS and Linux. It can be found at https://github.com/DocNow/hydrator Hydrator only expects the tweetIDs, so for the files in the post above, remove the 2nd column. Also make sure the first column is formatted as numbers, not general, which is what Excel showed it as when I opened one of the files.

So that’s the good news. The bad news is that in the one file I tested, Algeria, of the 86,844 tweets, only 2,288 were able to be rehydrated. That’s 3%. The other 97%, according to Hydrator, have been deleted. I don’t know if that means the individual tweets have been deleted, or the full account that originally created them. It’s also possible that Hydrator screwed up. I did get a couple of javascript errors along the way. It’d be great if someone could compare results with another tool!

Leave a Reply Cancel reply