Update 2/21/2012: As my colleague Alex Hanna recently informed me, up to 2% of the archives below may consist of duplicate tweet IDs. If you intend to work with this data, I highly recommend removing all the duplicates first.
Last May I posted some very basic descriptive statistics and charts on the usage of Arab Spring-related Twitter hashtags to this blog. Some of these findings were later included in a report titled “Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring?” published by the Project on Information Technology and Political Islam of the University of Washington. The data all came from the free online service formerly known as TwapperKeeper (now a part of HootSuite), which prior to March 20, 2011 allowed users to publicly initiate tweet archiving jobs and download the archived content at their convenience. Unfortunately for Internet researchers, Twitter decided on that date to forbid the public distribution of Twitter content, significantly diminishing the utility of TwapperKeeper and other services like it.
Since the release of the PITPI report in September 2011, several scholars have expressed interest in obtaining the Twitter data for their own research. I refused these requests categorically on the grounds that they would violate Twitter’s terms of service. However, one of these scholars asked Twitter directly about what data is allowed to be shared and what is not. He discovered that distributing numerical Twitter object IDs is not a violation of the TOS. I quote in full the message Twitter sent him below:
Under our API Terms of Service (https://dev.twitter.com/terms/api-terms), you may not resyndicate or share Twitter content, including datasets of Tweet text and follow relationships. You may, however, share datasets of Twitter object IDs, like a Tweet ID or a user ID. These can be turned back into Twitter content using the statuses/show and users/lookup API methods, respectively. You may also share derivative data, such as the number of Tweets with a positive sentiment.
Thanks, Twitter API Policy
Consistent with this official message, I have bundled up the numerical IDs for all of the users and individual tweets contained in all of the archives I have. They can be downloaded from the links below. The files are in CSV format, and each row contains both a tweet ID (left column) and the ID of the user who posted it (right column). As the message above notes, these IDs can be used to retrieve the complete datasets directly from Twitter via its API. All of these archives are based on Twitter keyword searches (except for those containing hash marks, which are hashtag archives) and were amassed between January and March 2011, except for “bahrain” which spans nearly a year beginning at the end of March 2010. None of these archives should be considered exhaustive for the months they cover, as TwapperKeeper was limited in its tweet collection capacity both by its own hardware and by Twitter’s API query restrictions. With those caveats out of the way, here are the data:
- algeria (86,844 tweets)
- bahrain (362,128 tweets)
- egypt (2,364,133 tweets)
- #feb14 (Bahrain) (48,698 tweets)
- #feb17 (Libya) (907,962 tweets)
- #jan25 (Egypt) (671,417 tweets)
- libya (2,745,912 tweets)
- morocco (85,542 tweets)
- #sidibouzid (Tunisia) (79,166 tweets)
- yemen (479,456 tweets)
These data will be all but useless to anyone without at least a basic understanding of all of the following:
- APIs and how to retrieve data from them,
- a programming language like PHP or Python,
- and a relational database system such as MySQL.
And even with this knowledge, recreating the full data sets would still take months of 24/7 automated querying given Twitter’s API limits.
Like many Twitter researchers, I reacted with dismay when Twitter changed its TOS last year to sharply restrict data sharing. To this day I struggle to think of a valid reason for the change, especially since their APIs remain open to anyone with the skills to query them. Nevertheless I feel bound to respect Twitter’s TOS, not so much out of fear for the consequences (although recent machinations at the DOJ may soon change that), but because so much social research depends critically upon the assumption that researchers will act according to the wishes of their subjects. The principle is similar to source confidentiality in journalism: if it became common practice for reporters to publish “off-the-record” information, sources would stop talking to them after awhile.
I have no idea what the prospects for getting Twitter to revert their TOS are since I don’t know why they changed it in the first place. However, if you would like to see this happen, you might consider leaving a comment to that effect on this blog post detailing why you think it’s important. If nothing else, such comments might convey a sense of some of the different ways Twitter’s API policy is hampering research, and may also start a conversation about possible workarounds or other ways of resolving the situation.