In Appendix A of the public report “Beyond the Hashtags: #Ferguson, #Blacklivesmatter, and the Online Struggle for Offline Justice,” my coauthors and I promised to release our Twitter data publicly in 2017. The time has come to make good on that promise. Unfortunately, Twitter’s Terms of Service restricts users from publishing any Twitter data except tweet IDs. However, these IDs can be programmatically “hydrated,” which recreates the original dataset minus any tweets that have been deleted or removed from public view since the dataset was generated. This blog post contains all the original tweet IDs separated by day along with sample code for hydrating them.
Our Twitter dataset contains all 40,815,975 tweets matching at least one of the following 45 keywords that were posted between June 1, 2014 and May 31, 2015 and had not been deleted or protected as of July 2015:
- “akai gurley”
- “black lives matter”
- “dante parker”
- “eric garner”
- “eric harris”
- “ezell ford”
- “freddie gray”
- “jerame reid”
- “john crawford”
- “jordan baker”
- “kajieme powell”
- “mckenzie cochran”
- “michael brown”
- “mike brown”
- “phillip white”
- “tamir rice”
- “tanisha anderson”
- “tony robinson”
- “tyree woodson”
- “victor white”
- “walter scott”
- “yvette smith”
The following zip file contains 365 text files, each of which contains the tweet IDs of all tweets posted on one of the days covered by our dataset.
ZIP of BTH Twitter ID text files (259 MB)
All that is required to hydrate the data is a properly formatted set of instructions for the Twitter API. To do this I recommend the Python module twarc, for which I provide sample code below. Before you can use twarc, you’ll need to install it and then create a Twitter app if you haven’t already. Both are free and more or less instantaneous.
Here is the (Python 3) code. The only change you’ll need to make is to enter the consumer key, consumer secret, etc. from your Twitter app between the single quotes on lines 4 – 7.
from twarc import Twarc import json consumer_key = '' consumer_secret = '' access_token = '' access_token_secret = '' t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret) data =  for tweet in t.hydrate(open('bth_ids_2014-06-01.txt')): data.append(json.dumps(tweet)) with open('bth_data_2014-06-01.json','w') as outfile: outfile.write("\n".join(data) + '\n')
This will create a JSON file in your working directory containing the public tweets and associated metadata matching the IDs in the
bth_ids_2014-06-01.txt file. You can repeat this process for whichever dates/files you’re interested in.
Finally, if you use this data in any public writings, we ask that you please cite the original report.
Thanks for releasing the dataset Deen!
One thing to note is that twarc’s interface has changed slightly. The command people will want to use is:
Also, people might want to check out the Hydrator desktop application which has Windows and Linux installers now, in addition to the OS X one that has been available for a few months.
Thanks Ed. I changed the code to work within Python for simplicity’s sake (I can’t get the generic twarc command to work in the Windows command line).
If you installed twarc on Windows recently the interface has changed. It’s no longer
and is now:
That being said I haven’t actually tested the twarc install on Windows … I’ll add that to the todo list!
Hmm, I guess I didn’t add the Hydrator link correctly here it is as plain text: