Twitter geolocation and its limitations

A couple recent articles have gotten me thinking about methods for geolocating Twitter users: Kalev Leetaru et al.’s recent double-sized piece in First Monday explaining how to identify the locations of users in the absence of GPS data; and the Floating Sheep collective’s new Twitter “hate map,” which has received a fair amount of media attention. The ability to know where social media users are located is pretty valuable: among other things, it promises to help us understand the role of geography in predicting or explaining different outcomes of interest. But we need to adjust our enthusiasm about these methods to fit their limitations. They have great potential, but (like most research methods) they’re not as complete as we might like them to be.

Let’s start with the gold standard: latitude/longitude coordinates. When Twitter users grant the service access to their GPS devices and/or cellular location info, their current latitude and longitude coordinates are beamed out as metadata attached to every tweet.  Because these data are generated automatically via very reliable hardware and software, we can be reasonably certain of their accuracy. But according to Leetaru et al., only about 1.6 percent of users have this functionality turned on. Due to privacy concerns, Twitter offers it on an opt-in basis, which partly explains the low level of uptake.

For the social science researcher, relying on lat/long for geolocation in Twitter raises a major sampling bias issue: what if users who have this feature turned on differ systematically in certain ways from those who don’t? Here are a few plausible albeit untested (as far as I know) characteristics that geolocatable social media users may be more likely to exhibit:

  • High levels of formal education
  • Maleness
  • General adeptness/comfort with digital technologies
  • Living in a politically stable country
  • Extroversion
  • An ideological commitment to “publicness

I’m sure you could probably come up with more. The point is, we cannot assume that, for example, a map containing geotagged racist/homophobic/abelist tweets faithfully represents the broader Twitter hate community. If all hate-tweets came geotagged, the map might look very different, especially since as things stand now many are smart and motivated enough to render their hate less visible.

Fortunately, lat/long coordinates are not our only options for trying to figure out where tweeps are. Leetaru et al. helpfully offer a series of methods for doing so in cases where this information is absent (other stabs at this include Cheng et al., 2010 and Hecht et al., 2011). Leetaru et al.’s most effective methods focus on the freetext “Bio” and “Profile” fields, which when combined increase the number of correctly IDed locations at the city level to 34%. This represents an increase of more than an order of magnitude over what lat/long alone allow as well as a very cool research finding in its own right. However, the sampling bias problem applies with nearly equal force to this enhanced data: the strong possibility still exists that the nearly two-thirds of unlocatable tweeps differ in critical ways from those whose locations can be identified.

So, what to do? Ideally, to be able to generalize effectively, we want to be able to say that the individual-level characteristics and overall geographic distribution of our geoidentified users resemble those of a representative sample of all users within our sampling frame. But this is a very tall order methodologically, and even if we could accomplish it, the results would likely disappoint us.

Our options at this point depend upon the required level of location granularity: the coarser this is, the better we’ll be able to do. If for example we only need country-level data for a fairly small N of countries, we can take advantage of the fact that it is easier to identify a user’s country than her city. One strategy here would be to start with string-matching methods like those used by Leetaru et al. and Hecht et al., which attempt to identify locations listed in various Twitter fields using dictionaries of place names. Next, for users whose locations can’t be thusly identified, a less definitive machine-learning method could be substituted to guess locations based on tweet text. This second method has the notable disadvantage of forcing a location guess for each user, introducing the issue of misidentification, whereas the first simply leaves unlabeled all users that don’t yield conclusive dictionary matches. (It is also more computationally intensive due to the higher volume of data required and the complexity of most machine-learning algorithms.) Nevertheless, Hecht et al achieve between a 89% and 73% accuracy rate using this method at the country level (depending on how the data are sampled), suggesting that it could help researchers address the sampling bias issue in some scenarios. It would probably suffice to identify relatively small randomly-selected subsamples for each country of interest using machine learning, compare them to those IDed via string-matching, and search for major differences between each pair of groups.

The prospects for determining the representativeness of geolocated users at more specific locations than their countries are much slimmer. The accuracy of Hecht et al’s machine-learning geolocation technique drops from 89%-73% at the country level to 30%-27% at the US state level. Extending this logic, it’s probably safe to assume an inverse correlation between algorithm accuracy and location specificity when 1) location info is sparse in the data (as with tweets) and 2) the set of possible locations is very large or unbounded. Under these conditions I can’t think of how one might go about measuring how representative the known locations are of the unknowns. (If you have any ideas, leave a comment!) At that point you might simply have to grant that you can’t say much about how representative your sample is, and justify your study’s contributions on other grounds.

2 comments

  1. It became painfully obvious today that using Perl Net::Twitter to find geocoded tweets matching search terms was missing the vast majority of messages. I have to thank you for posting this excellent analysis of the technical challenges and, subsequently, for saving me from a time consuming failure.

    Although not exactly helpful for your stated goal, using regional hash tags may help zero in on users (e.g. #airport_code, #area_code such as #yyz #416 #toronto). Some study would have to be done to see how users employ tags related to high profile news, but it would seem those codes are mostly familiar to locals. Out of curiosity, I took a glance at Ferguson and found none of these (see http://hashtagify.me/hashtag/ferguson ) while the city name was extensively used.

    My new strategy may be to add every account tweeting these tags to a list (following not required), and then searching against the list. Fun with Perl and cron!

  2. It is also possible to evaluate GPS data in photos. Even to find a non-photo tweet, and then review previous media posts by a user. Heavy lifting for fast searches however.

Leave a Reply

Your email address will not be published. Required fields are marked *