Is the Twitter language detection library publicly available?



Both my colleague and I collected tweets using the streaming API. Unfortunately, my colleague only saved part of the metadata, which didn’t include language. I’m only interested in English tweets.

Does Twitter have a publicly available version of its language detection algorithm which I can apply to my colleagues tweets, and if not, does anyone know of a good lang detection library for tweets?


No, that code is not public.


Thanks. Is there a public language detection library you recommend?


This is not an area I’m personally familiar with, but maybe another member of the community will have some ideas to share with you!


A quick google search provided:

You could always roll your own as well. Grab a dictionary file, and lookup each word in the tweet to compare against the dictionary. If a significant number of those words exist in the dictionary, there’s a high chance it’s English. Of course, probability comes into play, and you’ll need to take into account certain variables such as completely ignoring hashtags, as a non-English user may use English hashtags to convey their message. You’ll have to determine a confidence ratio you’re comfortable with. (For fun, you may also want to pick up non-English dictionaries and look for words that don’t match the English dictionary to see what language it actually is from, and then build a list of known words that are not English to help filter faster.)

Depending on how many tweets you’re looking to check, option 1 is most likely the simplest.