Hebrew and Indonesian getting incorrect 'lang' codes


#1

While going through some tweets I had collected through the REST API, I found that the lang parameter had tagged several tweets with the tags ‘iw’ and ‘in’. The blog post that introduces the ‘lang’ metadata cites BCP 47 (http://tools.ietf.org/html/bcp47) as the basis for the two letter codes being used, which notes:

"By contrast, the subtags ‘he’ and ‘iw’ share a ‘Description’ value of “Hebrew”; this is permitted because ‘iw’ is deprecated and its ‘Preferred-Value’ is ‘he’. "

However, I could not find ‘in’ in the list of ISO 639-1 two letter codes. From my dataset’s context, I’m guessing that it refers to ‘Indonesian’, but the correct code should be ‘id’.

Just wanted to let the devs know that these tags are incorrect. Thanks!


Tweets with language not present in supported lang
#2

Thanks - I will look into this.


#3

Hello.
Any news about this topic?
I also met problem with Indonesian.
The problems also occurs when filtering stream by the language.
Simply filtering by ‘id’ (indonesian real code) returns no results

Regards


Language in some results not in ISO-639-1 standard
#4

There are a number of language tags in our code which are still returned as deprecated (e.g. pre-BCP47) values, unfortunately Indonesian is one of them and may be returned as “in”.