Occasionally you might notice tweets in socialbearing.com and other Twitter tools being tagged in the wrong language.
Twitter automatically tags tweets depending entirely on the words within the tweet rather than any other indicator such as your location, country of origin or geo-location. This is the best and most reliable method of language identification because there are so many multilingual Twitter users and accounts, it would be impossible to do so otherwise. It does mean however that Twitter occasionally gets things wrong.
Incorrect language identification of tweets happens most frequently when tweets contain only a handful of words and there are not enough words in the body of the tweet to correctly identify a language. For example, the word ‘haha’ is often tagged as ‘Wikang Tagalog’, the national language of the Philippines. The Twitter API returns this tweet with the ISO code ‘tl’.
Twitter does a particularly bad job at not being able to identify Welsh tweets and instead of tagging them in the wrong language, the Twitter API returns tweets as ‘und’ – unidentified. For example a search for the word ‘Diolch’ (Thank you) returns the majority of tweets as unidentified:
While there’s nothing Social Bearing can do to correct the incorrect tagging of tweets, an ‘Unidentified’ language option has been added so at least visitors can see that some tweets are not being identified correctly.