Hi,
I am setting up a filtered stream using a list of keywords in various languages. I’ve understood that Twitter tokenizes tweets and looks for matches in the list of tokens. That brings me to the following question:
How does tokenization work in languages like Japanese, Thai or Chinese, where words are not (always) separated by a space? Is there a resource that explains this?
Best regards,
Björn
1 Like
I’d love to know too - i haven’t seen any notes on twitter’s tokenization of tweets for search, but i assume it’s by character for those character sets.
There are some clues to this in twitter-text/extract.yml at master · twitter/twitter-text · GitHub the twitter-text extraction library test cases. There’s also their own defined unicode character ranges to align character sets across programming languages twitter-text/unicode_regex at master · twitter/twitter-text · GitHub
1 Like