Exclude tweets with certain words does not work as expected


#1

While trying to clean a stream of tweets with our hashtag, we encountered an odd issue with the exclude filter.

Our search query like this: #hastag -cam -sexy -dating -naked filters out the tweets with naked in them, but not the cam/sexy/dating ones.

Is this an issue with our query, or are we missing something?


#2

Can you provide details of the actual request you’re making to the API?

Note that the track= parameter of the statuses/filter streaming API doesn’t support negation.


#3

It actually happens on the advanced twitter search too, so, I might as well give that as example.

https://twitter.com/search?f=realtime&q=%23loveoostende%20-naked%20-cam%20-sexy%20-dating&src=typd

There’s at least 2 “cam” tweets and a “dating” tweet in the first few results.


#4

Thanks, that’s a great reproducible example. Checking internally; stay tuned.


#5

Looking more closely at this, it looks like some of these tweets are using homoglyphs to escape filtering.

For instance, in this tweet the “c” of “cam” is actually Unicode U+0441, “CYRILLIC SMALL LETTER ES” and not a lowercase “c” at all… which is why it’s not filtered out by “-cam”.

We’ll need to look into homoglyph canonicalization in our search pipeline.


#6

At least we can filter these specific tweets out for now by using all specific cases, but I suppose bots outsmart us again here.

Thanks for the quick support and looking into this.