Does the search API support special characters?


#1

Hi all,

Does the REST API support special latin characters? No matter what I do, it doesn’t seem to, although it really should given the website Twitter search does so.

For example, say I want to search for accounts related to my football club São Paulo.

  • In the twitter website, the address https://twitter.com/search?f=users&q=S%C3%A3o%20Paulo is perfectly valid and returns the expected list of results

  • Using the REST API, the equivalent search would be https://api.twitter.com/1.1/users/search.json?q=S%C3%A3o%20Paulo. However, this does not work for me. The API returns a 401 HTTP status code and a 32 error code “could not authenticate you”. Please note this is the exact same search term that was used for the website search. (Calling https://api.twitter.com/1.1/users/search.json?q=São%20Paulo without escaping the special character leads to the exact same results)

  • This behaviour occurs only when there are special latin characters in the search term. For instance, I am able to search for another football club “Corinthians” through the API using the call https://api.twitter.com/1.1/users/search.json?q=Corinthians and get the correct results. The same goes for any search terms without special latin characters that need to be URL escaped; these are always successful and return the expected results. This is not an OAuth problem.

My questions are:

  • Is there a way to search for terms including special Latin characters through the API?

  • If not, why not? Surely the search functions behind the website search and the API should be equivalent? Is this a bug? Is it something the Twitter team could fix?

Thanks in advance,

Luis


#2

That’s odd - what library / language are you using?

Twitter search normalizes UTF-8, so a seach for “São Paulo” & “Sao Paulo” should give you the exact same results.

Good example here: https://twitter.com/FakeUnicode/status/700557070584557568 (the funky fonts are just different unicode letters that are all equivalent to #JustUnicodeThings - same as “ã” is normalized to “a” in search)

I just tried these two in Twitter4J:

/users/search.json?q=S%C3%A3o%20Paulo&count=20&page=1

/users/search.json?q=Sao%20Paulo&count=20&page=1

Both returned the same 20 accounts.


#3

Thanks Igor!

I am running the API calls from within R/RStudio (but I thought it shouldn’t matter as I am just directly sending HTTP requests to the API?).

It confuses me that you were able to call /users/search.json?q=S%C3%A3o%20Paulo&count=20&page=1 and I wasn’t (and still am not!).

So are you saying that if I normalize latin characters replacing them by their non-diacritic forms the search results will always be exactly the same? Interesting…

I think that solves my immediate problem, but regardless, this is still something that seems weird/the Twitter team should be aware of.

Cheers,

Luis


#4

Igor,

You’re right that the two searches:

  • q=S%C3%A3o%20Paulo
  • q=Sao%20Paulo

return the same results (at least in website search; cannot run it through the API as explained above).

However, that doesn’t hold true in general. For instance, the search results for:

  • q=Gazi%20%C3%9Cniversitesi (Gazi Üniversitesi)
  • q=Gazi%20Universitesi

are completely different. The website search function redirects the second search term to the first and so shows the same results initially, but the REST API does not.

I am running my API calls from the httr R package as GET(<api url>, config(token = <OAuth token>)), as described in https://github.com/hadley/httr/blob/master/demo/oauth1-twitter.r

I thought since I can run all sorts of other API calls without any problems, it would be an API-side bug, but since you were able to run calls with special characters in them, there must be a problem with my setup. Are other users able to run search queries with special characters such as https://api.twitter.com/1.1/users/search.json?q=S%C3%A3o%20Paulo?


#5

No idea why using httr would throw those errors unfortunately.

This might be of some use to you: http://www.r-bloggers.com/icu-unicode-text-transforms-in-the-r-package-stringi/
Twitter uses Normalization Form C (NFC) as far as i know: https://dev.twitter.com/overview/api/counting-characters worth checking to see if it works if you use stri_trans_nfc(“Gazi Üniversitesi”) in R?

The thing with Gazi Üniversitesi is different because of Ü which should be Ue so these return the exact same results:

/users/search.json?q=Gazi%20%C3%9Cniversitesi&count=20&page=1
/users/search.json?q=Gazi%20Ueniversitesi&count=20&page=1

while q=Gazi%20Universitesi was different (Twitter4J again)


#6

Thanks for the tips; unfortunately, stri_trans_nfc returns the string as is, with the diaeresis. I attempted every single translation protocol in the stringi package and not a single one returns Ue for Ü; it’s always either U or just Ü, unchanged. I think I might flag the issue with the stringi developer.

It’s also unfortunate that httr can’t handle these requests properly - I simply don’t understand how or why they behave differently from non-special character requests. Maybe there’s some redirecting involved behind the scenes that messes up the authorization?

Anyway, much appreciated.