Preventing Twitterbot to access website and UTM parameters


#1

I have a Twitter application so users of my application shares links of my webpage inside their tweets. It seems like bots follow these links and some of these bots create high bandwidth usage. And most of them doesn’t provide me any hit. So I want to disallow them with robots.txt, or .htaccess file.

My conecern is will it be a problem to ban Twitterbot ? Who owns this bot ? Twitter.com or other website? What would be the drawbacks to disallow it ?


#2

You can use robots.txt to tell Twitterbot to not crawl your pages but if you do it you won’t be able to use Twitter Cards and the tweet button counts might not be updated.
You can find some info at https://dev.twitter.com/docs/cards/getting-started#crawling
But Twitterbot won’t crawl the same page more than once a week, and it won’t download images, so I’m surprised it uses a lot of bandwidth.


#3

Vincent thank you for reply.

The problem is I use UTM parameters for my URLs. I use utm_source, utm_medium, utm_campaign and utm_content.
So Twitterbot marks all UTMed URLs as new URL so it crawls it also.
For example Twitterbot crawls these URLs in same hour:

http://example.com/?utm_source=aaa1&utm_medium=twitter&utm_campaign=mycampaign&utm_content=somekey1
http://example.com/?utm_source=aaa2&utm_medium=twitter&utm_campaign=mycampaign&utm_content=somekey2
http://example.com/?utm_source=aaa3&utm_medium=twitter&utm_campaign=mycampaign&utm_content=somekey3

Is it possible to say Twitterbot to not to mark these URLs as duplicate ? With robots.txt or ?


#4

I’m not completely sure but I thought that the UTM parameters were ignored, meaning that the 3 URLs above should only request one page. Does your web server log display requests from Twitterbot with all the different UTM parameters?


#5

I just take the latest access.log from my server and paste last requests from Twitterbot:

199.16.156.124 - - [14/Jan/2014:09:42:29 +0200] "GET /?utm_source=Celikoza&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.42 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:44:50 +0200] "GET /?utm_source=demokrat16&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.44 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:45:11 +0200] "GET /?utm_source=hoyranreklam&utm_medium=twitter&utm_campaign=notif1&utm_content=14.01.2014.09.45 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:46:43 +0200] "GET /?utm_source=cymrt&utm_medium=twitter&utm_campaign=unlock&utm_content=09.12.2013.20.32 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.124 - - [14/Jan/2014:09:47:09 +0200] "GET /?utm_source=OsmanlOsman1AK&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.47 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.125 - - [14/Jan/2014:09:49:24 +0200] "GET /robots.txt HTTP/1.1" 200 933 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:49:24 +0200] "GET /?utm_source=Karizmatik____&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.49 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.59.148.209 - - [14/Jan/2014:09:49:51 +0200] "GET /robots.txt HTTP/1.1" 200 933 "-" "Twitterbot/1.0" 199.59.148.209 - - [14/Jan/2014:09:49:51 +0200] "GET /robots.txt HTTP/1.1" 200 933 "-" "Twitterbot/1.0" 199.59.148.211 - - [14/Jan/2014:09:49:51 +0200] "GET /?utm_source=denizdogan22&utm_medium=twitter&utm_campaign=teasetweet&utm_content=14.01.2014.09.49 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.59.148.209 - - [14/Jan/2014:09:49:51 +0200] "GET /?utm_source=denizdogan22&utm_medium=twitter&utm_campaign=teasetweet&utm_content=14.01.2014.09.49 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.124 - - [14/Jan/2014:09:50:04 +0200] "GET /?utm_source=ibrahim2122&utm_medium=twitter&utm_campaign=notif1&utm_content=14.01.2014.09.50 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.124 - - [14/Jan/2014:09:50:54 +0200] "GET /eng/?utm_source=Brazuqa&utm_medium=twitter&utm_campaign=logs&utm_content=14.01.2014.09.50 HTTP/1.1" 200 8020 "-" "Twitterbot/1.0" 199.16.156.125 - - [14/Jan/2014:09:51:05 +0200] "GET /?utm_source=hazalkaya_news&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.51 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:52:20 +0200] "GET /?utm_source=_bertaraf&utm_medium=twitter&utm_campaign=support&utm_content=15.12.2013.00.43 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.125 - - [14/Jan/2014:09:53:16 +0200] "GET /?utm_source=emine19971905&utm_medium=twitter&utm_campaign=unlock&utm_content=14.12.2013.15.48 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:54:06 +0200] "GET /?utm_source=kotukizcandy&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.53 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.125 - - [14/Jan/2014:09:54:59 +0200] "GET /?utm_source=nsoyat&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.54 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.59.148.211 - - [14/Jan/2014:09:55:06 +0200] "GET /?utm_source=xahsenx&utm_medium=twitter&utm_campaign=notif1&utm_content=14.01.2014.09.55 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.125 - - [14/Jan/2014:09:55:38 +0200] "GET /?utm_source=kadircavgin&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.55 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:56:15 +0200] "GET /?utm_source=mrtozil&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.56 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.125 - - [14/Jan/2014:09:56:42 +0200] "GET /?utm_source=siverekname1&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.56 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.59.148.211 - - [14/Jan/2014:09:57:15 +0200] "GET /robots.txt HTTP/1.1" 200 933 "-" "Twitterbot/1.0" 199.59.148.211 - - [14/Jan/2014:09:57:15 +0200] "GET /eng/?utm_source=denizdogan22&utm_medium=twitter&utm_campaign=teasetweet&utm_content=14.01.2014.09.56&utm_lang=en HTTP/1.1" 200 8020 "-" "Twitterbot/1.0" 199.59.148.209 - - [14/Jan/2014:09:57:16 +0200] "GET /eng/?utm_source=denizdogan22&utm_medium=twitter&utm_campaign=teasetweet&utm_content=14.01.2014.09.56&utm_lang=en HTTP/1.1" 200 8020 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:58:48 +0200] "GET /?utm_source=sezercanbar&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.58 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:09:59:39 +0200] "GET /?utm_source=RTsupermarket&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.09.59 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.124 - - [14/Jan/2014:10:00:04 +0200] "GET /?utm_source=E_EsraG&utm_medium=twitter&utm_campaign=notif1&utm_content=14.01.2014.10.00 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.124 - - [14/Jan/2014:10:01:02 +0200] "GET /?utm_source=AzizoluBerrak&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.10.00 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:10:01:48 +0200] "GET /?utm_source=_ffirst_&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.10.01 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.59.148.210 - - [14/Jan/2014:10:02:28 +0200] "GET /?utm_source=snsz_olmazz&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.10.02 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0" 199.16.156.126 - - [14/Jan/2014:10:02:57 +0200] "GET /?utm_source=Hayaller_kenti1&utm_medium=twitter&utm_campaign=unlock&utm_content=14.01.2014.10.02 HTTP/1.1" 200 11255 "-" "Twitterbot/1.0"

#6

Any news please ?


#7

Sorry, I need time to investigate.
In the time being, the only work-around I can give you is either to use prevent Twitterbot to crawl your page using robots.txt or to generate less different URLs…


#8

Thank you Vincent.

Normally I use Twitter card for my Android application.
Yesterday I added this to my robots.txt:

User-agent: Twitterbot
Allow: /index.html
Disallow: /

Today I see that tweets doesn’t include Twitter card for my app anymore.
So I reverted robots.txt.


#9

I just found an other solution that might work better: add your UTM parameters not after a ? but after a #.
Google Analytics does support this use case: google for _setAllowAnchor for how to enable it.


#10