A new crawler that is not Twitterbot

media
cards

#1

Recently we started to get traffic to our shared resources from a new crawler that is not identified as Twitterbot but as a random user agent:

::ffff:10.0.3.70 - - [06/Sep/2017:13:42:14 +0000] "GET /nt8T HTTP/1.1" 301 53 "-" "Twitterbot/1.0"
::ffff:10.0.3.70 - - [06/Sep/2017:13:42:14 +0000] "GET /nt8T HTTP/1.1" 301 53 "-" "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en-US) AppleWebKit/534.8+ (KHTML, like Gecko) Version/6.0.0.701 Mobile Safari/534.8+"

As you can see, the IP is the same and at the same time. It happens every time we share a new resource and this seems a new not documented behaviour (https://dev.twitter.com/cards/getting-started#crawling).

We need to be able to identify the crawler to avoid counting visits. Until now that was easy but now, using random user agents, we’ll need to use a IP blacklist (we can’t use robots, we don’t want to block).

Is this new behaviour ok or just some tests? Is there any safe way we can identify the crawler now?

Thanks in advance


#2

The IP address indicated is not from our crawler (our IP ranges are listed in the cards documentation and in this announcement). Seems very strange!


#3

That’s an internal network IP address, have you got a load balancer between your servers and the internet or a caching layer that could be intercepting the true client IP?

The 10.x.x.x range is protected for internal network IPs


#4

Good spot @richardhyland! :slight_smile:


#5

That’s totally true!

I’m going to look for the real IPs and will come back if they are still from Twitter.

Thanks,


#6

These are the real IPs:

199.16.157.182 - - [08/Sep/2017:10:50:51 +0000] "GET /nH1Z HTTP/1.1" 301 74 "-" "Twitterbot/1.0"
52.21.176.42   - - [08/Sep/2017:10:50:51 +0000] "GET /nH1Z HTTP/1.1" 301 74 "-" "Opera/9.80 (J2ME/MIDP; Opera Mini/9.80 (J2ME/22.478; U; en) Presto/2.5.25 Version/10.54"

So the Twitterbot seems to be in the specified range, but the second one we don’t have idea. It seems to be an Amazon IP but is not inside our network.

We just want to confirm that is not a new service from Twitter…


#7

It is not anything that we are aware of.


#8

Thank you, we’ll keep investigating.


#9

Hey Andy here I am with new info:

In order to ensure that that connection doesn’t come from any service in our system we’ve done a test using a requestb.in link and trace the connection.

Steps to reproduce:

  1. Create a new link in requestb.in
  2. Share that link via DM, don’t open it
  3. Inspect the link requests.

You will see that there are 2 connections as I explained before. In my case:

Cf-Connecting-Ip: 52.21.137.163
Cf-Ipcountry: US
Cf-Visitor: {"scheme":"https"}
Accept-Encoding: gzip
Accept: */*
X-B3-Spanid: e0ce6366661dc41d
Connection: close
X-B3-Traceid: 06c63f692e6697ac
Total-Route-Time: 0
Cache-Control: max-age=259200
X-B3-Flags: 2
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1
X-B3-Sampled: false
Cf-Ray: 39b2305c9fc60f03-IAD
X-B3-Parentspanid: a61f820f6d558b70
Host: requestb.in
Via: 1.1 vegur
Connect-Time: 0
X-Request-Id: 281660ca-3d11-4214-9e66-c085f69be538
Cf-Connecting-Ip: 199.16.157.182
Cf-Ipcountry: US
Cf-Ray: 39b230596e0e3894-ATL
Cf-Visitor: {"scheme":"https"}
Accept-Encoding: gzip
Accept: */*
X-B3-Spanid: 20628b0e47b43021
Host: requestb.in
X-B3-Traceid: 004e01df00604491
Total-Route-Time: 0
X-B3-Flags: 2
User-Agent: Twitterbot/1.0
X-B3-Sampled: false
Connection: close
X-B3-Parentspanid: 0ad57fbe1d658aea
Via: 1.1 vegur
Connect-Time: 1
X-Request-Id: 6f1ff792-671b-4328-bf39-d924b0a9e26e

With this test we discard any trouble on our side and it points to any misbehaviour in the Twitter bot crawler.


#10

I’m not sure I come to the same conclusion - the second IP is not owned by the Twitter network, and I also note that in the three different posts above, the “extra” request has a different user-agent string every time. I’m still at a loss to explain how this is happening.


#11

Hi,
To add some info to what @ivanguardado posted, I can confirm that in all the tests we have performed, the user-agent of the extra request seems totally randomized, which reminds me of the behaviour some crawlers use to bypass protections and pretend to be legit requests.
Since this reproduces with an external tool (requestbin), it seems to leave out an issue with our infrastructure at least.


#12

Absolutely - to be clear I’m not suggesting that this is an infrastructure issue on your side or that your tools are not correctly reporting the behaviour… but I do not believe that Twitter’s network or services are directly responsible for the additional requests. This leaves us with a mystery :eyes:


#13

Hi!
Just wanted to ping this thread. Any new info about this issue?

Thanks!


#14

Not from our side.


#15

We’re seeing the same thing happen – we send links via Twitter DM and they are being requested without user action from a bot that’s using a morphing UserAgent (initially it was Opera/9.80 (J2ME/MIDP; Opera Mini/9.80 (J2ME/22.478; U; en) Presto/2.5.25 Version/10.54), but it seems like we can’t rely on that).

Like @ivanguardado we’re seeing the request come from an AWS IP address block: 52.21.178.185


#16

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.