Connections dropping/timing out starting Aug 16/17 after 503s subsided

mobile

#1

We have code in production pushed a week ago that started experiencing 503 errors on the 16th. Once the 503 errors subsided, we started seeing connections being dropped (no response, no response headers, no status). Edit: On Android, we’re receiving no response at all (and no 503 Timeout after 5 seconds).

It happens across various normal & analytics ads api requests. Wish we could provide more info, but this is all we’ve got.

Info:

Thu Aug 18 2016 13:48:04 GMT-0700 (PDT)
GET https://ads-api.twitter.com/0/accounts/18ce54a29n9/promotable_users?with_deleted=false

Accept:"/
Authorization:“Oauth oauth_consumer_key=”~",oauth_nonce=“eLOYtiCHTq5LFwyOFpFmk16LUozLKKLN”,oauth_token="~”,oauth_signature=“BustEiTGpkYE0DhVYd9FORgH8dY%3D”,oauth_signature_method=“HMAC-SHA1”,oauth_timestamp=“1471553283”,oauth_version="1.0""
Connection:"close"
User-Agent:“twitter4rct-js”

Additional Info:
Using react-native fetch on iOS (XmlHTTPRequest wrapper over NSURLRequest)
Error Code: -998 (kCFURLErrorUnknown = -998)
No response headers
No response body

It looks like ~20 other requests (same token, same endpoint, different ad account) failed at that time


#2

Did some debugging on the Android client. The errors manifest themselves a bit differently.

On Android (wrapping okhttp3) no error is thrown. Instead, there is no response from Twitter and our internal timeout of 6 seconds is hit (we would expect a 503 at 5 seconds).

Again, no client code has changed for a week and these issues started popping up at the same time.


#3

Hi,

We haven’t heard of this sort of issue happening recently, so I’m inclined to think that it could be something like your request is not making it over the wire or hitting some network issue on your end (you would receive some sort of response if it was OAuth problem).

Would you be able to create a stand alone unit test that you can debug on dev environment? You should be able to install and run Twurl inside a terminal and compare the raw headers being sent for requests.

Otherwise you might have to focus on response debugging with debug bridge tooling. The raw bytes being sent should be able to be simulated with an appropriate network wire debugging tool.

It would be a good idea in any case to have Twurl available to test with, because that would rule out any issues with both the API and your particular keys/access permissions.

Thanks,

John


#4

@yagottahavehart have you been able to use any of the advice from @JBabichJapan to continue to troubleshoot? As he stated, we’re not seeing dropped calls or increases in client errors on our side. Are these calls be made server-to-server, or are they being made from a mobile app directly to the API?


#5

@jaakkosf
The headers from Twurl and from our client are matching up (no gzip, but enabling didn’t seem to fix).
We have not had a chance to debug the raw bytes being sent/received yet.

We’re going straight from Mobile client -> API using a React Native Fetch wrapper over NSURLRequest & OkHttp.
We could not repro it using Twurl from command line, a Twitter Node client, or a client-side java library, so it appears the solution we were using is alone in broken (good for Twitter, bad for us haha).
The solution did work flawlessly for a couple months, so we’re hoping it’s just a small change needed on our side to bring it back up to speed.


#6

For checking raw bytes - Have you tried a network socket debugging tool like https://www.charlesproxy.com/ ? I imagine that as long as you can debug or log the correct bytes it should be possible to figure out what’s going on.

Is there a pattern to the connections which succeed? Have you tried hitting the REST API instead of Ads API?

I think at this point we are curious what is going on and how to prevent this sort of problem from occurring with community libraries and so on, so we are willing to look at anything you have for us to see offline if you continue to be stuck on this issue for much longer.


#7

@JBabichJapan I did spend some time trying to debug Charles, but didn’t continue in that direction because it could not proxy the SSL requests (due to certificate pinning).

I was able to peek at the raw requests via CFNetwork Diagnostics on iOS:
It appears to actually receive the correct size number of raw bytes, and the correct headers, but the error still occurs (and it’s not a helpful one) before the client code receives the response.

Received: request GET https://ads-api.twitter.com/0/accounts/18ce53v8qej/promotable_users?with_deleted=false HTTP/1.1
	         Response: HTTP/2.0 200

Response Error
Request: <CFURLRequest 0x7fa522b51710 [0x10e8d2a40]> {url = https://ads-api.twitter.com/0/accounts/18ce53v8qej/promotable_users?with_deleted=false, cs = 0x0}
  Error: Error Domain=kCFErrorDomainCFNetwork Code=-998 "(null)" UserInfo={_kCFStreamErrorCodeKey=0, _kCFStreamErrorDomainKey=2}
} [3:20570]

#8

Some more debugging information: @jaakkosf @JBabichJapan

For iOS, the error appears to happen using NSURLSession & NSURLSessionDataTask with a standard NSMutableURLRequest.

NSURLSession *session = [NSURLSession sharedSession];
  NSURLSessionDataTask *task = [session dataTaskWithRequest:request
                                         completionHandler:
                               ^(NSData *data, NSURLResponse *response, NSError *error) { //...}

This simplified example will throw the unknown error pasted above when repeated requests are made.

However, using NSURLConnection w/ sendAsynchronousRequest (deprecated iOS 9.0) does work. EDIT: It falls victim to the same error as described in a later post
https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSURLConnection_Class/

NSOperationQueue *queue = [[NSOperationQueue alloc] init]; //[NSOperationQueue mainQueue] || [NSOperationQueue currentQueue] 
  [NSURLConnection sendAsynchronousRequest:request queue:queue completionHandler:^(NSURLResponse *response, NSData *data, NSError *error)
  { //... }

Note: The URLRequest is just a URL w/ some manually generated oauth headers


#9

Can you check if the issue is related to encoding a la this thread on SO? http://stackoverflow.com/questions/35580600/getting-json-data-using-nsurlsession

I am kind of seeing enough smoke for there to be a fire with this issue doing searches, but unfortunately don’t see a clear pattern for that the root cause is. If the call is actually async but you’re treating it like sync could be an answer for getting null even though it’s HTTP 200 OK.


#10

Checked on the encoding. Everything looks OK (and 95% of identical requests are working).

One amendment to my previous post is that I may have been too hasty with calling the ‘working’ solution ‘working.’ It appears to occasionally also respond with a 200 and no response data (though this time no error either).

EDIT: Confirmed we’re seeing Status code 200, no error, and 0 bytes read for:

NSOperationQueue *queue = [NSOperationQueue currentQueue];
  [NSURLConnection sendAsynchronousRequest:request queue:queue completionHandler:^(NSURLResponse *response, NSData *data, NSError *error)
  {
    NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *) response;
    NSString *statusCode = [NSString stringWithFormat:@"%ld", [httpResponse statusCode]]; // == '200'
    if (error == nil) //enters here
        {
          NSString *str = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
          if ([str  isEqual: @""]) {
            NSLog(@"sad face %@", str); //This hits
          }

(Note: It still happens much less frequently than the other method with NSURLSession, in particular when there are many concurrent requests)

Will continue to debug & let you know what we find.

Thanks for looking into this


#11

Andy, how’s this debugging going on your end?


#12

@jaakkosf For the time being we threw in some retries so we can get something in place to fix our production apps. Since confirming neither attempted work-around was fully successful, we haven’t had the resources to debug it more.

I will hopefully have some time later this week to create a standalone project that reproduces it that I can send over.
I should also have some time dig deeper on Android to help paint a fuller picture.

Was there anything that went out in the Aug 16 fix, or anything funky with the logs tied to the originally posted request that you think might have led to this sort of situation?


#13

The Aug 16th fix only impacted the Tweet endpoint, and specifically just toggled the default encoding behavior for text passed that had been deployed for many months. There was an incident that caused a backend service to timeout for a period, but again, no deploys on the Ads API in or around that date.


#14

Hmm yeah, was worried that was the case. Thought to mention in case there might be a server tweak related to the backend service timeout.
I’ll see if I can expedite the test project to keep the ball moving on solving this one.

Thanks for helping,
Andy


#15

This is just another guess - but having some 503s immediately following up with connection issues somehow smells like a cache issue to me. If you have some server side caching or functions available to flush cache it might be worth a try just to eliminate more possibilities. I would also just try it again since time has passed and I’m curious if it still shows the same exact behavior!


#16

Hey Andy,

How goes the debugging on this, let us know if we can do anything to help this week.

Thanks,

John


#17

@JBabichJapan
Thanks for checking in.

Unfortunately, we had to pause development on it for the #GoLive submission & support. We have retries implemented as an interim solution, but we’re definitely going to revisit it soon.

Some potentially good news:
Looking at some metrics now, it appears that the retries we have in place haven’t been needed since at least early this week (could be longer). It looks like whatever was causing the issue may have been fixed/tweaked on your end. Note: This is for iOS only (haven’t taken a look at Android).

I’ll do some testing and get back to you by early next week on the current status of the bug & how it’s affecting us.

Thanks again,
Andy