Reconciling differences with SSL timeouts

Will Platnick's Avatar

Will Platnick

16 Jul, 2012 05:46 PM

Hello,
I am using blitz.io to load test our configuration. It looks like
HTTPS -> nginx -> http -> HAPROXY -> HTTP -> APP SERVERS

Before I used blitz.io, I tested locally using ab from 4 different beefy servers using:
ab -n 100000 -c 1000 "request"

I was able to get 4500 requests/second from ab using this configuration from 4 servers and there were no errors or timeouts in ab.

Testing using blitz.io, I'm seeing huge timeouts start to occur around 500-1000 users and continues up. I had 10-20% timeouts on my rushes overall.

What would cause such a huge difference? Internet connection is 1 Gig and is not being saturated at all. When I load tested directly to HTTP layer, I got no timeouts up to 5000 clients.

In the nginx logs when blitz.io is running, I get tons of 400 errors. However, none when ab is running and doing a ton more damage to the server.

  1. 2 Posted by Michael Smith on 16 Jul, 2012 07:29 PM

    Michael Smith's Avatar

    Hi Will,

    There are a few differences between Blitz and ab:

    Blitz has a rate-limiting delay of 1 second at the beginning of each request. This allows the Blitz clients to be used by multiple users so that we can keep our prices low. The delay happens after the TCP connection is created and before the HTTP request is sent. So, the maximum requests/second will be equal to the number of concurrent users.

    By default, Blitz uses a 1 second timeout on each request (you can increase it with the -T option). If the server takes longer than this amount of time to respond, then we abort the request, close the connection, and count it as a "timeout". I suspect that the SSL handshake is taking longer than one second to complete in some cases, and so the connection is closed before sending the HTTP request.

    Nginx reports a 400 (Bad Request) error when a TCP connection is terminated before the HTTP request is sent. I think this is probably what's happening in your case. It's also possible for 400 errors to happen at the very end of a rush, because we immediately terminate the open connections.

    If you set a longer timeout, then I expect that you'll get fewer timeouts and 400 errors. For example, -T 2000

  2. 3 Posted by Will Platnick on 16 Jul, 2012 07:43 PM

    Will Platnick's Avatar

    Hi Michael,
    I tried with various timeouts, and things slightly improved, but I'm seeing huge, tremendous differences.

    Here's an ab output from a server in another datacenter
    ab -n 100000 -c 2000 "https request" Requests per second: 542.71 [#/sec] (mean)
    Time per request: 3685.234 [ms] (mean)
    Time per request: 1.843 [ms] (mean, across all concurrent requests)

    I received no timeouts or errors.

    Here's from inside on a server on the network:

    Requests per second: 1226.08 [#/sec] (mean)
    Time per request: 1631.215 [ms] (mean)
    Time per request: 0.816 [ms] (mean, across all concurrent requests)

    No errors, no timeouts

    SSL handshake does not appear to be taking longer than 1 second anywhere else.

    During the blitz sessions, I was getting huge sections of 400 errors during the rush, not just at the end like I get with ab.

  3. 4 Posted by Michael Smith on 18 Jul, 2012 04:58 AM

    Michael Smith's Avatar

    Hi Will,

    The timeout (-T) setting tells Blitz how long to wait on each request before aborting. If the request takes longer than this amount of time, then it will immediately close the connection, which may result in a 400 error on your server if the HTTP request was not sent.

    As far as I know, ab doesn't have an equivalent timeout setting. It will continue to wait for each request to finish, no matter how long it takes.

  4. 5 Posted by Nathan Vander Wilt on 05 Mar, 2013 12:05 AM

    Nathan Vander Wilt's Avatar

    I'm seeing a huge difference in results depending on whether I use HTTP or HTTPS to connect. My app is on Heroku.

    No matter how many dynos I had going (1 vs. 4), the results of my rush to HTTPS would switch completely to "errors" (that I wasn't really seeing in my logs) after 416 users.

    When I switched to HTTP, I saw a much more interesting curve: it went up to 919 users (almost 10 second response times) and then I saw timeouts slope up rather than just errors.

    Perhaps Heroku has some HTTPS bottleneck, but it wouldn't be my first guess.

    Any explanation for this, other than that HTTPS connections take longer to set up? Is your TCP connection timeout really controlled by -T? And why do you log TCP timeouts as errors instead of timeouts?

  5. 6 Posted by John on 28 May, 2013 07:02 AM

    John's Avatar

    +1 Nathan
    I'm noticing the same issue on Heroku for SSL vs No-SSL.

  6. Support Staff 7 Posted by Guilherme Hermeto on 24 Jun, 2013 07:22 PM

    Guilherme Hermeto's Avatar

    Hi Nathan,

    this problem happens not only on Heroku. Although hit rate on Heroku seems to have a lower limit.

    We ran some internal tests and compared our engines HTTPS performance with apache ab. Here are the results we got:

    blitz engine + nginx (-p 200-200:10 )
    HTTP: 12000 hit/sec
    HTTPS: 200 hit/sec

    ab(keep-alive connections) + nginx (ab -k -t 10 -c 200 ):
    HTTP: 33000
    HTTPS: 6400

    ab(without keep-alive) + nginx (ab -t 10 -c 200 ):
    HTTP: 9000
    HTTPS: 120

    As Blitz don't keep connections alive, our engines are better compared with apache ab without keep-alive. The difference on the hit rate between HTTP and HTTPS happens indeed because of the SSL handshake, which takes time to complete. However, we are still investigating why with Heroku our HTTPS hit rate is more limited than usual.

  7. Guilherme Hermeto closed this discussion on 19 Oct, 2013 05:23 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac