Linux network troubleshooting with the weirdest case scenario


So my server crashed for the third time in a short time period the other day. Hopefully it won’t happen again.

However after rebooting the server after a few hours downtime I had unreliable access to the VizReader application. A disaster as I’ve come to rely on it a lot. HTTP requests to it would be successful in a random fashion with roughly 40% of them being dropped somewhere. But where and why?

Must be the browser I thought and switched from FF to Chrome but no, same problem. My second thought was that there was something wrong with my router so I switched to a backup but still experienced the problem. By coincidence I noticed that everything was OK if I accessed the site through Windows, but not in Ubuntu! WTF?

Tip: Doing netstat -r is a good way of seeing the IP of your router if you’ve forgotten it.

It was time for some network debugging, tshark confirmed the behavior reported by firebug. Trying to access the site via ssh would report back “no route to host” in the same random fashion, sometimes it would work, sometimes not. I also got the same error message when trying to download the first page with wget.

Pinging vizreader.com resulted in the following:

henrik@henrik-laptop:~$ ping vizreader.com
PING vizreader.com (208.94.245.186) 56(84) bytes of data.
From 154.54.42.65 icmp_seq=1 Time to live exceeded
From 154.54.42.65 icmp_seq=2 Time to live exceeded
From 154.54.42.65 icmp_seq=3 Time to live exceeded
From 154.54.42.65 icmp_seq=4 Time to live exceeded
^CFrom 154.54.42.65 icmp_seq=6 Time to live exceeded

--- vizreader.com ping statistics ---
6 packets transmitted, 0 received, +5 errors, 100% packet loss, time 22372ms
, pipe 2

Weird to say the least, so the server at 154.54.42.65 tells me that the ttl of my packets has expired. However this is something I get back virtually instantly! And how come 60% of my HTTP requests are getting through just fine if the ttl of every packet I send to the server expires? So I try it in windows and get the following:

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Administrator>ping vizreader.com

Pinging vizreader.com [208.94.245.186] with 32 bytes of data:

Reply from 154.54.42.66: TTL expired in transit.
Reply from 154.54.42.66: TTL expired in transit.
Reply from 154.54.42.66: TTL expired in transit.
Reply from 154.54.42.66: TTL expired in transit.

Ping statistics for 208.94.245.186:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:\Documents and Settings\Administrator>

Basically the same problem reported back with different wording but observe the conclusion: Lost = 0 (0% loss). WTF again? For some reason Windows interprets the situation differently, maybe that could have something to do with the fact that the site is working flawlessly in Windows but not in Ubuntu?

A dig didn’t help either:

henrik@henrik-laptop:~$ dig vizreader.com

; <<>> DiG 9.6.1-P2 <<>> vizreader.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27649
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;vizreader.com.			IN	A

;; ANSWER SECTION:
vizreader.com.		17832	IN	A	208.94.245.186

;; Query time: 71 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sat Sep 18 12:24:10 2010
;; MSG SIZE  rcvd: 47

The above queries my router which again is confirmation of that the problem is not with me. Let’s try that server which is reporting ttl failure in the pings:

henrik@henrik-laptop:~$ dig @154.54.42.65 vizreader.com

; <<>> DiG 9.6.1-P2 <<>> @154.54.42.65 vizreader.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Looks like we’ve found the culprit, let’s check the road from here to vizreader.com with traceroute:

1  192.168.1.1 (192.168.1.1)  2.445 ms  2.732 ms  3.140 ms
 2  118.173.72.1.adsl.dynamic.totbb.net (118.173.72.1)  31.726 ms  33.591 ms  35.774 ms
 3  172.17.24.253 (172.17.24.253)  40.510 ms  42.883 ms  44.117 ms
 4  118.174.235.137.static.totbb.net (118.174.235.137)  47.023 ms  48.496 ms  50.496 ms
 5  203.114.127.49.static.totisp.net (203.114.127.49)  61.041 ms  62.956 ms  64.684 ms
 6  te-1-3.kkm-gw-01.totisp.net (203.113.127.29)  66.867 ms  41.739 ms  42.853 ms
 7  203.190.251.73 (203.190.251.73)  44.781 ms  46.953 ms  48.383 ms
 8  203.190.250.49 (203.190.250.49)  50.587 ms  52.494 ms  54.236 ms
 9  203.190.251.226 (203.190.251.226)  72.089 ms  73.625 ms  74.873 ms
10  203.190.251.37 (203.190.251.37)  78.090 ms  79.513 ms  81.031 ms
11  ET-Network.asianetcom.net (202.147.17.253)  294.653 ms  295.750 ms  297.466 ms
12  gi3-0-0.cr3.hkg3.asianetcom.net (203.192.134.65)  267.545 ms  267.507 ms  269.466 ms
13  po13-0-0.cr1.nrt1.asianetcom.net (202.147.16.110)  270.989 ms  271.937 ms  273.993 ms
14  po7-0-0.gw1.sjc1.asianetcom.net (202.147.0.34)  277.877 ms  278.839 ms  281.252 ms
15  te3-4.mpd01.sjc03.atlas.cogentco.com (154.54.11.13)  281.699 ms  283.636 ms  284.853 ms
16  te8-2.mpd01.sjc01.atlas.cogentco.com (154.54.6.237)  463.673 ms te2-4.mpd01.sjc01.atlas.cogentco.com (154.54.41.201)  463.182 ms te3-2.mpd01.sjc01.atlas.cogentco.com (154.54.6.81)  462.293 ms
17  te0-0-0-4.mpd22.sfo01.atlas.cogentco.com (66.28.4.149)  292.477 ms  294.651 ms  296.563 ms
18  154.54.42.65 (154.54.42.65)  266.953 ms  266.968 ms  266.970 ms
19  154.54.42.66 (154.54.42.66)  272.769 ms  275.296 ms  276.403 ms
20  154.54.42.65 (154.54.42.65)  274.892 ms te0-2-0-0.mpd22.mci01.atlas.cogentco.com (154.54.30.158)  316.074 ms  317.968 ms
21  te4-4.mpd01.mci01.atlas.cogentco.com (154.54.30.166)  317.435 ms  319.157 ms te0-1-0-0.mpd21.sfo01.atlas.cogentco.com (154.54.30.50)  287.845 ms
22  te0-1-0-0.ccr21.sfo01.atlas.cogentco.com (154.54.30.49)  286.352 ms  266.516 ms  267.862 ms
23  38.104.86.74 (38.104.86.74)  307.340 ms  303.753 ms  305.914 ms
24  CORE1.KCMODATACENTER.COM (96.43.134.6)  313.323 ms  314.295 ms te0-1-0-0.ccr21.sfo01.atlas.cogentco.com (154.54.30.49)  275.431 ms
25  208.94.245.186 (208.94.245.186)  313.699 ms te0-1-0-0.mpd21.sfo01.atlas.cogentco.com (154.54.30.50)  281.885 ms 208.94.245.186 (208.94.245.186)  317.395 ms

That bastard server is involved in step #18 and #20. Let’s see what we get in Windows with tracert:

C:\Documents and Settings\Administrator>tracert vizreader.com

Tracing route to vizreader.com [208.94.245.186]
over a maximum of 30 hops:

  1    <1 ms    <1 ms     2 ms  10.0.2.2
  2     1 ms     1 ms    <1 ms  192.168.1.1
  3   500 ms    30 ms    30 ms  118.173.72.1.adsl.dynamic.totbb.net [118.173.72.1]
  4   501 ms    34 ms    32 ms  172.17.24.253
  5   501 ms    34 ms    33 ms  118.174.235.137.static.totbb.net [118.174.235.137]
  6   501 ms    41 ms    52 ms  203.114.127.49.static.totisp.net [203.114.127.49]
  7   501 ms    41 ms    43 ms  te-1-3.kkm-gw-01.totisp.net [203.113.127.29]
  8   501 ms    42 ms    36 ms  203.190.251.73
  9   507 ms    40 ms    42 ms  203.190.250.49
 10   501 ms    58 ms    59 ms  203.190.251.226
 11   501 ms    56 ms    57 ms  203.190.251.37
 12   502 ms   270 ms   307 ms  ET-Network.asianetcom.net [202.147.17.253]
 13   499 ms   268 ms   263 ms  ge-2-0-0-0.cr4.hkg3.asianetcom.net [203.192.134.69]
 14   510 ms   265 ms   265 ms  po15-0-0.cr1.nrt1.asianetcom.net [202.147.0.65]
 15   501 ms   351 ms   265 ms  po7-0-0.gw1.sjc1.asianetcom.net [202.147.0.34]
 16   501 ms   394 ms   323 ms  te3-4.mpd01.sjc03.atlas.cogentco.com [154.54.11.13]
 17   509 ms   347 ms   265 ms  te4-2.mpd01.sjc01.atlas.cogentco.com [154.54.6.105]
 18   495 ms   265 ms   268 ms  te0-0-0-4.mpd22.sfo01.atlas.cogentco.com [66.28.4.149]
 19   507 ms   262 ms   266 ms  154.54.42.65
 20   509 ms   268 ms   268 ms  154.54.42.66
 21   502 ms   252 ms   272 ms  154.54.42.65
 22   496 ms   268 ms   265 ms  154.54.42.66
 23   495 ms   266 ms   265 ms  154.54.42.65
 24   502 ms   269 ms   268 ms  154.54.42.66
 25   502 ms   267 ms   266 ms  154.54.42.65
 26   500 ms   268 ms   267 ms  154.54.42.66
 27   508 ms   258 ms   269 ms  154.54.42.65
 28   503 ms   275 ms   269 ms  154.54.42.66
 29   501 ms   266 ms   263 ms  154.54.42.65
 30   494 ms   268 ms   267 ms  154.54.42.66

Trace complete.

It looks different but 154.54.42.66 shows up there too, but maybe the difference is the reason why it works OK in Windows and not in Ubuntu. It’s either that or how Linux and Windows treats whatever error messages they get back, where in this case Windows simply ignores them and goes ahead with the request or something, and Linux don’t.

Pinging vizreader from a server in London worked just fine, doing a traceroute there gave me the following:

1  serverxxx-xxx-xxx-xxx-1.live-servers.net (xxx-xxx-xxx-xxx)  0.313 ms  0.370 ms  0.449 ms
 2  88.208.255.101 (88.208.255.101)  0.370 ms  0.463 ms  0.542 ms
 3  88.208.255.62 (88.208.255.62)  4.996 ms  5.072 ms  5.121 ms
 4  10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21)  13.755 ms  13.999 ms  14.107 ms
 5  10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241)  73.139 ms  73.183 ms  73.228 ms
 6  10gigabitethernet1-2.core1.chi1.he.net (72.52.92.102)  105.240 ms  102.798 ms  98.300 ms
 7  10gigabitethernet1-1.core1.mci1.he.net (72.52.92.2)  105.493 ms  105.567 ms  105.463 ms
 8  10gigabitethernet1-1.core1.mci2.he.net (184.105.213.2)  105.519 ms  105.469 ms  105.515 ms
 9  joe_s-data-center-llc.10gigabitethernet1-4.core1.mci1.he.net (216.66.79.14)  113.126 ms  113.080 ms  113.224 ms
10  96.43.134.62 (96.43.134.62)  114.341 ms  114.327 ms  114.379 ms
11  208.94.245.186 (208.94.245.186)  114.461 ms  114.471 ms  114.460 ms

Note how 154.54.42.65 is absent from the route.

I still have the problem but I now know there’s probably nothing I can do about it, I simply have to wait until the issue gets resolved by proper DNS propagation or some such.

Related Posts

Tags: , , , , ,