Linux network troubleshooting with the weirdest case scenario
So my server crashed for the third time in a short time period the other day. Hopefully it won’t happen again.
However after rebooting the server after a few hours downtime I had unreliable access to the VizReader application. A disaster as I’ve come to rely on it a lot. HTTP requests to it would be successful in a random fashion with roughly 40% of them being dropped somewhere. But where and why?
Must be the browser I thought and switched from FF to Chrome but no, same problem. My second thought was that there was something wrong with my router so I switched to a backup but still experienced the problem. By coincidence I noticed that everything was OK if I accessed the site through Windows, but not in Ubuntu! WTF?
Tip: Doing netstat -r is a good way of seeing the IP of your router if you’ve forgotten it.
It was time for some network debugging, tshark confirmed the behavior reported by firebug. Trying to access the site via ssh would report back “no route to host” in the same random fashion, sometimes it would work, sometimes not. I also got the same error message when trying to download the first page with wget.
Pinging vizreader.com resulted in the following:
henrik@henrik-laptop:~$ ping vizreader.com
PING vizreader.com (208.94.245.186) 56(84) bytes of data.
From 154.54.42.65 icmp_seq=1 Time to live exceeded
From 154.54.42.65 icmp_seq=2 Time to live exceeded
From 154.54.42.65 icmp_seq=3 Time to live exceeded
From 154.54.42.65 icmp_seq=4 Time to live exceeded
^CFrom 154.54.42.65 icmp_seq=6 Time to live exceeded
--- vizreader.com ping statistics ---
6 packets transmitted, 0 received, +5 errors, 100% packet loss, time 22372ms
, pipe 2
Weird to say the least, so the server at 154.54.42.65 tells me that the ttl of my packets has expired. However this is something I get back virtually instantly! And how come 60% of my HTTP requests are getting through just fine if the ttl of every packet I send to the server expires? So I try it in windows and get the following:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\Administrator>ping vizreader.com
Pinging vizreader.com [208.94.245.186] with 32 bytes of data:
Reply from 154.54.42.66: TTL expired in transit.
Reply from 154.54.42.66: TTL expired in transit.
Reply from 154.54.42.66: TTL expired in transit.
Reply from 154.54.42.66: TTL expired in transit.
Ping statistics for 208.94.245.186:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
C:\Documents and Settings\Administrator>
Basically the same problem reported back with different wording but observe the conclusion: Lost = 0 (0% loss). WTF again? For some reason Windows interprets the situation differently, maybe that could have something to do with the fact that the site is working flawlessly in Windows but not in Ubuntu?
A dig didn’t help either:
henrik@henrik-laptop:~$ dig vizreader.com
; <<>> DiG 9.6.1-P2 <<>> vizreader.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27649
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;vizreader.com. IN A
;; ANSWER SECTION:
vizreader.com. 17832 IN A 208.94.245.186
;; Query time: 71 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sat Sep 18 12:24:10 2010
;; MSG SIZE rcvd: 47
The above queries my router which again is confirmation of that the problem is not with me. Let’s try that server which is reporting ttl failure in the pings:
henrik@henrik-laptop:~$ dig @154.54.42.65 vizreader.com
; <<>> DiG 9.6.1-P2 <<>> @154.54.42.65 vizreader.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
Looks like we’ve found the culprit, let’s check the road from here to vizreader.com with traceroute:
1 192.168.1.1 (192.168.1.1) 2.445 ms 2.732 ms 3.140 ms
2 118.173.72.1.adsl.dynamic.totbb.net (118.173.72.1) 31.726 ms 33.591 ms 35.774 ms
3 172.17.24.253 (172.17.24.253) 40.510 ms 42.883 ms 44.117 ms
4 118.174.235.137.static.totbb.net (118.174.235.137) 47.023 ms 48.496 ms 50.496 ms
5 203.114.127.49.static.totisp.net (203.114.127.49) 61.041 ms 62.956 ms 64.684 ms
6 te-1-3.kkm-gw-01.totisp.net (203.113.127.29) 66.867 ms 41.739 ms 42.853 ms
7 203.190.251.73 (203.190.251.73) 44.781 ms 46.953 ms 48.383 ms
8 203.190.250.49 (203.190.250.49) 50.587 ms 52.494 ms 54.236 ms
9 203.190.251.226 (203.190.251.226) 72.089 ms 73.625 ms 74.873 ms
10 203.190.251.37 (203.190.251.37) 78.090 ms 79.513 ms 81.031 ms
11 ET-Network.asianetcom.net (202.147.17.253) 294.653 ms 295.750 ms 297.466 ms
12 gi3-0-0.cr3.hkg3.asianetcom.net (203.192.134.65) 267.545 ms 267.507 ms 269.466 ms
13 po13-0-0.cr1.nrt1.asianetcom.net (202.147.16.110) 270.989 ms 271.937 ms 273.993 ms
14 po7-0-0.gw1.sjc1.asianetcom.net (202.147.0.34) 277.877 ms 278.839 ms 281.252 ms
15 te3-4.mpd01.sjc03.atlas.cogentco.com (154.54.11.13) 281.699 ms 283.636 ms 284.853 ms
16 te8-2.mpd01.sjc01.atlas.cogentco.com (154.54.6.237) 463.673 ms te2-4.mpd01.sjc01.atlas.cogentco.com (154.54.41.201) 463.182 ms te3-2.mpd01.sjc01.atlas.cogentco.com (154.54.6.81) 462.293 ms
17 te0-0-0-4.mpd22.sfo01.atlas.cogentco.com (66.28.4.149) 292.477 ms 294.651 ms 296.563 ms
18 154.54.42.65 (154.54.42.65) 266.953 ms 266.968 ms 266.970 ms
19 154.54.42.66 (154.54.42.66) 272.769 ms 275.296 ms 276.403 ms
20 154.54.42.65 (154.54.42.65) 274.892 ms te0-2-0-0.mpd22.mci01.atlas.cogentco.com (154.54.30.158) 316.074 ms 317.968 ms
21 te4-4.mpd01.mci01.atlas.cogentco.com (154.54.30.166) 317.435 ms 319.157 ms te0-1-0-0.mpd21.sfo01.atlas.cogentco.com (154.54.30.50) 287.845 ms
22 te0-1-0-0.ccr21.sfo01.atlas.cogentco.com (154.54.30.49) 286.352 ms 266.516 ms 267.862 ms
23 38.104.86.74 (38.104.86.74) 307.340 ms 303.753 ms 305.914 ms
24 CORE1.KCMODATACENTER.COM (96.43.134.6) 313.323 ms 314.295 ms te0-1-0-0.ccr21.sfo01.atlas.cogentco.com (154.54.30.49) 275.431 ms
25 208.94.245.186 (208.94.245.186) 313.699 ms te0-1-0-0.mpd21.sfo01.atlas.cogentco.com (154.54.30.50) 281.885 ms 208.94.245.186 (208.94.245.186) 317.395 ms
That bastard server is involved in step #18 and #20. Let’s see what we get in Windows with tracert:
C:\Documents and Settings\Administrator>tracert vizreader.com
Tracing route to vizreader.com [208.94.245.186]
over a maximum of 30 hops:
1 <1 ms <1 ms 2 ms 10.0.2.2
2 1 ms 1 ms <1 ms 192.168.1.1
3 500 ms 30 ms 30 ms 118.173.72.1.adsl.dynamic.totbb.net [118.173.72.1]
4 501 ms 34 ms 32 ms 172.17.24.253
5 501 ms 34 ms 33 ms 118.174.235.137.static.totbb.net [118.174.235.137]
6 501 ms 41 ms 52 ms 203.114.127.49.static.totisp.net [203.114.127.49]
7 501 ms 41 ms 43 ms te-1-3.kkm-gw-01.totisp.net [203.113.127.29]
8 501 ms 42 ms 36 ms 203.190.251.73
9 507 ms 40 ms 42 ms 203.190.250.49
10 501 ms 58 ms 59 ms 203.190.251.226
11 501 ms 56 ms 57 ms 203.190.251.37
12 502 ms 270 ms 307 ms ET-Network.asianetcom.net [202.147.17.253]
13 499 ms 268 ms 263 ms ge-2-0-0-0.cr4.hkg3.asianetcom.net [203.192.134.69]
14 510 ms 265 ms 265 ms po15-0-0.cr1.nrt1.asianetcom.net [202.147.0.65]
15 501 ms 351 ms 265 ms po7-0-0.gw1.sjc1.asianetcom.net [202.147.0.34]
16 501 ms 394 ms 323 ms te3-4.mpd01.sjc03.atlas.cogentco.com [154.54.11.13]
17 509 ms 347 ms 265 ms te4-2.mpd01.sjc01.atlas.cogentco.com [154.54.6.105]
18 495 ms 265 ms 268 ms te0-0-0-4.mpd22.sfo01.atlas.cogentco.com [66.28.4.149]
19 507 ms 262 ms 266 ms 154.54.42.65
20 509 ms 268 ms 268 ms 154.54.42.66
21 502 ms 252 ms 272 ms 154.54.42.65
22 496 ms 268 ms 265 ms 154.54.42.66
23 495 ms 266 ms 265 ms 154.54.42.65
24 502 ms 269 ms 268 ms 154.54.42.66
25 502 ms 267 ms 266 ms 154.54.42.65
26 500 ms 268 ms 267 ms 154.54.42.66
27 508 ms 258 ms 269 ms 154.54.42.65
28 503 ms 275 ms 269 ms 154.54.42.66
29 501 ms 266 ms 263 ms 154.54.42.65
30 494 ms 268 ms 267 ms 154.54.42.66
Trace complete.
It looks different but 154.54.42.66 shows up there too, but maybe the difference is the reason why it works OK in Windows and not in Ubuntu. It’s either that or how Linux and Windows treats whatever error messages they get back, where in this case Windows simply ignores them and goes ahead with the request or something, and Linux don’t.
Pinging vizreader from a server in London worked just fine, doing a traceroute there gave me the following:
1 serverxxx-xxx-xxx-xxx-1.live-servers.net (xxx-xxx-xxx-xxx) 0.313 ms 0.370 ms 0.449 ms
2 88.208.255.101 (88.208.255.101) 0.370 ms 0.463 ms 0.542 ms
3 88.208.255.62 (88.208.255.62) 4.996 ms 5.072 ms 5.121 ms
4 10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21) 13.755 ms 13.999 ms 14.107 ms
5 10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241) 73.139 ms 73.183 ms 73.228 ms
6 10gigabitethernet1-2.core1.chi1.he.net (72.52.92.102) 105.240 ms 102.798 ms 98.300 ms
7 10gigabitethernet1-1.core1.mci1.he.net (72.52.92.2) 105.493 ms 105.567 ms 105.463 ms
8 10gigabitethernet1-1.core1.mci2.he.net (184.105.213.2) 105.519 ms 105.469 ms 105.515 ms
9 joe_s-data-center-llc.10gigabitethernet1-4.core1.mci1.he.net (216.66.79.14) 113.126 ms 113.080 ms 113.224 ms
10 96.43.134.62 (96.43.134.62) 114.341 ms 114.327 ms 114.379 ms
11 208.94.245.186 (208.94.245.186) 114.461 ms 114.471 ms 114.460 ms
Note how 154.54.42.65 is absent from the route.
I still have the problem but I now know there’s probably nothing I can do about it, I simply have to wait until the issue gets resolved by proper DNS propagation or some such.
Related Posts
Tags: debugging, dig, Linux, networking, traceroute, tshark