Losing IPv6 connectivity after 20min, why?

Discussion:

(too old to reply)

Olaf Schreck

2016-09-19 11:10:02 UTC

I have configured a Debian 7 server for IPv6 (in addition to IPv4).
I can ping6 www.google.com and other addresses, fine. BUT the server
reproducibly looses IPv6 connectivity after roughly 20min, and I can't
figure why this happens. Clues anyone?

My hoster (Hetzner) routes the 2a01:4f8:191:XXXX:/64 network to the
server. Following their instructions, I assign a static address from that
block and set the default route to fe80::1, either manually
ip -6 addr add 2a01:4f8:191:XXXX::5/64 dev eth0
ip -6 route add default via fe80::1 dev eth0

or in /etc/network/interfaces like this
iface eth0 inet6 static
address 2a01:4f8:191:XXXX::5
netmask 64
gateway fe80::1

This works, I can ping6. But it reproducible stops working after 20min,
confirmed using this command
while true; do date; ping6 -c3 -w5 www.google.com; sleep 10; done

I'm sure it's not Google rate-limiting my pings, I get the same results with
various IPv6 addresses that I'm authorized to ping.

To restore IPv4 connectivity (IPv4 still working), I can either reboot or
re-add the default route with these commands (order is important):
ip -6 route del default
ip -6 route del fe80::1 dev eth0
ip -6 route add fe80::1 dev eth0
ip -6 route add default via fe80::1 dev eth0

Which will give another 20min of IPv6. 100% reproducible. And it's just the
routing that needs to be fixed.

I have exluded:
- no router advertisements used by the hoster, no such packets seen with
tcpdump, server is not configured to accept RAs
- no ip6tables rules, and default ACCEPT everywhere
- no cronjob or other periodical script that could be responsible
- no "security software" or similar that would interfere

Important data point: This server has 2 ethernet interfaces, so there are
2 link-local fe80::/64 routes to eth0 and eth1. I was suspicious that the
problem might be related, so I disabled IPv6 on the second interface
completely with with sysctl net.ipv6.conf.eth1.disable_ipv6 = 1.

And that resulted in stable and flawless IPv6 connectivity!

While this workaround is ok for this server, I have another one that shows
the same symptoms. But for that server I need IPv6 on the other interfaces,
so the workaround does not apply.

I'd rather like to learn why this happens, or what config part I may be
missing. Clues or further debugging hints very welcome. Thanks!

Olaf

Marc Haber

2016-09-19 12:10:02 UTC

Permalink

Post by Olaf Schreck
My hoster (Hetzner) routes the 2a01:4f8:191:XXXX:/64 network to the
server. Following their instructions, I assign a static address from that
block and set the default route to fe80::1, either manually
ip -6 addr add 2a01:4f8:191:XXXX::5/64 dev eth0
ip -6 route add default via fe80::1 dev eth0
or in /etc/network/interfaces like this
iface eth0 inet6 static
address 2a01:4f8:191:XXXX::5
netmask 64
gateway fe80::1
This works, I can ping6. But it reproducible stops working after 20min,
confirmed using this command
while true; do date; ping6 -c3 -w5 www.google.com; sleep 10; done

Check ip neigh output. Does the entry for your default gateway go
STALE after those 20 minutes?

Also check the lifetime of any SLAAC ip addresses given in ip addr
output.

Post by Olaf Schreck
I'm sure it's not Google rate-limiting my pings, I get the same results with
various IPv6 addresses that I'm authorized to ping.

fyi, google does not rate-limit pings against 8.8.8.8, 8.8.4.4 and
their IPv6 counterparts which I simply cannot memorize.

Post by Olaf Schreck
To restore IPv4 connectivity (IPv4 still working), I can either reboot or
ip -6 route del default
ip -6 route del fe80::1 dev eth0
ip -6 route add fe80::1 dev eth0
ip -6 route add default via fe80::1 dev eth0

Do you really need to meddle with the fe80::1 route? Do you really
need an explicit route for fe80::1%eth0? Will it work without?

Does adding a route for 2000/3 via fe80::1 dev eth0 help, or is it
really necessary to remove the default route and to re-add it?

Post by Olaf Schreck
Important data point: This server has 2 ethernet interfaces, so there are
2 link-local fe80::/64 routes to eth0 and eth1. I was suspicious that the
problem might be related, so I disabled IPv6 on the second interface
completely with with sysctl net.ipv6.conf.eth1.disable_ipv6 = 1.

No need, an fe80::/64 IP address is only valid when an interface is
added:

[2/501]***@parada:~$ ping6 fe80::1
connect: Invalid argument
[3/502]***@parada:~$ ping6 fe80::1%eth0
PING fe80::1%eth0(fe80::1) 56 data bytes
64 bytes from fe80::1: icmp_seq=1 ttl=64 time=2.25 ms
^C
--- fe80::1%eth0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.252/2.252/2.252/0.000 ms
[4/503]***@parada:~$

Same reason why I think that your explicit fe80::1 route is unnecessary.

Post by Olaf Schreck
And that resulted in stable and flawless IPv6 connectivity!

Is the other interface connected? eth1 should not play a role here at
all.

Greetings
Marc

Olaf Schreck

2016-09-19 21:00:02 UTC

Permalink

Post by Marc Haber
Check ip neigh output. Does the entry for your default gateway go
STALE after those 20 minutes?

Yes, exactly:

# ip -6 nei
fe80::1 dev eth0 lladdr 0c:86:10:ed:31:ca router STALE

Post by Marc Haber
Also check the lifetime of any SLAAC ip addresses given in ip addr
output.

forever

Post by Marc Haber
Do you really need to meddle with the fe80::1 route?

I had no plans to do so. I just learned during debugging that this would
reestablish IPv6 connectivity without reboot.

Post by Marc Haber
Do you really
need an explicit route for fe80::1%eth0? Will it work without?

My hosters docs (Hetzner) recommend that. They don't specify an IPv6
default gateway.

Post by Marc Haber
Does adding a route for 2000/3 via fe80::1 dev eth0 help,

nope (just tried it)

Post by Marc Haber
or is it
really necessary to remove the default route and to re-add it?

yep, connectivity is back

Post by Marc Haber
No need, an fe80::/64 IP address is only valid when an interface is

[...]

Post by Marc Haber
Is the other interface connected? eth1 should not play a role here at
all.

I had no plans to fiddle with link-local addresses, and of course eth1
settings should not matter. I just disabled IPv6 on eth1 for debugging,
and suddenly IPv6 worked >20min. Maybe coincidence rather than causality.

I'd like to learn what's going on here, thanks for your comments.

Olaf

Kenyon Ralph

2016-09-19 22:20:01 UTC

Permalink

Post by Olaf Schreck
I had no plans to fiddle with link-local addresses, and of course eth1
settings should not matter. I just disabled IPv6 on eth1 for debugging,
and suddenly IPv6 worked >20min. Maybe coincidence rather than causality.
I'd like to learn what's going on here, thanks for your comments.

As someone else asked, showing us the output of 'ip -6 a' and 'ip -6
r' could be helpful.

--
Kenyon Ralph

Olaf Schreck

2016-09-19 22:40:01 UTC

Permalink

Post by Kenyon Ralph
As someone else asked, showing us the output of 'ip -6 a' and 'ip -6
r' could be helpful.

Ok, sorry. I don't see anything special here:

# ip -6 ro
2a01:4f8:191:XXXX::/64 dev eth0 proto kernel metric 256
fe80::/64 dev eth0 proto kernel metric 256
fe80::/64 dev eth1 proto kernel metric 256
fe80::/64 dev vif1.0 proto kernel metric 256
fe80::/64 dev vif2.0 proto kernel metric 256
fe80::/64 dev vif3.0 proto kernel metric 256
fe80::/64 dev vif4.0 proto kernel metric 256
default via fe80::1 dev eth0 metric 1024

# ip -6 ad
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
inet6 2a01:4f8:191:XXXX::4/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::5246:5dff:fe9f:f752/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
inet6 fe80::6a05:caff:fe18:596/64 scope link
valid_lft forever preferred_lft forever
4: vif1.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
5: vif2.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
6: vif3.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
7: vif4.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever

Kenyon Ralph

2016-09-19 22:50:02 UTC

Permalink

Post by Olaf Schreck

Post by Kenyon Ralph
As someone else asked, showing us the output of 'ip -6 a' and 'ip -6
r' could be helpful.

Thanks. You said it breaks when you configure eth1 though. What is the
configuration you are putting on eth1 that causes the breakage?

Post by Olaf Schreck
# ip -6 ro
2a01:4f8:191:XXXX::/64 dev eth0 proto kernel metric 256
fe80::/64 dev eth0 proto kernel metric 256
fe80::/64 dev eth1 proto kernel metric 256
fe80::/64 dev vif1.0 proto kernel metric 256
fe80::/64 dev vif2.0 proto kernel metric 256
fe80::/64 dev vif3.0 proto kernel metric 256
fe80::/64 dev vif4.0 proto kernel metric 256
default via fe80::1 dev eth0 metric 1024
# ip -6 ad
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
inet6 2a01:4f8:191:XXXX::4/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::5246:5dff:fe9f:f752/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
inet6 fe80::6a05:caff:fe18:596/64 scope link
valid_lft forever preferred_lft forever
4: vif1.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
5: vif2.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
6: vif3.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever
7: vif4.0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 32
inet6 fe80::fcff:ffff:feff:ffff/64 scope link
valid_lft forever preferred_lft forever

--
Kenyon Ralph

Marc Haber

2016-09-20 10:40:02 UTC

Permalink

Post by Olaf Schreck

Post by Marc Haber
Check ip neigh output. Does the entry for your default gateway go
STALE after those 20 minutes?

# ip -6 nei
fe80::1 dev eth0 lladdr 0c:86:10:ed:31:ca router STALE

So it is the neighbor table entry going stale. Does your system retry
neighbor solicitation (see tcpdump) or does it sit quietly with a
STALE entry? I had this behavior in similiarly styled hosting networks
a few years ago, and since it has fixed itself I suspect a kernel
issue that was fixed since then. I do use more recent kernels than
Debian stable though.

Post by Olaf Schreck

Post by Marc Haber
Do you really need to meddle with the fe80::1 route?

I had no plans to do so. I just learned during debugging that this would
reestablish IPv6 connectivity without reboot.

It would be interesting to see whether this causes your system to do a
neighbor solicitation for the default gateway.

Post by Olaf Schreck

Post by Marc Haber
Do you really
need an explicit route for fe80::1%eth0? Will it work without?

My hosters docs (Hetzner) recommend that. They don't specify an IPv6
default gateway.

They're unfortunately not very clueful with regard to IPv6. I haven't
been on their network for years though.

Post by Olaf Schreck

Post by Marc Haber
or is it
really necessary to remove the default route and to re-add it?

yep, connectivity is back

Interesting.

Post by Olaf Schreck

Post by Marc Haber
No need, an fe80::/64 IP address is only valid when an interface is

[...]

Post by Marc Haber
Is the other interface connected? eth1 should not play a role here at
all.

Check whether your system re-solicits for the default gateway with
eth1 down and/or up. If its re-solicitation behavior on eth0 differs
depending on eth1's state, we have a clear system software issue here.

If we're not talking about jessie but something more recent,
systemd-networkd might play a role. systemd-networkd has recently
started to take over IPv6 mechanics, and does so in a quite broken way.

Post by Olaf Schreck
I'd like to learn what's going on here,

Me too

Post by Olaf Schreck
thanks for your comments.

You're welcome

Greetings
MArc

Gerdriaan Mulder

2016-09-19 12:10:02 UTC

Permalink

Hi Olaf,

Depending on whether you need the link-local on the other interface
(e.g. eth1), you could try a couple of things:
* remove that address from the interface (which also removes the
fe80::/64 route on that interface)
* remove the fe80::/64 route on eth1 (although the OS might add it
again at some point)

Either of those could mean that stuff like neighbour discovery
probably fails, but I guess you have to check that in your specific
situation.

Could you post the (relevant) output of `ip -6 route` and `ip -6 addr`
during the 20 minutes of 'working' IPv6 (i.e. with 2 fe80::/64 routes
and whether there are temporary addresses assigned to eth0 and eth1)?

Could you also check whether privacy extensions are enabled on eth0
and eth1 (/proc/sys/net/ipv6/conf/*/use_tempaddr)? I have a hunch that
this might explain the 20 minutes lifetime.

~ Gerdriaan

Post by Olaf Schreck
I have configured a Debian 7 server for IPv6 (in addition to IPv4).
I can ping6 www.google.com and other addresses, fine. BUT the server
reproducibly looses IPv6 connectivity after roughly 20min, and I can't
figure why this happens. Clues anyone?
My hoster (Hetzner) routes the 2a01:4f8:191:XXXX:/64 network to the
server. Following their instructions, I assign a static address from that
block and set the default route to fe80::1, either manually
ip -6 addr add 2a01:4f8:191:XXXX::5/64 dev eth0
ip -6 route add default via fe80::1 dev eth0
or in /etc/network/interfaces like this
iface eth0 inet6 static
address 2a01:4f8:191:XXXX::5
netmask 64
gateway fe80::1
This works, I can ping6. But it reproducible stops working after 20min,
confirmed using this command
while true; do date; ping6 -c3 -w5 www.google.com; sleep 10; done
I'm sure it's not Google rate-limiting my pings, I get the same results with
various IPv6 addresses that I'm authorized to ping.
To restore IPv4 connectivity (IPv4 still working), I can either reboot or
ip -6 route del default
ip -6 route del fe80::1 dev eth0
ip -6 route add fe80::1 dev eth0
ip -6 route add default via fe80::1 dev eth0
Which will give another 20min of IPv6. 100% reproducible. And it's just the
routing that needs to be fixed.
- no router advertisements used by the hoster, no such packets seen with
tcpdump, server is not configured to accept RAs
- no ip6tables rules, and default ACCEPT everywhere
- no cronjob or other periodical script that could be responsible
- no "security software" or similar that would interfere
Important data point: This server has 2 ethernet interfaces, so there are
2 link-local fe80::/64 routes to eth0 and eth1. I was suspicious that the
problem might be related, so I disabled IPv6 on the second interface
completely with with sysctl net.ipv6.conf.eth1.disable_ipv6 = 1.
And that resulted in stable and flawless IPv6 connectivity!
While this workaround is ok for this server, I have another one that shows
the same symptoms. But for that server I need IPv6 on the other interfaces,
so the workaround does not apply.
I'd rather like to learn why this happens, or what config part I may be
missing. Clues or further debugging hints very welcome. Thanks!
Olaf

Marc Haber

2016-09-19 12:20:01 UTC

Permalink

Hi,

Post by Gerdriaan Mulder
Depending on whether you need the link-local on the other interface
* remove that address from the interface (which also removes the
fe80::/64 route on that interface)
* remove the fe80::/64 route on eth1 (although the OS might add it
again at some point)

This should not matter. It shold actually make things worse.
Generally, do not remove fe80::/64 addresses from interfaces.

Post by Gerdriaan Mulder
Could you also check whether privacy extensions are enabled on eth0
and eth1 (/proc/sys/net/ipv6/conf/*/use_tempaddr)? I have a hunch that
this might explain the 20 minutes lifetime.

Good point. Yes.

Greetings
Marc

Matthew Hall

2016-09-19 17:40:02 UTC

Permalink

I have DEFINITELY run into that problem. To the point where I generally
disable the privacy extension stuff to prevent it. A whole lot of equipment
doesn't handle it right and breaks everything.

Matthew.

Olaf Schreck

2016-09-19 18:50:02 UTC

Permalink

I didn't reply yet because I'm still testing stuff.

But privacy extensions are not the problem, they're turned off, no?:

# sysctl -a | grep tempad
net.ipv6.conf.all.use_tempaddr = 0
net.ipv6.conf.default.use_tempaddr = 0
net.ipv6.conf.lo.use_tempaddr = -1
net.ipv6.conf.eth0.use_tempaddr = 0
net.ipv6.conf.eth1.use_tempaddr = 0
net.ipv6.conf.vif1/0.use_tempaddr = 0
net.ipv6.conf.vif2/0.use_tempaddr = 0
net.ipv6.conf.vif3/0.use_tempaddr = 0
net.ipv6.conf.vif4/0.use_tempaddr = 0
net.ipv6.conf.vif5/0.use_tempaddr = 0

Post by Matthew Hall
I have DEFINITELY run into that problem. To the point where I generally
disable the privacy extension stuff to prevent it. A whole lot of equipment
doesn't handle it right and breaks everything.

Olaf Schreck

2016-10-02 14:00:01 UTC

Permalink

SOLVED. Short version: my hosting provider (Hetzner) obviously fixed
something *silently*. Everything works as initially configured, no
workarounds required.

Please disregard any speculations I made in this thread, especially that
disabling IPv6 on some interfaces or fiddling with link-local fe80:: routes
might be useful. These were conclusions based on coincidence rather than
causality.

Many thanks to Marc for very useful debugging hints, to Gertiaan for
pushing to read NDP specs, and to everyone else who bothered to reply on
this dull problem.

No thanks to Hetzner, for either lying ("we didn't change anything") or not
having a clue.

I did a lot of debugging with ping6, tcpdump, rtmon and syslog. It became
quite obvious that the problem was at the uplink router which seemed to
"forget" the /64 route after 20min. I provided some debugging hints in a
support ticket, and suddenly, magically everything worked without further
changes on my end.

"We didn't change anything". Yeah, sure.

Olaf

Matthew Hall

2016-10-03 04:30:02 UTC

Permalink

Post by Olaf Schreck
"We didn't change anything". Yeah, sure.

s/we didn't change anything/we have no clue what we're doing/g

Post by Olaf Schreck
Olaf

Matthew