I just posted the following question to ServerFault….and then realised there might be people out there in magical internetland who know the answer but never visit any of the SO sites, so i’ve posted it here too. Feel free to respond here on on serverfault.
In a recent upgrade (from Openstack Diablo on Ubuntu Lucid to Openstack Essex on Ubuntu Precise), we found that DNS packets were frequently (almost always) dropped on the bridge interface (br100). For our compute-node hosts, that’s a Mellanox MT26428 using the mlx4_en driver module.
1. Use an old lucid kernel (e.g. 2.6.32-41-generic). This causes other problems, in particular the lack of cgroups and the old version of the kvm and kvm_amd modules (we suspect the kvm module version is the source of a bug we’re seeing where occasionally a VM will use 100% CPU). We’ve been running with this for the last few months, but can’t stay here forever.
net.bridge.bridge-nf-call-arptables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-ip6tables = 0
Something I should have mentioned earlier – this happens even on machines that don’t have any openstack or even libvirt packages installed. Same hardware, same everything, but with not much more than the Ubuntu 12.04 base system installed.
On kernel 2.6.32-41-generic, the bridge works as expected.
On kernel 3.2.0-29-generic, using the ethernet interface, it works perfectly.
Using a bridge on that same NIC fails unless net.bridge.bridge-nf-call-iptables=0
So, it seems pretty clear that the problem is either in the mellanox driver, the updated kernel’s bridging code, netfilter code. or some interaction between them.
Interestingly, I have other machines (without a mellanox card) with a bridge interface that don’t exhibit this problem. with NICs ranging from cheap r8169 cards to better quality broadcom tg3 Gbit cards in some Sun Fire X2200 M2 servers and intel gb cards in supermicro motherboards. Like our openstack compute nodes, they all use the bridge interfaces as their primary (or sometimes only) interface with an IP address – they’re configured that way so we can run VMs using libvirt & kvm with real IP addresses rather than NAT.
So, that indicates that the problem is specific to the mellanox driver, although the blog post I mentioned above had a similar problem with some broadcom NICs that used the bnx2 driver.
There’s a stack of changes to the mellanox driver in the 3.6rc releases. Have you tried the latest version of the driver from the kernel git HEAD?
The latest i’ve tried is the mlx4 driver in the linux-image-3.5.0-13 (based on kernel version 3.5.3), mlx4_core v1.1 and mlx4_en v2.0. Is the version in 3.6rc any newer than that?
the 3.5.0-13 kernel results in the following:
For what it is worth, we had an identical problem without the mlx4_en problem (using bnx2, iirc), so I’m not sure the issue is in a single network driver, unless there is a systematic bug that exists across several.
3.6.0rc5 works exactly as well as 3.5 – i.e. not at all on our hardware.
i’d investigate further, but i really don’t want to get sidetracked into chasing problems in kernel versions that we aren’t likely to be running on production hw for at least 6 or 12 months.
(also, it looks like the driver author rarely ever bumps the version number of the driver)
We had an identical problem without mellanox cards in the mix (at least as an ethernet interface), so it isn’t solely that. The bug is somewhere in filtering rules on the nova-compute side of the house. In our configuration, we’re using dedicated hosts for n-net, so bypassing the rules isn’t such a big deal; metadata services aren’t local to the hypervisor, so remapping can happen elsewhere.
As an awful hack, you could add an alias on lo to 169.254.169.254; that would make the metadata work without any need for rewriting. This is, of course, an awful awful hack, and I feel dirty for even suggesting it, but it should do the trick.
We’re running nova-network and nova-api-metadata on the compute nodes (currently 84 compute nodes, with another 84 waiting for us to work out the best way to configure cells) as well as nova-compute, as we had scaling issues when we had only 3 dedicated nova-network servers. now we get 1 nova-network server per compute node.
I don’t think it would make any difference if we had dedicated host(s) for the nova-api-metdata server, either – we’d still need a DNAT rule to redirect the requests from the VM to the metadata server’s IP address(es)….and the DNAT rule is bypassed with net.bridge.bridge-nf-call-iptables=0
BTW, nova-network already adds 169.254.169.254 to the lo interface.
The metadata service listens on port 8775, but the request from the VM goes to 169.254.169.254:80 – i’ve tried using a simple port proxy, but it didn’t help. my next step is to get nova-api-metadata to listen on 169.254.169.254:80 instead of (or as well as) 0.0.0.0:8775….but even if that works, it’s not ideal.
nope, that won’t help. not without running nova-api-metadata as root so it can bind to port 80. also, when i’ve tried with simpleproxy (169.254.169.254:80 -> localip:8775), the packets don’t even go near the 169.254.169.254 address on the loopback interface, traceroute shows packets going all the way out to our router before being blocked with no-route-to-host. guess there’s a reason for that DNAT rule.
i’m going round in circles on this one. we need the 3.x kernels, and we need the DNAT rule for the metadata server, but net.bridge.bridge-nf-call-iptables=1 stops DNS from working on the compute node host. i’m still not sure at this stage whether it’s a bug in the kernel or the mellanox driver, or if it’s a configuration issue or bug with nova-network.
Where are you running the metadata service? On the n-net server or on the n-api server? During our build using Essex bits, we ended up putting nova-api-metadata directly on the n-net server. (the metadata service didn’t exist as a separate component in diablo, so we had do to some weird tricks (firing up OSPF) in order to be able to use multiple nova-network servers. In Essex, it was much easier to just put down nova-api-metadata on each n-net server)
I can verify that we were having the same problem, we’ve applied the sysctls on only the nova-network servers, and our hosts can definitely get their metadata.
One other data point that might help to get to root cause: we are running the VlanManager, and making nova-network use the vlanXXX interface instead of the brXXX interface makes the issue go away. So it is some interaction between bridging and netfilter rules directly. Presumably this is some sort of misapplication of netfilter bugs as opposed to a bug in the rules themselves, as a bug in the rules would effect both use of the vlan and br interfaces.
Each compute node is runing nova-compute, nova-network, and nova-api-metadata.
I don’t think it matters where nova-api-metadata (or nova-api) are running for this problem – a DNAT rule to redirect from 169.254.169.254:80 to metatata-server-ip:8775 is still required….and with net.bridge-nf-call-iptables=0, the DNAT rule is completely bypassed.
Ah. In our setup, we have just nova-compute on some hosts, where all of the netfilter rules do fire, and a dedicated nova-network server. I suspect things are working in our case because all of the DNAT rewriting is happening before the packets are sent from the hypervisor to the nodes running nova-network (and nova-api-metadata).
Actually, I just double-checked, and it looks like I was misremembering. The issue is in the rules on the nova-network server, but metadata seems to be working fine for us because we have re-writing happening on the nova-compute nodes as well. Sorry for the mixup in the earlier post.
Can you bisect (e.g., through the prebuilt kernels at http://snapshot.debian.org/package/linux-2.6/ to start)?
not likely. to start with, those are debian kernel packages, not ubuntu. more importantly, there’s over 100 kernel versions between 2.6.32-41 (where the bridging is known to work without problems) and the 3.x versions we’ve tried. needle in a hundred haystacks.
if we knew what we were looking for, it might be possible (to find what we needed to be looking for :) but if we knew that, we’d probably be able to figure out a fix or at least a workaround.
You can give a list of specific paths to git bisect start & the bisection will only look at changes to files in those path.
I’ve seen other similar sounding problems with DHCP. Try this hack:
I guess you mean with the bridge-nf sysctls at their default of 1. worth a shot…
… just tried it, doesn’ t make any difference. most DNS requests still timeout and fail.
here’s our Q&D test for the most obvious symptom:
Note how much longer the second one takes, with bridge-nf-call-iptables=1. many/most of the DNS requests are timing out.