I just posted the following question to ServerFault….and then realised there might be people out there in magical internetland who know the answer but never visit any of the SO sites, so i’ve posted it here too. Feel free to respond here on on serverfault.
In a recent upgrade (from Openstack Diablo on Ubuntu Lucid to Openstack Essex on Ubuntu Precise), we found that DNS packets were frequently (almost always) dropped on the bridge interface (br100). For our compute-node hosts, that’s a Mellanox MT26428 using the mlx4_en driver module.
1. Use an old lucid kernel (e.g. 2.6.32-41-generic). This causes other problems, in particular the lack of cgroups and the old version of the kvm and kvm_amd modules (we suspect the kvm module version is the source of a bug we’re seeing where occasionally a VM will use 100% CPU). We’ve been running with this for the last few months, but can’t stay here forever.
net.bridge.bridge-nf-call-arptables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-ip6tables = 0
Update 2012-09-12:
Something I should have mentioned earlier – this happens even on machines that don’t have any openstack or even libvirt packages installed. Same hardware, same everything, but with not much more than the Ubuntu 12.04 base system installed.
On kernel 2.6.32-41-generic, the bridge works as expected.
On kernel 3.2.0-29-generic, using the ethernet interface, it works perfectly.
Using a bridge on that same NIC fails unless net.bridge.bridge-nf-call-iptables=0
So, it seems pretty clear that the problem is either in the mellanox driver, the updated kernel’s bridging code, netfilter code. or some interaction between them.
Interestingly, I have other machines (without a mellanox card) with a bridge interface that don’t exhibit this problem. with NICs ranging from cheap r8169 cards to better quality broadcom tg3 Gbit cards in some Sun Fire X2200 M2 servers and intel gb cards in supermicro motherboards. Like our openstack compute nodes, they all use the bridge interfaces as their primary (or sometimes only) interface with an IP address – they’re configured that way so we can run VMs using libvirt & kvm with real IP addresses rather than NAT.
So, that indicates that the problem is specific to the mellanox driver, although the blog post I mentioned above had a similar problem with some broadcom NICs that used the bnx2 driver.
There’s a stack of changes to the mellanox driver in the 3.6rc releases. Have you tried the latest version of the driver from the kernel git HEAD?
The latest i’ve tried is the mlx4 driver in the linux-image-3.5.0-13 (based on kernel version 3.5.3), mlx4_core v1.1 and mlx4_en v2.0. Is the version in 3.6rc any newer than that?
the 3.5.0-13 kernel results in the following:
$ git log --oneline v3.5..v3.6-rc5 drivers/net/ethernet/mellanox 8f8ba75 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 846b999 Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband 96f17d5 mlx4_core: Clean up buddy bitmap allocation 3de819e mlx4_core: Fix integer overflow issues around MTT table 89dd86d mlx4_core: Allow large mlx4_buddy bitmaps 499b95f drivers/net/ethernet/mellanox/mlx4/mcg.c: fix error return code 2207b60 net/mlx4_core: Remove port type restrictions c18520b net/mlx4_en: Fixing TX queue stop/wake flow c8c40b7 net/mlx4_en: loopbacked packets are dropped when SMAC=DMAC 1e30c1b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ee64c0e net/mlx4_en: Limit the RFS filter IDs to be < RPS_NO_FILTER 57dbf29 mlx4: Add support for EEH error recovery 5dedb9f Merge tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband 089117e Merge branches 'cma', 'cxgb4', 'misc', 'mlx4-sriov', 'mlx-cleanups', 'ocrdma' and 'qib' into for-linus 4cce66c mlx4_en: map entire pages to increase throughput 1eb8c69 net/mlx4_en: Add accelerated RFS support d9236c3 {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq af22d9d net/mlx4: Move MAC_MASK to a common place 9c64508 net/mlx4_en: dereferencing freed memory 447458c net/mlx4: off by one in parse_trans_rule() 6634961 mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct and paravirtualize them 105c320 mlx4_core: Allow guests to have IB ports 396f2fe mlx4_core: Implement mechanism for reserved Q_Keys 240a920 net/mlx4_core: Free ICM table in case of error f457ce4 mlx4_core: Remove double function declarations 2aca117 net/mlx4_core: Initialize IB port capabilities for all slaves 00f5ce9 mlx4: Use port management change event instead of smp_snoop 752a50c mlx4_core: Pass an invalid PCI id number to VFs cabdc8e net/mlx4_en: Add support for drop action through ethtool 8206728 net/mlx4_en: Manage flow steering rules with ethtool 592e49d net/mlx4: Implement promiscuous mode with device managed flow-steering 1b9c6b0 net/mlx4_core: Add resource tracking for device managed flow steering rules 0ff1fb6 {NET, IB}/mlx4: Add device managed flow steering firmware API 8fcfb4d net/mlx4_core: Add firmware commands to support device managed flow steering c96d97f net/mlx4: Set steering mode according to device capabilities 6d19993 net/mlx4_en: Re-design multicast attachments flow aa1ec3d net/mlx4_core: Change resource tracking ID to be 64 bit 4af1c04 net/mlx4_core: Change resource tracking mechanism to use red-black tree 90b1ebe mlx4: set maximal number of default RSS queues b26d344 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 7e52b33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 6469933 ethernet: Remove casts to same typeFor what it is worth, we had an identical problem without the mlx4_en problem (using bnx2, iirc), so I’m not sure the issue is in a single network driver, unless there is a systematic bug that exists across several.
3.6.0rc5 works exactly as well as 3.5 – i.e. not at all on our hardware.
i’d investigate further, but i really don’t want to get sidetracked into chasing problems in kernel versions that we aren’t likely to be running on production hw for at least 6 or 12 months.
(also, it looks like the driver author rarely ever bumps the version number of the driver)
We had an identical problem without mellanox cards in the mix (at least as an ethernet interface), so it isn’t solely that. The bug is somewhere in filtering rules on the nova-compute side of the house. In our configuration, we’re using dedicated hosts for n-net, so bypassing the rules isn’t such a big deal; metadata services aren’t local to the hypervisor, so remapping can happen elsewhere.
As an awful hack, you could add an alias on lo to 169.254.169.254; that would make the metadata work without any need for rewriting. This is, of course, an awful awful hack, and I feel dirty for even suggesting it, but it should do the trick.
We’re running nova-network and nova-api-metadata on the compute nodes (currently 84 compute nodes, with another 84 waiting for us to work out the best way to configure cells) as well as nova-compute, as we had scaling issues when we had only 3 dedicated nova-network servers. now we get 1 nova-network server per compute node.
I don’t think it would make any difference if we had dedicated host(s) for the nova-api-metdata server, either – we’d still need a DNAT rule to redirect the requests from the VM to the metadata server’s IP address(es)….and the DNAT rule is bypassed with net.bridge.bridge-nf-call-iptables=0
BTW, nova-network already adds 169.254.169.254 to the lo interface.
The metadata service listens on port 8775, but the request from the VM goes to 169.254.169.254:80 – i’ve tried using a simple port proxy, but it didn’t help. my next step is to get nova-api-metadata to listen on 169.254.169.254:80 instead of (or as well as) 0.0.0.0:8775….but even if that works, it’s not ideal.
nope, that won’t help. not without running nova-api-metadata as root so it can bind to port 80. also, when i’ve tried with simpleproxy (169.254.169.254:80 -> localip:8775), the packets don’t even go near the 169.254.169.254 address on the loopback interface, traceroute shows packets going all the way out to our router before being blocked with no-route-to-host. guess there’s a reason for that DNAT rule.
i’m going round in circles on this one. we need the 3.x kernels, and we need the DNAT rule for the metadata server, but net.bridge.bridge-nf-call-iptables=1 stops DNS from working on the compute node host. i’m still not sure at this stage whether it’s a bug in the kernel or the mellanox driver, or if it’s a configuration issue or bug with nova-network.
Where are you running the metadata service? On the n-net server or on the n-api server? During our build using Essex bits, we ended up putting nova-api-metadata directly on the n-net server. (the metadata service didn’t exist as a separate component in diablo, so we had do to some weird tricks (firing up OSPF) in order to be able to use multiple nova-network servers. In Essex, it was much easier to just put down nova-api-metadata on each n-net server)
I can verify that we were having the same problem, we’ve applied the sysctls on only the nova-network servers, and our hosts can definitely get their metadata.
One other data point that might help to get to root cause: we are running the VlanManager, and making nova-network use the vlanXXX interface instead of the brXXX interface makes the issue go away. So it is some interaction between bridging and netfilter rules directly. Presumably this is some sort of misapplication of netfilter bugs as opposed to a bug in the rules themselves, as a bug in the rules would effect both use of the vlan and br interfaces.
Each compute node is runing nova-compute, nova-network, and nova-api-metadata.
I don’t think it matters where nova-api-metadata (or nova-api) are running for this problem – a DNAT rule to redirect from 169.254.169.254:80 to metatata-server-ip:8775 is still required….and with net.bridge-nf-call-iptables=0, the DNAT rule is completely bypassed.
Ah. In our setup, we have just nova-compute on some hosts, where all of the netfilter rules do fire, and a dedicated nova-network server. I suspect things are working in our case because all of the DNAT rewriting is happening before the packets are sent from the hypervisor to the nodes running nova-network (and nova-api-metadata).
Actually, I just double-checked, and it looks like I was misremembering. The issue is in the rules on the nova-network server, but metadata seems to be working fine for us because we have re-writing happening on the nova-compute nodes as well. Sorry for the mixup in the earlier post.
Can you bisect (e.g., through the prebuilt kernels at http://snapshot.debian.org/package/linux-2.6/ to start)?
not likely. to start with, those are debian kernel packages, not ubuntu. more importantly, there’s over 100 kernel versions between 2.6.32-41 (where the bridging is known to work without problems) and the 3.x versions we’ve tried. needle in a hundred haystacks.
if we knew what we were looking for, it might be possible (to find what we needed to be looking for :) but if we knew that, we’d probably be able to figure out a fix or at least a workaround.
You can give a list of specific paths to git bisect start & the bisection will only look at changes to files in those path.
I’ve seen other similar sounding problems with DHCP. Try this hack:
I guess you mean with the bridge-nf sysctls at their default of 1. worth a shot…
… just tried it, doesn’ t make any difference. most DNS requests still timeout and fail.
here’s our Q&D test for the most obvious symptom:
Note how much longer the second one takes, with bridge-nf-call-iptables=1. many/most of the DNS requests are timing out.