Tech Notes And Miscellaneous Thoughts
 

openstack, bridging, netfilter and dnat

I just posted the following question to ServerFault….and then realised there might be people out there in magical internetland who know the answer but never visit any of the SO sites, so i’ve posted it here too.  Feel free to respond here on on serverfault.


In a recent upgrade (from Openstack Diablo on Ubuntu Lucid to Openstack Essex on Ubuntu Precise), we found that DNS packets were frequently (almost always) dropped on the bridge interface (br100). For our compute-node hosts, that’s a Mellanox MT26428 using the mlx4_en driver module.

We’ve found two workarounds for this:

1. Use an old lucid kernel (e.g. 2.6.32-41-generic). This causes other problems, in particular the lack of cgroups and the old version of the kvm and kvm_amd modules (we suspect the kvm module version is the source of a bug we’re seeing where occasionally a VM will use 100% CPU). We’ve been running with this for the last few months, but can’t stay here forever.

2.  With the newer Ubuntu Precise kernels (3.2.x), we’ve found that if we use sysctl to disable netfilter on bridge (see sysctl settings below) that DNS started working perfectly again. We thought this was the solution to our problem until we realised that turning off netfilter on the bridge interface will, of course, mean that the DNAT rule to redirect VM requests for the nova-api-metadata server (i.e. redirect packets destined for 169.254.169.254:80 to compute-node’s-IP:8775) will be completely bypassed.
Long-story short: with 3.x kernels, we can have reliable networking and broken metadata service or we can have broken networking and a metadata service that would work fine if there were any VMs to service. We haven’t yet found a way to have both.
Anyone seen this problem or anything like it before? got a fix? or a pointer in the right direction?
Our suspicion is that it’s specific to the Mellanox driver, but we’re not sure of that (we’ve tried several different versions of the mlx4_en driver, starting with the version built-in to the 3.2.x kernels all the way up to the latest 1.5.8.3 driver from the mellanox web site. The mlx4_en driver in the 3.5.x kernel from Quantal doesn’t work at all)
BTW, our compute nodes have supermicro H8DGT motherboards with built-in mellanox NIC:
02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s – IB QDR / 10GigE] (rev b0)
We’re not using the other two NICs in the system, only the Mellanox and the IPMI card are connected.
Bridge netfilter sysctl settings:
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
Since discovering this bridge-nf sysctl workaround, we’ve found a few pages on the net recommending exactly this (including Openstack’s latest network troubleshooting page and a launchpad bug report that linked to this blog-post that has a great description of the problem and the solution)….it’s easier to find stuff when you know what to search for :), but we haven’t found anything on the DNAT issue that it causes.

Update 2012-09-12:

Something I should have mentioned earlier – this happens even on machines that don’t have any openstack or even libvirt packages installed.  Same hardware, same everything, but with not much more than the Ubuntu 12.04 base system  installed.

On kernel 2.6.32-41-generic, the bridge works as expected.

On kernel 3.2.0-29-generic, using the ethernet interface, it works perfectly.
Using a bridge on that same NIC fails unless net.bridge.bridge-nf-call-iptables=0

So, it seems pretty clear that the problem is either in the mellanox driver, the updated kernel’s bridging code, netfilter code.  or some interaction between them.

Interestingly, I have other machines (without a mellanox card) with a bridge interface that don’t exhibit this problem.  with NICs ranging from cheap r8169 cards to better quality broadcom tg3 Gbit cards in some Sun Fire X2200 M2 servers and intel gb cards in supermicro motherboards.  Like our openstack compute nodes, they all use the bridge interfaces as their primary (or sometimes only) interface with an IP address – they’re configured that way so we can run VMs using libvirt & kvm with real IP addresses rather than NAT.

So, that indicates that the problem is specific to the mellanox driver, although the blog post I mentioned above had a similar problem with some broadcom NICs that used the bnx2 driver.

17 Comments

  1. cas

    The latest i’ve tried is the mlx4 driver in the  linux-image-3.5.0-13 (based on kernel version 3.5.3), mlx4_core v1.1 and mlx4_en v2.0.  Is the version in 3.6rc any newer than that?

    the 3.5.0-13 kernel results in the following:

    [    2.796656] mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)
    [    2.796658] mlx4_core: Initializing 0000:02:00.0
    [    5.393495] mlx4_core 0000:02:00.0: irq 59 for MSI/MSI-X
    [    5.393507] mlx4_core 0000:02:00.0: irq 60 for MSI/MSI-X
    [    5.393515] mlx4_core 0000:02:00.0: irq 61 for MSI/MSI-X
    [    5.393525] mlx4_core 0000:02:00.0: irq 62 for MSI/MSI-X
    [    5.393533] mlx4_core 0000:02:00.0: irq 63 for MSI/MSI-X
    [    5.393541] mlx4_core 0000:02:00.0: irq 64 for MSI/MSI-X
    [    5.393549] mlx4_core 0000:02:00.0: irq 65 for MSI/MSI-X
    [    5.393557] mlx4_core 0000:02:00.0: irq 66 for MSI/MSI-X
    [    5.393566] mlx4_core 0000:02:00.0: irq 67 for MSI/MSI-X
    [    5.393575] mlx4_core 0000:02:00.0: irq 68 for MSI/MSI-X
    [    5.393584] mlx4_core 0000:02:00.0: irq 69 for MSI/MSI-X
    [    5.393592] mlx4_core 0000:02:00.0: irq 70 for MSI/MSI-X
    [    5.393600] mlx4_core 0000:02:00.0: irq 71 for MSI/MSI-X
    [    5.393608] mlx4_core 0000:02:00.0: irq 72 for MSI/MSI-X
    [    5.393616] mlx4_core 0000:02:00.0: irq 73 for MSI/MSI-X
    [    5.393624] mlx4_core 0000:02:00.0: irq 74 for MSI/MSI-X
    [    5.393633] mlx4_core 0000:02:00.0: irq 75 for MSI/MSI-X
    [    5.393642] mlx4_core 0000:02:00.0: irq 76 for MSI/MSI-X
    [    5.393650] mlx4_core 0000:02:00.0: irq 77 for MSI/MSI-X
    [    5.393658] mlx4_core 0000:02:00.0: irq 78 for MSI/MSI-X
    [    5.393666] mlx4_core 0000:02:00.0: irq 79 for MSI/MSI-X
    [    5.467933] mlx4_core 0000:02:00.0: command 0xc failed: fw status = 0x40
    [    7.077533] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.0 (Dec 2011)
    
    1. Phil
      $ git log --oneline v3.5..v3.6-rc5 drivers/net/ethernet/mellanox
      8f8ba75 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
      846b999 Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
      96f17d5 mlx4_core: Clean up buddy bitmap allocation
      3de819e mlx4_core: Fix integer overflow issues around MTT table
      89dd86d mlx4_core: Allow large mlx4_buddy bitmaps
      499b95f drivers/net/ethernet/mellanox/mlx4/mcg.c: fix error return code
      2207b60 net/mlx4_core: Remove port type restrictions
      c18520b net/mlx4_en: Fixing TX queue stop/wake flow
      c8c40b7 net/mlx4_en: loopbacked packets are dropped when SMAC=DMAC
      1e30c1b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
      ee64c0e net/mlx4_en: Limit the RFS filter IDs to be < RPS_NO_FILTER
      57dbf29 mlx4: Add support for EEH error recovery
      5dedb9f Merge tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
      089117e Merge branches 'cma', 'cxgb4', 'misc', 'mlx4-sriov', 'mlx-cleanups', 'ocrdma' and 'qib' into for-linus
      4cce66c mlx4_en: map entire pages to increase throughput
      1eb8c69 net/mlx4_en: Add accelerated RFS support
      d9236c3 {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
      af22d9d net/mlx4: Move MAC_MASK to a common place
      9c64508 net/mlx4_en: dereferencing freed memory
      447458c net/mlx4: off by one in parse_trans_rule()
      6634961 mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct and paravirtualize them
      105c320 mlx4_core: Allow guests to have IB ports
      396f2fe mlx4_core: Implement mechanism for reserved Q_Keys
      240a920 net/mlx4_core: Free ICM table in case of error
      f457ce4 mlx4_core: Remove double function declarations
      2aca117 net/mlx4_core: Initialize IB port capabilities for all slaves
      00f5ce9 mlx4: Use port management change event instead of smp_snoop
      752a50c mlx4_core: Pass an invalid PCI id number to VFs
      cabdc8e net/mlx4_en: Add support for drop action through ethtool
      8206728 net/mlx4_en: Manage flow steering rules with ethtool
      592e49d net/mlx4: Implement promiscuous mode with device managed flow-steering
      1b9c6b0 net/mlx4_core: Add resource tracking for device managed flow steering rules
      0ff1fb6 {NET, IB}/mlx4: Add device managed flow steering firmware API
      8fcfb4d net/mlx4_core: Add firmware commands to support device managed flow steering
      c96d97f net/mlx4: Set steering mode according to device capabilities
      6d19993 net/mlx4_en: Re-design multicast attachments flow
      aa1ec3d net/mlx4_core: Change resource tracking ID to be 64 bit
      4af1c04 net/mlx4_core: Change resource tracking mechanism to use red-black tree
      90b1ebe mlx4: set maximal number of default RSS queues
      b26d344 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
      7e52b33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
      6469933 ethernet: Remove casts to same type
      1. cas

        3.6.0rc5 works exactly as well as 3.5 – i.e. not at all on our hardware.

        i’d investigate further, but i really don’t want to get sidetracked into chasing problems in kernel versions that we aren’t likely to be running on production hw for at least 6 or 12 months.

        (also, it looks like the driver author rarely ever bumps the version number of the driver)

        # grep mlx4\\\|Linux.version dmesg.0
        [    0.000000] Linux version 3.6.0-030600rc5-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201209082035 SMP Sun Sep 9 00:36:02 UTC 2012
        [    3.153754] mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)
        [    3.153860] mlx4_core: Initializing 0000:02:00.0
        [    5.732519] mlx4_core 0000:02:00.0: irq 58 for MSI/MSI-X
        [    5.732535] mlx4_core 0000:02:00.0: irq 59 for MSI/MSI-X
        [    5.732544] mlx4_core 0000:02:00.0: irq 60 for MSI/MSI-X
        [    5.732552] mlx4_core 0000:02:00.0: irq 61 for MSI/MSI-X
        [    5.732560] mlx4_core 0000:02:00.0: irq 62 for MSI/MSI-X
        [    5.732568] mlx4_core 0000:02:00.0: irq 63 for MSI/MSI-X
        [    5.732577] mlx4_core 0000:02:00.0: irq 64 for MSI/MSI-X
        [    5.732585] mlx4_core 0000:02:00.0: irq 65 for MSI/MSI-X
        [    5.732593] mlx4_core 0000:02:00.0: irq 66 for MSI/MSI-X
        [    5.732600] mlx4_core 0000:02:00.0: irq 67 for MSI/MSI-X
        [    5.732608] mlx4_core 0000:02:00.0: irq 68 for MSI/MSI-X
        [    5.732615] mlx4_core 0000:02:00.0: irq 69 for MSI/MSI-X
        [    5.732623] mlx4_core 0000:02:00.0: irq 70 for MSI/MSI-X
        [    5.772658] mlx4_core 0000:02:00.0: command 0xc failed: fw status = 0x40
        [    7.516755] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.0 (Dec 2011)
        
  2. We had an identical problem without mellanox cards in the mix (at least as an ethernet interface), so it isn’t solely that. The bug is somewhere in filtering rules on the nova-compute side of the house. In our configuration, we’re using dedicated hosts for n-net, so bypassing the rules isn’t such a big deal; metadata services aren’t local to the hypervisor, so remapping can happen elsewhere.

    As an awful hack, you could add an alias on lo to 169.254.169.254; that would make the metadata work without any need for rewriting. This is, of course, an awful awful hack, and I feel dirty for even suggesting it, but it should do the trick. 

    1. cas

      We’re running nova-network and nova-api-metadata on the compute nodes (currently 84 compute nodes, with another 84 waiting for us to work out the best way to configure cells) as well as nova-compute, as we had scaling issues when we had only 3 dedicated nova-network servers.  now we get 1 nova-network server per compute node.

      I don’t think it would make any difference if we had dedicated host(s) for the nova-api-metdata server, either – we’d still need a DNAT rule to redirect the requests from the VM to the metadata server’s IP address(es)….and the DNAT rule is bypassed with net.bridge.bridge-nf-call-iptables=0

      BTW, nova-network already adds 169.254.169.254 to the lo interface.

      The metadata service listens on port 8775, but the request from the VM goes to 169.254.169.254:80 – i’ve tried using a simple port proxy, but it didn’t help.  my next step is to get nova-api-metadata to listen on 169.254.169.254:80 instead of (or as well as) 0.0.0.0:8775….but even if that works, it’s not ideal.

      1. cas

        nope, that won’t help. not without running nova-api-metadata as root so it can bind to port 80.  also, when i’ve tried with simpleproxy (169.254.169.254:80 -> localip:8775), the packets don’t even go near the 169.254.169.254 address on the loopback interface, traceroute shows packets going all the way out to our router before being blocked with no-route-to-host.  guess there’s a reason for that DNAT rule.

        i’m going round in circles on this one.  we need the 3.x kernels, and we need the DNAT rule for the metadata server, but net.bridge.bridge-nf-call-iptables=1 stops DNS from working on the compute node host.  i’m still not sure at this stage whether it’s a bug in the kernel or the mellanox driver, or if it’s a configuration issue or bug with nova-network.

        1. Where are you running the metadata service? On the n-net server or on the n-api server? During our build using Essex bits, we ended up putting nova-api-metadata directly on the n-net server. (the metadata service didn’t exist as a separate component in diablo, so we had do to some weird tricks (firing up OSPF) in order to be able to use multiple nova-network servers. In Essex, it was much easier to just put down nova-api-metadata on each n-net server)

          I can verify that we were having the same problem, we’ve applied the sysctls on only the nova-network servers, and our hosts can definitely get their metadata.

          One other data point that might help to get to root cause: we are running the VlanManager, and making nova-network use the vlanXXX interface instead of the brXXX interface makes the issue go away. So it is some interaction between bridging and netfilter rules directly. Presumably this is some sort of misapplication of netfilter bugs as opposed to a bug in the rules themselves, as a bug in the rules would effect both use of the vlan and br interfaces. 

          1. cas

            Each compute node is runing nova-compute, nova-network, and nova-api-metadata.

            I don’t think it matters where nova-api-metadata (or nova-api) are running for this problem – a DNAT rule to redirect from 169.254.169.254:80 to metatata-server-ip:8775 is still required….and with net.bridge-nf-call-iptables=0, the DNAT rule is completely bypassed.

          2. Ah. In our setup, we have just nova-compute on some hosts, where all of the netfilter rules do fire, and a dedicated nova-network server. I suspect things are working in our case because all of the DNAT rewriting is happening before the packets are sent from the hypervisor to the nodes running nova-network (and nova-api-metadata).

  3. Actually, I just double-checked, and it looks like I was misremembering. The issue is in the rules on the nova-network server, but metadata seems to be working fine for us because we have re-writing happening on the nova-compute nodes as well. Sorry for the mixup in the earlier post.

    1. cas

      not likely.  to start with, those are debian kernel packages, not ubuntu.   more importantly, there’s over 100 kernel versions between 2.6.32-41 (where the bridging is known to work without problems) and the 3.x versions we’ve tried.  needle in a hundred haystacks.

      if we knew what we were looking for, it might be possible (to find what we needed to be looking for :) but if we knew that, we’d probably be able to figure out a fix or at least a workaround.

  4. Stephen Gran

    I’ve seen other similar sounding problems with DHCP.  Try this hack:

    iptables -A POSTROUTING -t mangle -p udp --dport 53 -j CHECKSUM --checksum-fill
    iptables -A POSTROUTING -t mangle -p udp --sport 53 -j CHECKSUM --checksum-fill
    
    1. cas

      I guess you mean with the bridge-nf sysctls at their default of 1.  worth a shot…

      … just tried it, doesn’ t make any difference.   most DNS requests still timeout and fail.

       

      here’s our Q&D test for the most obvious symptom:

       

      # sysctl net.bridge.bridge-nf-call-iptables=0; time for i in $(seq 1 100) ; do host google.com > /dev/null ; done
      net.bridge.bridge-nf-call-iptables = 0
      
      real 0m1.652s
      user 0m0.572s
      sys 0m0.696s
      
      # sysctl net.bridge.bridge-nf-call-iptables=1; time for i in $(seq 1 100) ; do host google.com > /dev/null ; done
      net.bridge.bridge-nf-call-iptables = 1
      
      real 2m33.171s
      user 0m0.500s
      sys 0m0.752s

       

      Note how much longer the second one takes, with bridge-nf-call-iptables=1. many/most of the DNS requests are timing out.

Comments are closed.