NSX-v ECMP Active/Passive configuration

Share on:

OK, I know that the title to this post is a bit of a oxymoran.  ECMP Active/Passive?  Isn’t ECMP about active/active/active/active/…. ? Yes it is, but imagine a scenario where you are building your NSX-v deployment across two campuses, with datacenter firewalls upstream – making it important that you ensure data flow stays predictable and asymmetric routing doesn’t become an issue: ingress through datacenter A, egress through datacenter Roie Ben Haim has a fantastic write-up for why this is important, so I’ll leave you with this link as a primer.

Welcome back.  Now that we are on the same page on why we need to control datacenter ingress/egress, let’s further imagine that we have some incredibly demanding North South requirements that force us to lay down the maximum ECMP configuration of 8 ESGs.  Does that mean you can’t have passive, lower weighted ESGs on the secondary site of the DLR?  Does it mean you have to fail the ESGs over to the secondary site?

These questions hit me today, so I sought out to answer them.

Prior to answering this question I needed to do some work in my lab to enable the testing, as I only have a single DLR and two ESGs.  To solve this I whipped up the following powershell code to deploy out 7 more ESGs, and to configure BGP on them.  A few notes to understand the snippet:

  • 10.100.0.32/27 is my network core.  The ESGs uplink to here and peer with a pair of vyos routers.
  • 10.100.0.64/27 is my DLR transit.
  • AS 64513 is my vyos routers
  • AS 64514 is my ESGs
  • AS 64515 is my DLR
  • The snippet may not be perfect, it was a hack job to get the test going.
$cluster = (Get-Cluster -Name "lab")
$datastore = (Get-Datastore -Name "Lab1")
$uplinkpg = Get-VDPortgroup -Name "VLAN0002 10.100.0.32_27"
$internalpg = get-vdportgroup -Name "vxw-dvs-65-virtualwire-2-sid-7002-DLR Transit"
foreach ($i in 3..9)
{
    $edgename = "edge-$i"
    $uplinkIP = "10.100.0.4$i"
    $internalIP = "10.100.0.7$i"
    $uplink = New-NsxEdgeInterfaceSpec -Name Uplink -Type Uplink -ConnectedTo $uplinkpg -PrimaryAddress $uplinkIP -SubnetPrefixLength 27 -Index 0
    $internal1 = New-NsxEdgeInterfaceSpec -Name Internal -Type Internal -ConnectedTo $internalpg -PrimaryAddress $internalIP -SubnetPrefixLength 27 -Index 1
    $edge = New-NsxEdge -Name $edgename -Cluster $cluster -Datastore $datastore -Username admin -Password "guj34mp0w3R2win!" -FormFactor compact -Interface $uplink,$internal1
}

foreach ($i in 3..9)
{
    $edgename = "edge-$i"
    $routerID = "10.100.0.4$i"
    Get-NsxEdge -name $edgename | get-nsxedgerouting | Set-NsxEdgeBgp -EnableBGP -LocalAS 64514 -RouterId $routerID -GracefulRestart -confirm:$false
    Get-NsxEdge -name $edgename | get-nsxedgerouting | Set-NsxEdgeRouting -EnableBgpRouteRedistribution -confirm:$false
    
    
    
}

foreach ($i in 3..9)
{
    $edgename = "edge-$i"
    get-nsxedge -Name $edgename |Get-NsxEdgeFirewall| Set-NsxEdgeFirewall -Enabled:$false -confirm:$false
    Get-NsxEdge -name $edgename | Set-NsxEdge 
}

foreach ($i in 3..9)
{
    $edgename = "edge-$i"
    $routerID = "10.100.0.4$i"
    Get-NsxEdge -name $edgename| Get-NsxEdgeRouting | New-NsxEdgeBgpNeighbour -IpAddress 10.100.0.33 -RemoteAS 64513 -Confirm:$false -Weight 50 -HoldDownTimer 6 -KeepAliveTimer 2 
    get-NsxEdge -name $edgename| Get-NsxEdgeRouting | New-NsxEdgeBgpNeighbour -IpAddress 10.100.0.34 -RemoteAS 64513 -Confirm:$false -Weight 50 -HoldDownTimer 6 -KeepAliveTimer 2 
    get-NsxEdge -name $edgename| Get-NsxEdgeRouting | New-NsxEdgeBgpNeighbour -IpAddress 10.100.0.71 -RemoteAS 64515 -Confirm:$false -Weight 50 -HoldDownTimer 6 -KeepAliveTimer 2 
}

foreach ($i in 3..9)
{
    $edgename = "edge-$i"
    $routerID = "10.100.0.7$i"
    Get-NsxLogicalRouter -name "DLR01"| Get-NsxLogicalRouterRouting | New-NsxLogicalRouterBgpNeighbour -IpAddress $routerID -forwardingaddress 10.100.0.70 -ProtocolAddress 10.100.0.71 -RemoteAS 64514 -Confirm:$false -Weight 50 -HoldDownTimer 6 -KeepAliveTimer 2 
   
}

The heavy lifting done, I logged into my vyos routers and added the new ESGs as BGP peers, then gave everything a few minutes to settle.  One thing to note, the script as written configures the DLR to peer with 9 ESGs with equal weights.

So how does the DLR handle having  equal weighted paths, let’s show ip bgp and find out.  The DLR successfully pairs with all upstream ESGs with nary a complaint.

Now what about actual path selection, up next show ip route.

Only 8 selected paths, well that’s a bummer, but not entirely unexpected since the advertised ECMP limit for NSX is 8-way.

Let’s update the weight on .79 and make it a member of the “passive” datacenter and repeat our show ip route command.  Our expectation here is that .79 will drop from the route table, and be replaced by .78.

Excellent, this looks great.  I will spare you the screenshot, but also as expected .79 still shows up in show ip bgp but I still have two questions.

  • If all members of the 8 way ECMP(65,66,73-78) fail, what does the DLR do with .79?
  • If I revert .79 back to a 50 weight with the others, what does the DLR do if one of the selected ESGs fail?

Since .79 is currently configured as the passive site ESG, let’s fail edges 1-8.

And again it behaves exactly as expected, using .79 as the proper ESG!

Okay, now for our final test.  We will go power up edge1-8 and return .79 to a weight of 50, which as we saw earlier will result in 9 possible BGP paths, and NSX selecting 8 of them to use for ECMP.  Once everything is up and stable, we will fail edge7 and observe what happens (.66 was the one left out).

After .77 failed, we can see that the DLR added .66 to the ECMP route tables.

Recap

So in today’s screenshot laden post we examined the behavior af and NSX-v DLR when ECMP is utilized but you overload it with BGP peers.  What did we learn today?

  • It is possible to overload a DLR  with more than 8 BGP peers of equal weight.  When this occurs the DLR will select 8 of the peers and use them for ECMP.  If one of the peers fails, it will add one of the “overloaded” peers to the pool.  This is interesting because it allows you to provide N+1 8-way ECMP by deploying extra ESGs for situations where the bandwidth provided by 8 ESGs is critical to your environment.
  • It is possible to overload a DLR DLR configured for 8 way ECMP with one or more ESGs of lower weight for scenarios like datacenter failover.

While we explicitly tested this with an NSX-v DLR, I suspect this same behavior can be found in NSX-v ESGs, and any other Vyatta style router, and possible even classical hardware routers.  I am really at the neophyte level when it comes to routing and BGP behavior, so I don’t want to overpromise here.

Hopefully these two pieces of information are helpful to you in making informed decisions when planning out your NSX and possibly other routing deployments.