Troubleshooting 'unreachable' on a CDOT port

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Troubleshooting 'unreachable' on a CDOT port

Edward Rolison
Hello fellow NetApp admins.
Over the weekend, I hit a really quite "interesting" sort of a problem. One of those weekends that ... no one really wants to have.
It involved a firmware upgrade on a switch going catastrophically wrong. causing chaos for several hours.

And off the back of that - one of our two CDOT nodes, it's primary 'data' interface LACP group has .. seemingly died. 

I say "seemingly" because:

- Snooping the interfaces sees packets going in and out. (Mostly "arp"). 
- But the switch side "snoop" doesn't see the ARP replies. And thus never 'learns' the mac, and doesn't route traffic to it.

This happens on both ports of a LACP group, and even moving it to another switch entirely hasn't helped. Manually offlining the ports  doesn't help (Except if I do both, it migrates the lif automatically). 

But switching the lifs over onto the other head - has fixed it, for now. (although obviously, we're failed over, and have reduced resilience). 

Has anyone run into anything similar? Or can give me some insight as to what could explain this perplexing behaviour? 
The 'source' of the problem was - probably - some serious network strangeness. Loops, vlans going up and down, all sorts of chaos. 

I haven't (yet) rebooted the failed node, as the vservers are running quite merrily. 

I'm wondering if there's some sort of DOS/arp flood protection that might be tripping us up. 

Thanks,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Troubleshooting 'unreachable' on a CDOT port

Steiner, Jeffrey

I'd take a close look at spanning tree and VLAN ID's.

 

Spanning tree is just generally evil and likes to cause totally unexpected problems out of spite. I barely understand STP, I've just been bit repeatedly.

 

VLAN ID's have tripped me up a lot. Traffic will seem to be flowing, but because there's either a VLAN ID mis-set or a default VLAN ID that is incorrect the packets go nowhere.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Edward Rolison
Sent: Monday, November 07, 2016 12:42 PM
To: [hidden email]
Subject: Troubleshooting 'unreachable' on a CDOT port

 

Hello fellow NetApp admins.
Over the weekend, I hit a really quite "interesting" sort of a problem. One of those weekends that ... no one really wants to have.
It involved a firmware upgrade on a switch going catastrophically wrong. causing chaos for several hours.

And off the back of that - one of our two CDOT nodes, it's primary 'data' interface LACP group has .. seemingly died. 

I say "seemingly" because:

 

- Snooping the interfaces sees packets going in and out. (Mostly "arp"). 

- But the switch side "snoop" doesn't see the ARP replies. And thus never 'learns' the mac, and doesn't route traffic to it.

 

This happens on both ports of a LACP group, and even moving it to another switch entirely hasn't helped. Manually offlining the ports  doesn't help (Except if I do both, it migrates the lif automatically). 


But switching the lifs over onto the other head - has fixed it, for now. (although obviously, we're failed over, and have reduced resilience). 

 

Has anyone run into anything similar? Or can give me some insight as to what could explain this perplexing behaviour? 
The 'source' of the problem was - probably - some serious network strangeness. Loops, vlans going up and down, all sorts of chaos. 

I haven't (yet) rebooted the failed node, as the vservers are running quite merrily. 

 

I'm wondering if there's some sort of DOS/arp flood protection that might be tripping us up. 

 

Thanks,

Ed. 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Troubleshooting 'unreachable' on a CDOT port

Tim McCarthy
In reply to this post by Edward Rolison

On Mon, Nov 7, 2016 at 6:42 AM, Edward Rolison <[hidden email]> wrote:
Ed. 

Ok...gotta say it....

Have you tried physically unlinking and relinking the network connection?
Occasionally, I have see this fix weird problems.

Also, have you checked the port settings on the switch to make sure they line up as expected?

Do you have portfast enabled for the LACP ports (spanning-tree portfast trunk?)


--tmac

Tim McCarthy, Principal Consultant

Proud Member of the #NetAppATeam

I Blog at TMACsRack



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Troubleshooting 'unreachable' on a CDOT port

Edward Rolison
Thank you for the replies - we've got to a point where we _think_ we're "just" tickling a known issue: 987243

In certain instances, the Ethernet interface on the UTA2 X1143-R6 adapter
 and onboard ports might stop sending packets due to lack of transmission
 resources. 

So now we're just lining up for a reboot at a suitable outage window, and a code update later if that's done the trick. 

In the interim though - does anyone know of a good workaround for rerouting my intercluster (replication) traffic? 
I can't failover that interface to another node - and as the interface is "up", but not "working" my replication jobs have failed. 



On 7 November 2016 at 12:00, tmac <[hidden email]> wrote:

On Mon, Nov 7, 2016 at 6:42 AM, Edward Rolison <[hidden email]> wrote:
Ed. 

Ok...gotta say it....

Have you tried physically unlinking and relinking the network connection?
Occasionally, I have see this fix weird problems.

Also, have you checked the port settings on the switch to make sure they line up as expected?

Do you have portfast enabled for the LACP ports (spanning-tree portfast trunk?)


--tmac

Tim McCarthy, Principal Consultant

Proud Member of the #NetAppATeam

I Blog at TMACsRack




_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Troubleshooting 'unreachable' on a CDOT port

andrei.borzenkov@ts.fujitsu.com
You should be able to failover it to another port on the same node.

Отправлено с iPhone

8 нояб. 2016 г., в 13:43, Edward Rolison <[hidden email]> написал(а):

Thank you for the replies - we've got to a point where we _think_ we're "just" tickling a known issue: 987243

In certain instances, the Ethernet interface on the UTA2 X1143-R6 adapter
 and onboard ports might stop sending packets due to lack of transmission
 resources. 

So now we're just lining up for a reboot at a suitable outage window, and a code update later if that's done the trick. 

In the interim though - does anyone know of a good workaround for rerouting my intercluster (replication) traffic? 
I can't failover that interface to another node - and as the interface is "up", but not "working" my replication jobs have failed. 



On 7 November 2016 at 12:00, tmac <[hidden email]> wrote:

On Mon, Nov 7, 2016 at 6:42 AM, Edward Rolison <[hidden email]> wrote:
Ed. 

Ok...gotta say it....

Have you tried physically unlinking and relinking the network connection?
Occasionally, I have see this fix weird problems.

Also, have you checked the port settings on the switch to make sure they line up as expected?

Do you have portfast enabled for the LACP ports (spanning-tree portfast trunk?)


--tmac

Tim McCarthy, Principal Consultant

Proud Member of the #NetAppATeam

I Blog at TMACsRack



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Loading...