BURT 519766 panic on production 3270

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Fletcher Cocquyt
Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?
What was your action and outcome?

thanks,

Fletcher Cocquyt
Stanford University School of Medicine







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130102/a3db63cf/attachment.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Jason McDaniel
What protocol: FC, NFS?

From: toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net] On Behalf Of Fletcher Cocquyt
Sent: Wednesday, January 02, 2013 3:22 PM
To: toasters at teaparty.net Lists
Cc: netapp-users at mailman.stanford.edu
Subject: BURT 519766 panic on production 3270

Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?
What was your action and outcome?

thanks,

Fletcher Cocquyt
Stanford University School of Medicine






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130102/bcd3fa02/attachment.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Fletcher Cocquyt
In reply to this post by Fletcher Cocquyt
Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp  

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW
I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs
Currently those details are not public/forthcoming...


On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com> wrote:

> Fletcher,
> What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.
>
>
> Here are two forum posts:
>
> https://forums.netapp.com/thread/33616
> https://forums.netapp.com/thread/35456
>
> I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.
>
> I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.
>
>
>
>
>
>
>
>
>
> From: toasters-bounces at teaparty.net [toasters-bounces at teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt at stanford.edu]
> Sent: Wednesday, January 02, 2013 5:21 PM
> To: toasters at teaparty.net Lists
> Cc: netapp-users at mailman.stanford.edu
> Subject: BURT 519766 panic on production 3270
>
> Happy 2013!
>
> One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?
>
> The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.
>
> There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
> Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.
>
> Anyone else encountered this issue?
> What was your action and outcome?
>
> thanks,
>
> Fletcher Cocquyt
> Stanford University School of Medicine



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130102/17466fd8/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Jayanathan, David
Interesting that you were advised to replace all HW/cards. We've hit this three times in our environment that I know of and all times I was provided with the following information:



Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error



Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.



Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.



If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.



I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.



Thanks,

David


From: toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net] On Behalf Of Fletcher Cocquyt
Sent: Wednesday, January 02, 2013 3:14 PM
To: Doug Siggins
Cc: netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
Subject: Re: BURT 519766 panic on production 3270

Dec 25 04:19:26 na03.GoCardinal.EDU<http://na03.GoCardinal.EDU> Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW
I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs
Currently those details are not public/forthcoming...



On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com<mailto:DSiggins at ma.maileig.com>> wrote:


Fletcher,
What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.


Here are two forum posts:

https://forums.netapp.com/thread/33616
https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.









________________________________
From: toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net> [toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net>] on behalf of Fletcher Cocquyt [fcocquyt at stanford.edu<mailto:fcocquyt at stanford.edu>]
Sent: Wednesday, January 02, 2013 5:21 PM
To: toasters at teaparty.net<mailto:toasters at teaparty.net> Lists
Cc: netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>
Subject: BURT 519766 panic on production 3270
Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?
What was your action and outcome?

thanks,

Fletcher Cocquyt
Stanford University School of Medicine

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130102/96a044e1/attachment.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Nilsson Marcus
Hi,
We had 3 panics about in 6 months on one of the heads in a NFS only 3240 cluster. We were running 8.0.2P6 (7-mode) at the time.

This was the panic string:

PANIC: Uncorrectable Machine Check Error at CPU1.
MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
PLX PCI-E switch on Controller,
Qlogic FC 4G adapter on Controller,
Qlogic FC 4G adapter on Controller.
Root Port(0,6,0): Status(SigSysErr), SecStatus(RcvMstAbt), DevStatus(NFatal), RootErr(UCor,NFatal), ErrSrcID(CorrSrc(0),UCorrSrc(0x20)), UCorrErr(CpTim), FirstUCorrErr(CpTim), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(10),Format(2)), Hdr[1]((0x11000004)), Hdr[2]((0x70184c)), Hdr[3]((0)); Br[8624](14,5,0): DevStatus(Corr,UnSup), CorrErr(RNRov,RpTim,AdvsNF); Dv[6432](16,0,0): Status(0xffff), DevStatus(0xffff), CorrErr(0xffffffff), UCorrErr(0xfffffffe), FirstUCorrErr(0xffffffff); Dv[6432](16,0,1): Status(0xffff), DevStatus(0xffff), CorrErr(0xffffffff), UCorrErr(0xfffffffe), FirstUCorrErr(0xffffffff).

NetApp replaced the mainboard and since then we have not seen any additional panics.

BR Marcus


From: toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net] On Behalf Of Jayanathan, David
Sent: den 3 januari 2013 00:55
To: Fletcher Cocquyt; Doug Siggins
Cc: netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
Subject: RE: BURT 519766 panic on production 3270


Interesting that you were advised to replace all HW/cards. We've hit this three times in our environment that I know of and all times I was provided with the following information:



Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error



Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.



Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.



If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.



I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.



Thanks,

David


From: toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net> [mailto:toasters-bounces at teaparty.net] On Behalf Of Fletcher Cocquyt
Sent: Wednesday, January 02, 2013 3:14 PM
To: Doug Siggins
Cc: netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>; toasters at teaparty.net<mailto:toasters at teaparty.net> Lists
Subject: Re: BURT 519766 panic on production 3270

Dec 25 04:19:26 na03.GoCardinal.EDU<http://na03.GoCardinal.EDU> Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW
I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs
Currently those details are not public/forthcoming...


On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com<mailto:DSiggins at ma.maileig.com>> wrote:

Fletcher,
What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.


Here are two forum posts:

https://forums.netapp.com/thread/33616
https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.









________________________________
From: toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net> [toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net>] on behalf of Fletcher Cocquyt [fcocquyt at stanford.edu<mailto:fcocquyt at stanford.edu>]
Sent: Wednesday, January 02, 2013 5:21 PM
To: toasters at teaparty.net<mailto:toasters at teaparty.net> Lists
Cc: netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>
Subject: BURT 519766 panic on production 3270
Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?
What was your action and outcome?

thanks,

Fletcher Cocquyt
Stanford University School of Medicine

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130103/8c2e9561/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

AW: BURT 519766 panic on production 3270

Steffen Knauf
In reply to this post by Jayanathan, David
hi,

 

we're running into the same error (FAS3240):

 

Uncorrectable Machine Check Error at CPU3. MC5 Error:
STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)
); PLX PCI-E switch on Controller. Root Port(0,6,0):
SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr),
DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq),
FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0)
,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).

Problem Summary:

Device Br[8624](9,0,0) reported seeing the following error(s):
"Unsupported Request (UsReq): Some aspect of a received PCI packet was
unsupported".



 

A Netapp Engineer told us that the only working solution is to replace all
FRU/Cards.

 

greets

 

Steffen

 

 

 

Von: toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net] Im
Auftrag von Jayanathan, David
Gesendet: Donnerstag, 3. Januar 2013 00:55
An: Fletcher Cocquyt; Doug Siggins
Cc: netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
Betreff: RE: BURT 519766 panic on production 3270

 

Interesting that you were advised to replace all HW/cards. We've hit this
three times in our environment that I know of and all times I was provided
with the following information:

 

Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error

 

Problem Summary:

The storage controller suffered an interruption of service due to an
uncorrectable machine check error. The source of the error has not been
indicated.

 

Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.

 

If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.

 

I have yet to find any references in mail of hitting this bug on any of our
systems running 8.0.2P7 or above. Two of the times we hit it were on the
same system, so on the 2nd time we upgraded to 8.0.2P4 and performed a
motherboard swap. The other time we purely did a failback and opted out of
doing a code upgrade.

 

Thanks,

David

 

 

From: toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net]
On Behalf Of Fletcher Cocquyt
Sent: Wednesday, January 02, 2013 3:14 PM
To: Doug Siggins
Cc: netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
Subject: Re: BURT 519766 panic on production 3270

 

Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49
[na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check
Error at CPU1. MC5 Error:
STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)
); PLX PCI-E switch on IO Exp  

 

thanks - yes the fact Netapp is immediately willing to replace most of our
HW indicates they know its an issue with our current HW

I'd feel better recommending this plan if they could point to the specific
HW issue in the current HW and demonstrate how its fixed in newer HW revs

Currently those details are not public/forthcoming...

 

 

On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com> wrote:

 

Fletcher,

What was the panic string? Did you get a core to netapp? Sometimes support
is a bit reluctant to investigate further unless you press for a real
answer. After 2-3 core dumps with the same type panic string, I start
demanding a fix whether it be hardware or software.

 

 

Here are two forum posts:

 

https://forums.netapp.com/thread/33616

https://forums.netapp.com/thread/35456

 

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over
time has we prepare to retire the system the load has dropped significantly,
and I haven't seen the NMI panic for 6+ months. I had suggested we replace
the system immediately and migrate off.

 

I guess the rule of thumb is that if you see the panic more than once, you
should definitely think about hardware replacements.

 

 

 

 

 

 

 

 

 

  _____  

From: toasters-bounces at teaparty.net [toasters-bounces at teaparty.net] on
behalf of Fletcher Cocquyt [fcocquyt at stanford.edu]
Sent: Wednesday, January 02, 2013 5:21 PM
To: toasters at teaparty.net Lists
Cc: netapp-users at mailman.stanford.edu
Subject: BURT 519766 panic on production 3270

Happy 2013!

 

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump
of coal ?

 

The good news is, when our system panic'ed and rebooted, the failover
performed as expected so we had only a 2 second timeout logged on our ESXi
hosts, Oracle - no downtime.

 

There is scarce public info on this issue and Netapp is recommending options
from "do nothing - (its rare and may never happen again)" to "replace
motherboards and all cards"

Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since
we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the
issue is independent of OnTAP version.

 

Anyone else encountered this issue?

What was your action and outcome?

 

thanks,

 

Fletcher Cocquyt

Stanford University School of Medicine

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130103/f52d7bad/attachment.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Unnikrishnan KP
Hello all,
I have seen this happen at a few customer sites and for any errors the new
NetApp policy seems to be raplacing the hardware. In one instance the
system board and all PCI cards were replaced.

This was the panic string:



Uncorrectable Machine Check Error at CPU0. MC5 Error:
STATUS<0xb200001084200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
Root Port(0,6,0): DevStatus(Corr), CorrErr(Rcvr).  in SK process
idle_thread0 on release 8.1


Regards,
Unnikrishnan KP


On 3 January 2013 12:15, Steffen Knauf <sknauf at chipxonio.de> wrote:

> hi,****
>
> ** **
>
> we're running into the same error (FAS3240):****
>
> ** **
>
> Uncorrectable Machine Check Error at CPU3. MC5 Error:
> STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
> PLX PCI-E switch on Controller. Root Port(0,6,0):
> SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr),
> DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq),
> FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0)
> ,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).
>
> Problem Summary:
>
> Device Br[8624](9,0,0) reported seeing the following error(s):
> "Unsupported Request (UsReq): Some aspect of a received PCI packet was
> unsupported".
>
> ****
>
> ** **
>
> A Netapp Engineer told us that the only working solution is to replace all
> FRU/Cards.****
>
> ** **
>
> greets****
>
> ** **
>
> Steffen****
>
> ** **
>
> ** **
>
> ** **
>
> *Von:* toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net]
> *Im Auftrag von *Jayanathan, David
> *Gesendet:* Donnerstag, 3. Januar 2013 00:55
> *An:* Fletcher Cocquyt; Doug Siggins
> *Cc:* netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
> *Betreff:* RE: BURT 519766 panic on production 3270****
>
> ** **
>
> Interesting that you were advised to replace all HW/cards. We?ve hit this
> three times in our environment that I know of and all times I was provided
> with the following information:****
>
> ** **
>
> Bug Number / Title:****
>
> 519766 / FAS32xx Uncorrectable Machine Check Error****
>
> ** **
>
> Problem Summary:****
>
> The storage controller suffered an interruption of service due to an
> uncorrectable machine check error. The source of the error has not been
> indicated.****
>
> ** **
>
> Recommended Solution/Workaround:****
>
> If this is the first occurrence:****
>
> - Update BIOS FAS3200: 5.1.1 or later.****
>
> - Update SP firmware to 1.2.3 or later.****
>
> - Update Data ONTAP to 8.0.2P4 or later.****
>
> - Restart the system and monitor for any repeats.****
>
> ** **
>
> If this is the second occurrence:****
>
> - Replace the motherboard.****
>
> - Mark the faulty hardware for RCA under bug 519766.****
>
> ** **
>
> I have yet to find any references in mail of hitting this bug on any of
> our systems running 8.0.2P7 or above. Two of the times we hit it were on
> the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed
> a motherboard swap. The other time we purely did a failback and opted out
> of doing a code upgrade.****
>
> ** **
>
> Thanks,****
>
> David****
>
> ** **
>
> ** **
>
> *From:* toasters-bounces at teaparty.net [
> mailto:toasters-bounces at teaparty.net <toasters-bounces at teaparty.net>] *On
> Behalf Of *Fletcher Cocquyt
> *Sent:* Wednesday, January 02, 2013 3:14 PM
> *To:* Doug Siggins
> *Cc:* netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
> *Subject:* Re: BURT 519766 panic on production 3270****
>
> ** **
>
> Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49
> [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check
> Error at CPU1. MC5 Error:
> STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
> PLX PCI-E switch on IO Exp  ****
>
> ** **
>
> thanks - yes the fact Netapp is immediately willing to replace most of our
> HW indicates they know its an issue with our current HW****
>
> I'd feel better recommending this plan if they could point to the specific
> HW issue in the current HW and demonstrate how its fixed in newer HW revs*
> ***
>
> Currently those details are not public/forthcoming...****
>
> ** **
>
> ** **
>
> On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com> wrote:*
> ***
>
> ** **
>
> Fletcher,****
>
> What was the panic string? Did you get a core to netapp? Sometimes support
> is a bit reluctant to investigate further unless you press for a real
> answer. After 2-3 core dumps with the same type panic string, I start
> demanding a fix whether it be hardware or software.****
>
> ** **
>
> ** **
>
> Here are two forum posts:****
>
> ** **
>
> https://forums.netapp.com/thread/33616****
>
> https://forums.netapp.com/thread/35456****
>
> ** **
>
> I had a similar issue on an older filer. It panic'd 2-3 times. Luckily,
> over time has we prepare to retire the system the load has dropped
> significantly, and I haven't seen the NMI panic for 6+ months. I had
> suggested we replace the system immediately and migrate off.****
>
> ** **
>
> I guess the rule of thumb is that if you see the panic more than once, you
> should definitely think about hardware replacements.****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
> ------------------------------
>
> *From:* toasters-bounces at teaparty.net [toasters-bounces at teaparty.net] on
> behalf of Fletcher Cocquyt [fcocquyt at stanford.edu]
> *Sent:* Wednesday, January 02, 2013 5:21 PM
> *To:* toasters at teaparty.net Lists
> *Cc:* netapp-users at mailman.stanford.edu
> *Subject:* BURT 519766 panic on production 3270****
>
> Happy 2013!****
>
> ** **
>
> One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 -
> lump of coal ?****
>
> ** **
>
> The good news is, when our system panic'ed and rebooted, the failover
> performed as expected so we had only a 2 second timeout logged on our ESXi
> hosts, Oracle - no downtime.****
>
> ** **
>
> There is scarce public info on this issue and Netapp is recommending
> options from "do nothing - (its rare and may never happen again)" to
> "replace motherboards and all cards"****
>
> Our 3270 clusters (we have 2 in Active:Standby mode) have been stable
> since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says
> the issue is independent of OnTAP version.****
>
> ** **
>
> Anyone else encountered this issue?****
>
> What was your action and outcome?****
>
> ** **
>
> thanks,****
>
> ** **
>
> Fletcher Cocquyt****
>
> Stanford University School of Medicine****
>
> ** **
>
> _______________________________________________
> Toasters mailing list
> Toasters at teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130103/059e2c84/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Dan Burkland
We experienced the same issue with our NetApp 6080s running 8.0.1P4 and NetApp replaced the the system boards in each.

Regards,

Dan

From: Unnikrishnan KP <krshnakp at gmail.com<mailto:krshnakp at gmail.com>>
Date: Thursday, January 3, 2013 8:12 AM
To: Steffen Knauf <sknauf at chipxonio.de<mailto:sknauf at chipxonio.de>>
Cc: "netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>" <netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>>, "toasters at teaparty.net<mailto:toasters at teaparty.net>" <toasters at teaparty.net<mailto:toasters at teaparty.net>>
Subject: Re: BURT 519766 panic on production 3270

Hello all,
I have seen this happen at a few customer sites and for any errors the new NetApp policy seems to be raplacing the hardware. In one instance the system board and all PCI cards were replaced.

This was the panic string:

Uncorrectable Machine Check Error at CPU0. MC5 Error: STATUS<0xb200001084200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); Root Port(0,6,0): DevStatus(Corr), CorrErr(Rcvr).  in SK process idle_thread0 on release 8.1

Regards,
Unnikrishnan KP


On 3 January 2013 12:15, Steffen Knauf <sknauf at chipxonio.de<mailto:sknauf at chipxonio.de>> wrote:
hi,

we're running into the same error (FAS3240):

Uncorrectable Machine Check Error at CPU3. MC5 Error: STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on Controller. Root Port(0,6,0):
SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr),
DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq),
FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0)
,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).

Problem Summary:

Device Br[8624](9,0,0) reported seeing the following error(s):
"Unsupported Request (UsReq): Some aspect of a received PCI packet was
unsupported".


A Netapp Engineer told us that the only working solution is to replace all FRU/Cards.

greets

Steffen



Von:toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net> [mailto:toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net>] Im Auftrag von Jayanathan, David
Gesendet: Donnerstag, 3. Januar 2013 00:55
An: Fletcher Cocquyt; Doug Siggins
Cc: netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>; toasters at teaparty.net<mailto:toasters at teaparty.net> Lists
Betreff: RE: BURT 519766 panic on production 3270


Interesting that you were advised to replace all HW/cards. We?ve hit this three times in our environment that I know of and all times I was provided with the following information:



Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error



Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.



Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.



If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.



I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.



Thanks,

David


From:toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net> [mailto:toasters-bounces at teaparty.net] On Behalf Of Fletcher Cocquyt
Sent: Wednesday, January 02, 2013 3:14 PM
To: Doug Siggins
Cc: netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>; toasters at teaparty.net<mailto:toasters at teaparty.net> Lists
Subject: Re: BURT 519766 panic on production 3270

Dec 25 04:19:26 na03.GoCardinal.EDU<http://na03.GoCardinal.EDU> Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW
I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs
Currently those details are not public/forthcoming...


On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com<mailto:DSiggins at ma.maileig.com>> wrote:

Fletcher,
What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.


Here are two forum posts:

https://forums.netapp.com/thread/33616
https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.









________________________________
From: toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net> [toasters-bounces at teaparty.net<mailto:toasters-bounces at teaparty.net>] on behalf of Fletcher Cocquyt [fcocquyt at stanford.edu<mailto:fcocquyt at stanford.edu>]
Sent: Wednesday, January 02, 2013 5:21 PM
To: toasters at teaparty.net<mailto:toasters at teaparty.net> Lists
Cc: netapp-users at mailman.stanford.edu<mailto:netapp-users at mailman.stanford.edu>
Subject: BURT 519766 panic on production 3270
Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?
What was your action and outcome?

thanks,

Fletcher Cocquyt
Stanford University School of Medicine


_______________________________________________
Toasters mailing list
Toasters at teaparty.net<mailto:Toasters at teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130103/1265d868/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Fletcher Cocquyt
In reply to this post by Jayanathan, David
Yes - we are taking the stance of requesting more information

We'd be more confident recommending the drastic action of replacing all HW if Netapp could point to the specific HW issue(s) in the current HW and demonstrate how its fixed in newer HW revs
Currently those details are not public/forthcoming?

Without the full picture we fear the very considerable effort of the HW replacement plan work could be wasted if the issue then re-occurred - we are asking Netapp to make the technical  details available to us to raise confidence in the HW replacement plan

thanks


On Jan 2, 2013, at 3:54 PM, "Jayanathan, David" <djayan at qualcomm.com> wrote:

> Interesting that you were advised to replace all HW/cards. We?ve hit this three times in our environment that I know of and all times I was provided with the following information:
>  
> Bug Number / Title:
> 519766 / FAS32xx Uncorrectable Machine Check Error
>  
> Problem Summary:
> The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.
>  
> Recommended Solution/Workaround:
> If this is the first occurrence:
> - Update BIOS FAS3200: 5.1.1 or later.
> - Update SP firmware to 1.2.3 or later.
> - Update Data ONTAP to 8.0.2P4 or later.
> - Restart the system and monitor for any repeats.
>  
> If this is the second occurrence:
> - Replace the motherboard.
> - Mark the faulty hardware for RCA under bug 519766.
>  
> I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2ndtime we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.
>  
> Thanks,
> David
>  
>  
> From: toasters-bounces at teaparty.net [mailto:toasters-bounces at teaparty.net] On Behalf Of Fletcher Cocquyt
> Sent: Wednesday, January 02, 2013 3:14 PM
> To: Doug Siggins
> Cc: netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
> Subject: Re: BURT 519766 panic on production 3270
>  
> Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp  
>  
> thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW
> I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs
> Currently those details are not public/forthcoming...
>
>
>  
> On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com> wrote:
>
>
> Fletcher,
> What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.
>  
>  
> Here are two forum posts:
>  
> https://forums.netapp.com/thread/33616
> https://forums.netapp.com/thread/35456
>  
> I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.
>  
> I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.
>  
>  
>  
>  
>  
>  
>  
>  
>  
> From: toasters-bounces at teaparty.net [toasters-bounces at teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt at stanford.edu]
> Sent: Wednesday, January 02, 2013 5:21 PM
> To: toasters at teaparty.net Lists
> Cc: netapp-users at mailman.stanford.edu
> Subject: BURT 519766 panic on production 3270
>
> Happy 2013!
>  
> One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?
>  
> The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.
>  
> There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
> Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.
>  
> Anyone else encountered this issue?
> What was your action and outcome?
>  
> thanks,
>  
> Fletcher Cocquyt
> Stanford University School of Medicine
>  



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130108/afe51c74/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Unnikrishnan KP
Hello all,
The two BUG ID's and their information from NetApp is not useful at all to
say the least:

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=504167

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=519766


Regards,
Unnikrishnan KP


On 9 January 2013 06:49, Fletcher Cocquyt <fcocquyt at stanford.edu> wrote:

> Yes - we are taking the stance of requesting more information
>
> We'd be more confident recommending the drastic action of replacing all HW
> if Netapp could point to the specific HW issue(s) in the current HW and
> demonstrate how its fixed in newer HW revs
> Currently those details are not public/forthcoming?
>
> Without the full picture we fear the very considerable effort of the HW
> replacement plan work could be wasted if the issue then re-occurred - we
> are asking Netapp to make the technical  details available to us to raise
> confidence in the HW replacement plan
>
> thanks
>
>
> On Jan 2, 2013, at 3:54 PM, "Jayanathan, David" <djayan at qualcomm.com>
> wrote:
>
> Interesting that you were advised to replace all HW/cards. We?ve hit this
> three times in our environment that I know of and all times I was provided
> with the following information:****
> ** **
> Bug Number / Title:****
> 519766 / FAS32xx Uncorrectable Machine Check Error****
> ** **
> Problem Summary:****
> The storage controller suffered an interruption of service due to an
> uncorrectable machine check error. The source of the error has not been
> indicated.****
> ** **
> Recommended Solution/Workaround:****
> If this is the first occurrence:****
> - Update BIOS FAS3200: 5.1.1 or later.****
> - Update SP firmware to 1.2.3 or later.****
> - Update Data ONTAP to 8.0.2P4 or later.****
> - Restart the system and monitor for any repeats.****
> ** **
> If this is the second occurrence:****
> - Replace the motherboard.****
> - Mark the faulty hardware for RCA under bug 519766.****
> ** **
> I have yet to find any references in mail of hitting this bug on any of
> our systems running 8.0.2P7 or above. Two of the times we hit it were on
> the same system, so on the 2ndtime we upgraded to 8.0.2P4 and performed a
> motherboard swap. The other time we purely did a failback and opted out of
> doing a code upgrade.****
> ** **
> Thanks,****
> David****
>
>
> *From:* toasters-bounces at teaparty.net [mailto:toasters-
> bounces at teaparty.net] *On Behalf Of *Fletcher Cocquyt
> *Sent:* Wednesday, January 02, 2013 3:14 PM
> *To:* Doug Siggins
> *Cc:* netapp-users at mailman.stanford.edu; toasters at teaparty.net Lists
> *Subject:* Re: BURT 519766 panic on production 3270****
> ** **
> Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49
> [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check
> Error at CPU1. MC5 Error:
> STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
> PLX PCI-E switch on IO Exp  ****
>
> thanks - yes the fact Netapp is immediately willing to replace most of our
> HW indicates they know its an issue with our current HW****
> I'd feel better recommending this plan if they could point to the specific
> HW issue in the current HW and demonstrate how its fixed in newer HW revs*
> ***
> Currently those details are not public/forthcoming...****
>
>
> ****
> ** **
> On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins at ma.maileig.com> wrote:*
> ***
>
>
> ****
> Fletcher,****
> What was the panic string? Did you get a core to netapp? Sometimes support
> is a bit reluctant to investigate further unless you press for a real
> answer. After 2-3 core dumps with the same type panic string, I start
> demanding a fix whether it be hardware or software.****
>
>
> Here are two forum posts:****
>
> https://forums.netapp.com/thread/33616****
> https://forums.netapp.com/thread/35456****
>
> I had a similar issue on an older filer. It panic'd 2-3 times. Luckily,
> over time has we prepare to retire the system the load has dropped
> significantly, and I haven't seen the NMI panic for 6+ months. I had
> suggested we replace the system immediately and migrate off.****
>
> I guess the rule of thumb is that if you see the panic more than once, you
> should definitely think about hardware replacements.****
>
>
>
>
>
>
>
>
>
> ------------------------------
>
> *From:* toasters-bounces at teaparty.net [toasters-bounces at teaparty.net] on
> behalf of Fletcher Cocquyt [fcocquyt at stanford.edu]
> *Sent:* Wednesday, January 02, 2013 5:21 PM
> *To:* toasters at teaparty.net Lists
> *Cc:* netapp-users at mailman.stanford.edu
> *Subject:* BURT 519766 panic on production 3270****
> Happy 2013!****
> ** **
> One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 -
> lump of coal ?****
> ** **
> The good news is, when our system panic'ed and rebooted, the failover
> performed as expected so we had only a 2 second timeout logged on our ESXi
> hosts, Oracle - no downtime.****
> ** **
> There is scarce public info on this issue and Netapp is recommending
> options from "do nothing - (its rare and may never happen again)" to
> "replace motherboards and all cards"****
> Our 3270 clusters (we have 2 in Active:Standby mode) have been stable
> since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says
> the issue is independent of OnTAP version.****
> ** **
> Anyone else encountered this issue?****
> What was your action and outcome?****
> ** **
> thanks,****
> ** **
> Fletcher Cocquyt****
> Stanford University School of Medicine****
> ** **
>
>
>
> _______________________________________________
> Toasters mailing list
> Toasters at teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130110/7f804339/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Antonio Varni
It's not a lot more but there are little clues you can coax out of
support.netapp.com

Maybe this is already known... you can at least get the burt title, etc.
This can give you additional things to search for and try to piece together
as much as info as possible without having internal netapp support access
 Maybe you can find this elsewhere so sorry if this adds little to the
conversation:

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=504167
additional info:

504167: FAS32xx Uncorrectable Machine Checks
Content Type: Troubleshooting and Support
Content Sub-Type: Bug
Short Title: 504167: FAS32xx Uncorrectable Machine Checks
Last Updated Date: Fri, 16 Dec 2011 02:29:19 PST
File Type: htm
Description: Uncorrectable Machine Checks may occur on the FAS32xx
platform. These require diagnosis and remediation by Support Engineers
Bug Id: 504167
Date Created: Mon, 09 May 2011 07:24:02 PDT
Keywords: FAS32xx Uncorrectable Machine Checks
Burt Title: Carnegie: PMC-Sierra SAS HBA NMI PCIe panic caused by MfTLB
(Malformed TLP)
Burt Link:
http://burtweb-prd.eng.netapp.com/burt/burt-bin/start?burt-id=504167
Duplicate Of: 519766.0
Burt Patch Release: -
Fixed-In Version: -


_ _
antonio varni
[technology]

Estalea, L.P.
10 E. Figueroa St,.2nd Floor
Santa Barbara, CA 93101
v 805.252.0115
f 805.899.2697
e avarni at estalea.com
w www.estalea.com



On Thu, Jan 10, 2013 at 4:06 AM, Unnikrishnan KP <krshnakp at gmail.com> wrote:

> Hello all,
> The two BUG ID's and their information from NetApp is not useful at all to
> say the least:
>
> https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=504167
>
> https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=519766
>
>
> Regards,
> Unnikrishnan KP
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130110/5f6ef366/attachment.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Patrick Giagnocavo
I am only a newbie with NetApps, however have some experience with
rackmount servers as I have 2 racks' worth of them :)

A machine check exception is generated by the CPU, usually.

This Wikipedia page tells you in general what is going on:
http://en.wikipedia.org/wiki/Machine_Check_Exception

so the 2nd core (CPU1, not CPU0) had a problem (in the original post
on this thread).

The problem was not correctable and seems to have been on the PCI
Express bus (either on the bridge chip itself, or a device connected
to it).

You are not the only person to experience this (found via google):
https://twitter.com/nerdicwalker/status/110360608121167873

They require diagnosis because the error message is not specific
enough to figure out what is going on.

The only times I have seen this in my systems (non-NA) were 1) bad or
slightly incompatible RAM, easily fixed 2) motherboard was bad and I
stopped using it.  So, there is a quite a range as to what can be
going on.

Hope this helps,

Patrick

PS am looking for FAS250 or so on the cheap for testing / dev work if
anyone has one.

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Fletcher Cocquyt
We met with our Netapp team today and received the technical explanation we needed to move forward with the hardware replacement option.

As one reply already mentioned, there is a real hardware issue identified with the 32xx/62xx series and Netapp is now working to proactively replace the parts with suspect PCM (DRAM), SAS, IOxM chips

our clusters operate in active:standby mode so we won't need downtime or risk of production failover for this fix.

thanks

On Jan 10, 2013, at 12:29 PM, Patrick Giagnocavo <xemacs5 at gmail.com> wrote:

> I am only a newbie with NetApps, however have some experience with
> rackmount servers as I have 2 racks' worth of them :)
>
> A machine check exception is generated by the CPU, usually.
>
> This Wikipedia page tells you in general what is going on:
> http://en.wikipedia.org/wiki/Machine_Check_Exception
>
> so the 2nd core (CPU1, not CPU0) had a problem (in the original post
> on this thread).
>
> The problem was not correctable and seems to have been on the PCI
> Express bus (either on the bridge chip itself, or a device connected
> to it).
>
> You are not the only person to experience this (found via google):
> https://twitter.com/nerdicwalker/status/110360608121167873
>
> They require diagnosis because the error message is not specific
> enough to figure out what is going on.
>
> The only times I have seen this in my systems (non-NA) were 1) bad or
> slightly incompatible RAM, easily fixed 2) motherboard was bad and I
> stopped using it.  So, there is a quite a range as to what can be
> going on.
>
> Hope this helps,
>
> Patrick
>
> PS am looking for FAS250 or so on the cheap for testing / dev work if
> anyone has one.
> _______________________________________________
> Toasters mailing list
> Toasters at teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130114/c4372a16/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

BURT 519766 panic on production 3270

Unnikrishnan KP
Hello all,
I have brought this up as a NetApp community string:
https://communities.netapp.com/thread/25901

I hope we can get more information from other users too.

Regards,
Unnikrishnan KP


On 15 January 2013 04:42, Fletcher Cocquyt <fcocquyt at stanford.edu> wrote:

> We met with our Netapp team today and received the technical explanation
> we needed to move forward with the hardware replacement option.
>
> As one reply already mentioned, there is a real hardware issue identified
> with the 32xx/62xx series and Netapp is now working to proactively replace
> the parts with suspect PCM (DRAM), SAS, IOxM chips
>
> our clusters operate in active:standby mode so we won't need downtime or
> risk of production failover for this fix.
>
> thanks
>
> On Jan 10, 2013, at 12:29 PM, Patrick Giagnocavo <xemacs5 at gmail.com>
> wrote:
>
> I am only a newbie with NetApps, however have some experience with
> rackmount servers as I have 2 racks' worth of them :)
>
> A machine check exception is generated by the CPU, usually.
>
> This Wikipedia page tells you in general what is going on:
> http://en.wikipedia.org/wiki/Machine_Check_Exception
>
> so the 2nd core (CPU1, not CPU0) had a problem (in the original post
> on this thread).
>
> The problem was not correctable and seems to have been on the PCI
> Express bus (either on the bridge chip itself, or a device connected
> to it).
>
> You are not the only person to experience this (found via google):
> https://twitter.com/nerdicwalker/status/110360608121167873
>
> They require diagnosis because the error message is not specific
> enough to figure out what is going on.
>
> The only times I have seen this in my systems (non-NA) were 1) bad or
> slightly incompatible RAM, easily fixed 2) motherboard was bad and I
> stopped using it.  So, there is a quite a range as to what can be
> going on.
>
> Hope this helps,
>
> Patrick
>
> PS am looking for FAS250 or so on the cheap for testing / dev work if
> anyone has one.
> _______________________________________________
> Toasters mailing list
> Toasters at teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
>
> _______________________________________________
> Toasters mailing list
> Toasters at teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.teaparty.net/pipermail/toasters/attachments/20130116/d69e70b1/attachment.html>