ssh timing out (filer high CPU load)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ssh timing out (filer high CPU load)

Edward Rolison
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly). 

I'm pretty sure that's correlated with some high CPU load  - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'. 

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so. 

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most). 

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it? 

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue. 

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

  195 RR    N       47%      6928        10% NwkThd_00

  196 RR    N       47%      7880        12% NwkThd_01

  197 RR    0       47%      6928        10% NwkThd_02

  223 BR    s        7%      7648        46% pmcsas_intrd_1

  259 BR    e        5%      2440        19% fal_io_thread2

  502 BR    R        7%      7448        45% raidio_thread

  503 BR    R        7%      7448        45% raidio_thread

  635 BG    k        6%     15184        11% snmpd

 1614 BR    0        5%      3464        10% ntm_main

 1711 RR    w       35%     14256        21% wafl_exempt00

 1712 BR    w       35%     14136        21% wafl_exempt01

 1713 BR    w       35%     14136        21% wafl_exempt02

 2599 BR    k        5%      2752         8% gr_scheduler


That seems pretty busy for a 4cpu system...



Thanks and regards,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: ssh timing out (filer high CPU load)

Douglas Siggins-2
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <[hidden email]> wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly). 

I'm pretty sure that's correlated with some high CPU load  - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'. 

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so. 

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most). 

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it? 

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue. 

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

  195 RR    N       47%      6928        10% NwkThd_00

  196 RR    N       47%      7880        12% NwkThd_01

  197 RR    0       47%      6928        10% NwkThd_02

  223 BR    s        7%      7648        46% pmcsas_intrd_1

  259 BR    e        5%      2440        19% fal_io_thread2

  502 BR    R        7%      7448        45% raidio_thread

  503 BR    R        7%      7448        45% raidio_thread

  635 BG    k        6%     15184        11% snmpd

 1614 BR    0        5%      3464        10% ntm_main

 1711 RR    w       35%     14256        21% wafl_exempt00

 1712 BR    w       35%     14136        21% wafl_exempt01

 1713 BR    w       35%     14136        21% wafl_exempt02

 2599 BR    k        5%      2752         8% gr_scheduler


That seems pretty busy for a 4cpu system...



Thanks and regards,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: ssh timing out (filer high CPU load)

Edward Rolison
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps). 

So I'll shut those down for a while, and see if that helps. 

On 10 August 2016 at 16:33, Douglas Siggins <[hidden email]> wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <[hidden email]> wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly). 

I'm pretty sure that's correlated with some high CPU load  - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'. 

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so. 

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most). 

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it? 

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue. 

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

  195 RR    N       47%      6928        10% NwkThd_00

  196 RR    N       47%      7880        12% NwkThd_01

  197 RR    0       47%      6928        10% NwkThd_02

  223 BR    s        7%      7648        46% pmcsas_intrd_1

  259 BR    e        5%      2440        19% fal_io_thread2

  502 BR    R        7%      7448        45% raidio_thread

  503 BR    R        7%      7448        45% raidio_thread

  635 BG    k        6%     15184        11% snmpd

 1614 BR    0        5%      3464        10% ntm_main

 1711 RR    w       35%     14256        21% wafl_exempt00

 1712 BR    w       35%     14136        21% wafl_exempt01

 1713 BR    w       35%     14136        21% wafl_exempt02

 2599 BR    k        5%      2752         8% gr_scheduler


That seems pretty busy for a 4cpu system...



Thanks and regards,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters




_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: ssh timing out (filer high CPU load)

Douglas Siggins-2
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks


On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <[hidden email]> wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps). 

So I'll shut those down for a while, and see if that helps. 

On 10 August 2016 at 16:33, Douglas Siggins <[hidden email]> wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <[hidden email]> wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly). 

I'm pretty sure that's correlated with some high CPU load  - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'. 

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so. 

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most). 

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it? 

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue. 

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

  195 RR    N       47%      6928        10% NwkThd_00

  196 RR    N       47%      7880        12% NwkThd_01

  197 RR    0       47%      6928        10% NwkThd_02

  223 BR    s        7%      7648        46% pmcsas_intrd_1

  259 BR    e        5%      2440        19% fal_io_thread2

  502 BR    R        7%      7448        45% raidio_thread

  503 BR    R        7%      7448        45% raidio_thread

  635 BG    k        6%     15184        11% snmpd

 1614 BR    0        5%      3464        10% ntm_main

 1711 RR    w       35%     14256        21% wafl_exempt00

 1712 BR    w       35%     14136        21% wafl_exempt01

 1713 BR    w       35%     14136        21% wafl_exempt02

 2599 BR    k        5%      2752         8% gr_scheduler


That seems pretty busy for a 4cpu system...



Thanks and regards,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters





_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: ssh timing out (filer high CPU load)

Edward Rolison
With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login. 

(And on the filer, I get 'connection timed out' messages). 

I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.

Our next line is 'reboot it', which'll have to wait until an outage window. 

Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals). 


On 10 August 2016 at 18:04, Douglas Siggins <[hidden email]> wrote:
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks


On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <[hidden email]> wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps). 

So I'll shut those down for a while, and see if that helps. 

On 10 August 2016 at 16:33, Douglas Siggins <[hidden email]> wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <[hidden email]> wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly). 

I'm pretty sure that's correlated with some high CPU load  - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'. 

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so. 

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most). 

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it? 

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue. 

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

  195 RR    N       47%      6928        10% NwkThd_00

  196 RR    N       47%      7880        12% NwkThd_01

  197 RR    0       47%      6928        10% NwkThd_02

  223 BR    s        7%      7648        46% pmcsas_intrd_1

  259 BR    e        5%      2440        19% fal_io_thread2

  502 BR    R        7%      7448        45% raidio_thread

  503 BR    R        7%      7448        45% raidio_thread

  635 BG    k        6%     15184        11% snmpd

 1614 BR    0        5%      3464        10% ntm_main

 1711 RR    w       35%     14256        21% wafl_exempt00

 1712 BR    w       35%     14136        21% wafl_exempt01

 1713 BR    w       35%     14136        21% wafl_exempt02

 2599 BR    k        5%      2752         8% gr_scheduler


That seems pretty busy for a 4cpu system...



Thanks and regards,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters






_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: ssh timing out (filer high CPU load)

Douglas Siggins-2
Again similar issue. Pretty sure I did a kill -9, but not positive. I believe my issue was similar to this:



On Thu, Aug 11, 2016 at 7:43 AM, Edward Rolison <[hidden email]> wrote:
With Zabbix off all night, we've got as far as picking up a possible bug with 'sshd' - the login is actually 'going' in that it's connecting and doing key- exchange, it's just not actually getting as far as the 'shell' login. 

(And on the filer, I get 'connection timed out' messages). 

I am still unsure quite why - rshstat/rshkill cleared out some stale processes, but I think they were more like symptom than cause.

Our next line is 'reboot it', which'll have to wait until an outage window. 

Don't suppose anyone has any handy tricks for 'force kill' on sshd on a filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem to respond to kill signals). 


On 10 August 2016 at 18:04, Douglas Siggins <[hidden email]> wrote:
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You will have to go through and remove a bunch of checks


On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <[hidden email]> wrote:
Thanks for the response. Yes, we're polling with Zabbix (and generating snmp traps). 

So I'll shut those down for a while, and see if that helps. 

On 10 August 2016 at 16:33, Douglas Siggins <[hidden email]> wrote:
The first thing that caught my eye was the snmpd, any chance you set up new SNMP polling from monitoring stations that is querying the disks over and over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <[hidden email]> wrote:
On the off chance - I'm having trouble with a filer. I can't ssh to it reliably (at all, mostly). 

I'm pretty sure that's correlated with some high CPU load  - my system console has it 'spiked' at >95% for the last 24h, and that's much higher than 'normal'. 

What i'm not sure of is quite what's causing it - the filer is busy, but not abnormally so. 

The only thing I can think of that _might_ have changed it, is api calls (qtree-list, get-file-info) - I've recently started doing quota snmp trap enrichment. (but thats 'every few minutes' at most). 

But otherwise - I'm not sure what might be causing sshd to stall, and if there's a way to 'kick' it? 

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to track a high CPU-causing ssh to not respond type issue. 

I'm pretty sure a failover/failback will do the trick, but that'll have to wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

   ID State Domain %CPU StackUsed %StackUsed Name

  195 RR    N       47%      6928        10% NwkThd_00

  196 RR    N       47%      7880        12% NwkThd_01

  197 RR    0       47%      6928        10% NwkThd_02

  223 BR    s        7%      7648        46% pmcsas_intrd_1

  259 BR    e        5%      2440        19% fal_io_thread2

  502 BR    R        7%      7448        45% raidio_thread

  503 BR    R        7%      7448        45% raidio_thread

  635 BG    k        6%     15184        11% snmpd

 1614 BR    0        5%      3464        10% ntm_main

 1711 RR    w       35%     14256        21% wafl_exempt00

 1712 BR    w       35%     14136        21% wafl_exempt01

 1713 BR    w       35%     14136        21% wafl_exempt02

 2599 BR    k        5%      2752         8% gr_scheduler


That seems pretty busy for a 4cpu system...



Thanks and regards,
Ed. 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters







_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters