OnCommand CPU report question for 2.x OPM

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

OnCommand CPU report question for 2.x OPM

Klise, Steve-2
>In former versions of DFM or whatever it was called now, I was able to chart individual CPU’s in Performance Manager.  Now, I only see an average across all processors w/ 2.x OnCommand Performance Manager.  I know I can drop down to the console, and run a sysstat -whatever... but was wondering if I was missing something for OnCommand Performance Manager to see, or graph individual CPUs.  Is this an option, or how would I enable, setup?  I am running the LOD, and can't see how this can be done... If its not available, that was of value to some of my customers.  

Steve



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Christopher S Eno
Hi Steve,

I ended up having to follow NetApp’s instructions on sending OCPM data to Graphite (and later Grafana via the NetApp harvest toolbox tool), then graphing the individual CPU counters.  OCPM just tracks “utilization” of the node, which is just not helpful.



> On Apr 21, 2016, at 11:28 AM, Klise, Steve <[hidden email]> wrote:
>
>> In former versions of DFM or whatever it was called now, I was able to chart individual CPU’s in Performance Manager.  Now, I only see an average across all processors w/ 2.x OnCommand Performance Manager.  I know I can drop down to the console, and run a sysstat -whatever... but was wondering if I was missing something for OnCommand Performance Manager to see, or graph individual CPUs.  Is this an option, or how would I enable, setup?  I am running the LOD, and can't see how this can be done... If its not available, that was of value to some of my customers.  
>
> Steve
>
>
>
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Flores, Paul
utilization is supposed to include CPU load, kahuna load and disk busy %
as factors in it¹s value.

In theory, it should make a better metric for node utilization than just
CPU cores by themselves.  While I recognize that is not really helpful if
you are trying to do a deep dive into Œwhy¹ things are happening on a
given node, OPM is supposed to make it easy for you to alarm on
Œperformance of node is going¹, without having to do if cpu > x and kahuna
> y and disk busy > z then page me that the controller is melting.


Paul Flores
Professional Services Consultant 3
Americas Performance Assessments Team
NetApp
281-857-6981 Direct Phone
713-446-5219 Mobile Phone
[hidden email]

http://www.netapp.com/us/media/ds-3444.pdf
<http://www.netapp.com/us/solutions/professional/assessment.html>




On 4/21/16, 10:33 AM, "[hidden email] on behalf of Scott
Eno" <[hidden email] on behalf of [hidden email]> wrote:

>Hi Steve,
>
>I ended up having to follow NetApp¹s instructions on sending OCPM data to
>Graphite (and later Grafana via the NetApp harvest toolbox tool), then
>graphing the individual CPU counters.  OCPM just tracks ³utilization² of
>the node, which is just not helpful.
>
>
>
>> On Apr 21, 2016, at 11:28 AM, Klise, Steve <[hidden email]> wrote:
>>
>>> In former versions of DFM or whatever it was called now, I was able to
>>>chart individual CPU¹s in Performance Manager.  Now, I only see an
>>>average across all processors w/ 2.x OnCommand Performance Manager.  I
>>>know I can drop down to the console, and run a sysstat -whatever... but
>>>was wondering if I was missing something for OnCommand Performance
>>>Manager to see, or graph individual CPUs.  Is this an option, or how
>>>would I enable, setup?  I am running the LOD, and can't see how this
>>>can be done... If its not available, that was of value to some of my
>>>customers.  
>>
>> Steve
>>
>>
>>
>> _______________________________________________
>> Toasters mailing list
>> [hidden email]
>> http://www.teaparty.net/mailman/listinfo/toasters
>
>
>_______________________________________________
>Toasters mailing list
>[hidden email]
>http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Christopher S Eno
Don’t know if we’re allowed to attach images, but I’ll try.  If you can see the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana you get a really nice breakdown of these processes.

Sadly no alerting, just monitoring.




On Apr 21, 2016, at 1:07 PM, Flores, Paul <[hidden email]> wrote:

utilization is supposed to include CPU load, kahuna load and disk busy %
as factors in it¹s value.

In theory, it should make a better metric for node utilization than just
CPU cores by themselves.  While I recognize that is not really helpful if
you are trying to do a deep dive into Œwhy¹ things are happening on a
given node, OPM is supposed to make it easy for you to alarm on
Œperformance of node is going¹, without having to do if cpu > x and kahuna
y and disk busy > z then page me that the controller is melting.


Paul Flores
Professional Services Consultant 3
Americas Performance Assessments Team
NetApp
281-857-6981 Direct Phone
713-446-5219 Mobile Phone
[hidden email]

http://www.netapp.com/us/media/ds-3444.pdf
<http://www.netapp.com/us/solutions/professional/assessment.html>




On 4/21/16, 10:33 AM, "[hidden email] on behalf of Scott
Eno" <[hidden email] on behalf of [hidden email]> wrote:

Hi Steve,

I ended up having to follow NetApp¹s instructions on sending OCPM data to
Graphite (and later Grafana via the NetApp harvest toolbox tool), then
graphing the individual CPU counters.  OCPM just tracks ³utilization² of
the node, which is just not helpful.



On Apr 21, 2016, at 11:28 AM, Klise, Steve <[hidden email]> wrote:

In former versions of DFM or whatever it was called now, I was able to
chart individual CPU¹s in Performance Manager.  Now, I only see an
average across all processors w/ 2.x OnCommand Performance Manager.  I
know I can drop down to the console, and run a sysstat -whatever... but
was wondering if I was missing something for OnCommand Performance
Manager to see, or graph individual CPUs.  Is this an option, or how
would I enable, setup?  I am running the LOD, and can't see how this
can be done... If its not available, that was of value to some of my
customers.  

Steve



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Jeffrey Mohler

There's nothing really to alert from here.   

Use their utilization metris. 


On Thu, Apr 21, 2016 at 10:13, Scott Eno
Don’t know if we’re allowed to attach images, but I’ll try.  If you can see the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana you get a really nice breakdown of these processes.

Sadly no alerting, just monitoring.




On Apr 21, 2016, at 1:07 PM, Flores, Paul <<a rel="nofollow" shape="rect" class="" ymailto="mailto:Paul.Flores@netapp.com" target="_blank" href="javascript:return">Paul.Flores@...> wrote:

utilization is supposed to include CPU load, kahuna load and disk busy %
as factors in it¹s value.

In theory, it should make a better metric for node utilization than just
CPU cores by themselves.  While I recognize that is not really helpful if
you are trying to do a deep dive into Œwhy¹ things are happening on a
given node, OPM is supposed to make it easy for you to alarm on
Œperformance of node is going¹, without having to do if cpu > x and kahuna
y and disk busy > z then page me that the controller is melting.


Paul Flores
Professional Services Consultant 3
Americas Performance Assessments Team
NetApp
281-857-6981 Direct Phone
713-446-5219 Mobile Phone
<a rel="nofollow" shape="rect" class="" ymailto="mailto:paul.flores@netapp.com" target="_blank" href="javascript:return">paul.flores@...

http://www.netapp.com/us/media/ds-3444.pdf
<http://www.netapp.com/us/solutions/professional/assessment.html>




On 4/21/16, 10:33 AM, "[hidden email] on behalf of Scott
Eno" <[hidden email] on behalf of [hidden email]> wrote:

Hi Steve,

I ended up having to follow NetApp¹s instructions on sending OCPM data to
Graphite (and later Grafana via the NetApp harvest toolbox tool), then
graphing the individual CPU counters.  OCPM just tracks ³utilization² of
the node, which is just not helpful.



On Apr 21, 2016, at 11:28 AM, Klise, Steve <[hidden email]> wrote:

In former versions of DFM or whatever it was called now, I was able to
chart individual CPU¹s in Performance Manager.  Now, I only see an
average across all processors w/ 2.x OnCommand Performance Manager.  I
know I can drop down to the console, and run a sysstat -whatever... but
was wondering if I was missing something for OnCommand Performance
Manager to see, or graph individual CPUs.  Is this an option, or how
would I enable, setup?  I am running the LOD, and can't see how this
can be done... If its not available, that was of value to some of my
customers.  

Steve



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
<a shape="rect" ymailto="mailto:Toasters@teaparty.net" href="javascript:return">Toasters@...
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Flores, Paul
In reply to this post by Christopher S Eno
Those are good if you want to know more about _why_ your utilization metric is high, but looking at them on their own is only part of the story.  Nothing you see by looking at the domains is going to help you _more_ than just monitoring utilization, because utilization _includes_ the only domain that could cause you pain by being over utilized. 

PF

From: Scott Eno <[hidden email]>
Date: Thursday, April 21, 2016 at 12:13 PM
To: Paul Flores <[hidden email]>
Cc: "NGC-steve.klise-wwt.com" <[hidden email]>, Toasters <[hidden email]>
Subject: Re: OnCommand CPU report question for 2.x OPM

Don’t know if we’re allowed to attach images, but I’ll try.  If you can see the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana you get a really nice breakdown of these processes.

Sadly no alerting, just monitoring.




On Apr 21, 2016, at 1:07 PM, Flores, Paul <[hidden email]> wrote:

utilization is supposed to include CPU load, kahuna load and disk busy %
as factors in it¹s value.

In theory, it should make a better metric for node utilization than just
CPU cores by themselves.  While I recognize that is not really helpful if
you are trying to do a deep dive into Œwhy¹ things are happening on a
given node, OPM is supposed to make it easy for you to alarm on
Œperformance of node is going¹, without having to do if cpu > x and kahuna
y and disk busy > z then page me that the controller is melting.


Paul Flores
Professional Services Consultant 3
Americas Performance Assessments Team
NetApp
281-857-6981 Direct Phone
713-446-5219 Mobile Phone
[hidden email]

http://www.netapp.com/us/media/ds-3444.pdf
<http://www.netapp.com/us/solutions/professional/assessment.html>




On 4/21/16, 10:33 AM, "[hidden email] on behalf of Scott
Eno" <[hidden email] on behalf of [hidden email]> wrote:

Hi Steve,

I ended up having to follow NetApp¹s instructions on sending OCPM data to
Graphite (and later Grafana via the NetApp harvest toolbox tool), then
graphing the individual CPU counters.  OCPM just tracks ³utilization² of
the node, which is just not helpful.



On Apr 21, 2016, at 11:28 AM, Klise, Steve <[hidden email]> wrote:

In former versions of DFM or whatever it was called now, I was able to
chart individual CPU¹s in Performance Manager.  Now, I only see an
average across all processors w/ 2.x OnCommand Performance Manager.  I
know I can drop down to the console, and run a sysstat -whatever... but
was wondering if I was missing something for OnCommand Performance
Manager to see, or graph individual CPUs.  Is this an option, or how
would I enable, setup?  I am running the LOD, and can't see how this
can be done... If its not available, that was of value to some of my
customers.  

Steve



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Michael Bergman
If by "montoring utilisation" Paul means this PC:

system:system:cpu_busy

(N.B. the formula that calculates this inside the Counter Mgr changed in
8.2.1 both 7- & c-mode)

...then yes, it includes the highest utilised *single threaded* kernel
domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads),
hostOS. For a recent/modern ONTAP that is, don't trust this if you're still
on some old version!

The formula for calculating it is like this:

MAX(system:system:average_processor_busy,
     MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))

and it has been since 8.2.1 and still is in all 8.3 rels to this date.



There are 10 of these s-threaded domains. You can see them in statit output.
The multi-threaded ones are not counted here in other words, but those can
give you problems too. Not just wafl_exempt, which is where WAFL executes
mostly (hopefully!) (it's sometimes called parallel-Kahuna).

The domain named Kahuna in statit output, is the only one included in the
new Node Utilisation metric, which also includes something called B2B CP
margin. s-Kahuna is the most dominant source of overload in this "domain"
area, that said I've had systems suffer from overload of other single
threaded domains too.  And multi-threaded ones as well, there have been
ONTAP bugs causing nwk_exempt to over utilise (that was *not* pleasant and
hard to find).  Under normal circumstances this would be really rare

The formula for the new Node Utilisation metric is basically like this:

system:system:node_util =
         MAX(system:system:avg_processor_busy,
             100-system:system:b2b_cp_margin,
             MAX(single threaded domain{1,2,3,...} utilization))


The main reason for avoiding system:system:cpu_busy here, is that it's been
so plagued over the years by showing the wrong (= not interesting!) thing
that misunderstandings have been abundant and controversy just never seems
to end around it

Anyway.
'Node Utilisation' aims to calculate, a ballpark estimate, how much "head
room" there's left until the system will get into B2B CP "too much"  (not
the odd one, that's OK and most often not noticeable by the application
/users).  To do the calculation, you need to know the utilisation of the
disks in the Raid Groups inside the system -- something which isn't that
easy to do.  There's no single PC in the CM (Counter Mgr) which will give
you the equiv of what sysstat -x calls "Disk Util" -- that col will show the
most utilised drive in the whole system for each iteration. I.o.w. it can be
a different drive each iteration of sysstat (which is quite OK).

For scripting and doing things yourself, you pretty much have to extract
*all* the util counters from *all* the drives in the system and then post
process them all. In a big system, with many 100 of disks, this becomes
quite cumbersome

However, the utilisation of a drive is not as obvious a metric as one may
think. It seems simple; it's measured internally by the system at a 1 KHz
rate -- is there a command on the disk or not?

But there is a caveat... apparently (as Paul Flores informed me recently)
the "yes" answer to the Q is actually "is there data going in/out of the
disk right now?"  Meaning that if a drive is *really* *really* busy, so d**n
busy that it spends a lot of time seeking, then the util metric will
actually "lie" to you.  It will go down even if the disk is busier than
ever.  I'm thinking that this probably doesn't matter much IRL, because it's
literally only slow (7.2K rpm, large) drives which could ever get into this
state -- and if your system would end up in this state you're in deep s**t
anyway and there's no remedy except a reboot or kill *all* the workload
generators to release the I/O pressure

Think of it as a motorway completely jammed up, no car can move anywhere.
How do you "fix" it?  A: close *all* the entrance ramps, and just wait. It
will drain out after a while

Hope this was helpful to ppl, sorry for the length of this text but these
things are quite complex and I don't want to add to confusion or cause
misunderstandings more than absolutely necessary

/M

On 2016-04-21 19:47, Flores, Paul wrote:

> Those are good if you want to know more about _why_ your utilization metric
> is high, but looking at them on their own is only part of the story. Nothing
> you see by looking at the domains is going to help you _more_ than just
> monitoring utilization, because utilization _includes_ the only domain that
> could cause you pain by being over utilized.
>
> PF
>
> From: Scott Eno <[hidden email] <mailto:[hidden email]>>
> Date: Thursday, April 21, 2016 at 12:13 PM
> To: Paul Flores <[hidden email] <mailto:[hidden email]>>
> Cc: "NGC-steve.klise-wwt.com" <[hidden email]
> <mailto:[hidden email]>>, Toasters <[hidden email]
> <mailto:[hidden email]>>
> Subject: Re: OnCommand CPU report question for 2.x OPM
>
> Don’t know if we’re allowed to attach images, but I’ll try. If you can see
> the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana
> you get a really nice breakdown of these processes.
>
> Sadly no alerting, just monitoring.
>
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Michael Bergman
Ok so two things (comments).

1.
I believe Paul meant the new metric 'Node Utilisation' in his reply.
N.B. there's no PC in the CM or anything like that for it, it's only inside OCPM

Since it's actually currently defined like this (I *think*):

system:system:node_util =
    MAX(system:system:avg_processor_busy,  # Normalized to 85
    100-system:system:b2b_cp_margin,
    <Kahuna utilisation>)                  # Normalized to 50

what Paul wrote makes sense:

> [...] because utilization _includes_ the only domain that could cause
> you pain by being over utilized.

2.
Pls note!  There's no Performance Counter in the CM called
system:system:b2b_cp_margin
system:system:node_util.

It's just a notation I used to make it clear and stringent. I think there
probably *should* be such PCs, in the future!


My general view is that Kahuna isn't the only serial domain that can cause
you pain by being over utilised. It's not common, rare rather, that any of
the other 9 can bottleneck a system, but it can (and has) happened.  And, as
I wrote before, you can get hurt by over utilised multi-threaded domains
too.  Again, it's not that common though personally I think that it would
make a lot of sense to include at least a few of those domains in the
overall fomula for 'Node Util' as well.  R&D efforts is ongoing I'm sure :-)

The main argument about Kahuna being so dominant in causing trouble is heavy
CIFS workload. SMB operations which have to be serialised, and are done a
lot... :-(

That said:  my very humble opinion is that since ONTAP 8.2.1
system:system:cpu_busy actually isn't that bad at all. If you know what it
shows, and how it's calculated it tells you stuff about utilisation of some
or other of the 10 serial domains inside the system. Point being: it may not
be Kahuna (even if it most often is).  I've watched our systems for long
periods of time, looking at the difference between these two in parallel:

system:system:cpu_busy
system:system:avg_processor_busy

while at the same time running sysstat -M.  Conclusion: it's not at all
always Kahuna that makes the former go up now and then.  It's been a bit of
a mystery at times, as I've had trouble matching it together so that I can
tell which of the 10 single threaded domains is causing cpu_busy to increase
during some measurement intervals.  I need to do more with this, the data
shown by sysstat -M is in the CM as PC as well so it's better to use 'stats
show' in the node shell to look at it

Hope this helps,
/M


On 2016-04-21 21:11, Michael Bergman wrote:

> If by "montoring utilisation" Paul means this PC:
>
> system:system:cpu_busy
>
> (N.B. the formula that calculates this inside the Counter Mgr changed in
> 8.2.1 both 7- & c-mode)
>
> ...then yes, it includes the highest utilised *single threaded* kernel
> domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads),
> hostOS. For a recent/modern ONTAP that is, don't trust this if you're still
> on some old version!
>
> The formula for calculating it is like this:
>
> MAX(system:system:average_processor_busy,
> MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
>
> and it has been since 8.2.1 and still is in all 8.3 rels to this date.
> [...]

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Klise, Steve-2
Thank you everyone for the great info.. Learn something new everyday!
Steve




On 4/21/16, 1:27 PM, "[hidden email] on behalf of Michael Bergman" <[hidden email] on behalf of [hidden email]> wrote:

>Ok so two things (comments).
>
>1.
>I believe Paul meant the new metric 'Node Utilisation' in his reply.
>N.B. there's no PC in the CM or anything like that for it, it's only inside OCPM
>
>Since it's actually currently defined like this (I *think*):
>
>system:system:node_util =
>    MAX(system:system:avg_processor_busy,  # Normalized to 85
>    100-system:system:b2b_cp_margin,
>    <Kahuna utilisation>)                  # Normalized to 50
>
>what Paul wrote makes sense:
>
>> [...] because utilization _includes_ the only domain that could cause
>> you pain by being over utilized.
>
>2.
>Pls note!  There's no Performance Counter in the CM called
>system:system:b2b_cp_margin
>system:system:node_util.
>
>It's just a notation I used to make it clear and stringent. I think there
>probably *should* be such PCs, in the future!
>
>
>My general view is that Kahuna isn't the only serial domain that can cause
>you pain by being over utilised. It's not common, rare rather, that any of
>the other 9 can bottleneck a system, but it can (and has) happened.  And, as
>I wrote before, you can get hurt by over utilised multi-threaded domains
>too.  Again, it's not that common though personally I think that it would
>make a lot of sense to include at least a few of those domains in the
>overall fomula for 'Node Util' as well.  R&D efforts is ongoing I'm sure :-)
>
>The main argument about Kahuna being so dominant in causing trouble is heavy
>CIFS workload. SMB operations which have to be serialised, and are done a
>lot... :-(
>
>That said:  my very humble opinion is that since ONTAP 8.2.1
>system:system:cpu_busy actually isn't that bad at all. If you know what it
>shows, and how it's calculated it tells you stuff about utilisation of some
>or other of the 10 serial domains inside the system. Point being: it may not
>be Kahuna (even if it most often is).  I've watched our systems for long
>periods of time, looking at the difference between these two in parallel:
>
>system:system:cpu_busy
>system:system:avg_processor_busy
>
>while at the same time running sysstat -M.  Conclusion: it's not at all
>always Kahuna that makes the former go up now and then.  It's been a bit of
>a mystery at times, as I've had trouble matching it together so that I can
>tell which of the 10 single threaded domains is causing cpu_busy to increase
>during some measurement intervals.  I need to do more with this, the data
>shown by sysstat -M is in the CM as PC as well so it's better to use 'stats
>show' in the node shell to look at it
>
>Hope this helps,
>/M
>
>
>On 2016-04-21 21:11, Michael Bergman wrote:
>> If by "montoring utilisation" Paul means this PC:
>>
>> system:system:cpu_busy
>>
>> (N.B. the formula that calculates this inside the Counter Mgr changed in
>> 8.2.1 both 7- & c-mode)
>>
>> ...then yes, it includes the highest utilised *single threaded* kernel
>> domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads),
>> hostOS. For a recent/modern ONTAP that is, don't trust this if you're still
>> on some old version!
>>
>> The formula for calculating it is like this:
>>
>> MAX(system:system:average_processor_busy,
>> MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
>>
>> and it has been since 8.2.1 and still is in all 8.3 rels to this date.
>> [...]
>
>_______________________________________________
>Toasters mailing list
>[hidden email]
>http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Jeffrey Mohler
In reply to this post by Michael Bergman
"CPU" isn't a plagued reading..

It's just irrelevant.

People work VERY VERY hard (including here) to make Ontap look like a linux box with smoothly threading (to infinity) processes to get things done evenly across all resources.

Utopia.

It's not...no matter how hard people try.

Now..being SO visible, and SO informatically driven via sysstat, sysstat -M, and a multitude of other ways to "view" it, one could come to a conclusion that it _MEANS_ something...it HAS to...

But really...

Inline image

 
_________________________________
[hidden email]
Tech Yahoo, Storage Architect, Principal
Twitter: @PrincipalYahoo
CorpIM:  Hipchat & Iris



On Thursday, April 21, 2016 12:11 PM, Michael Bergman <[hidden email]> wrote:


If by "montoring utilisation" Paul means this PC:

system:system:cpu_busy

(N.B. the formula that calculates this inside the Counter Mgr changed in
8.2.1 both 7- & c-mode)

...then yes, it includes the highest utilised *single threaded* kernel
domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads),
hostOS. For a recent/modern ONTAP that is, don't trust this if you're still
on some old version!

The formula for calculating it is like this:

MAX(system:system:average_processor_busy,
    MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))

and it has been since 8.2.1 and still is in all 8.3 rels to this date.



There are 10 of these s-threaded domains. You can see them in statit output.
The multi-threaded ones are not counted here in other words, but those can
give you problems too. Not just wafl_exempt, which is where WAFL executes
mostly (hopefully!) (it's sometimes called parallel-Kahuna).

The domain named Kahuna in statit output, is the only one included in the
new Node Utilisation metric, which also includes something called B2B CP
margin. s-Kahuna is the most dominant source of overload in this "domain"
area, that said I've had systems suffer from overload of other single
threaded domains too.  And multi-threaded ones as well, there have been
ONTAP bugs causing nwk_exempt to over utilise (that was *not* pleasant and
hard to find).  Under normal circumstances this would be really rare

The formula for the new Node Utilisation metric is basically like this:

system:system:node_util =
        MAX(system:system:avg_processor_busy,
            100-system:system:b2b_cp_margin,
            MAX(single threaded domain{1,2,3,...} utilization))


The main reason for avoiding system:system:cpu_busy here, is that it's been
so plagued over the years by showing the wrong (= not interesting!) thing
that misunderstandings have been abundant and controversy just never seems
to end around it

Anyway.
'Node Utilisation' aims to calculate, a ballpark estimate, how much "head
room" there's left until the system will get into B2B CP "too much"  (not
the odd one, that's OK and most often not noticeable by the application
/users).  To do the calculation, you need to know the utilisation of the
disks in the Raid Groups inside the system -- something which isn't that
easy to do.  There's no single PC in the CM (Counter Mgr) which will give
you the equiv of what sysstat -x calls "Disk Util" -- that col will show the
most utilised drive in the whole system for each iteration. I.o.w. it can be
a different drive each iteration of sysstat (which is quite OK).

For scripting and doing things yourself, you pretty much have to extract
*all* the util counters from *all* the drives in the system and then post
process them all. In a big system, with many 100 of disks, this becomes
quite cumbersome

However, the utilisation of a drive is not as obvious a metric as one may
think. It seems simple; it's measured internally by the system at a 1 KHz
rate -- is there a command on the disk or not?

But there is a caveat... apparently (as Paul Flores informed me recently)
the "yes" answer to the Q is actually "is there data going in/out of the
disk right now?"  Meaning that if a drive is *really* *really* busy, so d**n
busy that it spends a lot of time seeking, then the util metric will
actually "lie" to you.  It will go down even if the disk is busier than
ever.  I'm thinking that this probably doesn't matter much IRL, because it's
literally only slow (7.2K rpm, large) drives which could ever get into this
state -- and if your system would end up in this state you're in deep s**t
anyway and there's no remedy except a reboot or kill *all* the workload
generators to release the I/O pressure

Think of it as a motorway completely jammed up, no car can move anywhere.
How do you "fix" it?  A: close *all* the entrance ramps, and just wait. It
will drain out after a while

Hope this was helpful to ppl, sorry for the length of this text but these
things are quite complex and I don't want to add to confusion or cause
misunderstandings more than absolutely necessary

/M

On 2016-04-21 19:47, Flores, Paul wrote:

> Those are good if you want to know more about _why_ your utilization metric
> is high, but looking at them on their own is only part of the story. Nothing
> you see by looking at the domains is going to help you _more_ than just
> monitoring utilization, because utilization _includes_ the only domain that
> could cause you pain by being over utilized.
>
> PF
>
> From: Scott Eno <[hidden email] <mailto:[hidden email]>>
> Date: Thursday, April 21, 2016 at 12:13 PM
> To: Paul Flores <[hidden email] <mailto:[hidden email]>>
> Cc: "NGC-steve.klise-wwt.com" <[hidden email]
> <mailto:[hidden email]>>, Toasters <[hidden email]
> <mailto:[hidden email]>>

> Subject: Re: OnCommand CPU report question for 2.x OPM
>
> Don’t know if we’re allowed to attach images, but I’ll try. If you can see
> the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana
> you get a really nice breakdown of these processes.
>
> Sadly no alerting, just monitoring.
>
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: OnCommand CPU report question for 2.x OPM

Flores, Paul
In reply to this post by Michael Bergman
Thanks I forgot about the b2b CP measure as well. :)

The reasons that the node utilization metric works off of Kahuna and not
off of Kahu (the other serialized domain) are varied and subject to
another long winded discussion, but if you look at how things are handled
in bento, you will see that Kahuna affects the ENTIRE systems¹ ability to
do work, and will over-ride Kahu (parallelized serial work down in the
lower affinities)

as more user workload migrates into the lower affinities (volume/aggr and
the like), Kahuna usage becomes less of a potential workload bottleneck,
however the possibility exists, if we get some kind of bug that drops
things into that processing domain, that it can and will pre-empt
everything going on beneath it.

So, for example, in 8.2, since we are still doing some CIFS things in
Kahuna, it¹s possible for a small CIFS workload to pre-empt a busier NFS
workload, due to the amount of serial processing being demanded. It would
NOT be possible for that busier NFS Œserialized¹ workload to cause CIFS
meta-data type stuff happening in Kahuna to slow down, since Kahuna and
Kahu are mutually exclusive execution wise, and Kahuna has priority over
Kahu.

going to 8.3, this is not so much of a problem, but the fact remains, the
architecture of the software is such that Kahuna is a high priority
workload domain, so anything dropping into it has the potential to disrupt
work going on in other parts of the system I.E. it represents a potential
bottleneck to performance, and is an important thing to track when you
want to represent Node Utilization.

A lot of people are listening to how folks want an Œeasy¹ button for
headroom. its going to continue to get better, as OnTap gets a handle on
spreading the user workload more evenly across more cores, IMHO.

PF

On 4/21/16, 3:27 PM, "[hidden email] on behalf of Michael
Bergman" <[hidden email] on behalf of
[hidden email]> wrote:

>Ok so two things (comments).
>
>1.
>I believe Paul meant the new metric 'Node Utilisation' in his reply.
>N.B. there's no PC in the CM or anything like that for it, it's only
>inside OCPM
>
>Since it's actually currently defined like this (I *think*):
>
>system:system:node_util =
>    MAX(system:system:avg_processor_busy,  # Normalized to 85
>    100-system:system:b2b_cp_margin,
>    <Kahuna utilisation>)                  # Normalized to 50
>
>what Paul wrote makes sense:
>
>> [...] because utilization _includes_ the only domain that could cause
>> you pain by being over utilized.
>
>2.
>Pls note!  There's no Performance Counter in the CM called
>system:system:b2b_cp_margin
>system:system:node_util.
>
>It's just a notation I used to make it clear and stringent. I think there
>probably *should* be such PCs, in the future!
>
>
>My general view is that Kahuna isn't the only serial domain that can
>cause
>you pain by being over utilised. It's not common, rare rather, that any
>of
>the other 9 can bottleneck a system, but it can (and has) happened.  And,
>as
>I wrote before, you can get hurt by over utilised multi-threaded domains
>too.  Again, it's not that common though personally I think that it would
>make a lot of sense to include at least a few of those domains in the
>overall fomula for 'Node Util' as well.  R&D efforts is ongoing I'm sure
>:-)
>
>The main argument about Kahuna being so dominant in causing trouble is
>heavy
>CIFS workload. SMB operations which have to be serialised, and are done a
>lot... :-(
>
>That said:  my very humble opinion is that since ONTAP 8.2.1
>system:system:cpu_busy actually isn't that bad at all. If you know what
>it
>shows, and how it's calculated it tells you stuff about utilisation of
>some
>or other of the 10 serial domains inside the system. Point being: it may
>not
>be Kahuna (even if it most often is).  I've watched our systems for long
>periods of time, looking at the difference between these two in parallel:
>
>system:system:cpu_busy
>system:system:avg_processor_busy
>
>while at the same time running sysstat -M.  Conclusion: it's not at all
>always Kahuna that makes the former go up now and then.  It's been a bit
>of
>a mystery at times, as I've had trouble matching it together so that I
>can
>tell which of the 10 single threaded domains is causing cpu_busy to
>increase
>during some measurement intervals.  I need to do more with this, the data
>shown by sysstat -M is in the CM as PC as well so it's better to use
>'stats
>show' in the node shell to look at it
>
>Hope this helps,
>/M
>
>
>On 2016-04-21 21:11, Michael Bergman wrote:
>> If by "montoring utilisation" Paul means this PC:
>>
>> system:system:cpu_busy
>>
>> (N.B. the formula that calculates this inside the Counter Mgr changed in
>> 8.2.1 both 7- & c-mode)
>>
>> ...then yes, it includes the highest utilised *single threaded* kernel
>> domain. Serial domains are all except *_exempt, wafl_xcleaner (2
>>threads),
>> hostOS. For a recent/modern ONTAP that is, don't trust this if you're
>>still
>> on some old version!
>>
>> The formula for calculating it is like this:
>>
>> MAX(system:system:average_processor_busy,
>> MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
>>
>> and it has been since 8.2.1 and still is in all 8.3 rels to this date.
>> [...]
>
>_______________________________________________
>Toasters mailing list
>[hidden email]
>http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters