super secret flags

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

super secret flags

Peter D. Gray
Hi people

Just out of idle curiosity, am I the only netapp admin who does
not know about the super secret flags to allow snapmirror to actually
work at reasonable speed?

We were running 8.3.2 cluster mode, and spent weeks looking into why our
snapmirrors to our remote site ran so slowly. We were often 2 days behind
over 40G networks. Obviously, we focussed on network
issues. And we wasted a lot of time. We could make no sense of the problem
at all since sometimes it appears to work ok, the later the transfers
slowed to a crawl.

We eventually opened a case and it did not take to long for a reply which
basically said "why don't you just disable the global snapmirror throttle."
I had already looked into such a beast, but found nothing.

As you may or may not know, it turns out to be a per node setting. The name of the
flag is repl_throttle_enable. Of course, you can only see such flags or change
them on the node, in privileged mode.

Setting the flag to 0 immediately (and I do mean immediately) allowed our
snapmirrors to run at the speed you might expect over 40G. Instead of
taking 2 days, snapmirror updates now took 2 hours.

We have since upgraded to 9.1.  The flags reverted to on, but again can be
set to off. I think there is a documented global snapmirror throttle
option in 9.1, but I have not looked into that yet.

Are we the only site in the world to have seen this issue?
We use snapmirror DR for all our mirrors which may be a factor.

As I said, just idle curiousity and maybe helping someone avoid the
time wasting we had.

Regards,
pdg

Peter Gray Ph (direct): +61 2 4221 3770
Information Management & Technology Services Ph (switch): +61 2 4221 3555
University of Wollongong Fax: +61 2 4229 1958
Wollongong NSW 2522 Email: [hidden email]
Australia URL: http://pdg.uow.edu.au
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.

On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.

In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.

The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.

You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Peter D. Gray
Sent: Monday, January 30, 2017 12:30 AM
To: [hidden email]
Subject: super secret flags

Hi people

Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?

We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.

We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle."
I had already looked into such a beast, but found nothing.

As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.

Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.

We have since upgraded to 9.1.  The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.

Are we the only site in the world to have seen this issue?
We use snapmirror DR for all our mirrors which may be a factor.

As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.

Regards,
pdg

Peter Gray Ph (direct): +61 2 4221 3770
Information Management & Technology Services Ph (switch): +61 2 4221 3555
University of Wollongong Fax: +61 2 4229 1958
Wollongong NSW 2522 Email: [hidden email]
Australia URL: http://pdg.uow.edu.au
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Clendening, William D
We routinely have to turn off throttle to get reasonable throughput on SnapMirrors and vol moves.  I know of other customers that leave it off permanently.  We try to leave at default.  

My understanding is that with throttling off, you can potentially impact performance of client IO.  However, we've never had any complaints.

Doug Clendening



-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Steiner, Jeffrey
Sent: Monday, January 30, 2017 12:13 AM
To: NGC-pdg-uow.edu.au; [hidden email]
Subject: [**EXTERNAL**] RE: super secret flags

I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.

On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.

In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.

The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.

You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Peter D. Gray
Sent: Monday, January 30, 2017 12:30 AM
To: [hidden email]
Subject: super secret flags

Hi people

Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?

We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.

We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle."
I had already looked into such a beast, but found nothing.

As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.

Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.

We have since upgraded to 9.1.  The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.

Are we the only site in the world to have seen this issue?
We use snapmirror DR for all our mirrors which may be a factor.

As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.

Regards,
pdg

Peter Gray Ph (direct): +61 2 4221 3770
Information Management & Technology Services Ph (switch): +61 2 4221 3555
University of Wollongong Fax: +61 2 4229 1958
Wollongong NSW 2522 Email: [hidden email]
Australia URL: http://pdg.uow.edu.au
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey
Thanks for the note. I'm working on a number of DR-related projects that rely on SnapMirror. I'll be looking into this more closely. If I find out anything useful, I'll relay it. If this is something that everyone will need to adjust, so be it, but it still seems odd to me that it would be required outside fringe cases.

-----Original Message-----
From: Clendening, William D [mailto:[hidden email]]
Sent: Monday, January 30, 2017 1:57 PM
To: Steiner, Jeffrey <[hidden email]>; NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
Subject: RE: super secret flags

We routinely have to turn off throttle to get reasonable throughput on SnapMirrors and vol moves.  I know of other customers that leave it off permanently.  We try to leave at default.  

My understanding is that with throttling off, you can potentially impact performance of client IO.  However, we've never had any complaints.

Doug Clendening



-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Steiner, Jeffrey
Sent: Monday, January 30, 2017 12:13 AM
To: NGC-pdg-uow.edu.au; [hidden email]
Subject: [**EXTERNAL**] RE: super secret flags

I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.

On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.

In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.

The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.

You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Peter D. Gray
Sent: Monday, January 30, 2017 12:30 AM
To: [hidden email]
Subject: super secret flags

Hi people

Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?

We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.

We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle."
I had already looked into such a beast, but found nothing.

As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.

Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.

We have since upgraded to 9.1.  The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.

Are we the only site in the world to have seen this issue?
We use snapmirror DR for all our mirrors which may be a factor.

As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.

Regards,
pdg

Peter Gray Ph (direct): +61 2 4221 3770
Information Management & Technology Services Ph (switch): +61 2 4221 3555
University of Wollongong Fax: +61 2 4229 1958
Wollongong NSW 2522 Email: [hidden email]
Australia URL: http://pdg.uow.edu.au
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Ehrenwald, Ian
In reply to this post by Peter D. Gray
It's funny you mention this - I had a support case open not more than a
couple weeks ago regarding the same exact thing.

I have a few fairly large volumes (20t, 40t) that are consistently lagged
in their SM replication by over a week to our DR site.  The primary and DR
site are connected via at 5Gb/s with and we've been able to fill that pipe
in the past.  The aggregate that holds these two volumes is made of 4 x
DS2246 on both the source and destination side.  The destination aggregate
is mostly idle, the source aggregate sees almost 20K+ IOPS 24/7/365.  I
also have an aggregate made of 8 x DS2246 and it's pretty busy all the
time too, and volumes on that aggregate replicate to an identical
aggregate at the DR site and are never lagged.

The support engineer I was working with did mention that we could disable
this global throttle though it may have an impact on client latency, so I
didn't do it.

The best idea we could come up with is that the source side aggregate with
the lagged SM volumes, and the node that owns it (a FAS8060), might be
IOPS and CPU bound and we could consider adding more shelves to this
aggregate, running a reallocate to spread the blocks around, and seeing if
that helps.

It's not really in my budget at the moment to purchase four more DS2246
with 24 x 1.2t each (2 for primary, 2 for DR) so this has rekindled my
interest in trying this global throttle flag on a weekend were if IO bogs
down nobody will complain (too much) :)



--
Ian Ehrenwald
Senior Infrastructure Engineer
Hachette Book Group, Inc.
1.617.263.1948 / [hidden email]








On 1/29/17, 6:29 PM, "Peter D. Gray" <[hidden email]> wrote:

>Hi people
>
>Just out of idle curiosity, am I the only netapp admin who does
>not know about the super secret flags to allow snapmirror to actually
>work at reasonable speed?
>
>We were running 8.3.2 cluster mode, and spent weeks looking into why our
>snapmirrors to our remote site ran so slowly. We were often 2 days behind
>over 40G networks. Obviously, we focussed on network
>issues. And we wasted a lot of time. We could make no sense of the problem
>at all since sometimes it appears to work ok, the later the transfers
>slowed to a crawl.
>
>We eventually opened a case and it did not take to long for a reply which
>basically said "why don't you just disable the global snapmirror
>throttle."
>I had already looked into such a beast, but found nothing.
>
>As you may or may not know, it turns out to be a per node setting. The
>name of the
>flag is repl_throttle_enable. Of course, you can only see such flags or
>change
>them on the node, in privileged mode.
>
>Setting the flag to 0 immediately (and I do mean immediately) allowed our
>snapmirrors to run at the speed you might expect over 40G. Instead of
>taking 2 days, snapmirror updates now took 2 hours.
>
>We have since upgraded to 9.1.  The flags reverted to on, but again can be
>set to off. I think there is a documented global snapmirror throttle
>option in 9.1, but I have not looked into that yet.
>
>Are we the only site in the world to have seen this issue?
>We use snapmirror DR for all our mirrors which may be a factor.
>
>As I said, just idle curiousity and maybe helping someone avoid the
>time wasting we had.
>
>Regards,
>pdg
>
>Peter GrayPh (direct): +61 2 4221 3770
>Information Management & Technology ServicesPh (switch): +61 2 4221 3555
>University of WollongongFax: +61 2 4229 1958
>Wollongong NSW 2522Email: [hidden email]
>AustraliaURL: http://pdg.uow.edu.au
>_______________________________________________
>Toasters mailing list
>[hidden email]
>http://www.teaparty.net/mailman/listinfo/toasters

This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Ehrenwald, Ian
OK so curiosity got the better of me :)  I just disabled this internal
throttle and the lagged SnapMirrors went from about 150Mbit/s to 2.7Gbit/s
according to our network monitoring tools.  CPU utilization on the node
that owns the disks has definitely increased, sometimes to the tune of 93%
or higher, and latency across all volumes has ticked up by a small but
measurable amount.  Disk utilization % as measured by sysstat is still
within a reasonable range.  I do see a lot more CP activity, mostly :s :n
and :f.

I understand why this throttle is enabled by default, and I would not keep
this flag enabled because of HA CPU concerns, but the increase in SM
throughput is unbelievable.


--
Ian Ehrenwald
Senior Infrastructure Engineer
Hachette Book Group, Inc.
1.617.263.1948 / [hidden email]








On 1/30/17, 9:56 AM, "Ehrenwald, Ian" <[hidden email]> wrote:

>[This sender failed our fraud detection checks and may not be who they
>appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
>
>It's funny you mention this - I had a support case open not more than a
>couple weeks ago regarding the same exact thing.
>
>I have a few fairly large volumes (20t, 40t) that are consistently lagged
>in their SM replication by over a week to our DR site.  The primary and DR
>site are connected via at 5Gb/s with and we've been able to fill that pipe
>in the past.  The aggregate that holds these two volumes is made of 4 x
>DS2246 on both the source and destination side.  The destination aggregate
>is mostly idle, the source aggregate sees almost 20K+ IOPS 24/7/365.  I
>also have an aggregate made of 8 x DS2246 and it's pretty busy all the
>time too, and volumes on that aggregate replicate to an identical
>aggregate at the DR site and are never lagged.
>
>The support engineer I was working with did mention that we could disable
>this global throttle though it may have an impact on client latency, so I
>didn't do it.
>
>The best idea we could come up with is that the source side aggregate with
>the lagged SM volumes, and the node that owns it (a FAS8060), might be
>IOPS and CPU bound and we could consider adding more shelves to this
>aggregate, running a reallocate to spread the blocks around, and seeing if
>that helps.
>
>It's not really in my budget at the moment to purchase four more DS2246
>with 24 x 1.2t each (2 for primary, 2 for DR) so this has rekindled my
>interest in trying this global throttle flag on a weekend were if IO bogs
>down nobody will complain (too much) :)
>
>
>
>--
>Ian Ehrenwald
>Senior Infrastructure Engineer
>Hachette Book Group, Inc.
>1.617.263.1948 / [hidden email]
>
>
>
>
>
>
>
>
>On 1/29/17, 6:29 PM, "Peter D. Gray" <[hidden email]> wrote:
>
>>Hi people
>>
>>Just out of idle curiosity, am I the only netapp admin who does
>>not know about the super secret flags to allow snapmirror to actually
>>work at reasonable speed?
>>
>>We were running 8.3.2 cluster mode, and spent weeks looking into why our
>>snapmirrors to our remote site ran so slowly. We were often 2 days behind
>>over 40G networks. Obviously, we focussed on network
>>issues. And we wasted a lot of time. We could make no sense of the
>>problem
>>at all since sometimes it appears to work ok, the later the transfers
>>slowed to a crawl.
>>
>>We eventually opened a case and it did not take to long for a reply which
>>basically said "why don't you just disable the global snapmirror
>>throttle."
>>I had already looked into such a beast, but found nothing.
>>
>>As you may or may not know, it turns out to be a per node setting. The
>>name of the
>>flag is repl_throttle_enable. Of course, you can only see such flags or
>>change
>>them on the node, in privileged mode.
>>
>>Setting the flag to 0 immediately (and I do mean immediately) allowed our
>>snapmirrors to run at the speed you might expect over 40G. Instead of
>>taking 2 days, snapmirror updates now took 2 hours.
>>
>>We have since upgraded to 9.1.  The flags reverted to on, but again can
>>be
>>set to off. I think there is a documented global snapmirror throttle
>>option in 9.1, but I have not looked into that yet.
>>
>>Are we the only site in the world to have seen this issue?
>>We use snapmirror DR for all our mirrors which may be a factor.
>>
>>As I said, just idle curiousity and maybe helping someone avoid the
>>time wasting we had.
>>
>>Regards,
>>pdg
>>
>>Peter GrayPh (direct): +61 2 4221 3770
>>Information Management & Technology ServicesPh (switch): +61 2 4221 3555
>>University of WollongongFax: +61 2 4229 1958
>>Wollongong NSW 2522Email: [hidden email]
>>AustraliaURL: http://pdg.uow.edu.au
>>_______________________________________________
>>Toasters mailing list
>>[hidden email]
>>http://www.teaparty.net/mailman/listinfo/toasters
>
>This may contain confidential material. If you are not an intended
>recipient, please notify the sender, delete immediately, and understand
>that no disclosure or reliance on the information herein is permitted.
>Hachette Book Group may monitor email to and from our network.
>
>_______________________________________________
>Toasters mailing list
>[hidden email]
>http://www.teaparty.net/mailman/listinfo/toasters

This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Peter D. Gray

On Mon, Jan 30, 2017 at 02:56:48PM +0000, Ehrenwald, Ian wrote:
> It's funny you mention this - I had a support case open not more than a
> couple weeks ago regarding the same exact thing.
>
>
> It's not really in my budget at the moment to purchase four more DS2246
> with 24 x 1.2t each (2 for primary, 2 for DR) so this has rekindled my
> interest in trying this global throttle flag on a weekend were if IO bogs
> down nobody will complain (too much) :)
>

We saw absolutely no impact on performance on the source filers. I am sure its possible
it could impact performance.

In fact I would argue our performance may have improved, because the snapmirrors
now finish in just a few hours overnight when we are under light load, rather
than having 20 snapmirrors running during the day for days on end.

Also, I already have a mechanism to control throttling. Each snapmirror has a throttle setting.
Whats the point if the global limit makes my per mirror limit useless?

Also, data protection can be more important than performance. That should be up to
us to decide, not netapp.

The bottom line is that a global snapmirror rate throttle is a great idea.
It should be documented and easily controllable by the administrator
and should have a reasonable default value. It appears the current throttle
has none of these things.

Regards,
pdg
_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Peter D. Gray
In reply to this post by Steiner, Jeffrey
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
> I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
>

I am not entirely convinced that every customer should need to raise a support case
to get their snapmirrors working properly.


> On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
>

Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is
a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no
noticeable drop in filer performance at either end.

But as I said elsewhere, it should be my choice how I prioritize performance over
data protection. Give me the tools and the documentation.

> In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.

Did not work here.

>
> The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.

Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time
looking at this problem only to be told about a really simple fix that seems to work a treat.

You can see that does not make us happy.

>
> You might consider continuing to follow up on the case to ensure that either (a) you're in an
> odd situation where this parameter really is warranted or (b) there is some kind of underlying
> problem that needs fixing. If you're otherwise happy with the way the system is performing
> and the parameter change worked, I'd probably call it good...


Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem.
The thing that made me the most angry is that there is a completely undocumented
setting that has an absolutely massive impact on performance of a major
feature in ONTAP.

Basically, I posted this to see if any other people have seen the problem.
It appears at least some have.

Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey
Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.

I have a question - what is your use of post-processing compression or deduplication?

There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.

If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.

-----Original Message-----
From: Peter D. Gray [mailto:[hidden email]]
Sent: Monday, January 30, 2017 11:52 PM
To: Steiner, Jeffrey <[hidden email]>
Cc: NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
Subject: Re: super secret flags

On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
> I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
>

I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.


> On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
>

Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.

But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.

> In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.

Did not work here.

>
> The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.

Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.

You can see that does not make us happy.

>
> You might consider continuing to follow up on the case to ensure that
> either (a) you're in an odd situation where this parameter really is
> warranted or (b) there is some kind of underlying problem that needs
> fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...


Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem.
The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.

Basically, I posted this to see if any other people have seen the problem.
It appears at least some have.

Regards,
pdg


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Ehrenwald, Ian
Hi Jeffrey
Not sure if you were addressing Peter specifically or the list in general, but I'll answer none-the-less :)

I have 23 entries in 'volume efficiency show -state enabled', and of those, 14 are inline-only policy on SSD.  None of those are being SnapMirrored.  Of the remaining 9, 4 of them are being SnapMirrored to our DR.  Of those 4, 1 of them is a constantly lagged 20t volume previously mentioned.  The other constantly lagged volumes are not being deduped or compressed.

My rough guesstimate is that disabling this global throttle and leaving the SnapMirrors running overnight has transferred more snapshot data from these lagged volumes in less than 24 hours than in total from the previous month or maybe more.  The tradeoff I am seeing is increased node CPU utilization as well as an occasional small uptick in latency to NFS clients.

I wonder if as filer hardware gets more powerful and enough evidence has been brought to light, this global throttle can be disabled by default or at least not be as aggressive - the snapmirror throughput increase is so significant that if I didn't see it myself I'd guess that someone was reading numbers incorrectly.


Ian Ehrenwald
Senior Infrastructure Engineer
Hachette Book Group, Inc.
1.617.263.1948 / [hidden email]

________________________________________
From: [hidden email] <[hidden email]> on behalf of Steiner, Jeffrey <[hidden email]>
Sent: Tuesday, January 31, 2017 2:30:35 AM
To: NGC-pdg-uow.edu.au
Cc: [hidden email]
Subject: RE: super secret flags

Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.

I have a question - what is your use of post-processing compression or deduplication?

There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.

If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Tim Parkinson
In reply to this post by Steiner, Jeffrey
Hi Jeffrey,

Just adding another voice to the "We've experienced abysmal snapmirror
performance in cmode" crowd. We've never really had a satisfactory
answer to why from our third party support people/netapp and have
spent a tremendous amount of time trying to track down the cause of
snapmirror issues (including buying larger controllers). This is the
first we've heard of this throttle setting, and will certainly test it
over a weekend to see if it helps us out, since we still see lagging
mirrors and can't work out why.

We have a large number of post-process deduped volumes, no
compression, to answer your question.

Regards,

Tim

On 31 January 2017 at 07:30, Steiner, Jeffrey
<[hidden email]> wrote:

> Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
>
> I have a question - what is your use of post-processing compression or deduplication?
>
> There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
>
> If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
>
> -----Original Message-----
> From: Peter D. Gray [mailto:[hidden email]]
> Sent: Monday, January 30, 2017 11:52 PM
> To: Steiner, Jeffrey <[hidden email]>
> Cc: NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
> Subject: Re: super secret flags
>
> On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
>> I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
>>
>
> I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
>
>
>> On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
>>
>
> Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
>
> But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
>
>> In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
>
> Did not work here.
>
>>
>> The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
>
> Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
>
> You can see that does not make us happy.
>
>>
>> You might consider continuing to follow up on the case to ensure that
>> either (a) you're in an odd situation where this parameter really is
>> warranted or (b) there is some kind of underlying problem that needs
>> fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
>
>
> Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem.
> The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
>
> Basically, I posted this to see if any other people have seen the problem.
> It appears at least some have.
>
> Regards,
> pdg
>
>
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters



--
Tim Parkinson
Server & Storage Administrator
University of Sheffield
0114 222 3039

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey
Thanks for all the replies, I'm going to bring this up with engineering and ask for clearer guidance. I suspect we really need an updated KB at a minimum. If this particular problem arises because there's legitimately a lot of "extra" work going on, then this is just another tunable that needs to be documented. On the other hand, if something like post-processing dedupe is abnormally outcompeting SnapMirror, that's a bug that ought to be fixed.

It might take a week, but I'll report back on what I find.

-----Original Message-----
From: Tim Parkinson [mailto:[hidden email]]
Sent: Wednesday, February 01, 2017 6:37 AM
To: Steiner, Jeffrey <[hidden email]>
Cc: [hidden email]
Subject: Re: super secret flags

Hi Jeffrey,

Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.

We have a large number of post-process deduped volumes, no compression, to answer your question.

Regards,

Tim

On 31 January 2017 at 07:30, Steiner, Jeffrey <[hidden email]> wrote:

> Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
>
> I have a question - what is your use of post-processing compression or deduplication?
>
> There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
>
> If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
>
> -----Original Message-----
> From: Peter D. Gray [mailto:[hidden email]]
> Sent: Monday, January 30, 2017 11:52 PM
> To: Steiner, Jeffrey <[hidden email]>
> Cc: NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
> Subject: Re: super secret flags
>
> On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
>> I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
>>
>
> I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
>
>
>> On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
>>
>
> Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
>
> But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
>
>> In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
>
> Did not work here.
>
>>
>> The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
>
> Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
>
> You can see that does not make us happy.
>
>>
>> You might consider continuing to follow up on the case to ensure that
>> either (a) you're in an odd situation where this parameter really is
>> warranted or (b) there is some kind of underlying problem that needs
>> fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
>
>
> Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem.
> The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
>
> Basically, I posted this to see if any other people have seen the problem.
> It appears at least some have.
>
> Regards,
> pdg
>
>
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters



--
Tim Parkinson
Server & Storage Administrator
University of Sheffield
0114 222 3039

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey
In reply to this post by Tim Parkinson
If anyone on this distribution list runs into unexplained slow snapmirror transfers, please open a support case and cite BURT 1030457. It sounds like under some circumstances we don't fully understand, the throttle is too aggressive. Post-processing deduplication jobs seem to be connected, but there's probably more to it than just that.

I've tagged the BURT with the support cases mentioned so far in this thread, and requested a better KB article explaining when this flag might need to be updated.

-----Original Message-----
From: Tim Parkinson [mailto:[hidden email]]
Sent: Wednesday, February 01, 2017 6:37 AM
To: Steiner, Jeffrey <[hidden email]>
Cc: [hidden email]
Subject: Re: super secret flags

Hi Jeffrey,

Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.

We have a large number of post-process deduped volumes, no compression, to answer your question.

Regards,

Tim

On 31 January 2017 at 07:30, Steiner, Jeffrey <[hidden email]> wrote:

> Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
>
> I have a question - what is your use of post-processing compression or deduplication?
>
> There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
>
> If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
>
> -----Original Message-----
> From: Peter D. Gray [mailto:[hidden email]]
> Sent: Monday, January 30, 2017 11:52 PM
> To: Steiner, Jeffrey <[hidden email]>
> Cc: NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
> Subject: Re: super secret flags
>
> On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
>> I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
>>
>
> I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
>
>
>> On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
>>
>
> Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
>
> But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
>
>> In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
>
> Did not work here.
>
>>
>> The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
>
> Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
>
> You can see that does not make us happy.
>
>>
>> You might consider continuing to follow up on the case to ensure that
>> either (a) you're in an odd situation where this parameter really is
>> warranted or (b) there is some kind of underlying problem that needs
>> fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
>
>
> Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem.
> The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
>
> Basically, I posted this to see if any other people have seen the problem.
> It appears at least some have.
>
> Regards,
> pdg
>
>
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters



--
Tim Parkinson
Server & Storage Administrator
University of Sheffield
0114 222 3039

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Filip Sneppe
Hi Jeffrey and others,

I don't want to hijack this thread, since this is specifically about the repl_throttle_enable flag, but are you guys aware of the performance impact on SnapMirror when the transfers run over etherchannels with port-based hashing on the sender side ?

I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.

Can anyone else confirm this behavior ?

(To put in my two cents on the repl_throttle_enable flag: at a customer today he reported this SnapMirror progress with/without throttling:
300GB in 2 hours vs. 100GB in 15 minutes after we disabled this flag. Also, earlier this week I had to wait for 160 TB of vol move operations on a 2nd line system. When disabling the repl_throttle_enable flag, I saw little or no impact for volumes with "dead/unmodified" data on it, but a big impact for (NFS) VMware datastores with some live VMs sitting on them: the cutover estimation from "vol move show" was reduced by 24 hours almost immediately - I am quite sure those VMs will have been impacted as CPU and disk load was pegged at 90+%).

Best regards,
Filip

On Thu, Feb 2, 2017 at 9:13 AM, Steiner, Jeffrey <[hidden email]> wrote:
If anyone on this distribution list runs into unexplained slow snapmirror transfers, please open a support case and cite BURT 1030457. It sounds like under some circumstances we don't fully understand, the throttle is too aggressive. Post-processing deduplication jobs seem to be connected, but there's probably more to it than just that.

I've tagged the BURT with the support cases mentioned so far in this thread, and requested a better KB article explaining when this flag might need to be updated.

-----Original Message-----
From: Tim Parkinson [mailto:[hidden email]]
Sent: Wednesday, February 01, 2017 6:37 AM
To: Steiner, Jeffrey <[hidden email]>
Cc: [hidden email]
Subject: Re: super secret flags

Hi Jeffrey,

Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.

We have a large number of post-process deduped volumes, no compression, to answer your question.

Regards,

Tim

On 31 January 2017 at 07:30, Steiner, Jeffrey <[hidden email]> wrote:
> Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
>
> I have a question - what is your use of post-processing compression or deduplication?
>
> There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
>
> If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
>
> -----Original Message-----
> From: Peter D. Gray [mailto:[hidden email]]
> Sent: Monday, January 30, 2017 11:52 PM
> To: Steiner, Jeffrey <[hidden email]>
> Cc: NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
> Subject: Re: super secret flags
>
> On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
>> I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
>>
>
> I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
>
>
>> On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
>>
>
> Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
>
> But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
>
>> In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
>
> Did not work here.
>
>>
>> The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
>
> Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
>
> You can see that does not make us happy.
>
>>
>> You might consider continuing to follow up on the case to ensure that
>> either (a) you're in an odd situation where this parameter really is
>> warranted or (b) there is some kind of underlying problem that needs
>> fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
>
>
> Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem.
> The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
>
> Basically, I posted this to see if any other people have seen the problem.
> It appears at least some have.
>
> Regards,
> pdg
>
>
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters



--
Tim Parkinson
Server & Storage Administrator
University of Sheffield
0114 222 3039

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Peter D. Gray
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:
> Hi Jeffrey and others,
>
> for (NFS) VMware datastores with some live VMs sitting on them: the cutover
> estimation from "vol move show" was reduced by 24 hours almost immediately
> - I am quite sure those VMs will have been impacted as CPU and disk load
> was pegged at 90+%).


You know about the -bypass-throttling true flag on the volume move command right?
Again a super secret option oply available in diag mode.
Without this, we find volume move takes forever.

This is our canned volume move command

set -privilege diag -confirmations off ';' volume move start -vserver "$svm" -volume "$volume" -destination-aggregate "$new_aggregate" \
                        -bypass-throttling true
                       

Volume split also takes about a thousand years and I have not found a way to speed that up.

However, you can cheat by moving the new volume to another aggregate (as long as you disable the
throttle). That effectively splits the volume but about 100 times quicker
than waiting for the split to finish.

There seems to be a widespread issue with the builtin ONTAP throttles.

Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Peter D. Gray
In reply to this post by Filip Sneppe
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:

> Hi Jeffrey and others,
>
>
> I have come across this on a couple of times (First time I encountered this
> I logged a case for it: 2005111796). Unforunately I have never had the time
> to troubleshoot this. In case 2005111796, support observed packet loss in
> the setup with port-based hashing, but we had to destroy our
> (test/troubleshooting) setup before we could get to the bottom of this.
> Since then, I have come across this on several occasions. More often than
> not it was not a real issue since those SnapMirrors ran across WAN links,
> or SnapMirror runs at night and can take all the time it wants, but on
> 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue.
> However, I found out there is a TR that mentions that SnapMirror
> performance could be impacted by port-based ifgrps so I've never bothered
> to open any additional cases for this.
>
> Can anyone else confirm this behavior ?
>

Is this an LACP ifgrp? We have had no issues with LACP on cluster mode.
On 7-mode, we saw many missed LACP packets, but like you never investigated
fully because it kept working. One thing our network guys drum into us is that the LACP setting
MUST agree at both ends.


Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Vervloesem Wouter
We don't think the issue was caused by LACP. 
The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow).
In both situations we did use "multimode_lacp" as mode.


Regards,

Wouter Vervloesem
Storage Consultant

Neoria NV
Prins Boudewijnlaan 41 - 2650 Edegem
T +32 3 451 23 82 | M +32 496 52 93 61

Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <[hidden email]> het volgende geschreven:

On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:
Hi Jeffrey and others,


I have come across this on a couple of times (First time I encountered this
I logged a case for it: 2005111796). Unforunately I have never had the time
to troubleshoot this. In case 2005111796, support observed packet loss in
the setup with port-based hashing, but we had to destroy our
(test/troubleshooting) setup before we could get to the bottom of this.
Since then, I have come across this on several occasions. More often than
not it was not a real issue since those SnapMirrors ran across WAN links,
or SnapMirror runs at night and can take all the time it wants, but on
1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue.
However, I found out there is a TR that mentions that SnapMirror
performance could be impacted by port-based ifgrps so I've never bothered
to open any additional cases for this.

Can anyone else confirm this behavior ?


Is this an LACP ifgrp? We have had no issues with LACP on cluster mode.
On 7-mode, we saw many missed LACP packets, but like you never investigated
fully because it kept working. One thing our network guys drum into us is that the LACP setting
MUST agree at both ends.


Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey

There is no requirement for the LACP hashing configuration to be the same on both sides. On the whole, it doesn't make any difference if there's a mismatch.

 

The important thing that a lot of people miss is that LACP distribution policies are controlled by the sending device. There is no negotiation. For example, you can have ONTAP using IP hashing, while the switch is using src-dst-MAC hashing. That might be a bad idea, such as with a routed environment where only 2 MAC addresses are talking, but it doesn't create a compatibility problem.

 

I've seen a few older switches that really don't like port hashing. I'm not sure exactly what's happening, but it seemed like the architecture of the switch wasn't expecting the same IP/MAC to appear on different multiple ports. It would pass traffic, but the CPU utilization jumped up significantly when any kind of port hashing was being used. Changing to IP solved the problem.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Vervloesem Wouter
Sent: Friday, February 03, 2017 9:15 AM
To: NGC-pdg-uow.edu.au <[hidden email]>
Cc: [hidden email]
Subject: Re: super secret flags

 

We don't think the issue was caused by LACP. 

The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow).

In both situations we did use "multimode_lacp" as mode.



Regards,

Wouter Vervloesem
Storage Consultant

Neoria NV
Prins Boudewijnlaan 41 - 2650 Edegem
T +32 3 451 23 82 | M +32 496 52 93 61

 

Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <[hidden email]> het volgende geschreven:

 

On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:

Hi Jeffrey and others,


I have come across this on a couple of times (First time I encountered this
I logged a case for it: 2005111796). Unforunately I have never had the time
to troubleshoot this. In case 2005111796, support observed packet loss in
the setup with port-based hashing, but we had to destroy our
(test/troubleshooting) setup before we could get to the bottom of this.
Since then, I have come across this on several occasions. More often than
not it was not a real issue since those SnapMirrors ran across WAN links,
or SnapMirror runs at night and can take all the time it wants, but on
1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue.
However, I found out there is a TR that mentions that SnapMirror
performance could be impacted by port-based ifgrps so I've never bothered
to open any additional cases for this.

Can anyone else confirm this behavior ?


Is this an LACP ifgrp? We have had no issues with LACP on cluster mode.
On 7-mode, we saw many missed LACP packets, but like you never investigated
fully because it kept working. One thing our network guys drum into us is that the LACP setting
MUST agree at both ends.


Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: super secret flags

Filip Sneppe
Hi Jeffrey,

In our case(s), the determining factor was that the distr-func was set to "port" on the sender/source side of the SnapMirror relationship. 
At the receiving end, this setting didn't matter. Yes, we are aware that the hashing algorithm does not need to be matched between both sides (including the hashing algorithm on the switch).

Also, I suspect it's not so much an LACP issue and we would probably have run into the same issue with a static multimode etherchannel too, although we've never tested this.

Before we had to break up our testing environment, we had tested and confirmed this behavior on Cisco 3750 and Nexus switches. Those aren't very exotic so we did worry about the performance drop.

ps. great thread by the way. Thanks Peter D. Gray for that other hidden flag in your reply :-)

Best regards,
Filip

On Fri, Feb 3, 2017 at 9:24 AM, Steiner, Jeffrey <[hidden email]> wrote:

There is no requirement for the LACP hashing configuration to be the same on both sides. On the whole, it doesn't make any difference if there's a mismatch.

 

The important thing that a lot of people miss is that LACP distribution policies are controlled by the sending device. There is no negotiation. For example, you can have ONTAP using IP hashing, while the switch is using src-dst-MAC hashing. That might be a bad idea, such as with a routed environment where only 2 MAC addresses are talking, but it doesn't create a compatibility problem.

 

I've seen a few older switches that really don't like port hashing. I'm not sure exactly what's happening, but it seemed like the architecture of the switch wasn't expecting the same IP/MAC to appear on different multiple ports. It would pass traffic, but the CPU utilization jumped up significantly when any kind of port hashing was being used. Changing to IP solved the problem.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Vervloesem Wouter
Sent: Friday, February 03, 2017 9:15 AM
To: NGC-pdg-uow.edu.au <[hidden email]>
Cc: [hidden email]
Subject: Re: super secret flags

 

We don't think the issue was caused by LACP. 

The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow).

In both situations we did use "multimode_lacp" as mode.



Regards,

Wouter Vervloesem
Storage Consultant

Neoria NV
Prins Boudewijnlaan 41 - 2650 Edegem
T +32 3 451 23 82 | M <a href="tel:+32%20496%2052%2093%2061" value="+32496529361" target="_blank">+32 496 52 93 61

 

Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <[hidden email]> het volgende geschreven:

 

On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:

Hi Jeffrey and others,


I have come across this on a couple of times (First time I encountered this
I logged a case for it: 2005111796). Unforunately I have never had the time
to troubleshoot this. In case 2005111796, support observed packet loss in
the setup with port-based hashing, but we had to destroy our
(test/troubleshooting) setup before we could get to the bottom of this.
Since then, I have come across this on several occasions. More often than
not it was not a real issue since those SnapMirrors ran across WAN links,
or SnapMirror runs at night and can take all the time it wants, but on
1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue.
However, I found out there is a TR that mentions that SnapMirror
performance could be impacted by port-based ifgrps so I've never bothered
to open any additional cases for this.

Can anyone else confirm this behavior ?


Is this an LACP ifgrp? We have had no issues with LACP on cluster mode.
On 7-mode, we saw many missed LACP packets, but like you never investigated
fully because it kept working. One thing our network guys drum into us is that the LACP setting
MUST agree at both ends.


Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: super secret flags

Steiner, Jeffrey

I almost mentioned that the Cisco 3750 switch was a model where I'd seen problems with port hashing. The 3750 has been around for a while, so it might have improved over time. I've also seen odd problems related to those four SFP ports on the right hand side of the switch.

 

A Nexus shouldn't have issues, unless vPC's are in use. There are some odd special settings to make vPC's play nice with LACP with some vendors including NetApp, and there have been a few vPC related bugs as well.

 

From: Filip Sneppe [mailto:[hidden email]]
Sent: Friday, February 03, 2017 10:28 AM
To: Steiner, Jeffrey <[hidden email]>
Cc: NGC-wouter.vervloesem-neoria.be <[hidden email]>; NGC-pdg-uow.edu.au <[hidden email]>; [hidden email]
Subject: Re: super secret flags

 

Hi Jeffrey,

 

In our case(s), the determining factor was that the distr-func was set to "port" on the sender/source side of the SnapMirror relationship. 

At the receiving end, this setting didn't matter. Yes, we are aware that the hashing algorithm does not need to be matched between both sides (including the hashing algorithm on the switch).

 

Also, I suspect it's not so much an LACP issue and we would probably have run into the same issue with a static multimode etherchannel too, although we've never tested this.

 

Before we had to break up our testing environment, we had tested and confirmed this behavior on Cisco 3750 and Nexus switches. Those aren't very exotic so we did worry about the performance drop.

 

ps. great thread by the way. Thanks Peter D. Gray for that other hidden flag in your reply :-)

 

Best regards,

Filip

 

On Fri, Feb 3, 2017 at 9:24 AM, Steiner, Jeffrey <[hidden email]> wrote:

There is no requirement for the LACP hashing configuration to be the same on both sides. On the whole, it doesn't make any difference if there's a mismatch.

 

The important thing that a lot of people miss is that LACP distribution policies are controlled by the sending device. There is no negotiation. For example, you can have ONTAP using IP hashing, while the switch is using src-dst-MAC hashing. That might be a bad idea, such as with a routed environment where only 2 MAC addresses are talking, but it doesn't create a compatibility problem.

 

I've seen a few older switches that really don't like port hashing. I'm not sure exactly what's happening, but it seemed like the architecture of the switch wasn't expecting the same IP/MAC to appear on different multiple ports. It would pass traffic, but the CPU utilization jumped up significantly when any kind of port hashing was being used. Changing to IP solved the problem.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Vervloesem Wouter
Sent: Friday, February 03, 2017 9:15 AM
To: NGC-pdg-uow.edu.au <[hidden email]>
Cc: [hidden email]
Subject: Re: super secret flags

 

We don't think the issue was caused by LACP. 

The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow).

In both situations we did use "multimode_lacp" as mode.



Regards,

Wouter Vervloesem
Storage Consultant

Neoria NV
Prins Boudewijnlaan 41 - 2650 Edegem
T +32 3 451 23 82 | M <a href="tel:&#43;32%20496%2052%2093%2061" target="_blank">+32 496 52 93 61

 

Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <[hidden email]> het volgende geschreven:

 

On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:

Hi Jeffrey and others,


I have come across this on a couple of times (First time I encountered this
I logged a case for it: 2005111796). Unforunately I have never had the time
to troubleshoot this. In case 2005111796, support observed packet loss in
the setup with port-based hashing, but we had to destroy our
(test/troubleshooting) setup before we could get to the bottom of this.
Since then, I have come across this on several occasions. More often than
not it was not a real issue since those SnapMirrors ran across WAN links,
or SnapMirror runs at night and can take all the time it wants, but on
1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue.
However, I found out there is a TR that mentions that SnapMirror
performance could be impacted by port-based ifgrps so I've never bothered
to open any additional cases for this.

Can anyone else confirm this behavior ?


Is this an LACP ifgrp? We have had no issues with LACP on cluster mode.
On 7-mode, we saw many missed LACP packets, but like you never investigated
fully because it kept working. One thing our network guys drum into us is that the LACP setting
MUST agree at both ends.


Regards,
pdg

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
12