Raid disk layout - the ability to lose a shelf.

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Raid disk layout - the ability to lose a shelf.

jordan slingerland-2
Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 
Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Klise, Steve-2

I know a medical pharm customer does it.  Just raises complexity and you have to make sure the rebuilds happen on the proper drives.  I think unless you have a had a shelf failure (or questionable power), maybe justified , but let “Netapp do, what Netapp does”.  Might be a challenge evacuating a shelf, say for decommissioning an older smaller disk type (migrating away from 1TB SATA shelves as an example).

 

 

From: <[hidden email]> on behalf of jordan slingerland <[hidden email]>
Date: Tuesday, June 21, 2016 at 8:11 AM
To: "[hidden email]" <[hidden email]>
Subject: Raid disk layout - the ability to lose a shelf.

 

Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 

Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

 

Thanks in advance.

--Jordan


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Tim McCarthy
In reply to this post by jordan slingerland-2
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:
Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 
Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Francis Kim
In reply to this post by jordan slingerland-2
Your customer obviously has (ir)rational reasons for their lack of confidence in disk shelf HA.

After mentioning all the objections you're likely to get in this forum you might ask your customer if maybe there's a subset of data that could be protected with this approach or maybe sync mirror might be an alternative?

.

> On Jun 21, 2016, at 5:13 PM, jordan slingerland <[hidden email]> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.  
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.  
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>  
> Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated?  
>
> Thanks in advance.
>
> --Jordan
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Scott M Gelb
In reply to this post by Tim McCarthy
agreed... for shelf resiliency, SyncMirror to a different stack is a no charge license... double the disk isn't, but for stack and shelf resiliency that might be a better option.



From: tmac <[hidden email]>
To: jordan slingerland <[hidden email]>
Cc: Toasters <[hidden email]>
Sent: Tuesday, June 21, 2016 8:26 AM
Subject: Re: Raid disk layout - the ability to lose a shelf.

This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:
Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 
Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Douglas Siggins-2
In reply to this post by Tim McCarthy
I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <[hidden email]> wrote:
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:
Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 
Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: Raid disk layout - the ability to lose a shelf.

Steiner, Jeffrey

For the most part, the shelf is just sheet metal. The controllers, power supplies, and all that are redundant. I'm not in hardware engineering, but I imagine that's the reason there isn't a native ability to isolate data within certain shelves. Catastrophic sheet metal failure is unlikely.

 

I was a NetApp customer for about 10 years before joining the company and the only shelf "failure" I encountered was a mildly annoying temperature sensor failure. There was only the one sensor, so I had to replace the shelf. It didn't cause downtime or anything.

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Douglas Siggins
Sent: Tuesday, June 21, 2016 5:42 PM
To: NGC-tmacmd-gmail.com <[hidden email]>
Cc: Toasters <[hidden email]>
Subject: Re: Raid disk layout - the ability to lose a shelf.

 

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

 

I've only had one of my 4243s with a single slot failure.

 

 

 

On Tue, Jun 21, 2016 at 11:26 AM, tmac <[hidden email]> wrote:

This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

 

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.

...ONCE

 

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.

We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1

When a disk fails, assign even to node 2, odd to node 1

 

This made the aggregates a bit trickier to place, but it happened.

 

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

 

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.

 


--tmac

 

Tim McCarthy, Principal Consultant

 

 

 

On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:

Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 

Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

 

Thanks in advance.

--Jordan


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: Re: Raid disk layout - the ability to lose a shelf.

Clendening, William D
In reply to this post by Douglas Siggins-2

I have seen whole shelf failures due to Bug ID 902420.   (Don’t bother trying to look up the bug, it’s one of those wonderful, completely blank ones.)

 

Doug Clendening

 

(c) 713-516-4671

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Douglas Siggins
Sent: Tuesday, June 21, 2016 10:42 AM
To: tmac
Cc: Toasters
Subject: [**EXTERNAL**] Re: Raid disk layout - the ability to lose a shelf.

 

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

 

I've only had one of my 4243s with a single slot failure.

 

 

 

On Tue, Jun 21, 2016 at 11:26 AM, tmac <[hidden email]> wrote:

This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

 

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.

...ONCE

 

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.

We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1

When a disk fails, assign even to node 2, odd to node 1

 

This made the aggregates a bit trickier to place, but it happened.

 

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

 

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.

 


--tmac

 

Tim McCarthy, Principal Consultant

 

 

 

On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:

Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 

Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

 

Thanks in advance.

--Jordan


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Rhorer, Kyle L. (JSC-OD)[THE BOEING COMPANY]
In reply to this post by jordan slingerland-2
Another issue to think about besides resiliency… what happens in this “no more than two RAID group disks per shelf” scheme when they want to add another shelf because they’re running out of capacity?

> On Jun 21, 2016, at 10:11, jordan slingerland <[hidden email]> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.  
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.  
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>  
> Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated?  
>
> Thanks in advance.
>
> --Jordan
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

jordan slingerland-2
thanks for all the reply so far.  That is a valid point but I believe in that situation each raidgroup could be extended by 1 or 2 disks.    The initial configuration will be 12 shelves@tmac and 12 disks raid groups.   so though the raid groups will end up being smaller than I would typically recommend, not tiny.   

On Tue, Jun 21, 2016 at 12:01 PM, Rhorer, Kyle L. (JSC-OD)[THE BOEING COMPANY] <[hidden email]> wrote:
Another issue to think about besides resiliency… what happens in this “no more than two RAID group disks per shelf” scheme when they want to add another shelf because they’re running out of capacity?

> On Jun 21, 2016, at 10:11, jordan slingerland <[hidden email]> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>
> Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated?
>
> Thanks in advance.
>
> --Jordan
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Momonth

I experienced a disk shelf failure once, some internal electronics failed, smoke and all that fun ... no data availability.

Another possible option is to enable data mirror across two separate sas domains, it used to need snapmirror_local license on 7Mode OnTAP.

On Jun 21, 2016 18:15, "jordan slingerland" <[hidden email]> wrote:
thanks for all the reply so far.  That is a valid point but I believe in that situation each raidgroup could be extended by 1 or 2 disks.    The initial configuration will be 12 shelves@tmac and 12 disks raid groups.   so though the raid groups will end up being smaller than I would typically recommend, not tiny.   

On Tue, Jun 21, 2016 at 12:01 PM, Rhorer, Kyle L. (JSC-OD)[THE BOEING COMPANY] <[hidden email]> wrote:
Another issue to think about besides resiliency… what happens in this “no more than two RAID group disks per shelf” scheme when they want to add another shelf because they’re running out of capacity?

> On Jun 21, 2016, at 10:11, jordan slingerland <[hidden email]> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>
> Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated?
>
> Thanks in advance.
>
> --Jordan
> _______________________________________________
> Toasters mailing list
> [hidden email]
> http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

andrei.borzenkov@ts.fujitsu.com
In reply to this post by Douglas Siggins-2
I have seen (more than once) NetApp support recommending power cycling shelf. And I did full shelf replacement less than a month ago.

Отправлено с iPhone

21 июня 2016 г., в 18:53, Douglas Siggins <[hidden email]> написал(а):

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <[hidden email]> wrote:
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:
Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 
Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: Raid disk layout - the ability to lose a shelf.

Steiner, Jeffrey

A shelf can certainly have problems like any electronic device, I just question whether isolating aggregates to particular shelves solves a real problem. If a temperature sensor went bad or a particular drive connector in one of the bays was bent you still have to deal maintenance work and potential downtime to address it irrespective of which drives from which aggregates are where.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of [hidden email]
Sent: Tuesday, June 21, 2016 6:53 PM
To: Douglas Siggins <[hidden email]>
Cc: Toasters <[hidden email]>
Subject: Re: Raid disk layout - the ability to lose a shelf.

 

I have seen (more than once) NetApp support recommending power cycling shelf. And I did full shelf replacement less than a month ago.

Отправлено с iPhone


21 июня 2016 г., в 18:53, Douglas Siggins <[hidden email]> написал(а):

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

 

I've only had one of my 4243s with a single slot failure.

 

 

 

On Tue, Jun 21, 2016 at 11:26 AM, tmac <[hidden email]> wrote:

This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

 

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.

...ONCE

 

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.

We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1

When a disk fails, assign even to node 2, odd to node 1

 

This made the aggregates a bit trickier to place, but it happened.

 

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

 

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.

 


--tmac

 

Tim McCarthy, Principal Consultant

 

 

 

On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <[hidden email]> wrote:

Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 

Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

 

Thanks in advance.

--Jordan


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters

 

_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

Re: Raid disk layout - the ability to lose a shelf.

Jeffrey Mohler
"A shelf can certainly have problems like any electronic device, I just question whether isolating aggregates to particular shelves solves a real problem."
---

That's been my thinking the whole thread.

All great thoughts, but it's really deep in the weeds to waste time over thinking the problem on enterprise HW, using retail JBOD thinking.

We're way down into the .00x afr percentages here even where specific targeted/found bugs actually exist.  Add more decimal points for the much more rare smoke events that can kill a shelf.

ONTAP does a good job at layout given a lot of practical experience/data based on best placement for highest reliability.
 
_________________________________
[hidden email]
Tech Yahoo, Storage Architect, Principal
Twitter: @PrincipalYahoo
CorpIM:  Hipchat & Iris





_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters
Reply | Threaded
Open this post in threaded view
|

RE: Raid disk layout - the ability to lose a shelf.

Duncan Cummings
In reply to this post by jordan slingerland-2

I don’t know that I would try to do it deliberately but OnTap does it automatically when you have enough shelves.

 

And I have seen this save a panic when a whole shelf failed.  Facilities were testing a power rail without checking that the power supplies in all devices were operational.  The engineer was opening the rack with power supply in hand when it went down.

 

It took a while to rebuild but it didn’t go down.

 

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of jordan slingerland
Sent: Wednesday, 22 June 2016 1:12 AM
To: [hidden email]
Subject: Raid disk layout - the ability to lose a shelf.

 

Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 

Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

 

Thanks in advance.

--Jordan


Duncan Cummings
NetApp Specialist
Interactive Pty Ltd
Telephone +61 7 3323 0800
Facsimile +61 7 3323 0899
Mobile +61 403 383 050
www.interactive.com.au

-------Confidentiality & Legal Privilege-------------
"This email is intended for the named recipient only. The information contained in this message may be confidential, or commercially sensitive. If you are not the intended recipient you must not reproduce or distribute any part of the email, disclose its contents to any other party, or take any action in reliance on it. If you have received this email in error, please contact the sender immediately. Please delete this message from your computer. Confidentiality and legal privilege are not waived or lost by reason of mistaken delivery to you."


_______________________________________________
Toasters mailing list
[hidden email]
http://www.teaparty.net/mailman/listinfo/toasters