Virus, pagan rescate? Pierden datos? ¡El remedio está aquí!

Esto no es publicidad! Esta es la realidad. Las compañías son afectadas por viruses todo el tiempo. Necesitas un equipo diligente para asegurar tu infraestructura 24×7 para poder sentirte seguro.

A pesar de sentirte seguro, de todas formas eres impactado por algún virus y tus datos están expuestos. Veamos cómo se ve este círculo vicioso donde un virus te impacta, despues se propone y integra la solución:

Virus and data loss cycle

Entonces, un virus se crea, publica y se difunde rápidamente, el mismo virus llega a tu entorno, tus sistemas están bajo ataque. Algunos datos se cifran, tu negocio pierde el acceso a esos datos. Mientras tanto, los mejores del mundo están buscando documentar pasos para protegerte de este virus. Tu equipo de TI, implementa medidas para protegerse de ese virus y su propagación.

Entonces te das cuenta de que tiene muchas unidades de negocio que piden acceso a sus datos o te comienzan a informar en una frecuencia alarmante que perdieron sus datos.

El resultado termina con tu equipo pagando el rescate (chantaje) por algunos conjuntos de datos o pidiendo a las diferentes unidades de negocio que re generen sus datos.

¡Ahora te haz dado a conocer como el líder de TI quien no puede proteger a su compañía de este virus ni recuperar sus datos! Este ciclo interminable es el de un nuevo virus, un momento de gran dolor, confusión y pérdida de datos, seguido por un parche de antivirus.  Sin embargo, el ciclo continuará la misma secuencia de eventos.

Ahora, veamos el siguiente ciclo:

break virus and data loss cycle

El virus golpea como siempre y de nuevo entro a tu entorno, pasando por alto las medidas anteriores de protección antivirus. Esta es la naturaleza de nuestro mundo. Pero esta vez tienes una capa de seguridad adicional. Tienes una solución de copia de seguridad y restauración (respaldos) para ayudarle a no pagar el rescate (chantaje) y podrás reiniciar tu negocio desde un punto aceptable en el tiempo. En este ciclo fuiste capaz de recuperar tus datos de negocio, seguir operando y no pagaste a ningún extraño ninguna cuota!

Echemos un vistazo a la importancia de datos en el negocio a lo largo del tiempo y cómo este ciclo de protección contra virus y el impacto de viruses puede afectar o no.

Data Importance vs time and data loss

Te pido paciencia, sé que el gráfico tiene muchos elementos. Voy a explicarlo en detalle.

Horizontalmente tenemos tiempo, verticalmente tenemos la cantidad de datos que crecen y con ella la importancia de esos datos para el negocio. En el lado derecho tenemos una etiqueta para líneas horizontales denominadas datos expuestos.

A medida que avanza el tiempo, creamos datos, en un momento dado estamos siendo atacados por viruses. Todo el conjunto de datos está expuesto a ser perdido, dañado, infectado. Sin embargo, es posible que tenga la suerte de que no todos los datos se vean afectados y sólo el conjunto de datos en el círculo ROJO es la pérdida de datos reales.

Esto ocurre durante el tiempo de vulnerabilidad. El tiempo de vulnerabilidad está representado en un color mostaza cremoso. A continuación, se procede a parchar contra el virus. Sólo para reiniciar el ciclo.

Sin embargo, tenga en cuenta que no está en control o puede predecir cuáles datos se ven afectados. Así que estos datos podrían ser muy relevantes para su día a día y las operaciones o puede ser un conjunto de datos más antiguos con menos impacto en su negocio inmediato.

Veamos un entorno diferente en el que ha implementado un medio para recuperar sus datos.

Data Importance vs time and data Protection

Al igual que con el gráfico anterior, tu entorno es expuesto a un nuevo virus y durante el tiempo de vulnerabilidad son capaces de restaurar los conjuntos de datos porque ha implementado servicios de copia de seguridad (respaldo) y restauración. ¡Esto es independientemente de que el conjunto de datos sea nuevo, viejo, ambos o todos! Durante ataques, infecciones, parches y correcciones, podrás restaurar tu data desde un punto anterior. Esto le dará la confianza de que tu negocio persistirá y soportará un ataque de virus.

¿Qué debería proteger?

Servidores físicos, servidores virtuales, ordenadores portátiles, sus plataformas SaaS (Google, 365 e incluso Salesforce). También deberían estar pidiendo DRaaS. Recuperación de Desastres como Plataformas de Servicio. Esto le protegerá de un sitio entero que está abajo por cualquier razón o aún un ataque masivo donde todos sus sistemas se afectan al punto donde podría ser mucho más simple dar click a un failover.  Donde un sitio remoto se convierte el sitio primario.

Si es que no sabes por dónde empezar, envíeme un correo electrónico a jcalderon@KIONetworks.com

Saludos!

Julio Calderon

Twitter: @JulioCUS

Skype: Storagepro

Email: Jcalderon@kionetworks.com

 

6 Tips To Make Your OpenStack Enterprise Ready.

 

How to Make Your Openstack Environment Enterprise Ready. 6 Tips.

At a baseline, let’s first come to an agreement of what “Enterprise Ready” means. As a storage consultant and IT generalist with a specialty in cloud architecture, I would define enterprise ready as an environment with the following characteristics:

Predictable

No surprises here: we know and understand the environment’s behaviors during any stress point.

Available

Availability, measured in uptime, indicates how many nines are supported and in general the practices that need to be in place to guarantee a highly available environment.

Fast

The performance of the environment should be dependable and we should be able to set clear expectations with our clients and know which workloads to avoid.

Well Supported

There should be a help line with somebody reliable to back you up in knowledge and expertise.

Expandable

We should know where we can grow and by how much.

Low Maintenance

The environment should so low-maintenance as to be a “set it and forget it” type of experience.

How to Get There: Artificial Intelligence

Now that we know the characteristics and their meanings, the question is, how do we make our open source environment enterprise ready? Let’s take it one at a time. Hint: artificial intelligence can help at every turn.

Predictable

To make your OpenStack environment enterprise ready, you need to perform a wide range of testing to discover functionality during issues, failures, and high workloads. At KIO Networks, we do continuous testing and internal documentation so our operations teams knows exactly what testing was done and the environment’s behavior.

Artificial Intelligence can help by documenting historical behavior and predicting potential issues down to the minute that our operations team will encounter an anomaly. It’s the fastest indication that something’s not running the way it’s supposed to.

Available

To test high availability, we perform component failures and document behavior. It is important to fail every single component including hardware, software, and supporting dependencies for the cloud environment like Internet lines, power supplies, load balancers, and physical or logical components. In our tests, there are always multiple elements that fail and are either recovered or replaced. You need to know your exposure time: how long does it take your team to both recover and replace an element.

AI-powered tools complement traditional monitoring mechanisms. Monitoring mechanisms need to know what your KPIs are. From time to time you may encounter a new problem and need to establish a new KPI for it alongside additional monitoring.  With AI, you can see that something abnormal is happening and that clarity will help your administrators hone in to the issue, fix it, and create a new KPI to monitor. The biggest difference with an AI-powered tool is that you’re able to do that without the surprise outage.

Fast

Really, this is about understanding speed and either documenting limitations or opting for a better solution. Stress testing memory, CPU, and storage IO is a great start. Doing so in a larger scale is desirable in order to learn breaking points and establish KPIs for capacity planning and, just as important, day-to-day monitoring.

Do you know of a single person who would be able to manually correlate logs to understand if performance latency is improving based on what’s happening now compared to yesterday, 3 weeks ago, and 5 months ago? It’s impossible! Now, imagine your AI-powered platform receiving all your logs from your hardware and software. This platform would be able to identify normal running conditions and notify you of an issue as soon as it sees something unusual. This would happen before it hits your established KPIs, before it slows down your parallel storage, before your software-defined storage is impacted, and before the end user’s virtual machine times out.

Well Supported

We emphasize the importance of continuously building our expertise in-house but also rely on certain vendors as the originators of code that we use and/or as huge contributors to open source projects. It’s crucial for businesses to keep growing their knowledge base and to continue conducting lab tests for ongoing learning.

I don’t expect anyone to build their own AI-powered platform. Many have done log platforms with visualization fronts, but this is still a manual process that relies heavily on someone to do the correlation and create new signatures for searching specific information as needed.  However, if you are interested in a set of signatures that’s self-adjusting, never rests, and can predict what will go wrong, alongside an outside team that’s ready to assist you, I would recommend Loom Systems. I have not found anything in the market yet that comes close to what they do.

Expandable

When testing growth, the question always is, what does theory tell you and what can you prove? Having built some of the largest clouds in LATAM, KIO knows how to manage a large-volume cloud, but smaller companies can always reach out to peers or hardware partners to borrow hardware. Of course, there’s always the good, old-fashioned way: you buy it all, build it it all, test it all, shrink it afterwards, and sell it. All of the non-utilized parts can be recycled to other projects. Loom Systems and its AI-powered platform can help you keep watch over your infrastructure as your human DevOps teams continue to streamline operations.

Low Maintenance

Every DevOps team wants a set-it-and-forget-it experience. Yes, this is achievable, but how do you get there? Unfortunately, there’s no short cut. It takes learning, documenting, and applying lessons to all of your environments. After many man hours of managing such an environment, our DevOps team has applied scripts to self-heal and correct, built templates to monitor and detect conditions, and set up monitors to alert themselves when KPIs are being hit. The process is intensive initially, but eventually dedicated DevOps teams get to a place where their environment is low maintenance.

The AI-powered platform from Loom Systems helps you by alerting you of the unknown. Your team will be shown potential fixes and be prompted to add new fixes. As time goes by, the entire team will have extensive documentation available that will help new or junior admins just joining the team. This generates a large knowledge base, a mature project, and also a lower-maintenance team.

All serious businesses should enjoy the benefits of running a predictable, highly available, fast, well supported, easily expandable and low-maintenance environment.  The AI-powered platform built by Loom Systems takes us there much faster and gives us benefits that are usually reserved for huge corporations. Just as an example, if you’re the first in the market offering a new product or service, you can feel confident with Loom Systems that they’ll detect problems early and give you actionable intelligence so you can fix them with surgical precision.

It’s been a pleasure sharing my learnings with you and I look forward to hearing your feedback. Please share your comments and points of view – they’re all welcome!

 

Best Regards,

Julio Calderon

Twitter: @JulioCUS

Skype: Storagepro

Email: jcalderon@kionetworks.com

Data Virtualization and Strategy

Do you have a storage strategy? Are you getting the most of all your storage platforms? Are you underspending with substandard results or are you over spending with AOK results?

There are multiple forces driving you to a choice and to help you narrow down to the basic aspects here is a picture.   In the center we have our users. Our users want it all, Reliability, Performance and Low Cost.  Well, low cost, highly available and high performance storage doesn’t exist! You will have to choose a compromise between these key aspects.  In a nutshell you have to choose how much of each aspect you get and there is no way to get all for low cost.

In this triangle we show 3 layers of technologies, Media, Access Type and Features on storage. In the media type as we change media we lower costs. In the Access layer we show some basic types you will recognize as NFS, SMB or CIFS, WebDav, S3, Block storage over FC or over iSCSI. Yes, there are many more as block over InfiniBand and more protocols as FTP, SFTP. In the storage feature layer we start to see levels of data protection and higher availability as local or remote snapshots, replication, offsite backups and storage clustering there are RAID levels and # of controllers that increase availability as well.

To dig deeper into the media type, lest look at the technologies invloved. For the highest performance we have Flash, SSD, then we lower performance with SAS, SATA and finally the lowest performance in Object storage. As we lower performance we also lower costs. These are just general examples, we all know that there are other type of media as Tape with LTFS that could lower costs further. You as a storage administrator would have to choose what to give your user given that you actually have it.

So, if you provide storage as a service then you most likely already have a compromise between all aspects, what you might call the sweet spot.  However, as this is a sweet spot of all key aspects it is not responding the wide spectrum of needs. Some workloads require super high availability, replication, clustering 100% uptime while others require low cost via Object Storage and then there is the LOCAL Flash cards for high performance needs. After all most high performance and requirements of IO come from local workloads like TMP, SCRATH and SWAP (after all, this data does not require high durability as it is recreated on the fly as the OS and applications run)

What we need is a solution that does not force you into a compromise. Much easier said than done. Example below.

As you can gather from the graphic, via a single point of access you receive a slice across 3 aspects, Reliability, Performance and Cost.  If you provide storage solutions you know this is very expensive and complex because you have to acquire, configure, manage all the technologies previously covered and you must then choose a level of performance with data protection and reliability and finally the cost is associated to each platform and configuration.

So, what is out in the world to help you with this? Well, take INFINIDAT, a centralized storage platform with higher levels of redundancy and good performance.  Look at PureStorage an all SSD solution running in deduplicated mode with good density and TCO across a longer than usual period of time, you should also look at SolidFire as they can assign # of IOPs to the storage you provide helping you align performance to workload, then look at ExtremeIO from EMC great performance, Netapp for an all-inclusive feature and function of data storage and data protection by far the most mature in the market for combining storage and protection. Look at new players as Actifio, Rubrik for copy data management and the previous block storage virtualization engines as IBMs SVC or the Virtualization engine from HDS also OEM by HPE.  There are many more technologies that I am leaving out.  The list is HUGE! And your time could be consumed just in learning from each one of these.

Here at KIO Networks we do this for you! This helps us offer higher value and the service that is just right for you. We have expertise in a wide spectrum of technologies across key practices.

So, next time you look for storage on demand look at KIO Networks. In the spirit of help you learn something new I like to introduce a new subject that promises to help businesses align the right storage to the application at the right price.

There is storage player out in the market promising to address this complex challenge. Its called Primary_Data

What will data virtualization do for you?

Simplest term, allow you to move data heterogeneously across multiple platforms.

Hope you enjoy the set of videos I know I did.

The Launch of Primary_Data

 

Introduction to Primary_Data

 

What is data virtualization: I think the video is over complicated but good video.

 

Use cases for Data Virtualization


Primary_Data User Experience demo

 

Vsphere Demo

 

Hope you enjoyed the videos, as with all cool technology, the devil in the details. From my side I will make sure to take a deep dive of the technology and report back to you guys.

Have a great week!

Julio Calderon  Twitter: @JulioCUS  Skype: Storagepro email: jcalderon@kionetworks.com

Replication! Anyway you want it from Rubrik. Come read about it.

This article was posted by a great profesional in the data protection arena. Rolland Miller, his info can be found here: http://www.rubrik.com/blog/author/rolland-miller/

Rolland is currently investing his time at Rubrik I am sure they will absolutely make it big! Here is the original post location:  http://www.rubrik.com/blog/unlimited-replication-with-rubrik-2-0/#.VdXN-vu9esk.linkedin

— Start of Snip Original Post —

Today we announced Rubrik 2.0, which is packed with exciting new features. I’ve been working in the storage industry for the past 16 years, the majority of time spent working on backup and DR solutions for companies. This isn’t my first rodeo–I’ve seen a lot, advised a lot of customers in how to architect their backup and disaster recovery infrastructure. Needless to say, I haven’t been this thrilled for a long time—our engineers are building something truly innovative that will simplify how recovery is done on-site or off-site for the better.

Why is Rubrik Converged Data Management 2.0 so interesting?

Our 2.0 release is anchored by Unlimited Replication. There are no limits to how many snapshot replicas you can have. There is zero impact on your production systems as replication occurs since this isn’t array-based replication. This is asynchronous, deduplicated, masterless, SLA driven replication that can be deployed any way you like, many-to-one, many-to-many, one-to-one, uni-directionally or bi-directionally. In the past, replication has always been engineered with a master-slave architecture in mind because systems have always had an active-passive view of control. Our Converged Data Management platform is fundamentally a distributed architecture that allows you to share nothing, but do everything—each node is a master of its domain. Our engineers apply the same building principles to replication. Hub and spoke? Check. Bi-directional? Check. Dual-hub, multi-spoke, and archived to the cloud. Check! Check! Check!A key property of Converged Data Management is instant data access. Data is immediately available, regardless of locality, for search and recovery. Using Rubrik for replication allows you to recover directly on the Rubrik appliance since applications can be mounted directly. Files can be found instantly with our Global Real-Time Search. There’s no need to copy files over to another storage system. We’ll give you near-zero RTO.

In this release, we extend our SLA policy engine concept into the realm of replication. You can define near-continuous data replication on a per-VM basis within the same place that backup policies are set. There’s no need to individually manage replication and backup jobs—instead, you’ve freed up your time by managing SLA policies instead of individual replication targets. Once you specify a few parameters, the engine automates schedule execution. For more on managing SLA policies, see Chris Wahl’s Part 1 and Part 2 posts.

No SLA policy is complete without measurement. In 2.0, we’re releasing beautifully simple reporting that helps you validate whether your backup snapshots are successful and whether they’re meeting the defined SLA policies. Our reporting will help you keep an eye on system capacity utilization, growth, and runway—so you’ll never be caught short-handed.

A New Addition to the Family

Finally, we’re welcoming the new r348, our smartest dense machine yet. We’re doubling the capacity within the same 2U form factor, while maintaining the fast, flash-optimized performance for all data operations, from ingest to archival.

Catch Us at VMworld

In less than two weeks, we’ll be at VMworld. Make sure to stop by our Booth 1045 to see a live demo. Arvind “Nitro” Nithrakashyap and Chris Wahl will be leading a breakout session on Wednesday, 9/2 at 10 am and giving away epic Battle of Hoth LEGO sets.

— End of Snip Original Post —

WOW!  Let me quote Rolland, “There is zero impact on your production systems as replication occurs since this isn’t array-based replication. This is asynchronous, deduplicated, masterless, SLA driven replication that can be deployed any way you like, many-to-one, many-to-many, one-to-one, uni-directionally or bi-directionally.”

WOW! KICK ASS! Love it. If you are a techno geek like I am you will be as super excitied about this as I am.

Just imagine the ramification? This could be a true base platform for a service provider. Without limits you really could service a huge pool of clients and their needs.   Obviously you still need size properly for your forecasted use and growth. Anyways, hopefully I can join you all at VMworld where I am sure Rubrik will WOW all of you!

Enjoy!  @KIO Networks we love cool technology!

Regards,

Julio Calderon / @JuliocUS /email: jcalderon@kionetworks.com

DON’T BACKUP. GO FORWARD (outsider view of rubrik)

As a Senior Product Manager for Data at KIO Networks, I am always in search of technologies that will enable us to provide additional value around data protection and storage.

I have been looking at Rubrik for a year now and on a recent review with their Director of Presales this is what I understand. Hopefully you will find it useful.

As with any great building, you must have a strong foundation and in my personal opinion, Rubrik has solid technical founders, here are the initial key founders. Obviously everyone in the company like Rubrik is key. For a full list follow this link: http://www.rubrik.com/company/

Bipul Sinha (CEO)

Founding investor in Nutanix. Partner at Lightspeed (PernixData, Numerify, Bromium). IIT, Kharagpur (BTech), Wharton (MBA).

Arvind Nithrakashyap (Engineering)

Co-founder of Oracle Exadata and Principle Engineer Oracle Cluster. Lead real-time ad infra at RocketFuel. Silk Road trekker. IIT, Madras (BTech), Uni of Mass, Amherst (MS).

Arvind Jain (Engineering)

Google Distinguished Engineer. Founding engineer at Riverbed. Chief Architect at Akamai. Chipotle advocate. IIT, Delhi (BS). University of Washington (PhD Dropout).

These folks came from the Google, Facebook, Data Domain and VMware of the world. These are the key players in technology and services worldwide. Not bad right?

Put your money with your mouth is!

That’s what the initial investors did, Lightspeed Venture Partners, as well as industry luminaries John W. Thompson (Microsoft Chairman, Symantec Former CEO), Frank Slootman (ServiceNow CEO, Data Domain Former CEO) and Mark Leslie (Leslie Ventures, Veritas Founding CEO).

You can read more about their Series A funding at: http://www.rubrik.com/blog/press-release/rubrik-invents-time-machine-for-cloud-infrastructure-to-redefine-47-billion-data-management-market/

This initial funding was followed by a quick Series B of 41 Million. You can read about it: http://fortune.com/2015/05/26/rubrik-archive-data/

Now, in both funding blog references you will read key descriptors as time machine, archive, backup, etc. I think that these are either catchy names or well-known technology references. By the end of my post you will see that Rubrik is more than a catchy name and it should not be compared to previously known technologies. It should be known as a new technology that removes the need for the old.

What is Rubrik trying to change?

The fundamental change is to converge all backup software, deduplicated storage, catalog management, and data orchestration in a single software that’s packaged within an appliance. Rubrik seamlessly scales to manage any size data set from 10TB, 50TB to 1PB.

Expandability:  So, what do we do with Archive, long retention data sets?

With an S3 tie in, data is sent over to this object storage.

If you are interested in lower your running costs you require a long term retention storage pool and when it comes to lowering cost there is nothing better than object storage.

When you look at your data protection operation as a question, what time frame do most of the restores come from?  The answer is yesterday or within the last week.

In Primary storage we create tiers of storage with different characteristics, performance, reliability and cost. So, with our continued data growth and data protection platforms it makes sense to do the same! Tier your data protection vaults, storage pools to best align to your restore requirements and initial data copy (backup) requirements.  Ingest quickly, restore quickly and for the unlike restore from a year ago wait a bit longer to pull data from the CLOUD or your own Object storage pool.

Easy of deployment: results in lower costs. In the case of Rubrik the promise is to do 50 minute or less from installation to backup of a VM. Now, I should mention that as of publicly available today, Rubrik is currently able to integrate with VMware environments. You have physical, HyperV, OpenStack and more? Never fear! Rubrik has plans to expand into other platforms. My recommendation as with any technology is, if you don’t find it anywhere else and you need it, reach out to the technology vendor with an opportunity and ask them to consider your priorities into their road map. Doesn’t hurt to ask! For me, my priorities are OpenStack integration and MS HyperV.

Now, going back to what Rubrik can do for your VMware environment. Granular restore from archive. Yes, the VM is protected and at the same time it is indexed.

After you have protected a VM in Rubrik, to restore a VM no data is moved! Instead, it is presented to VMware and you could BOOT the machine. You as a VMware admin will see this VM running in a data store.  So, how will it perform?  Well, how does 30,000 iops per appliance sound to you? Cool right?  This number came from an IOMeter driven test running 4K blocks 50% read 50% write random load.

So, why is this so cool? Well, let’s look at the contrast, traditional backup is used to restore, meaning, you copy the data, then you copy the data back to its origin in case of data loss.  In the Rubrik case, you copy the data, keep versions and each version is now usable in parallel of the original without the need to move the data back! Removing the painful restore window.

The idea is to do more with less, you don’t have to wait for data loss to boot up a VM. You could use this for Test and Dev workflows. Boot a new machine from Rubrik, test in parallel and then tear down instance.

Combine this functionality with replication and you could have your test and dev offshore.

A famous slogan from Data Domain comes to mind. Tape is dead, get over it. Yet in 2015 I kept hearing requirements for actual physical tape. However, I have been able to convert most of them over to Disk. I am sure you are hearing the same.  Rubrik is going to make that conversation a much easier one. The tape out requirement in this case can go over to cloud and achieving off-site, long retention at low costs.

In my opinion Rubrik is fulfilling its promise to remove all traditional components of Backup appliance. However it is focused in VMware.  I look forward in seeing Rubrik have the same impact in HyperV, OpenStack, Docker and expand its long retention or Tape out option over to google and Azure. It just makes sense!

If you are tech savvy, they have rest APIs and are built on html5, I challenge you to go build your own product and services around them!

My personal favorite aspects are: Rubrik can provide a DR solution with replication, it supports Bi Directional and Hub and Spoke. Imagine the architecture? 2 Rubriks, Site 1 and Site 2 cross replicating so that you have both primary data sets and the replica from your 2ndary site. What is the cherry on the top? All licenses is on Capacity to keep it simple. Yes, you won’t be surprised with an additional license required to get your flex capacitor working. There’s no additional cost for replication either – it’s included free of charge.

In a nutshell, Rubrik has a great technical foundation and its current focus on VMware is the best move for them as it will help them gain footprint in most environments.  The features, functions, ease of use with the ability to be more than just backup is a big WIN for the consumer. The way I see it, it is a great start to a new wave of data protection. Rubrik can finally provide additional value to backup and reach RTOs and RPOs unnatural for other platforms.

If you are interested in my previous write up on Rubrik visit:  https://tinyurl.com/yblscdgq or https://tinyurl.com/yb8y4fj5

Have a great day!

Julio Calderon Senior Product Manager (Data)  Skype: Storagepro  Twitter: @JuliocUS Email: jcalderon@kionetworks.com

Triple Parity Data Protection!

I remember when double parity was only provided by Netapp a long time ago.  Netapp back then seemed to have the experience of concentrating more data and have probably been impacted by dual drive failures.   It was called RAID-DP you can find its details here. http://www.netapp.com/us/communities/tech-ontap/tot-back-to-basics-raid-dp-1110-hk.aspx

Dual parity now a days is a basic raid offered by most vendors. As we continue to grow in disk sizes and amount of data, we concentrate much more important aspects of our business in these technology units (storage devices)

The issue with lesser parity in raid groups is the stress that is put onto the remaining drives in the raid group.  During the stressful task of rebuilding a drive failure you hopefully do not lose another drive.

Now, think about this. All drives come with an MTBF a meantime between failures, when you ordered your storage it came with most of the disks you are using today.  So, if the error rate is high enough that the your storage chooses to bring the drive down or drive actually dies, what is the likelihood that the other drives that came from the same batch behave the same way? On top of that you add stress to those same pre-existing drives?  Obviously, there are a lot more variables to account for, but I think you get the point.

Needless to say, the industry needs more data protection.  https://atg.netapp.com/wp-content/uploads/2012/12/RTP_Goel.pdf  here is a URL for an paper by Atul Goel and Peter Corbett both are Netapp members.  This algorithm will protect you from triple drive failure. I can’t wait to see it as an option in all storage arrays!

In the meantime, what can you do to protect from dual drive failures? Some folks say mirror mirror mirror and forget RAID 5 or 6. Yes, triple mirror.  Sure, that’s an approach! The claim that triple mirror is cheaper?  Depends on volume and what business you run. If you are looking into protecting from double failures and do not want to do RAID6 you might be better off just mirroring your data set from a volume on a RAID5 to another volume on a different RAID5.  When you want to protect from dual disk failues the idea is to have more copies of data and that can be achieved in different ways.

There are also other blogs and articles I reviewed before I put this one together. Here they are for your review: http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923-9.html http://www.techrepublic.com/blog/the-enterprise-cloud/raid-5-or-raid-6-which-should-you-select/

Above is a graph I put together to depict how as drive sizes continue to expand our time to recover also continues to expand. The time that it takes to recover is our time of risk. Now, with technologies such as deduplication and compression capable of storing 3X or in some cased 30X that of the original data size we concentrate more data (more risk) in the same storage foot print. Making recovery times more critical than ever before.  Keep in mind that your clients might be implementing compression, deduplication at the application level and you unknowingly are assuming a higher risk mode of operations.

As a nutshell, Raid5 was effective protection for smaller drives and the impact on rebuild was not that long due to amount of data you could store.  Now that data sets are larger and single drives are getting in the 10s of TBs per drive imagine your data loss risk! You should already be using RAID6 and when we start using these much larger drives we should start using triple mirror. I recommend you look at raid technologies at the chunk level so that all your drives are taking data and the raid chunks are across all drives as well.  Examples, EMC XtremeIO XRD, Huawei Raid2.0,  HP 3PAR chunklets based raid. Using these technologies will spread the rebuild load across much more disks lowering the burden per disk and increasing rebuild speeds.

 

Hope you found this informative, have a great day and weekend!
Regards,

Julio Calderon, Global Senior Product Manager @KIO Networks

Skype: Storagepro  Twitter: @JuliocUS  Email: jcalderon@kionetworks.com

Effortless! How to apply AFRs, MTBFs to your data management practice.

If you consume storage and build your own platform or have an Enterprise, SMB storage array you have probably already heard the references around MTBF and AFRs. What does it really mean to you and how relevant is it? Really.

What is MTBF, Meantime between failure. What is AFR, Annualized Failure Rate? There is a good reference here: http://knowledge.seagate.com/articles/en_US/FAQ/174791en?language=en_US

MTBF its just a reference point that happens to be synthetically created under a set of specifc variables that most likely do not represent your use case. AFR sounds much more consumable but as it is derived from an MTBF and that is not much better!

However, there is a very popular study on real failure rates out in the open. Its done by Backblaze.

I wanted to take a small step towards my interest! How do these AFRs differ from the data sheets and what predictions can I make in my own environment?

In the BackBlaze drive study the following chart was shared: great read, highly recomend it. https://www.backblaze.com/blog/hard-drive-reliability-stats-q1-2016/

For me, I really wanted to compare each drive specification AFR to the actual AFR found by BackBlaze.

If you want to learn how to calculate AFR based on MTBF follow this link: http://support.mdl.ru/pc_compl/firma/quantum/products/whitepapers/mtbf/qntmtbf4.htm

So, in the chart below I documented my research on published AFRs. Since not all of these drives had published their AFR and it was hard to find some of the MTBFs. I had to omit the top 2 drives listed by Backblaze. Also, for some drives I calculated the AFR based on their MTBF. If you want proof of the reasearch send me your email Ill send you all the links.

In the graph below you can see how some drives vendors published MTBF or AFR never hit the mark of reality. Reality being represented by the AFRs reported by BackBlaze.

Now, how much delta is there between public spec sheet data versus reality? Lets see! In the table below I calculated the difference of how much better or worse was the actual AFR to that of the public data.

In the graph below we can see that the larger delta is on the worse side.

If you take a closer look at the numbers, you can see that out of 16 drive models 8 of them performed better than public data sheets documented 1.05% . While the other 8 drives performed alot worse than their data sheets. 4.17% worse overall. This was mostly due to the failure rates from 2 specific drive models.

I was hoping for much more predictablility out of my findings but as I can see, drives are like wine. You might have a taste of a great year while in some cases you probably over spent for what you got.

In my opinion, you really cant fully trust the AFR nor the MTBF reference on drive manufacturers given that drives might live under many different variables than those used to calculate for product data sheets. My suggestion is add a buffer of a few points on average so that you can best plan for your yearly drive failures and make sure that the technology you use for storage can help you overcome drive rebuilds. Check out my previous post on parity and protection at : https://www.linkedin.com/pulse/triple-parity-data-protection-julio-calderon?trk=pulse_spock-articles

Wishful thinking, I wish all drives in the world to send their uptime and error information to a central repository where we (the public) may know real life AFRs to share back with users and vendors alike.

If you have something to share look forward to your comments.

Regards!

Julio Calderon, Global Senior Product Manager @KIO Networks

Skype: Storagepro  Twitter: @JuliocUS Email: jcalderon@kionetworks.com