Wednesday, October 2, 2013

Fun with Tintri Part 3

So last time I spent some time going through my performance testing with the Tintri VMstore T540 to see how it stacked up against our current storage platform and also to get a good idea of just how far I could push it. The results were great and I knew that this thing would be able to handle just about whatever I might need to throw at it. At this point I really wanted to throw some real live workloads at it but I still had a few questions to answer before I could move on to that step. These are the kinds of questions that may not exactly be at the top of your mind when thinking about new storage, but they are important to answer nonetheless. Questions like:

  • What does Tintri's support system look like?
  • What kind of alerting and notification capabilities does it give me?
  • They say it's reliable and all but what really happens if I unplug that cable/remove that disk?

I decided to do some experimenting to see how the T540 would react to various hardware issues/failures so I could get an idea of what to expect and where to look when I had some actual VMs running on it. Before we get into the actual testing breakdown, it's good to take a quick look at the hardware status page provided in the T540 web interface:

This dashboard of a sort is the best place to start when you're having a system problem or think you may be having a problem. It gives you a quick snapshot overview of all the components of your system and how they are currently operating. It also tells you which of the controllers is currently active and this is where you would perform a manual controller failover if you would ever need to do that. This screen will change/highlight information accordingly as the system experiences certain issues as I will illustrate in my test cases.

Test 1: Pull a Network Cable from the Active Controller
I mean really, what else could you possibly try first? This one is the quick simple test to see how the system would respond at the loss of a single network connection. If you have your system properly configured for redundancy the answer is you won't even notice it. I pulled one of my 10 Gig data connections from the active controller and the standby NIC took over without a hiccup. The hardware status page does let you know when a NIC goes down:

Test 2: Pull both Network Cables from the Active Controller
So now I know that the system doesn't even flinch at a single network cable loss, but what if both data connections were lost from the active controller? As you probably guessed, this will cause a failover to the standby controller which happens seamlessly. I had a continuous ping going to a test VM I had on the Tintri during the controller failover and I didn't even see a dropped ping. I can't guarantee that this will happen for you or that it will always work that way but in this case it never missed a beat. The hardware tab reported that both my data connections on the former active controller were down and there were also some system alerts generated. This leads us on to the alerts tab:

The Alerts button at the top of the screen in the web interface will have a number in parenthesis next to it if the system has new alerts to be reviewed which you can see in the picture above. The first entry there is just a notice telling me that my data network is up but is not redundant (when I pulled the single cable) and then there are alerts from when both cables were removed and the system initiated the controller failover. If you configured your system to do so, you'll also receive an e-mail when an alert is generated which I will show in just a little bit. Once you've reviewed the alerts you've got a few options for what you want to do with them - you can mark them as read which removes them as an active alert or you can archive them which moves them to the archived section in case you need to review them later. One of my favorite features here is the ability to add comments to the alerts in case you've got multiple administrators working on a single Tintri system so you can leave notes for issues that pop up. Of course this gives me a chance to leave some horrendously awful comments for my poor co-workers. I wonder if Tintri support sees them in the autosupport data? One can only hope!

Test 3: Pull a Power Cable
This one is another quick and easy thing to test and it works as expected. A T540 chassis has two power supplies and a single one can power the entire system. Pulling a power cable will generate an alert in the alert log and will send you an e-mail. I noticed that it did take a little longer to get this alert than some of the other ones but that may be by design due to inherent issues with power.

Test 4: Pull a Hard Drive
Anyone who has ever worked with storage knows that drive failures are a bit more common than we would like so it's important to know how the system is going to handle it. I walked up to the chassis and yanked the contents of drive bay #1 from the chassis to see what this would do and as expected the system handles this without issue. My test VM kept on kicking and didn't notice a thing. Going back to the web interface the first thing that I saw on my hardware dashboard is that the drive I removed changed from green to red to indicate that it had been removed and it also showed one of the healthy disks was in rebuild mode.

I only pulled the single disk in my testing but according to the Tintri T540 Specifications each disk group of SSD and HDD is in a RAID-6 so it should be able to support 2 disk failures to each RAID group. After I noticed the change on my hardware dashboard the e-mail alerts started to flow in. The first image shows the alert I received when the disk was removed and the rebuild started:

And this next one is the notification after the disk was reinserted into the system:

Tintri Support
One of the things that I really wanted to experience was interacting with Tintri support to get an idea of how the support process works and figure out what I could expect when I need to get assistance. I didn't want to submit a service request for no reason but I quickly found that it wouldn't even be necessary. During initial system setup we configured the T540 to send alerts not only to my team but also to Tintri's support team and by doing this support cases were automatically created when alerts were generated by the system. Here's an example of a case e-mail I received after the controller failover I caused in Test 2 by yanking both of the data connections from my active controller:

Now in my case I was just testing so I didn't have a need to engage their support personnel but it was nice to see that a case was automatically opened and I can go forward with it if I needed to do so. If you don't respond to a case that is automatically opened it will be closed after a period of inactivity or once they see that the condition has cleared so there's nothing you have to do there.

One thing I do want to reiterate is that their support team seems fairly proactive in reaching out when they see that your system is having issues. At one point I was doing some testing with Horizon View desktops on the T540 and I had global snapshotting turned on which caused it to try to snapshot some replica disks that simply should not be snapshotted. This generated an alert every time it tried to do this and at one point after a few of these alerts I got a call from Tintri support just to ask me if everything was OK and offer assistance. This was pretty great to see and it's I have just not experienced with other storage providers out there.

In general if you need to submit a support case that isn't automatically opened you would log into the support portal website where you can select from all the appliances you have registered and open a case on the one experiencing the problem. While they recommend you submit cases via the support portal there is also an 800 number you can call if you need to open a case that way. I must say they do keep the web submission form nice and straight to the point:

Conclusions
While I didn't test every possible scenario here it's clear to see that the system is designed with reliability in mind and can handle most of the common fault scenarios that you will see. Tintri has provided a simple interface to allow you to quickly get the health/status of your system and be able to view and respond to any alerts that may be generated. Their support methodology allows for manual/automatic case creation and their staff takes a proactive approach to case management which is a great thing to see. Time will tell as Tintri continues to grow if they can maintain this approach but for the time being I'm very happy with the interactions I've seen from their support.

As I've been writing this entry I've finally started to move some stuff in my environment over to the Tintri and so far things are looking good. I'm not entirely sure what I'm going to write about next but word on the street is that Tintri OS 2.1 will be coming soon along with the very intriguing Tintri Global Center and if that is the case I will certainly be upgrading to that and taking a closer look into the functionality that it will provide. We'll see if something else hits me before that time but either way stay tuned!