Wednesday, November 21, 2012

Citrix Provisioning Server High Availability

Since this week is a short week for us, being that Thanksgiving is tomorrow, we decided to work with our Disaster Recovery specialist on certifying our XenDesktop, XenApp, and Provisioning Server Infrastructure. Let me note that most DR tests that are conducted here usually take a full week. Being the overconfident guys that we are, we scheduled 3 days. This is where this blog starts to unfold as I explain an issue we had during our testing.
Like most infrastructure concepts, one piece is built at a time to ensure that each component successfully works. In this scenario, we had built one Provisioning Server and streamed 10 XenDesktops from that server to test functionality, performance and the build process. All was great! We then went further and then built the second Provisioning Server and added it to the same farm as the first. We then copied the VHD and PVP file over from one side to the other. We checked the load balancing of the file in the console and, viola... load balanced. We thought we have everything locked down for our testing.

On the first day we allowed people to launch a XenDesktop and left the session running. We then went ahead and stopped the Stream Service on one of the Provisioning Servers... Not all of the XenDesktops failed over to the other node. We tried to scour the Event Viewer for messages and could not find anything to indicate a cause or reason. We then started the Stream Service back on and those XenDesktops showed still on that node. We then decided to try and fail over to the other side... It worked successfully. Now we are really confused and started to run through the configuration changes to explain how this can be happening. When we contacted one of our consultants about the situation he directed us to look into a known Citrix bug as that might be the culprit. I must give him the benefit of the doubt though, since he did not get the opportunity to look through our environment to determine what the root cause might be.

Right before calling it quits for the day and chalking up the issue to the known Citrix bug, we dug a little deepter into configurations and we found out the culprit, which we came to realize that we would have never noticed unless we had decided to go through our DR testing. When going through the Bootstrap configuration on each Provisioning Server we noticed that one IP address was entered.


Both nodes in the Provisioning pair had the same IP address which was the IP address of node 1. Out of pure desperation we decided to use the "Read Servers from Database" option and it pulled the two IPs of both nodes and put them into the Bootstrap. Of course we made sure we did the same thing on the other nodes Bootstrap config.
For safe measure, we went ahead and rebooted the XenDesktops to ensure that they booted up with the Bootstrap file (probably unnecessary, but like I said we were desperate).

We went ahead and tested using the same procedure as stated above... and it worked! It's amazing how something so small, something that might seem unrelated, could affect the outcome.

Like I said earlier, I don't think we would have ever known to look into this configuration unless we ran this DR test. So, as a warning to others; Please check your Bootstrap configuration on all of your Provisiong Servers to ensure they are all "Read Servers from Database". If you don't, there is a chance that you won't successfully fail over during a true disaster situation.

No comments:

Post a Comment