CITRIX XENDESTOP AND PVS: A WRITE CACHE PERFORMANCE STUDY

Thursday, July 10, 2014   , , , , , , , , , , , ,   Source: Exit | the | Fast | Lane

image

If you’re unfamiliar, PVS (Citrix Provisioning Server) is a vDisk deployment mechanism available for use within a XenDesktop or XenApp environment that uses streaming for image delivery. Shared read-only vDisks are streamed to virtual or physical targets in which users can access random pooled or static desktop sessions. Random desktops are reset to a pristine state between logoffs while users requiring static desktops have their changes persisted within a Personal vDisk pinned to their own desktop VM. Any changes that occur within the duration of a user session are captured in a write cache. This is where the performance demanding write IOs occur and where PVS offers a great deal of flexibility as to where those writes can occur. Write cache destination options are defined via PVS vDisk access modes which can dramatically change the performance characteristics of your VDI deployment. While PVS does add a degree of complexity to the overall architecture, since its own infrastructure is required, it is worth considering since it can reduce the amount of physical computing horsepower required for your VDI desktop hosts. The following diagram illustrates the relationship of PVS to Machine Creation Services (MCS) in the larger architectural context of XenDesktop. Keep in mind also that PVS is frequently used to deploy XenApp servers as well.

image

PVS 7.1 supports the following write cache destination options (from Link):

  • Cache on device hard drive – Write cache can exist as a file in NTFS format, located on the target-device’s hard drive. This write cache option frees up the Provisioning Server since it does not have to process write requests and does not have the finite limitation of RAM.
  • Cache on device hard drive persisted (experimental phase only) – The same as Cache on device hard drive, except cache persists. At this time, this write cache method is an experimental feature only, and is only supported for NT6.1 or later (Windows 7 and Windows 2008 R2 and later).
  • Cache in device RAM – Write cache can exist as a temporary file in the target device’s RAM. This provides the fastest method of disk access since memory access is always faster than disk access.
  • Cache in device RAM with overflow on hard disk – When RAM is zero, the target device write cache is only written to the local disk. When RAM is not zero, the target device write cache is written to RAM first.
  • Cache on a server – Write cache can exist as a temporary file on a Provisioning Server. In this configuration, all writes are handled by the Provisioning Server, which can increase disk IO and network traffic.
  • Cache on server persistent – This cache option allows for the saving of changes between reboots. Using this option, after rebooting, a target device is able to retrieve changes made from previous sessions that differ from the read only vDisk image.

Many of these were available in previous versions of PVS, including cache to RAM, but what makes v7.1 more interesting is the ability to cache to RAM with the ability to overflow to HDD. This provides the best of both worlds: extreme RAM-based IO performance without the risk since you can now overflow to HDD if the RAM cache fills. Previously you had to be very careful to ensure your RAM cache didn’t fill completely as that could result in catastrophe. Granted, if the need to overflow does occur, affected user VMs will be at the mercy of your available HDD performance capabilities, but this is still better than the alternative (BSOD).

Results

Even when caching directly to HDD, PVS shows lower IOPS/ user numbers than MCS does on the same hardware. We decided to take things a step further by testing a number of different caching options. We ran tests on both Hyper-V and ESXi using our standard 3 user VM profiles against LoginVSI’s low, medium, high workloads. For reference, below are the standard user VM profiles we use in all Dell Wyse Datacenter enterprise solutions:

Profile Name Number of vCPUs per Virtual Desktop Nominal RAM (GB) per Virtual Desktop Use Case
Standard 1 2 Task Worker
Enhanced 2 3 Knowledge Worker
Professional 2 4 Power User

We tested three write caching options across all user and workload types: cache on device HDD, RAM + Overflow (256MB) and RAM + Overflow (512MB). Doubling the amount of RAM cache on more intensive workloads paid off big netting a near host IOPS reduction to 0. That’s almost 100% of user generated IO absorbed completely by RAM. We didn’t capture the IOPS generated in RAM here using PVS, but as the fastest medium available in the server and from previous work done with other in-RAM technologies, I can tell you that 1600MHz RAM is capable of tens of thousands of IOPS, per host. We also tested thin vs thick provisioning using our high end profile when caching to HDD just for grins. Ironically, thin provisioning outperformed thick for ESXi, the opposite proved true for Hyper-V. Toachieve these impressive IOPS number on ESXi it is important to enable intermediate buffering (see links at the bottom). I’ve highlighted the more impressive RAM + overflow results in red below. Note: IOPS per user below indicates IOPS generation as observed at the disk layer of the compute host. This does not mean these sessions generated close to no IOPS.

Hyper-visor PVS Cache Type Workload Density Avg CPU % Avg Mem Usage GB Avg IOPS/User Avg Net KBps/User
ESXi Device HDD only Standard 170 95% 1.2 5 109
ESXi 256MB RAM + Overflow Standard 170 76% 1.5 0.4 113
ESXi 512MB RAM + Overflow Standard 170 77% 1.5 0.3 124
ESXi Device HDD only Enhanced 110 86% 2.1 8 275
ESXi 256MB RAM + Overflow Enhanced 110 72% 2.2 1.2 284
ESXi 512MB RAM + Overflow Enhanced 110 73% 2.2 0.2 286
ESXi HDD only, thin provisioned Professional 90 75% 2.5 9.1 250
ESXi HDD only thick provisioned Professional 90 79% 2.6 11.7 272
ESXi 256MB RAM + Overflow Professional 90 61% 2.6 1.9 255
ESXi 512MB RAM + Overflow Professional 90 64% 2.7 0.3 272

For Hyper-V we observed a similar story and did not enabled intermediate buffering at the recommendation of Citrix. This is important! Citrix strongly recommends to not use intermediate buffering on Hyper-V as it degrades performance. Most other numbers are well inline with the ESXi results, save for the cache to HDD numbers being slightly higher.

Hyper-visor PVS Cache Type Workload Density Avg CPU % Avg Mem Usage GB Avg IOPS/User Avg Net KBps/User
Hyper-V Device HDD only Standard 170 92% 1.3 5.2 121
Hyper-V 256MB RAM + Overflow Standard 170 78% 1.5 0.3 104
Hyper-V 512MB RAM + Overflow Standard 170 78% 1.5 0.2 110
Hyper-V Device HDD only Enhanced 110 85% 1.7 9.3 323
Hyper-V 256MB RAM + Overflow Enhanced 110 80% 2 0.8 275
Hyper-V 512MB RAM + Overflow Enhanced 110 81% 2.1 0.4 273
Hyper-V HDD only, thin provisioned Professional 90 80% 2.2 12.3 306
Hyper-V HDD only thick provisioned Professional 90 80% 2.2 10.5 308
Hyper-V 256MB RAM + Overflow Professional 90 80% 2.5 2.0 294
Hyper-V 512MB RAM + Overflow Professional 90 79% 2.7 1.4 294

Implications

So what does it all mean? If you’re already a PVS customer this is a no brainer, upgrade to v7.1 and turn on “cache in device RAM with overflow to hard disk” now. Your storage subsystems will thank you. The benefits are clear in both ESXi and Hyper-V alike. If you’re deploying XenDesktop soon and debating MCS vs PVS, this is a very strong mark in the “pro” column for PVS. The fact of life in VDI is that we always run out of CPU first, but that doesn’t mean we get to ignore or undersize for IO performance as that’s important too. Enabling RAM to absorb the vast majority of user write cache IO allows us to stretch our HDD subsystems even further, since their burdens are diminished. Cut your local disk costs by 2/3 or stretch those shared arrays 2 or 3x. PVS cache in RAM + overflow allows you to design your storage around capacity requirements with less need to overprovision spindles just to meet IO demands (resulting in wasted capacity).

References:

DWD Enterprise Reference Architecture

http://support.citrix.com/proddocs/topic/provisioning-7/pvs-technology-overview-write-cache-intro.html

When to Enable Intermediate Buffering for Local Hard Drive Cache

FOLDER REDIRECTION FUN WITH DFS AND NAS

Sunday, April 18, 2010   , , ,

Source: Exit | the | Fast | Lane

Folder redirection has been around since Windows 2000 and has undergone significant changes since then. The core function is the same: take a local directory path and point it somewhere else, without the user knowing or caring that it isn’t local. The advantages of this are that your users can store their important documents “on the network” without having to map drives or instructing them to save in a certain location. You can selectively redirect documents and app data but exclude photos, music, etc if you choose. In this example, I am interested in redirecting My Documents for all users to a secure, redundant, and high performance NetApp Filer. Technologies involved are Windows 7 Enterprise, Server 2008 R2 DFS, and NetApp CIFS shares running on a FAS2020.

First thing, create your CIFS shares on the filer. The way NetApp NAS with the CIFS protocol works is that the filer actually becomes a member of your domain. You can even apply certain GPO settings to it! The stated domain type is incorrect as I’m running in 2008 R2 native mode, but this doesn’t affect anything functionally from what I can tell.

 image

Your CIFS shares are then managed just like a regular Windows server. You can even connect to the filer via the computer management MMC.

image

Each share you create exists inside a volume and has an associated qtree. All the other NetApp goodies still apply: deduplication, snapshots, auto volume grow, and opportunistic locking. The rest of the options look very much like a regular Windows file server.

image

I have created a single hidden share called Users$ that sits in a 300GB volume. All of my user’s My Documents will live here. Following best practices, I have granted authenticated users full control to the share as I will control granular access permissions with NTFS. We’re ready to prepare the DFS namespace.

Now in my redirection GPO I could simply point all users to redirect to \<filername>users$<username> but one of the values DFS provides is a consistent domain-based namespace: \domain<dfsRoot><redirection_root><username>. Everything that will exist as a file share in my environment will be accessible via a DFS namespace, much cleaner this way and much easier to change targets should I need to enact my DR plan which I would do via folder targets to my DR filer. I first create a new domain-based namespace in Server 2008 mode called Users (add a $ to the end to make it hidden):

image

This is simply the DFS root which lives in the local file system space on my namespace server (domain controller). To be able to point to my filer I need to create another folder inside of this DFS root, that can then be targeted to a matching folder on my NAS. So I will create a new folder called “root” on both NAS: \cufas2users$ and DFS: \domain.comUsers. First create the new root folder on NAS then add a folder to DFS with a target that will point to the root folder on the filer. Additional targets can be configured and controlled for redundancy, replication, and DR.

image

Before we configure the GPO let’s set permissions on the Root folder. This is a critical step and is what will ultimately make or break this configuration. Since this is the root folder for the entire share, remove permission inheritance and set the following permissions:

  • CREATOR OWNER – Full Control (Apply onto: Subfolders and Files Only)
  • System – Full Control (Apply onto: This Folder, Subfolders and Files)
  • Domain Admins – Full Control (Apply onto: This Folder, Subfolders and Files)
  • Authenticated Users – Create Folder/Append Data (Apply onto: This Folder Only)
  • Authenticated Users – List Folder/Read Data (Apply onto: This Folder Only)
  • Authenticated Users – Read Attributes (Apply onto: This Folder Only)
  • Authenticated Users – Traverse Folder/Execute File (Apply onto: This Folder Only)

image

This will allow all users to programmatically create their directory folders beneath the root folder as well as be granted full control to them without the ability to see anyone else’s folders. Domain Admins will have full control to all folders. Now we’re ready to set up our folder redirection GPO.

Folder redirection is a user configuration setting so the GPO that contains these settings must be linked to an OU that houses user accounts, or linked high enough in the AD tree so that user-housing OUs will inherit. Redirection can be set in basic or advanced mode, basic redirecting everyone to the same location. Advanced enables the opportunity to redirect users differently based on security group. In either case you can redirect to the user’s home directory, create a folder for each user under the specified root, redirect to a specific location or to the user’s local %userprofile% path. I will be using the “create a folder” option under the advanced mode and the path is the DFS root created earlier: \domain.comusers$root. For now the policy will apply to one group, domain users, but I will have future flexibility should I need to have additional groups redirected differently. The effect of this policy is that each user, once successfully redirected, will automatically have a new folder under the root directory named after their username, with the documents folder beneath it. Any other folders I choose to redirect will also live under this %username% directory.

image

Additionally, you can grant the user exclusive access to their folder which keeps out even domain admins and creates problems for backups. Since we’ve set very specific permissions on the root we don’t need to worry about this anyway. The other pertinent option is to move the contents of the source to the destination which I have had problems with in my environment. I’ll be leaving both of these unchecked.

image

Now when a user logs into any machine in the enterprise they will see the exact same documents folder which lives safe and sound on an enterprise storage array. Should the primary array fail, DFS will repoint their redirected documents to a DR filer in another datacenter.

 image

Something else to consider for your laptop users is offline files in conjunction with folder redirection. I have this enabled by default for all laptops and expressly disable them for desktops. This is a good compromise so that laptop users will enjoy the benefits of redirection while in the office but will also be able to access and work on these documents while away. The next time they connect to the corporate network any changes they made to an offline file will sync back up with their redirected folders on the NAS.

15 comments :

  1. Thanks a lot, very useful info for us.

    Reply

  2. Nice article

    Reply

  3. Nice article I’m getting ready to configure a root that will be for roaming profiles and folder redirections

    Reply

  4. Curious about Windows 7 Search functions – my guess is that it is not available since you are using DFS Namespace and a filer / server that is not running Windows Serch Service. Was Search not a requirement for your implementation?

    Reply

  5. Hi Kevin,

    Correct, integrated search was not a key design consideration at the time. Sharepoint can integrate into 7 search depending on your license level but for this users would have to resort to opening the folder and then searching within it (slow).

    -Peter

    Reply

  6. Hi I’m not sure if you still monitor this blog, but if you do please reply back! i have some questions regarding the setup. im trying to do the exact same thing on a FAS2040, but there is already shares created for users “home folders” which are mapped drives…. im trying to move to folder redirection.

    Reply

  7. Are you looking for guidance on what’s possible or need help crafting a plan? Is your DFS namespace setup conducive to user redirection as I outlined here?

    Reply

  8. i will be doing the same thing you are trying to do, but im not sure about DFS. i have no experience with it but i could just follow your guide and technet. I guess im looking for somee guideness when it comes to the actual filer. i came to an already setup environment where the previous admin created share for each individual user and used a script to map a home drive. one of the hurdles for example is i cant access the filer from my MMC snapin. it says i dont have permissions, and i cant find the settings on the filers to enable the remote administrations, or give my self permissions.

    Reply

  9. Ah, ok, I think I understand your situation. So those shares that exist currently on the filer should be fine. You’ll just be changing how they’re accessed and presented to the user via DFS/ redirection.

    Re admin access, can’t help you there. Might put a call into netapp.

    Reply

  10. thanks for your help! i have one more questinos, when you were setting up the NTFS permissions for the share did you encounter a problem when adding “Creator Owner” with full control?
    When I add it it strips it out of all permissions and says access denied ~snapshots or something. I think the snapshots folder inside the share is preventing it. any thoughts? Nepapp support hasnt been helpful to me…

    Reply

  11. Try this process with a brand new share, do you still get the error?

    Reply

  12. Hi Weestro, I’m Argie.
    Interesting post. How do you manage replication between “Target Folders” in Filers.
    As far as I know (and I do try), DFS replication is not possible using shares in Filers, because they are not Windows Machines. Even NetApp says not to use DFS replication (see TR-3782).
    Do you have some workaround for this? Mind to Share?
    Thanks,
    Argie

    Reply

  13. Hi Argie,

    There are a few ways to do this but the best way would be via the native replication tool (snapmirror) and replicate at the volume or qtree level of your root folder path. Now in DFS just add folder targets for the paths on each side of your replication mirror. You could then use site referral ordering or disable the DR target referrals in DFS until you need them, then in a DR test or failover scenario, make your DR targets active and break the snapmirror.

    HTH,

    Peter

    Reply

  14. Is there a way to use storage replication like SnapMirror and to use some kind of automatic switching between main and DR site in case of failure?

    Reply

  15. There is, but you’ll need a way to automate the process. VMware’s SRM tool, for example, does this. Zerto is another. If you’re a scripting master you could probably create something or use automation tools like Puppet, Chef, Salt…

    Reply

XENAPP 7.X ARCHITECTURE AND SIZING

Thursday, May 08, 2014   , , , , , , , , , ,

Source: Exit | the | Fast | Lane

image

Peter Fine here from Dell CCC Solution Engineering, where we just finished an extensive refresh for our XenApp recommendation within the Dell Wyse Datacenter for Citrix solution architecture.  Although not called “XenApp” in XenDesktop versions 7 and 7.1 of the product, the name has returned officially for version 7.5. XenApp is still architecturally linked to XenDesktop from a management infrastructure perspective but can also be deployed as a standalone architecture from a compute perspective. The best part of all now is flexibility. Start with XenApp or start with XenDesktop then seamlessly integrate the other at a later time with no disruption to your environment. All XenApp really is now, is a Windows Server OS running the Citrix Virtual Delivery Agent (VDA). That’s it! XenDesktop on the other hand = a Windows desktop OS running the VDA.

Architecture

The logical architecture depicted below displays the relationship with the two use cases outlined in red. All of the infrastructure that controls the brokering, licensing, etc is the same between them. This simplification of architecture comes as a result of XenApp shifting from the legacy Independent Management Architecture (IMA) to XenDesktop’s Flexcast Management Architecture (FMA). It just makes sense and we are very happy to see Citrix make this move. You can read more about the individual service components of XenDesktop/ XenApp here.

image

Expanding the architectural view to include the physical and communication elements, XenApp fits quite nicely with XenDesktop and compliments any VDI deployment. For simplicity, I recommend using compute hosts dedicated to XenApp and XenDesktop, respectively, for simpler scaling and sizing. Below you can see the physical management and compute hosts on the far left side with each of their respective components considered within. Management will remain the same regardless of what type of compute host you ultimately deploy but there are several different deployment options. Tier 1 and tier 2 storage are comprehended the same way when XenApp is in play, which can make use of local or shared disk depending on your requirements. XenApp also integrates nicely with PVS which can be used for deployment and easy scale out scenarios.  I have another post queued up for PVS sizing in XenDesktop.

image

From a stack view perspective, XenApp fits seamlessly into an existing XenDesktop architecture or can be deployed into a dedicated stack. Below is a view of a Dell Wyse Datacenter stack tailored for XenApp running on either vSphere or Hyper-v using local disks for Tier 1. XenDesktop slips easily into the compute layer here with our optimized host configuration. Be mindful of the upper scale utilizing a single management stack as 10K users and above is generally considered very large for a single farm. The important point to note is that the network, mgmt and storage layers are completely interchangeable between XenDesktop and XenApp. Only the host config in the compute layer changes slightly for XenApp enabled hosts based on our optimized configuration.

image

Use Cases

There are a number of use cases for XenApp which ultimately relies on Windows Server’s RDSH role (terminal services). The age-old and most obvious use case is for hosted shared sessions, i.e. many users logging into and sharing the same Windows Server instance via RDP. This is useful for managing access to legacy apps, providing a remote access/ VPN alternative, or controlling access to an environment through which can only be accessed via the XenApp servers. The next step up naturally extends to application virtualization where instead of multiple users being presented with and working from a full desktop, they simply launch the applications they need to use from another device. These virtualized apps, of course, consume a full shared session on the backend even though the user only interacts with a single application. Either scenario can now be deployed easily via Delivery Groups in Citrix Studio.

image

XenApp also compliments full XenDesktop VDI through the use of application off-load. It is entirely viable to load every application a user might need within their desktop VM, but this comes at a performance and management cost. Every VDI user on a given compute host will have a percentage of allocated resources consumed by running these applications which all have to be kept up to date and patched unless part of the base image. Leveraging XenApp with XenDesktop provides the ability to off-load applications and their loads from the VDI sessions to the XenApp hosts. Let XenApp absorb those burdens for the applications that make sense. Now instead of running MS Office in every VM, run it from XenApp and publish it to your VDI users. Patch it in one place, shrink your gold images for XenDesktop and free up resources for other more intensive non-XenApp friendly apps you really need to run locally. Best of all, your users won’t be able to tell the difference!

image

Optimization

We performed a number of tests to identify the optimal configuration for XenApp. There are a number of ways to go here: physical, virtual, or PVS streamed to physical/ virtual using a variety of caching options. There are also a number of ways in which XenApp can be optimized. Citrix wrote a very good blog article covering many of these optimization options, of which most we confirmed. The one outlier turned out to be NUMA where we really didn’t see much difference with it turned on or off. We ran through the following test scenarios using the core DWD architecture with LoginVSI light and medium workloads for both vSphere and Hyper-V:

  • Virtual XenApp server optimization on both vSphere and Hyper-V to discover the right mix of vCPUs, oversubscription, RAM and total number of VMs per host
  • Physical Windows 2012 R2 host running XenApp
  • The performance impact and benefit of NUMA enabled to keep the RAM accessed by a CPU local to its adjacent DIMM bank.
  • The performance impact of various provisioning mechanisms for VMs: MCS, PVS write cache to disk, PVS write cache to RAM
  • The performance impact of an increased user idle time to simulate a less than 80+% concurrency of user activity on any given host.

To identify the best XenApp VM config we tried a number of configurations including a mix of 1.5x CPU core oversubscription, fewer very beefy VMs and many less beefy VMs. Important to note here that we based on this on the 10-core Ivy Bridge part E5-2690v2 that features hyperthreading and Turbo boost. These things matter! The highest density and best user experience came with 6 x VMs each outfitted with 5 x vCPUs and 16GB RAM.  Of the delivery methods we tried (outlined in the table below), Hyper-V netted the best results regardless of provisioning methodology. We did not get a better density between PVS caching methods but PVS cache in RAM completely removed any IOPS generated against the local disk. I’ll got more into PVS caching methods and results in another post.

Interestingly, of all the scenarios we tested, the native Server 2012 R2 + XenApp combination performed the poorest. PVS streamed to a physical host is another matter entirely, but unfortunately we did not test that scenario. We also saw no benefit from enabling NUMA. There was a time when a CPU accessing an adjacent CPU’s remote memory banks across the interconnect paths hampered performance, but given the current architecture in Ivy Bridge and its fat QPIs, this doesn’t appear to be a problem any longer.

The “Dell Light” workload below was adjusted to account for less than 80% user concurrency where we typically plan for in traditional VDI. Citrix observed that XenApp users in the real world tend to not work all at the same time. Less users working concurrently means freed resources and opportunity to run more total users on a given compute host.

The net of this study shows that the hypervisor and XenApp VM configuration matter more than the delivery mechanism. MCS and PVS ultimately netted the same performance results but PVS can be used to solve other specific problems if you have them (IOPS).

image

* CPU % for ESX Hosts was adjusted to account for the fact that Intel E5-2600v2 series processors with the Turbo Boost feature enabled will exceed the ESXi host CPU metrics of 100% utilization. With E5-2690v2 CPUs the rated 100% in ESXi is 60000 MHz of usage, while actual usage with Turbo has been seen to reach 67000 MHz in some cases. The Adjusted CPU % Usage is based on 100% = 66000 MHz usage and is used in all charts for ESXi to account for Turbo Boost. Windows Hyper-V metrics by comparison do not report usage in MHz, so only the reported CPU % usage is used in those cases.

** The “Dell Light” workload is a modified VSI workload to represent a significantly lighter type of user. In this case the workload was modified to produce about 50% idle time.

†Avg IOPS observed on disk is 0 because it is offloaded to RAM.

Summary of configuration recommendations:

  • Enable Hyper-Threading and Turbo for oversubscribed performance gains.
  • NUMA did not show to have a tremendous impact enabled or disabled.
  • 1.5x CPU oversubscription per host produced excellent results. 20 physical cores x 1.5 oversubscription netting 30 logical vCPUs assigned to VMs.
  • Virtual XenApp servers outperform dedicated physical hosts with no hypervisor so we recommend virtualized XenApp instances.
  • Using 10-Core Ivy Bridge CPUs, we recommend running 6 x XenApp VMs per host, each VM assigned 5 x vCPUs and 16GB RAM.
  • PVS cache in RAM (with HD overflow) will reduce the user IO generated to disk almost nothing but may require greater RAM densities on the compute hosts. 256GB is a safe high water mark using PVS cache in RAM based on a 21GB cache per XenApp VM.

Resources:

Dell Wyse Datacenter for Citrix – Reference Architecture

XenApp/ XenDesktop Core Concepts

Citrix Blogs – XenApp Scalability

4 comments :
  1. Do you have anything on XenApp 7.5 + HDX 3D? This is super helpful, but there is even less information on sizing for XenApp when GPUs are involved.

    Reply

  2. Unfortunately we don’t yet have any concrete sizing data for XenApp with graphics but this is Tee’d up for us to tackle next. I’ll add some of the architectural considerations which will hopefully help.

    Reply

  3. Two questions:
    1. Did you include antivirus in your XenApp scalability considerations? If not, physical box overhead with Win 2012 R2 and 1 AV instance is minimal, when compared to 6 PVS streamed VMs outfitted with 6 AV instances respectively (I am not recommending to go physical though).
    2. When suggesting PVS cache in RAM to improve scalability of XenApp workloads, do you consider CPU, not the IO, to be the main culprit? After all, you only have 20 cores in a 2 socket box, while there are numerous options to fix storage IO.

    PS. Some of your pictures are not visible

    Reply

  4. Hi Alex,

    1) Yes, we always use antivirus in all testing that we do at Dell. Real world simulation is paramount. Antivirus used here is still our standard McAfee product, not VDI-optimized.

    2) Yes, CPU is almost always the limiting factor and exhausts first, ultimately dictating the limits of compute scale. You can see here that PVS cache in RAM didn’t change the scale equation, even though it did use slightly less CPU, but it all but eliminates the disk IO problem. We didn’t go too deep on the higher IO use cases with cache in RAM but this can obviously be considered a poor man’s Atlantis ILIO.

    Thanks for stopping by!

SO LONG NETBIOS, IT’S BEEN FUN!

Thursday, January 13, 2011   , ,

Source: Exit | the | Fast | Lane

Without going too far down the “history of NetBIOS” rabbit hole, this protocol has been included in all versions of Windows since Windows for Workgroups, and including Windows7. Back before DNS was adopted as the primary name resolution protocol, NetBIOS was used for PCs in workgroups to find each other by name and communicate. NetBIOS over TCP/IP is a non-routable broadcast protocol and is by nature very chatty on the wire. WINS was created to centralize and resolve NetBIOS name to IP address registrations but DNS is still a much more efficient method and became the basis for Active Directory in Windows 2000.  The problem with overly chatty broadcasts is that all hosts constrained within the boundaries of a broadcast domain (L3 VLAN) have to process every packet that is broadcasted. This can be especially taxing in L3 VLANs with a large amount of hosts.

While Windows has functioned fine without NetBIOS for over a decade, Microsoft continued to support the protocol for legacy applications that required its use.   Having just built a pristine environment with Windows7/Server 2008 R2 and all the latest technology, I decided to explore the elimination of NetBIOS from my environment.

So, why bother?

The number one reason this effort is worthwhile is the elimination of a broadcast protocol from your network stream. This will ultimately free up network interface usage and CPU cycles that are currently processing each packet that any host in the VLAN broadcasts. Run a network trace and you will see an alarming number of name query broadcasts sent and received via UDP/137. As you can see from the network capture below, 66.16 is broadcasting a NetBIOS name resolution request for a host called DC2.

Locally, on a client running NetBIOS run the nbtstat –n command to display the local name table. If you can display this table, that means NetBIOS is alive and well. The wireless adapter is disabled on this client so there will be nothing in the cache for that connection.

Other reasons to abandon NetBIOS include maintaining antiquated browse lists, worrying about which resources might be accidentally visible in those browse lists, and fighting browser master wars. I have no need for this protocol in my network and will take steps to remove it.

How to disable NetBIOS over TCP/IP

There are a few ways to go about disabling NetBIOS programmatically, I want the path of least resistance.  As with any process, you can accomplish this goal manually, but that is tedious, time consuming, and ineffective. While this task can be done via GPO Preferences, this really isn’t the cleanest method either. You would need to create a new GPO Pref registry item targeting the NetBiosOptions value in the PC’s network interface key path. The problem with this method is that each PC will have a different GUID assigned to its network interfaces, highlighted below. You would first need to determine all applicable GUIDs in your network and push this policy so that each can be updated individually. No easy task.

Luckily, if you use DHCP to assign IPs to your clients there is an easier way. By default all Windows clients are set to “default” under the NetBIOS setting portion on the WINS tab in their NIC’s TCP/IP settings. This default setting allows all clients that use a DHCP server to use the NetBIOS settings as defined by that server.

SNAGHTML529d5f8f

This setting corresponds to the “NetBiosOptions” registry entry, in the aforementioned key path, who’s value of 0 means that the default setting is enabled. Manually disabling NetBIOS above would set a value of 2 on this entry in the registry.

image

Great! So we can control NetBIOS via DHCP, this is certainly easier than the GPO Pref method. From your DHCP snap-in, navigate to the “scope options” portion of the scope you would like to change. Right-click—>Configure Options. Select the Advanced tab and change the vendor class dropdown to “Microsoft Windows 2000 Options.” You will now see a 001 option to disable NetBIOS. Select it and change the data entry to 0x2, click OK to activate.

SNAGHTML52fad359

**This will not take affect until your clients renew their address leases and pull the new scope options.

Verify

On the DHCP client with a renewed IP lease, you will see a new registry entry, in the same key path shown previously, called “DhcpNetBiosOptions” with the corresponding value you set in the scope.

image

This new key is only read by the system if the NetBiosOptions value is 0. Running nbtstat –n should now yield an empty name cache with no node address.

image

Another network traffic capture, assuming all clients in the scope have been updated, should yield no NetBIOS traffic. Keep in mind that disabling NetBIOS stops your ability to send broadcasts, not receive them.

image

Now without all those NBNS broadcasts you can keep tabs with what spanning tree is up to! Winking smile

References:

RFC1001

RFC1002

NetBIOS over TCP/IP Configuration Parameters

MS KB313314

FUN WITH STORAGE SPACES IN WINDOWS SERVER 2012 R2 (Reposted From Exit | the | Fast | Lane)

Wednesday, September 17, 2014   , , , , ,

Source: Exit | the | Fast | Lane

image

Previously I took at look at a small three disk Storage Spaces configuration on Windows 8.1 for my Plex media server build. Looking at the performance of the individual disks, I grew curious so set out to explore Spaces performance in Server 2012 R2 with one extra disk. Windows 8.1 and Server 2012 R2 are cut from the same cloth with the Server product adding performance tiering, deduplication and means for larger scale.  At its core, Storage Spaces is essentially the same between the products and accomplishes a singular goal: take several raw disks and create a pool with adjustable performance and resiliency characteristics. The execution of this however, is very different between the products.

For my test bed I used a single Dell PowerEdge R610 which has 6 local 15K SAS disks attached to a PERC H700 RAID controller with 1GB battery-backed cache. Two disks in RAID 1 for the OS (Windows Server 2012 R2) and four disks left for Spaces. Spaces is a very capable storage aggregation tool that can be used with large external JBODs and clustered across physical server nodes, but for this exercise I want to look only at local disks and compare performance between three and four disk configurations.

Questions I’m looking to answer:

  • Do 2 columns outperform 1? (To learn about columns in Spaces, read this great article)
  • Do 4-disk Spaces greatly outperform 3-disk Spaces?
  • What advantage, if any, does controller cache bring to the table?
  • Are there any other advantages in Spaces for Server 2012 R2 that outperform the Windows 8.1 variant (aside from SSD tiering, dedupe and scale)?

No RAID

The first rule of Storage Spaces is: thou shalt have no RAID in between the disks and the Windows OS. Spaces handles the striping and resiliency manipulation itself so can’t have hardware or software RAID in the way. The simplest way to accomplish this is via the use of a pass-through SAS HBA but if you only have RAID controller at your disposal, it can work by setting all disks to RAID 0. For Dell servers, in the OpenManage Server Administrator (OMSA) console, create a new Virtual Disk for every physical disk you want to include, and set its layout to RAID-0. *NOTE – There are two uses of the term “virtual disk” in this scenario. Virtual Disk in the Dell storage sense is not to be confused with virtual disk in the Windows sense, although both are ultimately used to create a logical disk from a physical resource.

image

To test the native capability of Spaces with no PERC hardware cache assist, I set the following policy on each Virtual Disk:

image

Make some Spaces!

Windows will now see these local drives as assignable disks to a Storage Pool. From the File and Storage Services tab in Server Manager, launch the New Storage Pool Wizard. This is where creating Spaces in Windows 8.1 and Server 2012 R2 divest in a major way. What took just a few clicks on a single management pane in Win8.1 takes several screens and multiple wizards in Server 2012 R2. Also, all of this can be done in PowerShell using the new-storagepool and new-virtualdisk cmdlets. First name the new storage pool:

image

Add physical disks to the pool, confirm and commit.

image

You will now see your new pool created under the Storage Spaces section. Next create new virtual disks within the pool by right-clicking and selecting “New Virtual Disk”.

image

Select the pool to create the Virtual Disk within and name it. You’ll notice that this is where you would enable performance tiering if you had the proper disks in place using at least one SDD and one HDD.

image

Choose the Storage Layout (called resiliency in Win8.1 and in PowerShell). I previously determined that Mirror Spaces are the best way to go so will be sticking with that layout for these investigations.

image

Choices are good, but we want to save committed capacity, choose Thin Provisioning:

image

Specify the size:

image

Confirm and commit:

image

Now we have a pool (Storage Space) and a Virtual Disk, now we need to create a volume. Honestly I don’t know why Microsoft didn’t combine this next step with the previous Virtual Disk creation dialogs, as a Virtual Disk with no file system or drive letter is utterly useless.

image

Next select the size of the volume you want to create (smaller than or equal to the vDisk size specified previously), assign a drive letter and file system. It is recommended to use ReFS whenever possible. Confirm and commit.

image

Finally, a usable Storage Space presented to the OS as a drive letter!

Performance Baseline

First, to set a baseline I created a 4-disk RAID10 set using the integrated PERC H700 with 1GB cache (enabled). I copied a 6GB file back and forth between the C: and M: drives to test 100% read and 100% write operations. Write IO was fairly poor not even breaking 20 IOPS, the RAID10 double write penalty in full effect. Reads were much better, as expected, nearing 300 IOPS at peak.

image

3-Disk Mirror Space Performance

To force the server to go to disk for all IO operations (since I simply moved the same 6GB file between volumes), I used a little utility called Flush Windows Cache File by DelphiTools. This very simple CLI tool flushes the file cache and memory working sets along with the modified and purged memory standby lists.

Test constants: 3-disk Mirror space, PERC caching disabled, single column (default for 3 disks), 6GB file used to simulate a 100% read operation and a 100% write operation:image

imageHere are a few write operations followed by a read operation to demonstrate the disk performance capabilities. Write IOPS spike ~30 with read IOPS ~170. Interestingly, the write performance is better than RAID10 on the PERC here with read performance at almost half.

image

Turning the PERC caching back on doesn’t help performance at all, for reads or writes.

image

4-Disk Space, 1 Column Performance

Test constants: 4-disk Mirror space, PERC caching enabled, single column, 6GB file used to simulate a 100% read operation and a 100% write operation:

When adding a 4th disk to a 3-disk Space, the column layout has already been determined to be singular, so this does not change. To use a two column layout with four disks, the space would need to be destroyed and recreated. Let’s see how four disks perform with one column.

image

You’ll notice in the vDisk output that the Allocated size is larger now with the 4th disk, but the number of columns is unchanged at 1.

image

Adding the 4th disk to the existing 3-disk Space doesn’t appear to have helped performance at all. Even with PERC caching enabled, there is no discernable difference in performance between 3 disks or 4 using a single column, so the extra spindle adds nothing here. Disabling PERC cache appears to make no difference so I’ll skip that test here.

image

4-Disk Space, 2 Column Performance

Test constants: 4-disk Mirror space, PERC caching enabled, two columns, 6GB file used to simulate a 100% read operation and a 100% write operation.

Using 4 disks by default will create Virtual Disks with 2 columns as confirmed in PowerShell, this is also the minimum number of disks required to use two columns:

image

WOW! Huge performance difference with four disks and two columns. The read performance is consistently high (over 700 IOPS) with the writes performing in a wider arc upwards of 300 IOPS. This is a 1000% performance increase for writes and 300% increase for reads. Literally off the chart! Spaces is allowing each of these participating disks to exhibit their full performance potential individually.

image

Thinking this is too good to be true, I disabled the PERC cache and ran the test again. This is no fluke, I got very close to the same results the second time with all disks participating. Incredible.

image

Plex Performance

Since I began this whole adventure with Plex in mind, fitting that I should end it there too. Streaming a compatible file across the network yielded the following results. Skipping around in the file, all four disks were read from at some point during the stream, half spiking ~110 IOPS, the other two ~80 IOPS. This is twice the amount of observed performance on my home 3-disk 7.2k SATA-based HTPC.

image

Conclusions:

Using 15K SAS disks here, which are theoretically capable of ~200 IOPS each, performance is very good overall and seems to mitigate much of the overhead typically seen in a traditional RAID configuration. Spaces provides the resilient protection of a RAID configuration while allowing each disk in the set to maximize its own performance characteristics. The 3-disk one column Spaces configuration works but is very low performing. Four disks with one column don’t perform any better, so there is significant value in running a configuration that supports two or more columns. Columns are what drive the performance differential and the cool thing about Spaces is that they are tunable. Four disks with a two column configuration should be your minimum consideration for Spaces, as the performance increase adding that second column is substantial. Starting with three disks and adding a fourth will persist your single column configuration so is not advised. Best to just start with at least four disks. PERC caching showed to neither add nor detract from performance when used in conjunction with Spaces, so you could essentially ignore it altogether. Save your money and invest in SAS pass-through modules instead. The common theme that sticks out here is that the number of columns absolutely matters when it comes to performance in Spaces. The column-to-disk correlation ratio for a two-way mirror is 1:2, so you have to have a minimum of four disks to use two columns. Three disks and one column work fine for Plex but if you can start with four, do that.

Resources:

Storage Spaces FAQs

Storage Spaces – Designing for Performance