CITRIX XENSERVER – SETTING MORE THAN ONE VCPU PER VM TO IMPROVE APPLICATION PERFORMANCE AND SERVER CONSOLIDATION E.G. FOR CAD/3-D GRAPHICAL APPLICATIONS

source: Citrix Blog

Many rich graphical applications such as CAD applications may benefit from allocating more than 1 vCPU per VM (1 vCPU is the default) via XenCenter or the command line. This will be of interest to many of those evaluating XenDesktop and XenApp HDX 3D Pro GPU pass-through and GPU sharing technologiesand can lead to noticeable performance enhancements. The performance gains possible though are highly application specific and users should evaluate and benchmark for their particular application and user workflows.

Overprovisioning

Virtualisation allows users to define more virtual vCPUs than there are physical CPUs (pCPUs) available.  A key driver for server consolidation is to virtualise workloads which spend much of their time idle. When virtualising, the expectation is often that when one of the VMs requires increased processing resources, several other of the VMs on the same host will be idle. This type of statistical multiplexing enables the total number of vCPUs across all VMs on a host to be potentially much greater than the number of physical cores.

In the context of virtualising graphical workloads (such as those that benefit from GPU-accelerations) this means you really should look at the demographics and usage patterns of your user base – I blogged about this a few months ago. If you are streaming video or similar constantly to all users there may be little scope to overprovision but if your users use applications in ways that use the CPUs in bursts there is often significant scope to consolidate resources.

Reservation of Capacity for Dom0

Before proceeding, it is worth noting that the XenServer scheduler ensures that dom0 cannot be starved of processing resources, given that it is crucial for normal operation of the platform. The scheduler ensures that dom0′s CPUs are allocated at least as much real CPU cycles as they would receive if they had a dedicated physical core. In other words, today, we ensure that dom0 uses the entirety of 4 physical cores, if it needs them. The maximum amount of CPU resource that dom0 can use is the equivalent of half the number of physical cores on the system.

This means that on a heavily loaded host, capacity for VMs will be the number of physical cores, minus 4.

vCPU Consolidation Ratio

As a rule of thumb, the ideal consolidation ratio will mean that ~80% of physical CPU utilisation is achieved under normal operation, to leave a 20% margin for bursts of activity. Therefore, the consolidation ratio that can be achieved is highly variable. It is conceivable that a 10:1 consolidation ratio is not _theoretically_ unreasonable.

Having said this, very high consolidation ratios can result in unexpected performance impacts. For example, when attempting to use 150 vCPUs on a 54 physical core box, 4 cores will go to dom0. This then means that we have a consolidation ratio of 3:1 (quite low). However, if all the CPUs are reasonably heavily loaded, this means that each vCPU will receive a 30 millisecond time slice, then have to wait for 90 milliseconds before its next time slice. The higher the consolidation ratio, the longer the interval between the time slices.

This waiting can impact aspects such as TCP network throughput, because packets appear to take far longer than expected to arrive at the VM. In the worst case, with very high consolidation ratios, if a VM receives a time slice extremely infrequently (once every few seconds) it will appear to freeze.

There is no absolute limit/number for the vCPU consolidation ratio. Clearly the greater the ratio, the greater the expectation is that the VMs are going to spend a greater fraction of their lives idle. The way to understand whether a consolidation ratio is appropriate on a running system is to examine the overall physical utilisation of the host.

cores-per-socket

Windows client operating systems support a maximum of two CPU sockets e.g. Windows 7 can only use 2 sockets. XenServer’s default behaviour is to present each vCPU with 1 core per socket; i.e. 4 vCPUs would appear as a 4 socket system.  Within XenServer you can set the number of cores to present in a socket. This setting is applied per guest VM.  With this setting you can tell XenServer to present the vCPUs as a 4 core socket.

[root@xenserver ~]# xe vm-param-set uuid=<vm-uuid> platform:cores-per-socket=4

VCPUs-max=4 VCPUs-at-startup=4

Further background can be found in http://support.citrix.com/article/CTX126524

 

Licensing

Users will need to assert whether any additional constraints are imposed by the licensing conditions of their graphical software and operating systems.

Maximum number of vCPUs per VM

The underlying Xen Hypervisor ultimately limits the number of vCPUs per guest VM, see here. Currently the Citrix XenServer limit is lower and defined by what is auto and regression tested, this limit is currently 16 vCPUs, you should reference the XenServer Configuration Guide for the version of XenServer you are interested in, e.g. for XS6.2, see here. It is therefore likely that higher numbers of vCPUs will work which maybe of interested to unsupported users, however for those with support we advise you stick to the supported limits. If there was sufficient demand raising the supported limit is something we would evaluate.

The XenCenter GUI currently allows users to set up to 8 vCPUs, for higher numbers users must use the command line CLI.

 

Anecdotal information

I’ve collected some of the feedback I’ve had from those involved with vGPU deployments and although it cannot be considered official Citrix endorsed best practice, I felt it was useful to share:

  • Most of the major CAE analysis apps will probably benefit from >2 cores although in some cases it depends on licensing. So Ansys, Abaqus, SolidWorks analysis module etc. all usually take as many cores as they find. Some applications will limit you to 2 cores unless you license an HPC pack or similar though.
  • Similarly ray tracing apps even when they are GPU accelerated will often use multiple cores (in addition to the CPU) although in current versions they may not take all cores since they want to leave some for interactivity.
  • We’ve had some feedback from customers doing analysis they expect applications use all cores so it runs faster, and they have noticed in a virtualized environments when CPU cores get over subscribed (please be aware of the need to understand overprovisioning above).
  • Our general rule is that to do good 3D interactive graphics you need a minimum of 4 vCPU cores, and for workstation cases 8 vCPU cores not because 8 will be used during graphics but because most high-end workstations are 8 cores and users and apps expect it.
  • Likewise during graphics all 4 vCPU cores may not be 100% busy but it’s likely that 2 probably are between the various needs of application code, graphics driver executions, operating system, virus checkers etc. So if you only have 2 vCPU cores we frequently observe a noticeable impact on performance.

How to investigate vCPU usage and server consolidation

One useful tool for investigating the performance of a system is xentop,details of xentop and other useful tools are available here. This allows you to understand the overall load on the system, and hence whether you are below the 80% load recommended above.

Over the last few versions of XenServer we’ve been increasing the number of metrics available and also improving the documentation and alerting mechanisms around these metrics. These new metrics complement those comprehensively documented in Chapter 9 of the XenServer 6.2 Administrators Guide.If you are interested in measuring the vcpu overprovisioning from the point of view of the host, you can use the host’s cpu_avg metric to see if it’s too close to 1.0 (rather than 0.8, i.e. 80%): If you are interested in measuring the vcpu overprovisioning from the point of view of a specific VM, you can use the VM’s runstate_*  metrics, especially the ones measuring runnable, which should be less than 0.01 or so. These metrics can be investigated via the command line or XenCenter.

 

Investigating cores per socket and thread per core

The Xen hypervisor utility xl can be of some use for debug and diagnosis in XenServer environments, particularly the inquiry options such as info [-n, --numa], by which you can query hardware information such as cores_per_socket and threads_per_core and similar data that you might want to log or keep in benchmarking assessments. Further details and links to this utility are detailed on this page, alongside other useful tools for use in XenServer environments.

Fine tuning

This type of fine tuning is indeed fiddly, and we are open to user feedback on how this could be improved, including the documentation of this type of information. In fact our documentation team are specifically watching this post to see what comments it solicits. A large number of you have already written blogs and articles of your experiences of best practice in your own “real life” XenServer deployments and they are of interest to us and many customers, see:

Further Improvements for graphical applications – CPU pinning and GPU pinning

For many of these applications a pinning strategy that considers the NUMA architecture of the server can also lead to performance enhancements. It is highly application and server specific and you really do need to investigate this with your particular applications, user demographic profile and usage patterns. There is some information linked to at the bottom of this page, but really it’s one best addressed in a future blog….. in the meantime I’d be interested to hear user anecdotes if anyone has tried this though.

One Comment

  1. Tobias Kreidl

    Very nice article, Rachel! Looking forward to more.

    Indeed it should also be noted that for many thin clients, one of the VCPUs may be used pretty much entirely just for reformatting and dealing with the video output, so we — in fact — always assign two VCPUs for all our VMs at a minimum to provide extra CPU cycles where they are needed. With as many as 80 or more VMs running on a 32-VCPU server youd think this would be inefficient, but with the environment we have, the dom0 load on a server is typically just 20-40%. Even with 20 or so pretty active users, it rarely climbs above 60%. Note that we allocate eight dom0 instances, each with 4 GB of memory, to be able to accommodate heavy use or boot storms on the XenDesktop environment and stress tests have shown that four dom0 instances at times were just not quite enough. With 256 GB of memory, the extra 16 GB were worth allocating for that purpose.

    The use of a GPU/vGPU can cut VCPU load on a VM by 30 to 50%. We have not had the chance to see how well this scales with many users, but the impact is clearly not negligible. Current testing underway to see if allocating a portion of a GPU to a VM is worse or better with many users than, say, creating a XenApp VM running on a XenServer and leveraging GPU passthrough for the entire XenApp server and having the VMs leverage the GPU that way. And of course, you can do both! The recent articles on PVH are, of course, tantalizing.

    We have not tried CPU pinning and in an environment with this many VMs and a parade of users coming and going, it probably wouldnt make that much sense. The CPU multiplexing certainly doesnt seem to incur a big penalty in this environment, at least from qualitative observations.

    We have also experimented with various XenServer parameters, such as modifying the txqueuelen and some of the network and TCP parameters with apparently positive effects. Linux has such a plethora of tuning options, which of course should be modified with caution.

Advertisements
MyXenApp

A blog dedicated to Citrix technologies

There's More to the Story: a blog about LIFE, chronic illness, and Mental Health

I’m the loud and relentless "patient" voice and advocate they warned you about. I happen to have type 1 diabetes, ADHD, anxiety, OCD, PCOS, endometriosis, thyroid issues, asthma, allergies, lactose intolerance (and more), but there’s more to story.

Blog of Julian Andres Klode

Debian, Ubuntu, Linux in general, and other free software

DeployWindows•Info

Sharing knowledge in deploying, troubleshooting and managing Windows

Dirk & Brad's Windows Blog

Microsoft Platform How To's, Best Practices, and other Shenanigans from Highly-qualified Windows Dorks.

Ingmar Verheij

About Citrix, Remote Desktop, Performance, Workspace, Monitoring and more...

Jack's Server blog

Blog about server management

Virtxpert

A blog by Jonathan Frappier about virtualization and technology

CloudPundit: Massive-Scale Computing

the business of Internet infrastructure, cloud computing, and data centers

UCSguru.com

Every Cloud Has a Tin Lining.

speakvirtual

See no physical, hear no physical, speak no physical - speakvirtual.com

IT BLOOD PRESSURE

IT can be easy

Ask the Architect

My virtual desktop journey

blog.scottlowe.org

The weblog of an IT pro specializing in virtualization, networking, cloud, servers, & Macs

akosijesyang

a place under control of his big head

The Neighborhood

society online's social conscious

Yellow Bricks

by Duncan Epping

THE SAN GUY

Proven Storage Professional