NVIDIA GRID – Blue Clouds

Notes From The Trenches : Horizon View Multimedia and Graphics Tuning

In the last few weeks I’ve spent a lot of time with different customers from different verticals looking at the performance of multimedia playback and hardware graphics rendering for applications. I thought I already had a good grasp of how all of these technologies work and how best to tune them, but it’s not until you’re presented with varying requirements that it really puts your skills to the test trying to find bottlenecks and in some cases, turning what you think you knew on it’s head.

This post relates solely to VMware Horizon View, but I guess could be applied to any virtual desktop solution, whether that be XenDesktop, vWorkspace or anything else. From a View perspective, we are always hammered with the message “know your use cases”. In many ways, this message gets saturated to the degree that a touch of snow blindness kicks in and you just make some general assumptions about what will work and what won’t.

To recap, at the very least before starting a View proof of concept (PoC), you should be asking yourself the following questions :-

How many users will require multimedia playback?
What level of quality is expected? 480p? 720p?
Will the videos be in full screen or will a smaller window suffice?
What applications will require OpenGL or DirectX support?

By covering off the above points, you put yourself in a position where the proof of concept can address a high proportion of user requirements. Don’t beat yourself up though if you don’t nail it 100% out of the gate, getting it right on day one is unheard of! Taking each point in turn, let’s look at why the responses to these points matter.

How many users will require multimedia playback?

Quite often, organisations limit access to video streaming sites such as YouTube, Vimeo, iPlayer and others to preserve office bandwidth. From a View perspective, this can do you a favour but there are always uses that may well slip under that radar and result in users having a degraded VDI experience that gives the solution an unmerited bad name. News websites such as Sky News or BBC News all have multimedia content these days, from radio interviews to video clips in small boxes with a full screen option.

The result of this question will help determine an appropriate pool design. If you have users that will use a lot of video (presenters, trainers, academics, for example), then it would make sense to have a dedicated pool of desktops for these users with more resource than would be given to “regular” users. Also, VMware’s recommendation is that good quality multimedia playback requires 2 vCPUs per virtual desktop, which in turn affects your infrastructure sizing requirements as this will impact your host:desktop consolidation ratios.

If the answer to this question “all users” then the next step is to determine what quality is expected.

What level of quality is expected? 480p? 720p?

If there are users expecting full screen 1080p native performance at 60 fps (same as a blu ray player etc.) then this level of expectation should be reset. The nature of a virtual desktop solution means that this won’t be achieved due to bandwidth, compression, hardware resource and other potential bottlenecks along the way. Remember a blu ray player has a single cable running from it to the TV, so there isn’t a bunch of other traffic flying down that cable, the cable length is typically less than a metre (or 3 feet 3⅜ inches in old money!) so it’s barely even an apples to oranges comparison.

In a View environment, PCoIP is the display protocol of choice. RDP is also available, but generally lacks the flexibility and tuning options afforded to us by PCoIP. In addition to this, there are several hardware options to help augment and improve the PCoIP experience. If you recall, Teradici are the owners of PCoIP and VMware use this protocol in View and Teradici have both host card based and zero client based options to offload PCoIP processing to dedicated hardware.

One other thing worth bearing in mind is that in my experience, the human eye can’t really tell the difference between 720p and 1080p, so if you get mired in discussions with end users about this, you’re kind of missing the point. If you can deliver video at a good resolution (720p) and a good frame rate, the rest is just splitting hairs in my opinion.

Will the video be in full screen or smaller screen?

This fact obviously matters because the larger the video playback surface area, the more resource required to push it along. In the tests I’ve done, standard 480p video in a full screen uses around 20MB more GPU RAM than a video embedded into a web page (BBC News is a good example of this). If full screen high quality video playback is required, you need to factor this into pool designs and the PoC. As always though, benchmark it yourself during the PoC phase, as your mileage will inevitably vary.

What applications will require OpenGL or DirectX support?

As we ascend to the more demanding groups of users, it’s key to know who requires extra grunt on their virtual desktop. This again affects pool design but also means that some consideration will need to be given to specialist graphics hardware to enable these users to work effectively in a virtual environment as they do in the physical world. Some examples of applications requiring this level of support include but are not limited to:-

(All OpenGL apps)

Adobe After Effects, Photoshop CS3/CS4, Premiere Pro
AutoCAD
Google Earth
Google SketchUp
Scilab
Virtools

DirectX seems to be typically used in gaming, so it’s debatable whether or not there’s a use case there in general environments. That being said, one recent customer teaches computer game design and coding, which would place this use case right in the middle of your View deployment.

What are my options?

There is a very good white paper from VMware that discusses graphics acceleration in View and is well worth a read. It focusses primarily on the NVIDIA GRID solution which I’ve discussed previously and is available here. So let’s say you’ve done your scoping exercise and completed a desktop assessment, where do you go from here? There are lots of resources out there if your Google-Fu is up to snuff that will tell you what the different varieties of hardware there are, I’m just going to provide a simple reference based on my testing.

Multimedia Playback

For good multimedia performance, look for a thin or zero client with the Teradici chip installed. I recently tested the 10Zig V1200P Zero Client with a customer and I have to say the multimedia performance was exceptional. This was partly due to the excellent bandwidth to desktop that the customer had, but also with the Tera2 chip installed, the PCoIP processing gets offloaded to a dedicated hardware device, in a similar way to say a TCP/IP Offload Engine works on a network card. Of course it goes without saying that performance with hardware vs software will always be superior and this was the case in my experience.

If you can’t stretch that far budget wise, ensure you follow best practice and have 2 vCPU per virtual desktop as video playback is a purely CPU based operation. If you run up Task Manager on two desktops, one uniprocessor and one multiprocessor, you should see the difference in CPU spiking during multimedia playback. Remember though not to oversubscribe vCPU – there is not a linear improvement the more vCPUs you add to a desktop, in actual fact you could slow the whole environment down. This is well known in the virtualisation community and there is a good explanation as to why here. So in short? No more than 2 vCPUs per desktop unless a specific use case calls for it, which would be dependent on multiprocessor desktop applications.

Applications requiring OpenGL or DirectX support

Primarily in the case of OpenGL, I have seen improved performance using NVIDIA GRID K1 cards operating in vSGA mode. As stated in a previous post, make sure you have the latest versions of ESXi, Horizon View and the NVIDIA VIB on the ESXi host. So how much better is hardware acceleration? A picture says a thousand words, so I’ve pictured below the output from Passmark 3D test that shows a desktop using software rendering and 128 MB video RAM and a desktop using hardware rendering with 128MB video RAM.

On the same two desktops I’ve also got the 2D Passmark test results shown below:-

Apologies for the general rubbishness of the pictures, they were taken with a phone camera, but it gives you a sense as to what kind of performance you can expect with hardware acceleration. Thankfully Passmark is a free tool and can be downloaded from here.

How much Video RAM should I allocate to my pool?

Great question and somewhat subjective. VMware state that if you are using Windows 7 with Aero (and let’s be honest, most people are) that you should set a value between 64MB – 128MB in the pool settings (I couldn’t find any change in this advice for View 6). In my experience, this is fine for basic use cases, but where multimedia is required and especially good quality video playback, that won’t be enough. We talk all the time about right sizing View deployments, but one gap really in the analysis for me is to how to right size video RAM per pool.

I found a free utility called GPU-Z which is really useful for benchmarking performance during multimedia playback and determining the high watermark for video RAM usage. This can then be taken forward into the View design and ensure multimedia users have enough resource for their use case. GPU-Z can be downloaded here and is pretty simple to use. For my testing purposes, I ran it up during normal Windows navigation (starting and closing apps, web pages etc.)

The above screenshot shows the tool running on a laptop and gives you the kind of idea of what information can be gleaned. The Sensors tab is the one with all the key information, the screen shot below shows the “idle” state of the GPU during normal operations.

As you can see above, we’re already at 206MB of video RAM and we’re not really doing anything. If your pool is set to the recommended 128MB RAM, there is an obvious pinch point there already. This will lead to degraded multimedia performance as there isn’t sufficient resource. Playing a small screen video from the BBC News website, the video RAM usage climbs to 284MB as shown below.

And then finally on the full screen version of the video, the usage climbs still further to 363MB. So in this particular case, you’d be looking to set an initial high pool watermark of 370MB video RAM (for example) to give the user sufficient horse power for video playback in View. That being said, the results will most likely be lower in a virtual environment, so make sure you continue to monitor usage during the PoC phase to ensure that the pool video RAM size is neither under or over specced.

Conclusions

In summary, I’d say look at the following:-

Review end user requirements for graphics performance, including multimedia and application support
Spend time tuning PCoIP if you find that bandwidth is a constraint, but remember this is usually a balance and/or trade off between picture quality and playback smoothness. Audio quality is also affected by tuning maximum audio bandwidth
Conduct a PoC to level expectation appropriately
Use free tools such as Passmark and GPU-Z to accurately benchmark the environment and right size the capacity
Obtain eval units if you are going down the thin/zero client route and test out all the different use cases you know of to see which unit is most appropriate
vSGA with NVIDIA GRID cards can be very cost effective when applications require additional 3D resource
Consider the use of a Teradici APEX 2800 card to offload some graphics processing to dedicated hardware in the host (caveat : I haven’t tested this and Teradici – feel free to loan me one!)

Adventures in NVIDIA GRID K1 and Horizon View

Just had a really interesting week helping a customer with a proof of concept environment using Horizon View 6 and an NVIDIA GRID K1 card, so I thought I’d blog about it. You know, like you do. Anyway, this customer already had a View 5.2 environment running on vSphere 5.1 on a Cisco UCS back end. They use 10 Zig V-1200 zero clients on the desktop to provide virtual desktops to end users. As they are an education customer providing tuition on 3D rendering and computer games production, they wanted to see how far they could push a K1 card and get acceptable performance for end users. Ultimately, the customer’s goal is to move away from fat clients as much as possible and move over to thin or zero clients.

I have to say that in my opinion, there is not a great deal of content out there about K1 cards and Horizon View. One of the best articles I’ve seen is by my good friend Steve Dunne, where he conducted a PoC with a customer using dedicated VGA. In our case, we were testing out sVGA or Shared VGA as the TCO of dedicated would be far too high and impossible to justify to the cheque signers. This PoC involved a Cisco UCS C240-M3 with an NVIDIA K1 card pre-installed. The K1 card has four GPU cores and 16GB RAM on board and takes up two PCI slots.

I’m not going to produce a highly structured and technical test plan with performance metrics as this really isn’t the way we ran the initial testing. It was really a bit more ad hoc than that. We ran the following basic tests:-

– Official 720p “Furious 7” trailer from YouTube (replete with hopelessly unrealistic stunts)

– Official 1080p “Fury” trailer from YouTube (replete with hopelessly unrealistic acting)

– Manipulating a pre-assembled dune buggy type unit in LEGO Digital Designer

– Manipulating objects in 3DS Max

When we started off at View 5.2, we found that video playback was very choppy, lip sync was out on the trailers and 3D objects would just hang on the screen. Not the most auspicious of starts, so I ensured the NVIDIA VIB for ESXi 5.1 was the latest version (it was) and ESXi was at 5.1 U1. We made a single change at a time and went back over the test list above to see what difference it made. We initially set the test pool to be hardware 3D renderer, with 256MB of video RAM. The command gpuvm was checked to ensure the test VMs were indeed assigned to the GRID K1 card and we used nvidia-smi -l to monitor the GPU usage during the testing process.

Remember that when you configure video RAM in a desktop pool, half of the RAM is assigned from host memory and the other half is from the GRID K1 card, so keep this in mind when sizing your environment. The goal of the customer is to provide enough capacity for 50 concurrent heavy 3D users per host, so two GRID K1 cards per host is what they’re looking at, pending testing and based on the K120Q profile.

So after performing some Google-Fu, the next change we decided to make was to add the Teradici audio driver to the base image, the main reason for this was that there is apparently a known issue with 10Zig devices and audio sync, so we thought we’d give it a try. Although audio quality and sync did improve, it still didn’t really give us the results we were looking for.

Having gone back to the drawing board (and forums), the next change we made was to upgrade the View agent on the virtual desktop from 5.2 to 5.3. Some customers in the VMware Communities forums had observed some fairly major performance improvements doing this, without the need to upgrade the rest of the View infrastructure to 5.3. We did this and boy did things improve! It was like night and day and the video was much improved, barely flickered at all and the audio was perfect. At this point we decided that View 5.2 was obviously not going to give us the performance we needed to put this environment in front of power users for UAT.

The decision was taken to upgrade the identical but unused parallel environment at another site from vSphere 5.1 and View 5.2 to vSphere 5.5 U2 and View 6.0.1. The reasoning behind this was that I knew that PCoIP had improved markedly in 6.0.1 from the 5.x days, plus is meant we were testing on the latest platform available. Once we upgraded the environment and re-ran the tests, we saw further improvement without major pool or image changes. We updated VMware Tools and also the View Agent to the latest versions as part of the upgrade and the customer was really impressed with the results.

In fact, as we were watching the 720p and 1080p videos, we remarked that you’d never know you were watching it on a zero client, basically streamed from a data centre. That remark is quite telling, as if a bunch of grizzled techies say that, end users are likely to be even more chirpy! We also performed more rudimentary testing with LEGO Digital Designer and also 3DS Max, with improved results. The PoC kit has now been left with the customer, as really we need subject matter experts to test whether or not this solution provides acceptable end user performance to totally replace fat clients.

What is the takeway from this?

The takeaway is that in my opinion, VDI is following a similar path to what datacentre virtualisation followed a few years back. First you take the quick and easy wins such as web servers and file servers and then you get more ambitious and virtualise database servers and messaging servers once confidence in the platform has been established.

VDI has started with the lighter use of the “knowledge user” who uses a web browser and some Office applications to a basic level. Now this target has been proven and conquered, we’re moving up the stack to concentrate on users who require more grunt out of their virtual desktop. The improvements in Horizon View, ESXi and now with the support of hardware vendors such as NVIDIA and Teradici, native levels of performance for some heavy use cases can be achieved at a sensible cost.

That being said, running a PoC will also help finding where the performance tipping point is and what realistic expectations are. In our case, early testing has shown that video, audio and smaller scale 3D object manipulation is more than feasible and a realistic goal for production. However, much larger vehicles such as the Unreal Development Kit may still be best suited to dedicated gaming hardware rather than a virtual desktop environment. The one thing of course I haven’t mentioned is the fact that the customer is GigE to the desktop and have a 10GigE backbone between their two sites. This makes a huge difference. I doubt we’d have seen the same results on a 100Mbps/1Gbps equivalent environment.

The customer will be testing the PoC until the end of December, hopefully I can share the results nearer the time as to whether or not they proceed and if so, how they do it. Hopefully for anyone researching or testing NVIDIA GRID cards in a VDI environment, this has been some help.