05-05-17

Event Review – Google Cloud Next London – Day One

I was fortunate enough to spend the last couple of days at the Google Cloud Next London event at the ExCel centre and I have a few thoughts about it I’d like to share. The main takeaway I got from the event is that while there may not be the breadth of services within Google Cloud (GCP) as there is in AWS or Azure, GCP is not a “me too” public cloud hyperscaler.

While some core services such as cloud storage, VPC networking, IaaS and databases are available, there are some key differences with GCP that are worth knowing about. My interpretation of what I saw over the couple of days was that Google have taken some of the core services they’ve been delivering for years, such as Machine Learning, Maps and Artificial Intelligence and presenting them as APIs for customers to consume within their GCP account.

This is a massive difference from what I can see with AWS and Azure. Sure, there are components of the above available in those platforms, but these are services which have been at the heart of Google’s consumer services for over a decade and they have incredible power. In terms of market size, both AWS and Azure dwarf GCP, but don’t be fooled into thinking this is not a priority area for Google, because it is. They have ground to make up, but they have very big war chests of capital to spend and also have some of the smartest people on the planet working for them.

To start with, in the keynote, there was the usual run down of event numbers, but the one that was most interesting for me was that there were 4,500 delegates, which is up a whopping 300% on last year, and 67% of registered attendees described themselves as developers. Google Cloud is made up of GCP, G Suite (Gmail and the other consumer apps), Maps and APIs, Chrome and Android. Google Cloud provides services to 1 billion people worldwide per day. Incredible!

Gratuitous GC partner slide

There was the usual shout out of thanks to the event sponsors. One thing I did notice in contrast to other vendor events I’ve been to was the paucity of partners in the exhibition hall. There were several big names including Rackspace, Intel and Equinix but obviously building a strong partner ecosystem is still very much a work in progress.

We then had a short section with Diane Greene, who many industry veterans will know as one of the founders of VMware. She is now Senior VP for Google Cloud and it’s her job to get Google Cloud better recognition in the market. Something I found quite odd about this section is that she seemed quite ill prepared for her content and brought some paper notes with her on stage, which is very unusual these days. There were several quite long pauses and it seemed very under-rehearsed, which surprised me. Normally the keynote speakers are well versed and very slick.

GDPR and GC investment

Anyway, moving on to other factoids – Greene committed Google to be fully GDPR compliant by the time it becomes law next May. She also stated there has been $29.4 billion spent on Google Cloud in the last three years. The Google fibre backbone carries one third of all internet traffic. Let that sink in for a minute!

There is ongoing investment in the GC infrastructure and when complete in late 2017/early 2018, there will be 17 regions and 50 availability zones in the GC environment, which will be market leading.

 

GCP regions, planned and current

Google Cloud billing model

One aspect of the conference that was really interesting was the billing model for virtual machines. In the field, my experience with AWS and Azure has been one of pain when trying to determine the most cost effective way to provide compute services. It becomes a minefield of right sizing instances, purchasing reserved instances, deciding what you might need in three year’s time, looking at Microsoft enterprise agreements to try and leverage Hybrid Use Benefit. Painful!

The GCP billing model is one in which you can have custom VM sizes (much like we’ve always had with vSphere, Hyper-V and KVM), so there is less waste per VM. Also, the longer you use a VM, the cheaper the cost becomes (this is referred to as sustained usage discount). Billing is also done per minute, which is in contrast to AWS and Azure who bill per hour. So even if you only use a part hour, you still pay the full amount.

It is estimated that 45% of public cloud compute spend is wasted, the GC billing model should help reduce this figure. You can also change VM sizes at any time and the sustained usage discount can result in “up to” 57% savings. Worth looking at, I think you’ll agree.

Lush from the UK were brought up to discuss their migration to GCP and they performed this in 22 days and they calculate 40% savings on hosting charges per year. Not bad!

Co-existence and migration

There has also been a lot of work done within GCP to support Windows native tools such as PowerShell (there are GCP cmdlets) and Visual Studio. There are also migration tools that can live move VMs from vSphere,. Hyper-V and KVM, as you’d probably expect. Worth mentioning too at this point that GCP has live migration for VMs as per vSphere and Hyper-V, which is unique to GCP right now, certainly to the best of my knowledge.

G Suite improvements

Lots of work has been done around G Suite, including improvements to Drive to allow for team sharing of documents and also using predictive algorithms to put documents at the top of the Drive page within one click, rather than having to search through folders for the document you’re looking for. Google claim a 40% hit rate from the suggested documents.

There are also add ons from the likes of QuickBooks, where you can raise an invoice from directly within Gmail and be able to reconcile it when you get back to QuickBooks. Nice!

Encryption in the cloud

Once the opening keynote wrapped, I went to my first breakout session which was about encryption within GC. I’m not going to pretend I’m an expert in this field, but Maya Kaczorowski clearly is, and she is a security PM at Google. The process of encrypting data within the GC environment can be summarised thus :-

  • Data uploaded to GC is “chunked” into small pieces (variable size)
  • Each chunk is encrypted and has it’s own key
  • Chunks are written randomly across the GC environment
  • Getting one chunk of data compromised is effectively useless as you will still need the other chunks
  • There is a strict hierarchy to the Key Management Service (shown below)

Google key hierarchy

A replay of this session is available on YouTube and is well worth a watch. Probably a couple of times so you actually understand it!

What’s new in Kubernetes and Google Container Engine

Next up was a Kubernetes session and how it works with Google Container Engine (GCE). I have to say, I’ve heard the name of Kubernetes thrown around a lot, but never really had the time or the inclination to see what all the fuss is about. As I understand it, Kubernetes is a wrapper over the top of container technologies such as Docker to provide more enterprise management and features such as clustering and scaling.

Kubernetes was written initially by Google before being open sourced and it’s rapidly becoming one of the biggest open source projects ever. One of the key drivers for using containers and Kubernetes is the ability to port your environment to any platform. Containers and Kubernetes can be run on Azure, AWS, GC or even on prem. Using this technology avoids vendor lock in, if this is a concern for you.

Kubernetes contributors and users

There is also a very high release cadence – a new version ships every three months and version 1.7 is due at the end of June (1.6 is the current version). The essence of containerisation is that you can start to use and develop microservices (services broken down into very small, fast moving parts rather than one huge bound up, inflexible monolithic stack). Containers also are stateless in the sense that data is stored elsewhere (cloud storage bucket, etc) and are disposable items.

In a Kubernetes cluster, you can now scale up to 5,000 pods per cluster. A cluster is a collection of nodes (think VMs) and pods are container items running isolated from each other on a node. Clusters can be multi-zone and multi-region and now also have the concept of “taints” and “tolerances”. Think of taints as node characteristics such as having a GPU, or a certain RAM or CPU size. A tolerance is a container rule that allows or disallows affinity based on the node taint. For example, a tolerance would allow a container to run on a node with a GPU only.

The final point of note here is that Google offer a managed Kubernetes service called Google Container Engine.

From Blobs to Relational Tables, where do I store my data?

My next breakout was to try and get a better view of the different storage options within GC. One of the first points made was really interesting in that Rolls Royce actually lease engines to airlines so they can collect telemetry data and have the ability to tune engines as well as perform pro-active maintenance based on data received back from the engines.

In summary, your storage options include:-

  • RDBMS – Cloud SQL
  • Data Warehousing – BigQuery
  • Hadoop – Cloud Storage
  • NoSQL – Cloud BigTable
  • NoSQL Docs – Cloud datastore
  • Scalable RDBMS – Cloud Spanner

Cloud Storage can have several different characteristics, including multi-region, regional, nearline and coldline. This is very similar to the options provided by AWS and Azure. Cloud Storage has an availability SLA of 99.95% and you use the same API to access all storage tiers.

Data lifecycle policies are available available in a similar way to S3, moving data between the tiers when rules are triggered. Delivery Network is performed using the Cloud CDN product and message queuing is performed using Cloud Pub/Sub. Cloud Storage for hybrid environments is also available in a similar way to StorSimple or the AWS Storage Gateway using partner solutions such as Panzura (cold storage, backup, tiering device, etc.)

Cloud SQL – 99.95% SLA, with failover replica and read replicas, which seemed very similar to how AWS RDS works. One interesting product was Cloud Spanner. This is a horizontally scalable RDBMS solution that offers typical SQL features such as ACID but with the scalability of typical cloud NoSQL solutions. This to me seemed a pretty unique feature of GC, I haven’t seen this elsewhere. Cloud Spanner also provides global consistency, 99.99% uptime SLA and a 99.999% multi-region availability SLA. Cool stuff!

Serverless Options on GCP

My next breakout was on serverless options on GCP. Serverless seems to the latest trend in cloud computing that for some people is the answer to everything and nothing. Both AWS and Azure provide serverless products, and there are a lot of similarities with the Google Functions product.

To briefly deconstruct serverless tech, this is where you use event driven process to perform a specific task. For example, a file gets uploaded to a storage bucket and this causes an event trigger where “stuff” is performed by a fleet of servers. Once this task is complete, the process goes back to sleep again.

The main benefit of serverless is cost and management. You aren’t spinning VMs up and down and you aren’t paying compute fees for idle VMs. Functions is charged per 100ms of usage and also how much RAM is assigned to the process. The back end also auto scales so you don’t have to worry about setting up your own auto scaling policies.

Cloud Functions is in it’s infancy right now, so only node.js is supported but more language support will be added over time. Cloud storage, Pub/Sub channels and HTTP webhooks can be used to capture events for serverless processes.

Day Two wrap up to come in the next post!

24-03-17

Avoiding vendor lock-in in the public cloud

A little while back, I had a pretty frank discussion with a customer about vendor lock-in in the public cloud and he left me under no illusions that he saw cloud more as a threat than an opportunity. I did wonder if there had been some incident in the past that had left him feeling this way, but didn’t really feel it appropriate to probe into that much further.

Instead of dwelling on the negatives of this situation, we decided to accentuate the positives and try to formulate some advice on how best this risk could be mitigated. This was especially important as there was already a significant investment made by the business into public cloud deployments. It is an important issue though – it’s easy enough to get in, but how do you get out? There are several strategies you could use, I’m just going to call out a couple of them as an example.

To start with, back in the days of all on premises deployments, generally you would try and go for a “best of breed” approach. You have a business problem that needs a technical solution so you look at all the potential solutions and choose the best fit based on a number of requirements. Typically these include cost, scalability, support, existing skill sets and strength of the vendor in the market (Gartner Magic Quadrant, etc.). This applies equally in the public cloud – it’s still a product set in a technical solution so the perspective needn’t change all that much.

One potential strategy is to use the best of breed approach to look at all public cloud vendors (for the purpose of this article, I really just mean the “big three” of AWS, Azure and Google Cloud Platform). As you might expect, the best cost, support and deployment options for say SQL Server on Windows would probably be from Microsoft. In that case, you deploy that part of the solution in Azure.

Conversely, you may have a need for a CDN solution and decide that AWS CloudFront represents the best solution, so you build that part of your solution around that product. This way you are mitigating risk by spreading services across two vendors while still retaining the best of breed approach.

However, “doing the splits” is not always preferable. It’s two sets of skills, two lots of billing to deal with and two vendors to punch if anything goes badly wrong.

Another more pragmatic approach is to make open source technologies a key plank of your strategy. Products such as MySQL, Postgres, Linux, Docker, Java, .NET, Chef and Puppet are widely available on public cloud platforms and mean that any effort put into these technologies can be moved elsewhere if need be (even back on premises if you need to). Not only this, but skills in the market place are pretty commoditised now and mean that bringing in new staff to help with the deployments (or even using outside parties) is made easier and more cost effective.

You could go down the road of deploying a typical web application on AWS using Postgres, Linux, Chef, Docker and Java and if for any reason later this approach becomes too expensive or other issues occur, it’s far easier to pick up the data you’ve generated in these environments, walk over to a competitor, drop it down and carry on.

Obviously this masks some of the complexities of how that move would actually take place, such as timelines, cost and skills required, but it presents a sensible approach to stakeholders that provider migration has been considered and has been accounted for in the technical solution.

The stark reality is that whatever you are doing with technology, there will always be an element of vendor lock in. Obviously from a financial perspective there is a motive for them to do that, but also this comes of innovation when a new technology is created which adds new formats and data blobs to the landscape. The key to addressing this is taking a balanced view and being able to tell project stakeholders that you’re taking a best of breed approach based on requirements and you have built in safeguards in case issues occur in future that prompt a re-evaluation of the underlying provider.

 

07-03-17

What is the Cloud Shared Responsibility Model and why should I care?

When I have discussions with customers moving to a public cloud provider, one of the main topics of conversation (quite rightly) is security of services and servers in the cloud. Long discussions and whiteboarding takes place where loads of boxes and arrows are drawn and in the end, the customer is confident about the long term security of their organisation’s assets when moving to Azure or AWS.

Almost as an aside, one of the questions I ask is how patching of VMs will be performed and a very common answer is “doesn’t Microsoft/AWS patch them for us?”. At this point I ask if they’ve heard of the Shared Responsibility Model and often the answer is “no”. So much so that I thought a quick blog post was in order to reinforce this point.

So then, what is the Shared Responsibility Model? Put simply, when you move services into a public cloud provider, you are responsible for some or most of the security and operational aspects of the server (tuning, anti-virus, backup, etc.) and your provider is responsible for services lower down the stack that you don’t have access to, such as the hypervisor host, physical racks, power and cooling.

That being said, there is a bit more to it than that, depending on whether or not we’re talking about IaaS, PaaS or SaaS. The ownership of responsibility can be thought of as a “sliding scale” depending on the service model. To illustrate what I mean, take a look at the diagram below, helpfully stolen from Microsoft (thanks, boys!).

Reading the diagram from left to right, you can see that in the left most column where all services are hosted on prem, it is entirely the responsibility of the customer to provide all of the security characteristics. There is no cloud provider involved and you are responsible for racking, stacking, cooling, patching, cabling, IAM and networking.

As we move right to the IaaS column, you can see subtle shades of grey emerging (quite literally) as with IaaS, you’re hosting virtual machines in a public cloud provider such as Azure or AWS. The provider is responsible for DC and rack security and some of the host infrastructure (for example, cloud providers patch the host on your behalf), but your responsibility is to ensure that workloads are effectively spread across hosts in appropriate fault and update domains for continuity of service.

Note however that in the IaaS model, as you the customer are pretty much responsible for everything from the guest upwards, it’s down to you to configure IAM, endpoint security and keep up to date with security patches. This is where a lot of the confusion creeps in. Your cloud provider is not on the hook if you fail to patch and properly secure your VMs (including network and external access). Every IaaS project requires a patching and security strategy to be baked in from day one and not retrofitted. This may mean extending on prem AD and WSUS for IAM and patching, to leverage existing processes. This is fine and will work, you don’t necessarily need to reinvent the wheel here. Plus if you re-use existing processes, it may shorten any formal on boarding of the project with Service Management.

Carrying on across the matrix to the next column on the right is the PaaS model. In this model, you are consuming pre-built features from a cloud provider. This is most commonly database services such as SQL Server or MySQL but also includes pre-built web environments such as Elastic Beanstalk in AWS. Because you are paying for a sliver of a larger, multi-tenant service, your provider will handle more layers of the lower stack, including the virtual machines the database engine is running on as well as the database engine itself. Typically in this example, the customer does not have any access to the underlying virtual machine either via SSH or RDP, as with IaaS.

However, as the matrix shows, there is still a level of responsibility on the customer (though the operational burden is reduced). In the case of Database PaaS, the customer is still in charge of backing up and securing (i.e. encryption and identity and access management) the data. This is not the responsibility of the provider with the exception of logical isolation from other tenants and the physical security of the hardware involved.

Finally, in the far right column is the SaaS model. The goal of this model is for the customer to obtain a service with as little administrative/operational overhead as possible. As shown in the matrix, the provider is responsible for everything in the stack from the application down, including networking, backup, patching, availability and physical security. IAM functions are shared as most SaaS is multi-tenant, so the provider must enforce isolation (in the same way as PaaS) and the customer must ensure only authorised personnel can access the SaaS solution.

You will note that endpoint security is also classed as a shared responsibility. Taking Office 365 as an example, Microsoft provide security tools such as anti-virus scanning and data loss prevention controls, it is up to the customer to configure this to suit their use case. Microsoft’s responsibility ends with providing the service and the customer’s starts with turning the knobs to make it work to their taste. You will also notice that as in all other cases, it is solely the customer’s responsibility to ensure the classification and accountability of the data. This is not the same as the reliability of the services beneath it (networking, storage and compute) as this is addressed in the lower layers of the model.

I hope this article provides a bit of clarity on what the Shared Responsibility Model is and why you should care. Please don’t assume that just because you’re “going cloud” that a lot of these issues will go away. Get yourself some sound and trusted advice and make sure this model is accounted for in your project plan.

For your further reading pleasure, I have included links below to documentation explaining provider’s stances and implementation of the model :-

As always, any questions or comments on this post can be left below or feel free to ping me on Twitter @ChrisBeckett

31-12-16

AWS Certified DevOps Engineer Professional – Exam Experience & Tips

aws-certified-devops-engineer-professional_large1

I managed to find the time yesterday to sit the above exam before the end of the year to reach my goal of holding all five current AWS certifications. There isn’t a lot out there about this exam, so as usual I thought I would try to pass on the benefit of my experiences for others planning to sit this one.

The exam is 80 questions over 170 minutes. I finished with about 20 minutes to spare and passed barely with a 66%, but as we always say – a pass is a pass! Looking back over the score report, there are four domains tested in the exam:-

  • Domain 1: Continuous Delivery and Process Automation
  • Domain 2: Monitoring, Metrics, and Logging
  • Domain 3: Security, Governance, and Validation
  • Domain 4: High Availability and Elasticity

I managed to score really well on domains 1, 3 and 4 (between 75% and 85%0, but really bombed on domain 2, which really surprised me. This domain focusses mainly on CloudWatch, so it goes without saying that I didn’t know it as well as I thought I did!

Like all the other AWS exams, the questions are worded in a very specific way, and it can take time to read and re-read the questions to truly understand what is being asked. I wouldn’t worry too much about time running out, some of the questions are quite short but you need to look for key words in the questions – such as “cost-effective”, “fault tolerant” and “efficient”. This can help you rule out the obviously incorrect answers.

In terms of what you need to know, I’d say the following :-

  • Domain 1: CloudFront (templates, custom resources), OpsWorks (lifecycles), Elastic Beanstalk (platform support, scaling, Docker), SQS, SNS, Data Pipeline (I was surprised to see this feature in the exam as I figured it was being phased out in favour of Lambda), SWF, bootstrapping
  • Domain 2: CloudWatch, CloudTrail (what it can and can’t do), CloudWatch Logs (Log streams, Log filters, Log agent), EMR
  • Domain 3: IAM (Roles, users, STS, AssumeRole(s))
  • Domain 4: Load balancing, auto scaling, EC2, S3, Glacier, EBS, RDS,  DynamoDB, instance types

And for what I used for study, use your AWS account and the free tier entitlement to much around with all the services. There are loads of walkthroughs in the documentation and provided you don’t leave massive instances running 24/7 it should only cost you pennies to use.

The A Cloud Guru course is well worth the investment of time and money – Adrian and Nick do a great job of taking you through most of what you need to know for the exam. I did find that there wasn’t as much DynamoDB content on the exam as I was expecting, not that I’m complaining because a lot of how it works still really mashes my head!

There are lots of good videos on YouTube, from Re:Invent conferences from years gone by which go into a lot of depth. I can also recommend Ian Massingham’s CloudFormation Masterclass video as a good refresher/primer for CF.

Difficulty wise, it’s definitely a tough exam, don’t let anyone tell you otherwise. 80 questions is a lot and many of them are very verbose in both the question and the answers. I’d say it’s not as tough as the Solutions Architect Pro as it doesn’t cover as broad a range of topics, but you can’t really wing it.

I hope this article helps anyone doing this exam any time soon. I’m going to enjoy being part of the “All 5” club for as long as it lasts (the three “Specialty” exams are coming up early next year, I’ve registered to sit all the betas).

all5

17-11-16

Azure VMs – New Auto-Shutdown Feature Sneaked In (almost!)

I saw the news the other day that Azure Backup has now been included on the VM management blade in the Azure Portal, which is great news as you don’t want to be jumping around in the portal to manage stuff where you don’t need to. However, one feature I notice that appears to have sneaked into the VM management blade without any fanfare at all is the ability to auto schedule the shutdown of a virtual machine.

Many customers request the function of shutting down virtual machines during off hours in order to save cost once any backups and scheduled maintenance tasks have occurred. Previously this would have to be done by using Azure Automation to execute a run book to shut down VMs. This is fine and a valid way of doing this, but on larger estates ends up being a costed feature as the time taken to run the run books exceeds the free tier allowances.

This typical requirement has obviously found it’s way back the product management team at Microsoft and in order to make it a lot easier when spinning up VMs to enable this, it’s been added to the standard VM management blade, as shown below:-

vm-shutdown

As far as I can tell, this feature is either not in use yet or is only available in a small number of regions, ahead of a broader roll out. I tried it on VMs in UK South and North Europe, only to see this message :-

auto-stop

And trying to read between the lines of the error message, will this feature allow starting the VM too? You’d have to hope so! I did ping Azure Support on Twitter to see when this feature would be fully available in the UK/EU and got a very speedy response (thanks, chaps!):-

 

 

So stay tuned for this feature being enabled at some point in the near future. I’d also assume there will be some corresponding PowerShell command to go with it, so that you can add it to scripted methods of deploying multiple virtual machines.

13-10-15

VMworld Europe Day Two

Today is pretty much the day the whole conference springs to life. All the remaining delegates join the party with the TAM and Partner delegates. The Solutions Exchange opened for business and there’s just a much bigger bustle about the place than there was yesterday.

The opening general session was hosted by Carl Eschenbach, and credit to him for getting straight in there and talking about the Dell deal. I think most are scratching their heads, wondering what this means in the broader scheme of things, but Carl reassured the delegates that it would still be ‘business as usual’ with VMware acting as an independent entity. That’s not strictly true, as they’re still part of the EMC Federation, who are being acquired by Dell, so not exactly the same.

Even Michael Dell was wheeled out to give a video address to the conference to try and soothe any nerves, giving one of those award ceremony ‘sorry I can’t be there’ speeches. Can’t say it changed my perspective much!

The event itself continues to grow. This year there are 10,000 delegates from 96 countries and a couple of thousand partners.

Into the guts of the content, first up were Telefonica and Novamedia. The former are a pretty well known European telco, and the latter are a multinational lottery company. The gist of the chat was that VMware solutions (vCloud, NSX etc) have allowed both companies to bring new services and solutions to market far quicker than previously. In Novamedia’s case, they built 4 new data centres and had them up and running in a year. I was most impressed by Jan from Novamedia’s comment ‘Be bold, be innovative, be aggressive’. A man after my own heart!

VMware’s reasonably new CTO Ray O’Farrell then came out and with Kit Colbert discussed the ideas behind cloud native applications and support for containers. I’ll be honest at this point and say that I don’t get the container hype, but that’s probably due in no small part to my lack of understanding of the fundamentals and the use cases. I will do more to learn more, but for now, it looks like a bunch of isolated processes on a Linux box to me. What an old cynic!

VMware have taken to approaches to support containers. The first is to extend vSphere to use vSphere Integrated Containers and the second is the Photon platform. The issue with containerised applications is that the vSphere administrator has no visibility into them. It just looks and acts like a VM. With VIC, there are additional plug-ins into the vSphere Web Client that allow the administrator to view which processes are in use, on which host and how it is performing. All of this management layer is invisible and non-intrusive to the developer.

The concept of ‘jeVM’ was discussed, which is ‘just enough VM’, a smaller footprint for container based environments. Where VIC is a Linux VM on vSphere, the Photon platform is essentially a microvisor on the physical host, serving up resource to containersa running Photon OS, which is a custom VMware Linux build. The Photon platform itself contains two objects – a controller and the platform itself. The former will be open sourced in the next few weeks (aka free!) But the platform itself will be subscription only from VMware. I’d like to understand how that breaks down a bit better.

VRealize Automation 7 was also announced, which I had no visibility of, so that was a nice surprise. There was a quick demo with Yangbing Li showing off a few drag and drop canvas for advanced service blueprints. I was hoping this release would do away with the need for the Windows IaaS VM(s), but I’m reliably informed this is not the case.

Finally, we were treated with a cross cloud vMotion, which was announced as an industry first. VMs were migrated from a local vSphere instance to a vCloud Air DC in the UK and vice versa. This is made possible by ‘stretching’ the Layer 21 network between the host site and the vCloud Air DC. This link also includes full encryption and bandwidth optimisation. The benefit here is that again, it’s all managed from a familiar place (vSphere Web Client) and the cross cloud vMotion is just the migration wizard with a couple of extra choices for source and destination.

I left the general session with overriding feeling that VMware really are light years ahead in the virtualisation market, not just on premises solutions but hybrid too. They’ve embraced all cloud providers, and the solutions are better for it. Light years ahead of Microsoft in my opinion, and VMware have really raised their game in the last couple of years.

My first breakout session of the day was Distributed Switch Best Practices. This was a pretty good session as I’ve really become an NSX fanboy in the last few months, and VDSes are the bedrock of moving packet between VMs. As such, I noted the following:-

  • DV port group still has a one to one mapping to a VLAN
  • There may be multiple VTEPS on a single host. A DV port group is created for all VTEPs
  • DV port group is now called a logical switch when backed by VXLAN
  • Avoid single point of failure
  • Use separate network devices (i.e switches) wherever possible
  • Up to 32 uplinks possible
  • Recommend 2 x 10v Gbps links,  rather than lots of 1 Gbps
  • Don’t dedicate physical up links for management when connectivity is limited and enable NIOC
  • VXLAN compatible NIC recommended, so hardware offload can be used
  • Configure port fast and BPDU on switch ports, DVS does not have STP
  • Always try to pin traffic to a single NIC to reduce risk of out of order traffic
  • Traffic for VTEPs only using single up link in an active passive configuration
  • Use source based hashing. Good spread of VM traffic and simple configuration
  • Myth that VM traffic visibility is lost with NSX
  • Net flow, port mirroring, VXLAN ping tests connections between VTEPs
  • Trace flow introduced with NSX 6.2
  • Packets are specially tagged for monitoring, reporting back to NSX controller
  • Trace flow is in vSphere Web client
  • Host level packet capture from the CLI
  • VDS portgroup, vmknic or up link level, export as pcap for Wireshark analysis
  • Use DFW
  • Use jumbo frames
  • Mark DSCP value on VXLAN encapsulation for Quality of Service

For my final session of the dayt, I attended The Practical Path to NSX and Network Virtualisation. At first I was a bit dubious about this session as the first 20 minutes or so just went over old ground of what NSX was, and what all the pieces were, but I’m glad I stayed with it, as I got a few pearls of wisdom from it.

  • Customer used NSX for PCI compliance, move VM across data center and keep security. No modification to network design and must work with existing security products
  • Defined security groups for VMs based on role or application
  • Used NSX API for custom monitoring dashboards
  • Use tagging to classify workloads into the right security groups
  • Used distributed objects, vRealize for automation and integration into Palo Alto and Splunk
  • Classic brownfield design
  • Used NSX to secure Windows 2003 by isolating VMs, applying firewall rules and redirecting Windows 2003 traffic to Trend Micro IDS/IPS
  • Extend DC across sites at layer 3 using encapsulation but shown as same logical switch to admin
  • Customer used NSX for metro cluster
  • Trace flow will show which firewall rule dropped the packet
  • VROps shows NSX health and also logical and physical paths for troubleshooting

It was really cool to see how NSX could be used to secure Windows 2003 workloads that could not be upgraded but still needed to be controlled on the network. I must be honest, I hadn’t considered this use case, and better still, it could be done with a few clicks in a few minutes with no downtime!

NSX rocks!

 

 

 

12-10-15

VMworld Europe Day One

Today saw the start of VMworld Europe in Barcelona, with today being primarily for partners and TAM customers (usually some of the bigger end users). However, that doesn’t mean that the place is quiet, far from it! There are plenty of delegates already milling around, I saw a lot of queues around the breakout sessions and also for the hands on labs.

As today was partner day, I already booked my sessions on the day they were released. I know how quickly these sessions fill, and I didn’t want the hassle of queuing up outside and hoping that I would get in. The first session was around what’s new in Virtual SAN. There have been a lot of press inches given to the hyper converged storage market in the last year, and I’ve really tried to blank them out. Now the FUD seems to have calmed down, it’s good to be able to take a dispassionate look at all the different offerings out there, as they all have something to give.

My first session was with Simon Todd and was titled VMware Virtual SAN Architecture Deep Dive for Partners. 

It was interesting to note the strong numbers of customer deploying VSAN. There was a mention of 3,000 globally, which isn’t bad for a product that you could argue has only just reached a major stage of maturity. There was the usual gratuitous customer logo slide, one of which was of interest to me. United Utilities deal with water related things in the north west, and they’re a major VSAN customer.

There were other technical notes, such as VSAN being an object based file system, not a distributed one. One customer has 14PB of storage over 64 nodes, and the limitation to further scaling out that cluster is a vSphere related one, rather than a VSAN related one.

One interesting topic of discussion was whether or not to use passthrough mode for the physical disks. What this boils down to is the amount of intelligence VSAN can gather from the disks if they are in passthrough mode. Basically, there can be a lot of ‘dialog’ between the disks and VSAN if there isn’t a controller in the way. I have set it up on IBM kit in our lab at work, and I had to set it to RAID0 as I couldn’t work out how to set it to passthrough. Looks like I’ll have to go back to that one! To be honest, I wasn’t getting the performance I expected, and that looks like it’s down to me.

VSAN under the covers seems a lot more complex than I thought, so I really need to have a good read of the docs before I go ahead and rebuild our labs.

There was also an interesting thread on troubleshooting. There are two fault types in VSAN – degraded and absent. Degraded state is when (for example) an SSD is wearing out, and while it will still work for a period of time, performance will inevitably suffer and the part will ultimately go bang. Absent state is where a temporary event has occured, with the expectation that this state will be recovered from quickly. Examples of this include a host (maintenance mode) or network connection down and this affects how the VSAN cluster behaves.

There is also now the ability to perform some proactive testing, to ensure that the environment is correctly configured and performance levels can be guaranteed. These steps include a ‘mock’ creation of virtual machines and a network multicast test. Other helpful troubleshooting items include the ability to blink the LED on a disk so you don’t swap out the wrong one!

The final note from this session was the availability of the VSAN assessment tool, which is a discovery tool run on customer site, typically for a week, that gathers existing storage metrics and provides sizoing recommendations and cost savings using VSAN. This can be requested via a partner, so in this case, Frontline!

The next session I went to was Power Play :What’s New With Virtual SAN and How To Be Successful Selling It. Bit of a mouthful I’ll agree, and as I’m not much of a sales or pre-sales guy, there wasn’t a massive amount of takeaway for me from this session, but Rory Choudhari took us through the current and projected revenues for the hyperconverged market, and they’re mind boggling.

This session delved into the value proposition of Virtual SAN, mainly in terms of costs (both capital and operational) and the fact that it’s simple to set up and get going with. He suggested it could live in harmony with the storage teams and their monolithic frames, I’m not so sure myself. Not from a tech standpoint, but from a political one. It’s going to be difficult in larger, more beauracratic environments.

One interesting note was Oregon State University saving 60% using Virtual SAN as compared to refreshing their dedicated storage platform. There are now nearly 800 VASN production customers in EMEA, and this number is growing weekly. Virtual SAN6.1 also brings with it support for Microsoft and Oracle RAC clustering. There is support for OpenStack, Docker and Photon and the product comes in two versions.

If you need an all flash VSAN and/or stretched clusters, you’ll need the Advanced version. For every other use case, Standard is just fine.

After all the VSAN content I decided to switch gears and attend an NSX session called  Disaster Recovery with NSX, SRM and vRO with Gilles Chekroun. Primarily this session seemed to concentrate on the features in the new NSX 6.2 release, namely the universal objects now available (distributed router, switch, firewall) which span datacentres and vCenters. With cross vCenter vMotion, VMware have really gone all out removing vCenter as the security or functionality boundary to using many of their products, and it’s opened a whole new path of opportunity, in my opinion.

There are currently 700 NSX customers globally, with 65 paying $1m or more in their deployments. This is not just licencing costs, but also for integration with third party products such as Palo Alto, for example. Release 6.2 has 20 new features and has the concept of primary and secondary sites. The primary site hosts an NSX Manager appliance and the controller cluster, and secondary sites host only an NSX Manager appliance (so no controller clusters). Each site is aware of things such as distributed firewall rules, so when a VM is moved from one site to another, the security settings arew preserved.

Locale IDs have also been added to provide the ability to ‘name’ a site and use the ID to direct routing traffic down specific paths, either locally on that site or via another site. This was the key takeway from the session that DRis typically slow, complex and expensive, with DR tests only being invoked annually. By providing network flexibility between sites and binding in SRM and vRO for automation, some of these issues go away.

In between times I sat the VCP-CMA exam for the second time. I sat the beta release of the exam and failed it, which was a bit of a surprise as I thought I’d done quite well. Anyway, this time I went through it, some of the questions from the beta were repeated and I answered most in the same way and this time passed easily with a 410/500. This gives me the distinction of now holding a full house of current VCPs – cloud, desktop, network and datacenter virtualisation. Once VMware Education sort out the cluster f**k that is the Advanced track, I hope to do the same at that level.

Finally I went to a quick talk called 10 Reasons Why VMware Virtual SAN Is The Best Hyperconverged Solution. Rather than go chapter and verse on each point I’ll list them below for your viewing pleasure:-

  1. VSAN is built directly into the hypervisor, giving data locality and lower latency
  2. Choice – you can pick your vendor of choice (HP, Dell, etc.) And either pick a validated, pre-built solution or ‘roll your own’ from a list of compatible controllers and hard drives from the VMware HCL
  3. Scale up or scale out, don’t pay for storage you don’t need (typically large SAN installations purchase all forecasted storage up front) and grow as you go by adding disks, SAS expanders and hosts up to 64 hosts
  4. Seamless integration with the existing VMware stack – vROps adapters already exist for management, integration with View is fully supported etc
  5. Get excellent performance using industry standard parts. No need to source specialised hardware to build a solution
  6. Do more with less – achieve excellent performance and capacity without having to buy a lot of hardware, licencing, support etc
  7. If you know vSphere, you knopw VSAN. Same management console, no new tricks or skills to learn with the default settings
  8. 2000 customers using VSAN in their production environment, 65% of whom use it for business critical applications. VSAN is also now third generation
  9. Fast moving road map – version 5.5 to 6.1 in just 18 months, much faster rate of innovation than most monolithic storage providers
  10. Future proof – engineered to work with technologies such as Docker etc

All in all a pretty productive day – four sessions and a new VCP for the collection, so I can’t complain. Also great to see and chat with friends and ex-colleagues who are also over here, which is yet another great reason to come to VMworld. It’s 10,000 people, but there’s still a strong sense of community.