Configuring rbk-log to start on boot

In my last post, I made mention of the fact that the Linux binary we use to connect Polaris to your syslog collector doesn’t start at boot and if you log off will be killed. I thought about that and decided that it would be stupid not to note how to make it automatically start on boot. I mean, why would you not?

The process itself is not complicated. In days of old you would have used /etc/init.d to configure services, these days we use a process called systemd. There are a couple of tricks to getting the rbk-log tool to work as it requires either switches or environment variables, and as we used environment variables the last time round, I’m going to stick with that.

So what steps do we need to take to make the magic happen?

  • Download the rbk-log tool if you haven’t already
  • Save the file to a home directory and make it executable
  • Configure the syslog service locally
  • Create the systemd service definition file
  • Start the rbk-log service to make sure it starts properly
  • Enable the service to auto start on boot

Downloading the rbk-log tool

You can get this tool from the Rubrik Support website, under Downloads. See my Sentinel blog post prior to this to see where it lives.

Save the file to a home directory and make it executable

As I’m using Ubuntu on EC2 for this example, there is a user created during VM provisioning called ubuntu. As it has a home directory already created, I’d recommend dropping the tool in there. To make it executable, run chmod ugo+x rbk-log. If you do an ls -l, it should look like the following :-

Saving rbk-log and making it executable for all

Create the systemd service definition file

We then need to create a small configuration file for the rbk-log tool so that systemd knows how the service should run and what other information it needs to start successfully. To save time, I’ve copied my definition file below :-

Description=Rubrik Polaris Syslog Importer

ExecStart=/home/ubuntu/rbk-log start


The file itself should be pretty self explanatory, just copy the file and put it into /etc/systemd/system and name it rbk-log.service. The full path to the saved file should be /etc/systemd/system/rbk-log.service

You will then need to go in and check the ExecStart parameter to make sure the path to the rbk-log tool executable is correct.

You will also need to configure the environment variables to match your configuration, so the three values in bold and the pointy brackets need to be amended. Save the file away when you’re done, if you get any permissions errors, remember you will probably need to run your text editor with the sudo command.

sudo vim /etc/systemd/system/rbk-log.service

Start the rbk-log service to make sure it starts properly

In order for systemd to refresh the config files, issue a sudo systemctl daemon-reload command at the prompt. This should be instant. Once you’ve reloaded systemctl, we can issue the command to start the rbk-log service :-

sudo systemctl start rbk-log.service

Wait a few seconds for the service to start and then issue a status command, to make sure it has started up properly.

sudo systemctl status rbk-log.service 

All being well, you should see something similar to below :-

Checking the status of the rbk-log service

You can also issue a tail -f /var/log/syslog command to watch the tail of the syslog file to perform any necessary troubleshooting.

Viewing the syslog file for errors

Enable the service to auto start on boot

All being well at this stage, we can now enable the service to start automatically on boot :-

sudo systemctl enable rbk-log

That’s it! You may wish to reboot the VM just to make sure the service auto starts, just as a precaution.



Connecting Rubrik Polaris to Azure Sentinel

First blog in nearly two years, so let’s make it something of substance, eh? I’ve been at Rubrik now since April and it already feels like a long time (in a good way) and as folks who know me know I like to tinker around the edges and build stuff. As such, I hear many enquiries about how we can hook up Rubrik Polaris to Azure Sentinel, so this is how you do it.

What is Rubrik Polaris?

Polaris is the SaaS based management plane for Rubrik clusters, wherever they may reside. It allows cluster owners to manage their estate via a web browser from one point, reducing admin overhead and providing the capability to add new features to our already market leading backup solution.

As well as providing the management plane, it also extends the product capability to include ransomware recovery tooling (Radar), data classification (Sonar) and orchestrated disaster recovery (AppFlows). All of this is done in situ on the Rubrik appliance, with metadata and signals being transmitted to the Polaris cloud. Customer data never leaves the appliance, and this is important.

What is also important is that we do all of this without the need for any additional infrastructure, such as software virtual appliances or proxy servers. As you’d expect, we log all events centrally and these can be managed via the Polaris web interface. That’s great, but what if you have an existing investment into a SIEM tool? What we don’t want to do is ask the SOC/NOC teams to log into yet another interface to view activity in the environment.

In response to this, we made a tool available to pull Polaris events out in syslog format, which is kind of a Swiss Army Knife of log formats that just about any SIEM tool will ingest. This means that we can now ingest Polaris events and push them into your SIEM of choice.

Rubrik Polaris GPS console

What is Azure Sentinel?

Azure Sentinel is Microsoft’s SIEM tool in the cloud. It’s incredibly popular with Azure customers and is used as a general sink for all log events, regardless of where they originate from. Customers like the range of integration options, limitless scaling and pay as you go pricing. It’s a convenient way to bring logs to a central point. As you’d expect from Microsoft, there is an extensive list of integration options with third parties.

Azure Sentinel console

Integrating Polaris with Sentinel

So to the main thrust of this post, how do we bring the two together? First up, we need to meet some pre-requisites :-

  • Linux VM in Azure
  • rbk-log tool from Rubrik
  • Azure Sentinel workspace
  • Rubrik Polaris account

Provisioning the Linux VM

First we need to provision a Linux machine in Azure. What do we need this for? Well we need a Linux box to basically act as our syslog forwarder (this can be used in any syslog based scenario) and is the method Microsoft uses and recommends. In short, your Linux box ingests syslog messages from your connected system, sends them to the Log Agent on the same box and that agent sends them up to Sentinel.

Because Sentinel is not a “typical” centralised syslog server as such, because we’re not piping logs to an IP address or hostname, we’re sending it to a webscale, distributed platform where things such as IP addresses are obfuscated from us. This is where the Log Agent handles all of that for us.

The diagram below from the Azure documentation site shows how the internal communication happens on the Linux VM syslog forwarder. A picture says a thousand words!

This diagram shows the data flow from syslog sources to the Microsoft Sentinel workspace, where the Log Analytics agent is installed directly on the data source device.

The Log Agent/syslog architecture within the syslog VM

So first up, let’s get an Linux VM. By using one straight out of Azure, we get the benefit of selecting an image with the Log Analytics agent already installed. This will save us some time.

  • From the Azure console, select Create a resource
  • In the search bar, type Ubuntu (or whatever your chosen distro is, but these instructions will focus on Ubuntu)
  • (Optional) Toggle the Pricing button to Free (because I’m tight). This means you only pay for compute.
  • Choose a Linux image. For the purposes of this guide, I’m going to use the Cognosys image
Choosing the Ubuntu image
  • Click Create
  • Follow the provisioning wizard, I recommend an A2 instance type for cost reasons (this can always be changed later if required)
  • Accept the default settings for a quick deployment, or feel free to customise the VNet settings, authentication type, tags etc. Ensure a Public IP address is created and assigned to the VM (this is the default behaviour). I set the authentication type to password. This is so that the provisioning process automatically creates a non-root user on the VM with a home area. This is an optional step, however.
  • Click Review + Create
VM creation screen

Obtaining the rbk-log tool

The next step is to download the rbk-tool to connect to Polaris. We obtain this from the Rubrik Support site, so credentials will be required and a product entitlement also.

  1. Login to Rubrik Support
  2. In the Docs and Downloads section, click View Downloads
  • Click Rubrik Polaris SaaS Platform (usually at the bottom of the list)
  • Under Software, click the Download button to the right of rbk-log
  • Under Documentation, you may also find it useful to download the Rubrik rbk-log Version 1.0 Technical Reference.pdf guide
  • Once the rbk-log tool has been downloaded, copy it to the Linux VM
  • Copy the tool to be on the PATH, so we can run it from anywhere on the filesystem. Type export to see what the PATH variable currently contains
export output
  • Once the tool has been copied to it’s final resting place, change the permissions on the file by typing chmod 755 rbk-log
  • The final configuration step is to set the required connection parameters for Polaris. This can be found in the Technical Reference guide, but for quick reference, these environment variables shown below. The RUBRIK_POLARIS_ACCOUNT URL must represent your Polaris tenant URL. Similarly, the RUBRIK_POLARIS_USERNAME and RUBRIK_POLARIS_PASSWORD variables should represent a discrete syslog user in your tenant (using an existing admin account is not recommended as passwords can change and MFA can cause issues with the connection). Also, having a suitably named account will help identify account activity much more easily. PIC
  • As per the Technical Reference guide, you should set these environment variables in the .bashrc file of the Linux user used to run the tool, so that the values are always available on login.
Environment variables required for the rbk-log tool

Configuring the syslog daemon

We need to check that the syslog daemon has been enabled, configured and set to run on boot. On this particular image, we just have a couple of configuration steps.

  1. Edit the file /etc/rsyslog.conf
  2. Find the sections commented “provides UDP syslog reception” and “provides TCP syslog reception” and uncomment the following two lines, as per below picture
  3. Restart the syslog service by typing sudo service rsyslog restart

Testing the Polaris connection

Now the syslog daemon has been configured and started, we can verify a connection to our Polaris tenant to make sure communications are working properly. First, verify that your environment variables have been set as per the previous section. Once this has been verified, run ./rbk-log test and review the result. A successful connection response is shown below. If you do not see something similar, go back and verify your environment variable settings, this is the most common problem.

Successful test result

If the connection test is successful, you can move ahead and start the tool and begin collecting events from Polaris. These will be fed into the local syslog on the Linux VM and then ultimately moved up into Sentinel. Start the tool by typing rbk-log start.

You may notice at this stage that because we are running this from the command line, if we do a CTRL-C or CTRL-D then the tool stops and so does Polaris event collection. An option here is to hit CTRL-Z to stop the process and then type bg to run the process in the background. This will allow us to check the syslog file itself to see if events are being pulled in.

The better strategy is to turn the tool into a daemon so it starts on system boot up. This beyond the scope of this guide, but this link may help .

Verifying the Polaris events in syslog

The next step is to ensure Polaris events are being pulled locally into the syslog, before we send them up to Sentinel. To do this, you can type tail -f /var/log/syslog, this will give you a “live feed” of the syslog as it us updated. If things are working properly, you should see similar results to the picture below.

Polaris events in syslog

In the syslog I can see a lot of login requests against my username. This is being done by the rbk-log tool and can skew log results as we discussed earlier. This is why having a dedicated account for log collection makes sense. Call the account rbk-syslog or something, but this account must have a working e-mail address so the account can be activated properly.

Connecting the Linux VM to Sentinel

Now the final configuration step – connecting the Linux syslog VM to Sentinel. This is done from the Sentinel page in the Azure Portal. These steps assume that you have a Sentinel workspace already configured. If you don’t, set one up as per the Azure documentation.

From the Sentinel blade, select your workspace and click the Data connectors link under Configuration (usually at the bottom of the blade).

Data connectors screen

In the search by name or provider box, type syslog and the Open Connector page.

There are two steps to configuring syslog ingestion into Sentinel – the first one is onboarding the Linux Agent and the second is choosing which types of syslog events we want to capture.

First up, open the Install agent on Azure Linux Virtual Machine section and click Download & install agent for Azure Linux Virtual machines. This takes you through to a page where all of your Linux VMs are listed with their connection status. Your newly provisioned VM will show as Not Connected. Let’s sort this!

VM connection status

Click on your syslog VM and in the following screen, click Connect. This will connect the VM to your Sentinel workspace. This can take several minutes, so be patient!

Sentinel VM connection screen

You can view the status of the process by typing top on your Linux VM and observing the processes at work.

Pro tip – if the connection times out, look for a process on the VM called unattended-upgrades. This is a process added to Ubuntu to automatically install security patches out of the box. This will hold a lock on the package database and prevent the connection process from completing. You can either wait for this process to finish or rebooting the VM and quickly restarting the connection process before the unattended-upgrades process kicks in worked for me.

Following the connection page. the final step is to configure which syslog event types we want to capture. This means we can scope down the collection and reduce the amount of noise. Click Open your workspace agents configuration > to view the syslog event types for collection (or “facilities” in syslog parlance).

Polaris events are logged under the kern facility, so click Add Facility and search for and add kern. You can also scope the level of events you want to capture (debug, warning, etc.), but I just accepted the defaults of all levels. I also added the priv and addpriv facilities so we can also capture local Linux events.

Running a basic Sentinel query to verify Polaris events are being ingested

Now that we have been through all the steps of configuring the syslog forwarded VM and connecting it to Sentinel, now to query the logs! Sentinel uses a query language called KQL (Kusto Query Language). It’s fairly easy to write a simple query, but more advanced KQL is well outside of the scope of this guide.

A simple KQL query to view all Polaris events is

| where SyslogMessage contains "Polaris"

This query should yield results from Polaris, though you may need to be patient for a few minutes while all the different processes hook up and synchronise.

You can see below the results of our simple query and you will notice that all of the pertinent Polaris information is held in the SyslogMessage field. This means we can construct our queries against this value. For example, if someone is trying to brute force an admin account, we can search for this in Sentinel.

Using an AND logical operator, we can pull out login failure events using the following KQL :

| where SyslogMessage contains "Polaris"
    and SyslogMessage contains "Failure"
    and SyslogMessage contains "failed to login with invalid password."


Hopefully this article has given you a good starting point for ingesting Polaris events into Sentinel and how you can construct some basic queries to assess the overall health and security of your Polaris tenant.


Some Google Cloud exam feedback and tips




I started 2019 thinking that it was a bit pointless me sitting any more exams as I didn’t think it would do much to enhance my career at this point. I stayed in that mindset until quite late in the year when I decided that my Google Cloud skills weren’t up to snuff and needed improving.

Once I’d made that decision, it then made sense for me to start down the certification route again. Mainly because for me, having an exam at the end of a learning phase gives me a goal to focus on, rather than just doing bits and pieces aimlessly. With that being said, I had a look at what Google Cloud had to offer on the certification track – they have the following:-

  • Associate Cloud Engineer
  • Professional Cloud Architect
  • Professional Data Engineer
  • Professional Cloud Developer
  • Professional Cloud Network Engineer
  • Professional Cloud Security Engineer
  • Professional Collaboration Engineer
  • G Suite Certification

It was a bit more extensive than I remembered when I last looked at it, but looking down the list, I figured that starting at associate level would be most appropriate for me. The recommendation is to have a minimum of 6 months hands on GCP skills, which I was kind of at.

To start with, all GCP exams I’ve sat have all followed the same format. You book them via WebAssessor, following pretty much the same process as every other IT vendor.

The exams themselves are 50 questions long and I think they’re 2 hours each. You might think 50 questions in 120 minutes doesn’t give you too much time per question, but to be honest, I never found time management as big a problem as it is with the AWS exams.

Where to start?

With any exam, I’d always reference the exam blueprint first. This will give you a good idea of what the exam expects of you in terms of knowledge and experience. More often than not, the exam questions themselves are pretty faithful to the exam blueprint.

As an example, I decided on the  Associate Cloud Engineer as my starting point, and the exam blueprint is available here. From memory, I’d say that it’s well worth going over the product offerings and knowing what they can and can’t do, and which use case is most appropriate. The other exams followed a very similar format and there is a good amount of overlap between the three I’ve now done. Knowing when BigTable is more appropriate than BigQuery is very helpful for example, and helps you weed out the incorrect answers in the multiple choices.

Once you’re familiar with the exam blueprint, time to get some hands on. Sign up for a free account so you can try a few things out at a very low cost. In terms of training, I have a subscription to Linux Academy. The course content is very good, pretty close to the exam blueprint and the killer is that there are real hands on labs in GCP that are included as part of your subscription.

I’m not sure how the A Cloud Guru situation will change the quality of Linux Academy, only time will tell, I suppose. For now, it’s the best resource there is for cloud exams, in my opinion.

Study Books

There are also study books from Dan Sullivan available for the Associate Cloud Engineer and Professional Cloud Architect (available below), which are pretty cost effective resources to supplement hands on practice as well as resources such as Linux Academy.

So armed with my Linux Academy subscription, GCP trial and study books, I began my Associate Cloud Engineer journey. For me, a month is about the maximum length of time I will wait from starting study to sitting the exam. I don’t believe in waiting too long, you have to strike while the iron is hot, plus if there are areas I don’t know too well that need to be improved, I’d rather find out sooner than later.

The Associate Cloud Engineer exam I’d say is a bit more “ops” focused than the Professional Cloud Architect, so by that I mean you will get probed about the gcloud command. A lot! Make sure you know the command pretty well, what the options and switches are and the general syntax of how to use it. You’ll also need to know gsutil pretty well too.

Having sat this exam and also two Professional level exams, I’d say don’t go into the Cloud Associate exam thinking it’s easier or a lower level. It’s most certainly not and I think the names are a bit misleading. If you’re coming from AWS, you may have found the Associate level exams quite easy and then found the Professional level exams a big jump up (well I did, anyway).

I’m pleased to say that since I started this process in early in November 2019, I’ve passed three GCP exams:-

  • Associate Cloud Engineer
  • Professional Cloud Architect
  • Professional Cloud Security Engineer

All three exams had a decent amount of overlap between them, especially around product patterns and anti-patterns (when to use them and when not to use them), so I’d definitely recommending knowing the common products quite well (compute, networking, databases, pub/sub at an absolute minimum).

In terms of difficulty, I think they’re probably slightly easier than the AWS Professional level exams but more difficult than the Azure exams. It’s different for everyone though, so don’t take that as gospel. Maybe I’ve just had a lot more experience of the AWS exams, but I found it relatively straight forward to eliminate the obviously incorrect responses from the multiple choice. They’re not gimmes though, if you’re not prepared, you won’t pass!

Finally, kudos to Google for giving certified professionals some top quality swag for free. I don’t see how store access to buy a cheap t-shirt or mouse mat is a certification benefit (I’m looking at you, AWS!), especially if you live outside of the US. The shipping to Europe alone is more than the item itself! Google have given me two Certified Professional hoodies free by mail order and a very nice padded jacket at Cloud Next at London, also totally free. I also got a discount code for Google Cloud Next 2020 in the US, knocking $500 off, which is pretty decent.

What’s next for me? Back to AWS to add the Security Specialty hopefully, sometime in January. Now I’ve re-certified my AWS Professional certs, that’s next. Then back to GCP networking and then maybe back to Azure. Who knows?

If you’re sitting the GCP exams, I hope these tips helped and let me know how you get on!



Notes from the field – Cloud Design Part 1

I’ve had an interesting last couple of weeks having discussions with customers who are both already in the public cloud and those that are dipping a toe. One recurrent theme seems to be the idea of taking what you have on premises and putting it into the cloud and expecting it to work in exactly the same way.

I’ve been working with cloud technologies for coming up on 5 years now, and in that time, this concept has been prevalent all the way through. There is a famous quote that seems to have been attributed to different historical people, including Einstein and Henry Ford, but instead I’m going to use Aerosmith’s interpretation of it:-

Cause if you do what you’ve always done
you’ll always get what you always got
Uh could that be nothin’

Steven Tyler, Aerosmith “Get A Grip”

Now that I’ve shoe horned in a hard rock reference, let’s look at what that actually means. For a start, many larger organisations use the end of a DC lease to trigger their move into public cloud by doing a “lift and shift” of VMs into the cloud, maybe deploying a couple of third party appliances (such as CloudGuard IaaS, of course) and then declaring themselves “in the cloud”. Job done.

Well yes and no.

Let me be clear on my view that if you are moving to cloud to save money, you’re doing it for entirely the wrong reasons. Really what you’re buying into is hyperscale technologies – the ability to provision highly complex stacks with a few clicks of a button and paying only for what you use.

If you drag and drop a bunch of most probably oversized VMs into cloud, when you get the monthly bill, you’re in for a shock. In my experience, compute charges make up the lion’s share of your bill. Do your research ahead of time and look if there are ways you can mitigate this cost.

For starters, if you have simple web serving needs, why not use the ability to publish web sites via S3 buckets, or maybe an Azure Web App? There are multiple tiers in the latter service, depending on what levels of performance and resilience you want.

If you have bursty compute requirements, look at auto scaling technologies or even serverless. Don’t be sucked into the dogma that serverless cures all ills, because it doesn’t. Used in the right way, it can be highly cost effective and elegant. Used in the wrong way, it can be expensive and inflexible for your needs.

noun [ C or U ] disapproving
UK /ˈdɒɡ.mə/ US /ˈdɑːɡ.mə/

a fixed, especially religious, belief or set of beliefs that people are expected to accept without any doubts

Cambridge Dictionary

Not to say there isn’t anything wrong with lift and shift of VMs into Azure, AWS, GCP, etc. But it’s a staging post, it’s not a destination. One public sector body I worked with a couple of years ago were really switched on to this. They saw L&S as very much a phase one, then used the “long tail” method of transitioning their apps to something more cloud native, using technologies such as Azure SQL and Azure Web Apps.

As usual, this post is a bit more of a brain dump than anything more formal. In future posts I intend to explore some more of the experiences I’ve had in the field and hopefully some will resonate with you.

As always, comments welcome. You can reach me on Twitter @ChrisBeckett.




Security starts with Windows + L (or command+control+Q)


<tap tap> Is this thing still on?

Well then, log time no blog. In case you are wondering, I joined Check Point as a Cloud Security Architect for UK & I last July, and have just celebrated my one year anniversary. I’m not quite sure how I managed to end up here, but it’s been a fantastic experience and I’ve learned absolutely tons about cyber security in that time.

As such, the topic of my first blog since 865 BC (or what feels like it!) is something that absolutely grinds my gears and I’m hoping we can raise a bit of awareness. The cyber security industry is worth billions of pounds a year and the vendors in the market make some pretty awesome products. Businesses and organisations are taking this topic more seriously than they’ve ever done, beefing up defences and increasing budgets.

CISOs and CSOs are now more commonplace and many cloud professionals are well educated on the importance of best practices such as making S3 buckets private, using security groups to control traffic flow and implementing solutions such as Check Point CloudGuard IaaS to provide deep packet inspection and IPS capabilities.

Automation gives us the ability to close any configuration gaps and perform remediation quicker than a human could spot it and fix it. This is all well and good, but attackers will always look for the easiest way to infiltrate an environment. I mean, why smash a window when the front door has been left open?

Lock your laptop, stoop.

I’ve been doing a lot of travelling the last year and it’s interesting to see how people behave. There is as much behavioural science in cyber security as there is technology. I find it staggering how many people will happily boot up their laptop in a public place, log in, open their e-mail, open a company document and then promptly get up and go to the toilet or order a coffee from the counter.

The one further observation on this is that the more “exclusive” the surroundings, the more likely it is that the individual will make this mistake. Two examples – a lady sat next to me in the lounge at Manchester Airport got ready to work and then promptly buggered off for five minutes. Similarly, I was in the first class carriage on a train and another lady from a pharma company (I won’t say which) opened a spreadsheet with an absolute f**k ton of customer data on it and then went off to the ladies (I presume, she was gone for ages) with the screen unlocked.

A better class of idiot

The one thing that connects these two examples is the fact that they took place in a more “restricted” area. Presumably the assumption is that the better “class” of people you are sat with, the smaller the chance that anything nefarious will happen. It’s impossible to say for sure if this is actually true, but shows how humans think. If I’m behind the velvet rope, all the thieving assholes are wandering through the duty free shops and drinking themselves into a coma.

Not necessarily true. Many data thieves are well funded (via legal avenues or otherwise) and so quite regularly will pop up behind the velvet rope. They’ve done their research too and have seen the same things I have. Even taking a picture of a laptop screen with a mobile phone takes seconds, you don’t even need to touch a keyboard.

We know now that once data gets out there, you can’t get it back. Whether it’s corporate data or a tweet declaring your undying love for your secondary school English teacher from way back when.

Don’t overlook the simple stuff

At a customer event a couple of months ago, I asked for a show of hands on how many organisations present had a corporate policy on locking your workstation when you aren’t in front of it. About three quarters put their hand up. I followed that up with the question of how many organisations actually enforced this policy. How many do you think? The answer was none.

It’s great that organisations moving to the cloud are really boosting their skills and knowledge around security. It’s a fast moving target and it’s hard to keep up with, but there are some things that are so simple that they often get overlooked.

Start with a policy mandating screen locking when a user walks away. Laptop, desktop, tablet, whatever. Make sure the lock screen has to be cleared by means of a password, PIN or ideally some biometrics such as fingerprint.

This policy will cost you nothing but will make a huge difference. It’s amazing, once you start doing it, it becomes habit very quickly, meaning that users away from the office will do this without thinking. You could even follow this up by advising road warriors to get a privacy screen gauze on their laptop (there are a bunch of them on Amazon or whatever your favourite e-tailer is). All small stuff, inexpensive but forms a good layer of protection against data loss.

Do it today, and do yourself a favour. Like the great Maury Finkle of Finkle’s Fixtures says..




05-01-18 : 6/7 Ain’t Bad : AWS Certified Big Data – Specialty Exam Tips

I’m pleased to say I just returned from sitting the AWS Certified Big Data Specialty exam and I managed to just about pass it first time. As always, I try and give some feedback to the community to help those who are planning on having a go themselves.

The exam itself is 65 questions over 170 minutes. In terms of difficulty, it’s definitely harder than the Associate level exams and in some cases, as tough as the Professional level exams. I didn’t feel particularly time constrained as with some other AWS exams as most of the questions are reasonably short (and a couple of them don’t make sense, meaning you need to take a best guess attempt at it).

In terms of preparation, I was lucky enough to be sent on the AWS Big Data course by my employer just before Christmas and it certainly helped but there was some exam content I didn’t remember the course covering. I also chose LinuxAcademy over A Cloud Guru, but really only for the reason that LA had hands on labs with its course and I don’t think ACG has them right now. There’s really no substitute for hands on lab to help understand a concept beyond the documentation.

I also use QwikLabs for hands on lab practice, there are a number of free labs you can use to help with some of the basics, above that for the more advanced labs, I’d recommend buying an Advantage Subscription which allows you to take unlimited labs based on a monthly charge. It’s about £40 if you’re in the UK, around $55 for US based folks. It might sound like a lot, but it’s cheaper than paying for an exam resit!

I won’t lie, Big Data is not my strong point and it’s also a topic I find quite dry, having been an infrastructure guy for 20 years or more. That being said, Big Data is a large part of the technology landscape we live in, and I always say a good architect knows a little bit about a lot of things.

As with other AWS exams, the questions are worded in a certain way. For example, “the most cost effective method”, “the most efficient method” or “the quickest method”. Maybe the latter examples are more subjective, but cost effectiveness usually wraps around S3 and Lambda as opposed to massive EMR and Redshift clusters, for example.

What should you focus on? Well the exam blueprint is always a good place to start. Some of the objectives are a bit generic, but you should have a sound grasp of what all the products are, the architecture of them and design patterns and anti-patterns (e.g. when not to use them). From here, you should be able to weed out some of the clearly incorrect answers to give you a statistically better chance of picking the correct answer.

Topic wise, I’d advise focusing on the following:-

  • Kinesis (Streams, Firehose, Analytics, data ingestion and export to other AWS services, tuning)
  • DynamoDB (Performance tuning, partitioning, use patterns and anti-patterns, indexing)
  • S3 (Patterns and anti-patterns, IA/Glacier and lifecycling, partitioning)
  • Elastic MapReduce (Products used in conjunction and what they do – Spark, Hadoop, Zeppelin, Sqoop, Pig, Hive, etc.)
  • QuickSight (Use patterns/anti-patterns, chart types)
  • Redshift (Data ingestion, data export, slicing design, indexing, schema types)
  • Instance types (compute intensive, smaller nodes of large instances vs larger nodes of smaller instances)
  • Compression (performance, compression sizes)
  • Machine Learning (machine learning model types and when you’d use them)
  • IoT (understand the basics of AWS IoT architecture)
  • What services are multi-AZ and/or multi-region and how to work around geographic constraints
  • Data Import/Export (when to use, options)
  • Security (IAM, KMS, HSM, CloudTrail)
  • CloudWatch (log files, metrics, etc.)

As with many AWS exams, the topics seem very broad, so well worth knowing a little about all of the above, but certainly focus on EMR and Redshift as they are the bedrock products of Big Data. If you know them well, I’d say you’re half way there.

You may also find Re:Invent videos especially helpful, especially the Deep Dive ones at the 300 or 400 level. The exam is passable, if I can do it, anyone can! Hopefully this blog helped you out, as there doesn’t seem to be much information out there on the exam since it went GA.

Just the Networking Specialty to do now for the full set, hopefully I’ll get that done before my SA Professional expires in June!



AWS Certification – Changes To Resit Policies?



As I tweeted at the end of last week, I failed the AWS Advanced Networking exam on Friday and I was looking earlier to see when I could reschedule this and jump back on the horse. Originally when I first started sitting AWS exams back in the dark depths of December 2015, you could sit an exam three times before you had to wait 12 months to sit it again.

As you can imagine, sitting my SA Pro exam at the third time of asking was pressure enough but also to have that sword hanging over my head just made the situation practically unbearable. I’m pleased to note that when I logged into the Training and Certification portal this morning, the resit policy has been relaxed quite a bit. From three attempts in a single year, all exams now have the following terms :-

  • You can sit any AWS exam a total of 10 times (Initial sitting plus 9 retakes)
  • You must wait 14 days after any failed attempt before you can register for a resit
  • The maximum number of exam sittings in a 12 month period seems to have been removed

This is a much better approach for test sitters and takes some of the pressure off. It also makes sense from AWS’s point of view as they can generate more revenues from exams now. I’m not sure when this policy changed (I quickly Googled it and found nothing), but it’s well worth knowing if you’re sitting any exams soon.

As regards the maximum sittings in a single year, if you need more than 10 attempts, it’s probably safe to say you should consider something a bit different. 😉

Screen grab from the T&C portal showing the new resit policy for all exams



Event Review – Google Cloud Next London – Day Two

Seenit demo

The day 2 keynote started with an in depth discussion of Cloud Spanner, as mentioned previously. AWS and Azure provide highly scalable and highly tunable NoSQL services in the form of DynamoDB etc, but when it comes to more traditional “meat and potatoes” RDBMS solutions, they are constrained by the limitations of the products they use, such as MySQL, SQL Server, Postgres, etc.

Cloud Spanner is different as it is a fully scalable RDBMS solution the cloud that offers all the same benefits as the NoSQL solutions in Azure and AWS. Much of the complexity of sharding the database and replicating it globally has been taken care of within Cloud Spanner. Automatic tuning is also done over time by background algorithms.

Cloud Spanner will be GA’d on May 16th and well worth a look if you have ACID database requirements at scale.

A representative from The Telegraph was brought up to discuss how GC’s data solutions allow them to perform very precise consumer targeting using analytics. It was also worth noting that they are a multi-cloud environment, using best of breed tools depending on the use case. Rare and ballsy!

An example of the powerful Google APIs available was then demonstrated by a UK startup called Seenit. They use the Google Video Intelligence API to automatically tag videos that are uploaded to their service. Shazam then came up on stage to discuss their use of the Google Cloud platform and to share some of the numbers they have for their service.

Shazam by numbers

As you can see from the picture above, there have been over a billion downloads for the app and more than 300 million daily active users. Those numbers take some processing! One of the key takeaways for Shazam was that in some cases, traffic spikes can be predicted, such as at a major sporting event or during a Black Friday sale. This is less the case with Shazam, so they have to have an underlying platform that can be resilient to these spikes.

There was a demo of GPU usage in the cloud, around the use case of rendering video. The key benefit of cloud GPU is that you can harness massive scalability at a fraction of the cost it would take to provision your own kit. Not only that, but consumption based charging means that you only pay for what you use, making it a highly cost effective option.

For the final demo of the keynote, there was a show and tell around changes coming to G Suite. This includes Hangouts, which has had some major engineering done to it. It will support call in delegates, a Green Room to hold attendees before the meeting starts and also the support for a new device called the Jamboard. This is a touch screen whiteboard that can be shared with delegates in the Hangouts meeting who can also interact with the virtual whiteboard, making it a team interactive session. Jamboards are still not available, but expect them to cost a few thousand pounds/dollars on release.

One of the new aspects of G Suite that I liked was the addition of bots and natural language support. Bots are integrated with Hangouts so that you can assign a project task to a team member, or you can use the bot to find the next free meeting slot for all delegates, all of which takes time in the real world.

Hangouts improvements

Natural language support was demonstrated in Sheets, whereby a user wanted to apply a particular formula but didn’t know how. By expressing what they wanted to do in natural language, Sheets was able to construct a complex formula that achieved these results in a split second, again illustrating the value of the powerful Google APIs.

A final demo was given by another UK startup called Ravelin. They have a service that detects fraud in financial transactions using powerful Machine Learning techniques. They then draw heat maps of suspected fraud activity and this can at a glance show parts of the country where fraud is most likely.

The service sits in the workflow for online payments and can return positive or negative results in milliseconds, thus not delaying the checkout process for the end consumer. Really impressive stuff!

More security and compliance in the cloud

After the keynote, I went to the first breakout of the day which was about security and compliance. This did not just cover GCP but also mobile as well. A Google service called Safety Net makes 400 million checks a day against devices to help prevent attacks and data leaks. This is leveraged by Google Play, whose payment platform serves 1 billion users worldwide.

One stat that blew me away was that 99% of all exploits happen a year after the CVE was published. This is a bit of a damning statement and shows that security and patching is still not treated seriously enough. On the other side of the coin, Android still has a lot to do in this area, so in some respects I thought it was a bit rich of Google to point fingers.

Are you the weakest link?

Google has 15 regions and 100 POPs in 33 countries, with a global fibre network backbone that carries a third of all internet traffic daily. The Google Peering website has more information on the global network and is worth a visit. Google really emphasised their desire to be the securest cloud provider possible by noting that they have 700+ security researchers and have published 160 academic security white papers. Phishing is still the most common way of delivering malicious payloads.

DLP is now available for both GMail and Drive, meaning the leak of data to unauthorised sources can now be prevented. There is also support for FIDO approved tokens, which are USB sticks with a fingerprint scanner on board. These are fairly cheap and provide an additional layer of security. The session wrapped with announcements around expiring access and IRM support for Drive, S/MIME support for Gmail and third party apps white listing for G Suite.

To mention GDPR  – Google have stated that you are the data controller and Google are the data processor. Google has certified all infrastructure for FedRAMP, only provider to do that. Although FedRAMP doesn’t apply outside of the US, there may be cases where this level of certification will be useful to show security compliance.

Cloud networking for Enterprises

My next breakout was on GC networking. I have to say that as a rule, the way GC does this is very similar to AWS with VPC and subnet constructs, along with load balancing capabilities. Load balancing comes in three main flavours – HTTP(S), SSL and TCP Proxy. You can also have both internal and external load balancing.

Load balancing can be globally distributed, to help enable high availability and good levels of resilience. This uses IP AnyCast to achieve this functionality. IPv6 is now supported on load balancers, but you can only have one address type to each load balancer. In respect of CDN, there is a Google CDN, but you can also use third party CDN providers such as Akamai or Fastly.

Fastly took part in the breakout to explain how their solution works. It adds a layer of scalability and also performance on top of public cloud providers. It is custom code written by Fastly to determine optimal routes for network traffic globally. I’m sure it does a lot more than that, so feel free to check them out.

The Fastly network

Andromeda is the name of the SDN written by Google to control all networking functions. There is a 60Gbps link between VMs in the same region and live migration of VMs is available (unique to GC at the time of writing). GCP firewalls are stateful, accept ingress/egress rules and deny is the default unless overridden.

DDos protection at layer 3 and 4 with Cloud CDN and Load Balancer, with third party appliances supported (Checkpoint, Palo Alto, F5, etc.). Identity Aware Proxy can be used to create ACLs for access to external and internal sites using G Suite credentials. In respect of VPCs, you can have a single VPC that can be used globally and also shared with other organisations. VPCs have expandable IP address ranges, so you don’t need to decide up front how many addresses you will need, this can be changed later.

There is private access to Google services from VPCs including cloud storage, so think of S3 endpoints in AWS and you’ll get the idea. Traffic does not traverse the public internet, but uses Google’s network backbone. You can access any region from a single interconnect through Google’s network (think Direct Connect or ExpressRoute).

Like Azure and AWS, VPC network peering is available. VMs support multi NICs and you can have 10 NICs per VM. XPNs define cross project networking and you can have shareable central network admin, shared VPN, fine grained IAM controls and the Cloud Router supports BGP. Finally, in terms of high bandwidth/low latency connections, you can have a direct connection to Google with Partner interconnections also available.

To wrap up

To summarise, props to Google for a very good event. There was loads of technical deep dive content and if I had one criticism, it would be that the exhibition hall was a bit sparse, but I expect that will be addressed pretty quickly. In respect of functionality, I was pleasantly surprised with how much is currently available in GC.

Standouts for me include the VM charging model, VM sizing, live migration of VMs and added flexibility around the networking piece. It’s clear that Google want to position GC as having all the core stuff you’d expect, but with the availability of the APIs that help run the consumer side of Google, with some massively powerful APIs available.




Event Review – Google Cloud Next London – Day One

I was fortunate enough to spend the last couple of days at the Google Cloud Next London event at the ExCel centre and I have a few thoughts about it I’d like to share. The main takeaway I got from the event is that while there may not be the breadth of services within Google Cloud (GCP) as there is in AWS or Azure, GCP is not a “me too” public cloud hyperscaler.

While some core services such as cloud storage, VPC networking, IaaS and databases are available, there are some key differences with GCP that are worth knowing about. My interpretation of what I saw over the couple of days was that Google have taken some of the core services they’ve been delivering for years, such as Machine Learning, Maps and Artificial Intelligence and presenting them as APIs for customers to consume within their GCP account.

This is a massive difference from what I can see with AWS and Azure. Sure, there are components of the above available in those platforms, but these are services which have been at the heart of Google’s consumer services for over a decade and they have incredible power. In terms of market size, both AWS and Azure dwarf GCP, but don’t be fooled into thinking this is not a priority area for Google, because it is. They have ground to make up, but they have very big war chests of capital to spend and also have some of the smartest people on the planet working for them.

To start with, in the keynote, there was the usual run down of event numbers, but the one that was most interesting for me was that there were 4,500 delegates, which is up a whopping 300% on last year, and 67% of registered attendees described themselves as developers. Google Cloud is made up of GCP, G Suite (Gmail and the other consumer apps), Maps and APIs, Chrome and Android. Google Cloud provides services to 1 billion people worldwide per day. Incredible!

Gratuitous GC partner slide

There was the usual shout out of thanks to the event sponsors. One thing I did notice in contrast to other vendor events I’ve been to was the paucity of partners in the exhibition hall. There were several big names including Rackspace, Intel and Equinix but obviously building a strong partner ecosystem is still very much a work in progress.

We then had a short section with Diane Greene, who many industry veterans will know as one of the founders of VMware. She is now Senior VP for Google Cloud and it’s her job to get Google Cloud better recognition in the market. Something I found quite odd about this section is that she seemed quite ill prepared for her content and brought some paper notes with her on stage, which is very unusual these days. There were several quite long pauses and it seemed very under-rehearsed, which surprised me. Normally the keynote speakers are well versed and very slick.

GDPR and GC investment

Anyway, moving on to other factoids – Greene committed Google to be fully GDPR compliant by the time it becomes law next May. She also stated there has been $29.4 billion spent on Google Cloud in the last three years. The Google fibre backbone carries one third of all internet traffic. Let that sink in for a minute!

There is ongoing investment in the GC infrastructure and when complete in late 2017/early 2018, there will be 17 regions and 50 availability zones in the GC environment, which will be market leading.


GCP regions, planned and current

Google Cloud billing model

One aspect of the conference that was really interesting was the billing model for virtual machines. In the field, my experience with AWS and Azure has been one of pain when trying to determine the most cost effective way to provide compute services. It becomes a minefield of right sizing instances, purchasing reserved instances, deciding what you might need in three year’s time, looking at Microsoft enterprise agreements to try and leverage Hybrid Use Benefit. Painful!

The GCP billing model is one in which you can have custom VM sizes (much like we’ve always had with vSphere, Hyper-V and KVM), so there is less waste per VM. Also, the longer you use a VM, the cheaper the cost becomes (this is referred to as sustained usage discount). Billing is also done per minute, which is in contrast to AWS and Azure who bill per hour. So even if you only use a part hour, you still pay the full amount.

It is estimated that 45% of public cloud compute spend is wasted, the GC billing model should help reduce this figure. You can also change VM sizes at any time and the sustained usage discount can result in “up to” 57% savings. Worth looking at, I think you’ll agree.

Lush from the UK were brought up to discuss their migration to GCP and they performed this in 22 days and they calculate 40% savings on hosting charges per year. Not bad!

Co-existence and migration

There has also been a lot of work done within GCP to support Windows native tools such as PowerShell (there are GCP cmdlets) and Visual Studio. There are also migration tools that can live move VMs from vSphere,. Hyper-V and KVM, as you’d probably expect. Worth mentioning too at this point that GCP has live migration for VMs as per vSphere and Hyper-V, which is unique to GCP right now, certainly to the best of my knowledge.

G Suite improvements

Lots of work has been done around G Suite, including improvements to Drive to allow for team sharing of documents and also using predictive algorithms to put documents at the top of the Drive page within one click, rather than having to search through folders for the document you’re looking for. Google claim a 40% hit rate from the suggested documents.

There are also add ons from the likes of QuickBooks, where you can raise an invoice from directly within Gmail and be able to reconcile it when you get back to QuickBooks. Nice!

Encryption in the cloud

Once the opening keynote wrapped, I went to my first breakout session which was about encryption within GC. I’m not going to pretend I’m an expert in this field, but Maya Kaczorowski clearly is, and she is a security PM at Google. The process of encrypting data within the GC environment can be summarised thus :-

  • Data uploaded to GC is “chunked” into small pieces (variable size)
  • Each chunk is encrypted and has it’s own key
  • Chunks are written randomly across the GC environment
  • Getting one chunk of data compromised is effectively useless as you will still need the other chunks
  • There is a strict hierarchy to the Key Management Service (shown below)

Google key hierarchy

A replay of this session is available on YouTube and is well worth a watch. Probably a couple of times so you actually understand it!

What’s new in Kubernetes and Google Container Engine

Next up was a Kubernetes session and how it works with Google Container Engine (GCE). I have to say, I’ve heard the name of Kubernetes thrown around a lot, but never really had the time or the inclination to see what all the fuss is about. As I understand it, Kubernetes is a wrapper over the top of container technologies such as Docker to provide more enterprise management and features such as clustering and scaling.

Kubernetes was written initially by Google before being open sourced and it’s rapidly becoming one of the biggest open source projects ever. One of the key drivers for using containers and Kubernetes is the ability to port your environment to any platform. Containers and Kubernetes can be run on Azure, AWS, GC or even on prem. Using this technology avoids vendor lock in, if this is a concern for you.

Kubernetes contributors and users

There is also a very high release cadence – a new version ships every three months and version 1.7 is due at the end of June (1.6 is the current version). The essence of containerisation is that you can start to use and develop microservices (services broken down into very small, fast moving parts rather than one huge bound up, inflexible monolithic stack). Containers also are stateless in the sense that data is stored elsewhere (cloud storage bucket, etc) and are disposable items.

In a Kubernetes cluster, you can now scale up to 5,000 pods per cluster. A cluster is a collection of nodes (think VMs) and pods are container items running isolated from each other on a node. Clusters can be multi-zone and multi-region and now also have the concept of “taints” and “tolerances”. Think of taints as node characteristics such as having a GPU, or a certain RAM or CPU size. A tolerance is a container rule that allows or disallows affinity based on the node taint. For example, a tolerance would allow a container to run on a node with a GPU only.

The final point of note here is that Google offer a managed Kubernetes service called Google Container Engine.

From Blobs to Relational Tables, where do I store my data?

My next breakout was to try and get a better view of the different storage options within GC. One of the first points made was really interesting in that Rolls Royce actually lease engines to airlines so they can collect telemetry data and have the ability to tune engines as well as perform pro-active maintenance based on data received back from the engines.

In summary, your storage options include:-

  • RDBMS – Cloud SQL
  • Data Warehousing – BigQuery
  • Hadoop – Cloud Storage
  • NoSQL – Cloud BigTable
  • NoSQL Docs – Cloud datastore
  • Scalable RDBMS – Cloud Spanner

Cloud Storage can have several different characteristics, including multi-region, regional, nearline and coldline. This is very similar to the options provided by AWS and Azure. Cloud Storage has an availability SLA of 99.95% and you use the same API to access all storage tiers.

Data lifecycle policies are available available in a similar way to S3, moving data between the tiers when rules are triggered. Delivery Network is performed using the Cloud CDN product and message queuing is performed using Cloud Pub/Sub. Cloud Storage for hybrid environments is also available in a similar way to StorSimple or the AWS Storage Gateway using partner solutions such as Panzura (cold storage, backup, tiering device, etc.)

Cloud SQL – 99.95% SLA, with failover replica and read replicas, which seemed very similar to how AWS RDS works. One interesting product was Cloud Spanner. This is a horizontally scalable RDBMS solution that offers typical SQL features such as ACID but with the scalability of typical cloud NoSQL solutions. This to me seemed a pretty unique feature of GC, I haven’t seen this elsewhere. Cloud Spanner also provides global consistency, 99.99% uptime SLA and a 99.999% multi-region availability SLA. Cool stuff!

Serverless Options on GCP

My next breakout was on serverless options on GCP. Serverless seems to the latest trend in cloud computing that for some people is the answer to everything and nothing. Both AWS and Azure provide serverless products, and there are a lot of similarities with the Google Functions product.

To briefly deconstruct serverless tech, this is where you use event driven process to perform a specific task. For example, a file gets uploaded to a storage bucket and this causes an event trigger where “stuff” is performed by a fleet of servers. Once this task is complete, the process goes back to sleep again.

The main benefit of serverless is cost and management. You aren’t spinning VMs up and down and you aren’t paying compute fees for idle VMs. Functions is charged per 100ms of usage and also how much RAM is assigned to the process. The back end also auto scales so you don’t have to worry about setting up your own auto scaling policies.

Cloud Functions is in it’s infancy right now, so only node.js is supported but more language support will be added over time. Cloud storage, Pub/Sub channels and HTTP webhooks can be used to capture events for serverless processes.

Day Two wrap up to come in the next post!


Breaking Bad – Beating Analysis Paralysis

We’ve all been there, right? A large project or even a project that’s part of a bigger programme of works that seems to be stalled and stuck in an infinite loop because there is so much worry that something deployed may prove to be “wrong” further down the line. Meeting after meeting goes by, nips and tucks go into designs, more stakeholders are pulled into the process and the WITWIT cycle becomes a cycle you can’t seem to break.

Barack was unimpressed as the project meeting deferred yet another decision

WITWIT? Yep, it means “What if this? what if that?”. I made it up myself, as you can probably tell. There is also a further damaging introspection stage I like to call WITWOO (“What if this?”, “What Other Obstacles?”), but that is so silly and so contrived that I’m not going to mention it again. No, it’s not an April Fool’s Day post!

“WITWOO! Your project is dooooomed!”

So anyway, back to the project that is stuck in the thought process because everyone is terrified of making the wrong decision. I suppose in some ways, dealing with it depends on the deliverable of the project. If it’s something new, the TR (Terror Rating) is usually pretty high. This is because we’re dealing with a bit of a nebulous concept – we can’t see it, touch it, play with it. We don’t know what it can and can’t do, less how it can help deliver value to our organisation. We’ve seen demos and it looks cool, we signed off on that bit, but now how can this technology help the business grow?

Let’s start breaking down the problem then into more digestible chunks. I’ve spent a lot of time recently looking at things like Lean, Agile, Kanban, Scrum and Lean Coffee (look that one up, it’s interesting!). All of those things are frameworks, much like ITIL and PRINCE2. That means they aren’t prescriptive and you need to pick and choose which parts of the framework suit the needs of the project deliverable.

Agile type frameworks don’t need to represent software code as such – we’re talking about a product with features we want to consume. This could be anything – Office 365, in house software, even a drinks vending machine for heaven’s sake. We have this big “thing” in the distance, how do we get there in the best way? Keep it in your sights, but deliver smaller pieces quickly and start delivering value much quicker.

I’ve talked before about the Minimum Viable Product process and I’ve had folks rebut this argument with models such as RAT (Riskiest Assumption Test). Either way, if you start arguing about stuff like this, you’re totally missing the point. Similarly if you embrace a full Agile/Kanban method with daily standups, camp fires, burning joss sticks and rounds of “Kum Ba Yah” without having anything tangible to show for it other than “Hey! We do DevOps!”.

The DevOpsiest DevOps team in the world. Singing Kum ba yah. Possibly.

Frameworks are buffets – this means we can pick a bit of this, a bit of that and leave the rest because we have no use for it. Keep it as simple as you can to start relieving the paralysis log jam so it doesn’t make the problem bigger.

Start delivering value

To start with, using the MVP analogy, ask yourself “what is the minimum set of features this product must have on day one to start delivering value to the business?”. Going back to the earlier product deliverable itself, this could be:-

  • Office 365 Product – Just SharePoint Online
  • In house software – A login GUI for end users, linked to LDAP and a dashboard with one chart on it
  • A drinks vending machine – Sell cans of diet cola (any brand will do)

Already we know that if we’ve done our requirements capture properly, these features are just the tip of the iceberg. That being said, by producing this MVP, we now have something our customers can consume. Users can start to add SharePoint sites or login to the new in house software and see a dashboard with a key chart on it or go to the vending machine and buy a can of diet cola.

At this point, we can then talk to our customers and find out if the MVP delivers the initial deliverable and if not, how it might be changed. For example, in the drinks machine scenario, customers like that the machine is there, but would prefer a branded diet cola rather than the own brand version from a large cash and carry warehouse.

Each time we make a change or add a feature, we go back to the customer to find out their thoughts. We also commit to delivering changes regularly, such as weekly or bi-weekly. This is a key Agile concept. We don’t wait until the whole thing is finished before we let people consume it.

Customers would like to be able to buy bottled water and ice tea from the vending machine. Great! However, in the next week, we can only commit to getting one of these drinks in. How do we know which one to add?

Any project board or design authority board will have people on it with strong opinions. In fact, you should want this. In my experience, passive “passengers” sit and say nothing and only complain once something has been delivered and it’s harder to make changes. Members will also be passionate about things that don’t matter to other board members. Brand of cola versus bottled water, for example. How do we break this cycle? This is most likely to be the main bottleneck to progress.

Defining value

We need to have a framework to defining value to the business of what the product is delivering. On the project board, for the next release, Gomez wants to stock Diet Pepsi instead of the unbranded diet cola and Morticia doesn’t drink cola but thinks bottled water is essential in any vending machine. Oh, and Gomez is the CTO and Morticia is the CEO.

“Get some water in. What do we say? Now!”

We can’t make both people happy, but we still have to keep delivering value. How can we do this? The best measure of anything is analytical, empirical and measurable. Take out instinct, opinion and gut feeling. Everyone’s is different. By assigning a numeric value to proposed features, we can be dispassionate about the design decision and also demonstrate to stakeholders that we are delivering maximum value to the business.

Firstly, we need some criteria by which to score the proposed features. We know we want to add Diet Pepsi, bottled water and pre-mixed protein shakes (I forgot to mention the CFO is a gym rat). We have three “wants”, each requester is C-Level and we have to keep delivering maximum value to the business in the shortest amount of time.

“I need your clothes, your boots and your protein shake”

Keep the criteria definitions short and simple. Between 5-8 I would say, this way you have enough criteria to give a well defined score, but you also aren’t listing 20 criteria on which you have to decide. Using this principle, we arrive at the following criteria:-

  • Compatibility of product with vending machine slot
  • Delivery of product in 3-5 days from supplier
  • Availability of stock
  • Demand for product
  • Health benefits to staff

So we now have our criteria, but how do we score it? Remember, KISS! Not the glam rock band, but Keep It Simple, Stupid! One process that works well for me is a simple low, medium and high. So 25, 50 and 100. This way, the rating is easier to decide on, plus there is enough gap between the numbers to help decide ordering of the deliverable.

“Keep it simple, or we’ll tour again. Mmm-kay?”

Based on both the criteria and the scoring values, we can now rate features for the next iteration.

  Diet Pepsi Bottled Water Pre-Mixed Protein
Machine Compatibility 100 100 25
Item delivery lead time 100 100 25
Availability of stock 50 50 25
Product demand 50 100 25
Health benefits 50 100 100
TOTAL 350 450 200

Quickly once scoring is done, we can see a clear set of priorities defined for the next three releases but how did we arrive at these scores? Well, the following happened during discussions:-

  • The protein mix is in a large bottle that doesn’t fit in a standard slot (so scores low for compatibility)
  • Cola and water is in stock with the supplier but the protein mix has to be ordered from the manufacturer
  • Cola and water is in stock but there are not enough bottles to fill the machine to capacity and will need to be back ordered
  • An internal poll on the company intranet showed all respondents wanted to buy water, 65% also wanted to buy cola and 10% also wanted to buy protein, so we round the values accordingly
  • Water and protein is healthy and nutritious, diet cola not so much (it can damage bones, apparently)

So there you have it – now we have dispassionately provided an ordered list of features, we commit to delivering bottled water, Diet Pepsi and pre-mixed protein into the vending machine during the next three releases. Customers are informed of this and are informed of the dates when they will be added (one item per week over three weeks).

What you should then find is that any arguments stop, because we have applied strict rules to our criteria and weighted them accordingly. Also, we know we want to stock all kinds of goodies in the vending machine, but that is a longer term goal. We have the machine, we want to stock it as quickly as possible and with products we know there is a clear demand for. A happy side effect of this process is we don’t waste time stocking a product nobody wants.

For example, we could add a new Marmite drink to the machine on the first day. It’s healthy, fits in a standard slot, on a special offer at the warehouse and can be with us tomorrow. It also tastes like an old sock, so nobody buys it. The value we add is that we are making the best use of time and keep waste to a minimum. Plus, the business is out of pocket because the Marmite drink is rank, nobody wants to drink it and it doesn’t sell a single item.

What Marmite tastes like.

Remember that the scoring does not reflect importance, it reflects business value and speed of delivery. You may do this for an IT project and you find DR scores lower than most other features. This is not because DR is not important, but more likely because there are other delays or constraints caused by factors such as licencing, compatibility and physical infrastructure. This means DR, while important as a deliverable, will take longer than enabling a new line of business SaaS application for example.

Bringing it all together

I know that drinks vending machines have bugger all to do with IT projects, but the concepts and the constraints remain exactly the same. You have a big thing – Office 365 (think vending machine), you have things within it that people consume – Outlook, SharePoint, Yammer (think drinks types) and you need to deliver it to end customers as quickly as you can without second guessing which features they want. Let’s say you don’t add Teams because you already use Slack (you hipster, you!). We’d know this because Teams would have a low score.

In summary, the way to break out of analysis paralysis is to a) break the project down into small, deliverable chunks and b) use weightings, metrics and empirical data to define the decision making processes on what to deliver and when. This makes the whole process so much more visible.

Remember that most decisions are reversible, don’t be afraid to get it wrong once in a while and have an open culture within all members of the project board and teams. Finally, don’t embrace frameworks like a cult – choose what works for that project and put away the guitars and joss sticks!