05-01-18

05-01-18 : 6/7 Ain’t Bad : AWS Certified Big Data – Specialty Exam Tips

I’m pleased to say I just returned from sitting the AWS Certified Big Data Specialty exam and I managed to just about pass it first time. As always, I try and give some feedback to the community to help those who are planning on having a go themselves.

The exam itself is 65 questions over 170 minutes. In terms of difficulty, it’s definitely harder than the Associate level exams and in some cases, as tough as the Professional level exams. I didn’t feel particularly time constrained as with some other AWS exams as most of the questions are reasonably short (and a couple of them don’t make sense, meaning you need to take a best guess attempt at it).

In terms of preparation, I was lucky enough to be sent on the AWS Big Data course by my employer just before Christmas and it certainly helped but there was some exam content I didn’t remember the course covering. I also chose LinuxAcademy over A Cloud Guru, but really only for the reason that LA had hands on labs with its course and I don’t think ACG has them right now. There’s really no substitute for hands on lab to help understand a concept beyond the documentation.

I also use QwikLabs for hands on lab practice, there are a number of free labs you can use to help with some of the basics, above that for the more advanced labs, I’d recommend buying an Advantage Subscription which allows you to take unlimited labs based on a monthly charge. It’s about £40 if you’re in the UK, around $55 for US based folks. It might sound like a lot, but it’s cheaper than paying for an exam resit!

I won’t lie, Big Data is not my strong point and it’s also a topic I find quite dry, having been an infrastructure guy for 20 years or more. That being said, Big Data is a large part of the technology landscape we live in, and I always say a good architect knows a little bit about a lot of things.

As with other AWS exams, the questions are worded in a certain way. For example, “the most cost effective method”, “the most efficient method” or “the quickest method”. Maybe the latter examples are more subjective, but cost effectiveness usually wraps around S3 and Lambda as opposed to massive EMR and Redshift clusters, for example.

What should you focus on? Well the exam blueprint is always a good place to start. Some of the objectives are a bit generic, but you should have a sound grasp of what all the products are, the architecture of them and design patterns and anti-patterns (e.g. when not to use them). From here, you should be able to weed out some of the clearly incorrect answers to give you a statistically better chance of picking the correct answer.

Topic wise, I’d advise focusing on the following:-

  • Kinesis (Streams, Firehose, Analytics, data ingestion and export to other AWS services, tuning)
  • DynamoDB (Performance tuning, partitioning, use patterns and anti-patterns, indexing)
  • S3 (Patterns and anti-patterns, IA/Glacier and lifecycling, partitioning)
  • Elastic MapReduce (Products used in conjunction and what they do – Spark, Hadoop, Zeppelin, Sqoop, Pig, Hive, etc.)
  • QuickSight (Use patterns/anti-patterns, chart types)
  • Redshift (Data ingestion, data export, slicing design, indexing, schema types)
  • Instance types (compute intensive, smaller nodes of large instances vs larger nodes of smaller instances)
  • Compression (performance, compression sizes)
  • Machine Learning (machine learning model types and when you’d use them)
  • IoT (understand the basics of AWS IoT architecture)
  • What services are multi-AZ and/or multi-region and how to work around geographic constraints
  • Data Import/Export (when to use, options)
  • Security (IAM, KMS, HSM, CloudTrail)
  • CloudWatch (log files, metrics, etc.)

As with many AWS exams, the topics seem very broad, so well worth knowing a little about all of the above, but certainly focus on EMR and Redshift as they are the bedrock products of Big Data. If you know them well, I’d say you’re half way there.

You may also find Re:Invent videos especially helpful, especially the Deep Dive ones at the 300 or 400 level. The exam is passable, if I can do it, anyone can! Hopefully this blog helped you out, as there doesn’t seem to be much information out there on the exam since it went GA.

Just the Networking Specialty to do now for the full set, hopefully I’ll get that done before my SA Professional expires in June!

 

29-08-17

AWS Certification – Changes To Resit Policies?

 

 

As I tweeted at the end of last week, I failed the AWS Advanced Networking exam on Friday and I was looking earlier to see when I could reschedule this and jump back on the horse. Originally when I first started sitting AWS exams back in the dark depths of December 2015, you could sit an exam three times before you had to wait 12 months to sit it again.

As you can imagine, sitting my SA Pro exam at the third time of asking was pressure enough but also to have that sword hanging over my head just made the situation practically unbearable. I’m pleased to note that when I logged into the Training and Certification portal this morning, the resit policy has been relaxed quite a bit. From three attempts in a single year, all exams now have the following terms :-

  • You can sit any AWS exam a total of 10 times (Initial sitting plus 9 retakes)
  • You must wait 14 days after any failed attempt before you can register for a resit
  • The maximum number of exam sittings in a 12 month period seems to have been removed

This is a much better approach for test sitters and takes some of the pressure off. It also makes sense from AWS’s point of view as they can generate more revenues from exams now. I’m not sure when this policy changed (I quickly Googled it and found nothing), but it’s well worth knowing if you’re sitting any exams soon.

As regards the maximum sittings in a single year, if you need more than 10 attempts, it’s probably safe to say you should consider something a bit different. 😉

Screen grab from the T&C portal showing the new resit policy for all exams

 

24-03-17

Avoiding vendor lock-in in the public cloud

A little while back, I had a pretty frank discussion with a customer about vendor lock-in in the public cloud and he left me under no illusions that he saw cloud more as a threat than an opportunity. I did wonder if there had been some incident in the past that had left him feeling this way, but didn’t really feel it appropriate to probe into that much further.

Instead of dwelling on the negatives of this situation, we decided to accentuate the positives and try to formulate some advice on how best this risk could be mitigated. This was especially important as there was already a significant investment made by the business into public cloud deployments. It is an important issue though – it’s easy enough to get in, but how do you get out? There are several strategies you could use, I’m just going to call out a couple of them as an example.

To start with, back in the days of all on premises deployments, generally you would try and go for a “best of breed” approach. You have a business problem that needs a technical solution so you look at all the potential solutions and choose the best fit based on a number of requirements. Typically these include cost, scalability, support, existing skill sets and strength of the vendor in the market (Gartner Magic Quadrant, etc.). This applies equally in the public cloud – it’s still a product set in a technical solution so the perspective needn’t change all that much.

One potential strategy is to use the best of breed approach to look at all public cloud vendors (for the purpose of this article, I really just mean the “big three” of AWS, Azure and Google Cloud Platform). As you might expect, the best cost, support and deployment options for say SQL Server on Windows would probably be from Microsoft. In that case, you deploy that part of the solution in Azure.

Conversely, you may have a need for a CDN solution and decide that AWS CloudFront represents the best solution, so you build that part of your solution around that product. This way you are mitigating risk by spreading services across two vendors while still retaining the best of breed approach.

However, “doing the splits” is not always preferable. It’s two sets of skills, two lots of billing to deal with and two vendors to punch if anything goes badly wrong.

Another more pragmatic approach is to make open source technologies a key plank of your strategy. Products such as MySQL, Postgres, Linux, Docker, Java, .NET, Chef and Puppet are widely available on public cloud platforms and mean that any effort put into these technologies can be moved elsewhere if need be (even back on premises if you need to). Not only this, but skills in the market place are pretty commoditised now and mean that bringing in new staff to help with the deployments (or even using outside parties) is made easier and more cost effective.

You could go down the road of deploying a typical web application on AWS using Postgres, Linux, Chef, Docker and Java and if for any reason later this approach becomes too expensive or other issues occur, it’s far easier to pick up the data you’ve generated in these environments, walk over to a competitor, drop it down and carry on.

Obviously this masks some of the complexities of how that move would actually take place, such as timelines, cost and skills required, but it presents a sensible approach to stakeholders that provider migration has been considered and has been accounted for in the technical solution.

The stark reality is that whatever you are doing with technology, there will always be an element of vendor lock in. Obviously from a financial perspective there is a motive for them to do that, but also this comes of innovation when a new technology is created which adds new formats and data blobs to the landscape. The key to addressing this is taking a balanced view and being able to tell project stakeholders that you’re taking a best of breed approach based on requirements and you have built in safeguards in case issues occur in future that prompt a re-evaluation of the underlying provider.

 

07-03-17

What is the Cloud Shared Responsibility Model and why should I care?

When I have discussions with customers moving to a public cloud provider, one of the main topics of conversation (quite rightly) is security of services and servers in the cloud. Long discussions and whiteboarding takes place where loads of boxes and arrows are drawn and in the end, the customer is confident about the long term security of their organisation’s assets when moving to Azure or AWS.

Almost as an aside, one of the questions I ask is how patching of VMs will be performed and a very common answer is “doesn’t Microsoft/AWS patch them for us?”. At this point I ask if they’ve heard of the Shared Responsibility Model and often the answer is “no”. So much so that I thought a quick blog post was in order to reinforce this point.

So then, what is the Shared Responsibility Model? Put simply, when you move services into a public cloud provider, you are responsible for some or most of the security and operational aspects of the server (tuning, anti-virus, backup, etc.) and your provider is responsible for services lower down the stack that you don’t have access to, such as the hypervisor host, physical racks, power and cooling.

That being said, there is a bit more to it than that, depending on whether or not we’re talking about IaaS, PaaS or SaaS. The ownership of responsibility can be thought of as a “sliding scale” depending on the service model. To illustrate what I mean, take a look at the diagram below, helpfully stolen from Microsoft (thanks, boys!).

Reading the diagram from left to right, you can see that in the left most column where all services are hosted on prem, it is entirely the responsibility of the customer to provide all of the security characteristics. There is no cloud provider involved and you are responsible for racking, stacking, cooling, patching, cabling, IAM and networking.

As we move right to the IaaS column, you can see subtle shades of grey emerging (quite literally) as with IaaS, you’re hosting virtual machines in a public cloud provider such as Azure or AWS. The provider is responsible for DC and rack security and some of the host infrastructure (for example, cloud providers patch the host on your behalf), but your responsibility is to ensure that workloads are effectively spread across hosts in appropriate fault and update domains for continuity of service.

Note however that in the IaaS model, as you the customer are pretty much responsible for everything from the guest upwards, it’s down to you to configure IAM, endpoint security and keep up to date with security patches. This is where a lot of the confusion creeps in. Your cloud provider is not on the hook if you fail to patch and properly secure your VMs (including network and external access). Every IaaS project requires a patching and security strategy to be baked in from day one and not retrofitted. This may mean extending on prem AD and WSUS for IAM and patching, to leverage existing processes. This is fine and will work, you don’t necessarily need to reinvent the wheel here. Plus if you re-use existing processes, it may shorten any formal on boarding of the project with Service Management.

Carrying on across the matrix to the next column on the right is the PaaS model. In this model, you are consuming pre-built features from a cloud provider. This is most commonly database services such as SQL Server or MySQL but also includes pre-built web environments such as Elastic Beanstalk in AWS. Because you are paying for a sliver of a larger, multi-tenant service, your provider will handle more layers of the lower stack, including the virtual machines the database engine is running on as well as the database engine itself. Typically in this example, the customer does not have any access to the underlying virtual machine either via SSH or RDP, as with IaaS.

However, as the matrix shows, there is still a level of responsibility on the customer (though the operational burden is reduced). In the case of Database PaaS, the customer is still in charge of backing up and securing (i.e. encryption and identity and access management) the data. This is not the responsibility of the provider with the exception of logical isolation from other tenants and the physical security of the hardware involved.

Finally, in the far right column is the SaaS model. The goal of this model is for the customer to obtain a service with as little administrative/operational overhead as possible. As shown in the matrix, the provider is responsible for everything in the stack from the application down, including networking, backup, patching, availability and physical security. IAM functions are shared as most SaaS is multi-tenant, so the provider must enforce isolation (in the same way as PaaS) and the customer must ensure only authorised personnel can access the SaaS solution.

You will note that endpoint security is also classed as a shared responsibility. Taking Office 365 as an example, Microsoft provide security tools such as anti-virus scanning and data loss prevention controls, it is up to the customer to configure this to suit their use case. Microsoft’s responsibility ends with providing the service and the customer’s starts with turning the knobs to make it work to their taste. You will also notice that as in all other cases, it is solely the customer’s responsibility to ensure the classification and accountability of the data. This is not the same as the reliability of the services beneath it (networking, storage and compute) as this is addressed in the lower layers of the model.

I hope this article provides a bit of clarity on what the Shared Responsibility Model is and why you should care. Please don’t assume that just because you’re “going cloud” that a lot of these issues will go away. Get yourself some sound and trusted advice and make sure this model is accounted for in your project plan.

For your further reading pleasure, I have included links below to documentation explaining provider’s stances and implementation of the model :-

As always, any questions or comments on this post can be left below or feel free to ping me on Twitter @ChrisBeckett

30-01-17

AWS Specialty Beta Exams – Feedback and Tips

 

download

 

At the end of last week, I completed all three new AWS beta “specialty” exams. For those not aware, AWS are bringing in three new certifications to complement the existing five that have been around for a while. The new exams focus on specific technology areas:-

There was a special offer running during the beta in that the exams were half the usual price, plus a free resit if you don’t get past them the first time. It’s difficult to say at what level these are pitched – in general, a lot of the content seemed “Pro” level to me, certainly you need to know a lot more than the Associate exams.

The exams themselves were the Pro length 170 minutes, with varying numbers of questions. The Networking exam had something like 130, the Security I think was 106 and the Big Data 100. The questions were typical wordy type AWS questions with some of the usual favourite key words such as “resilient” and “cost optimal”. Certainly from a format perspective, there’s nothing really new here. Of the three, I think I did best on the Security exam, followed by a borderline Networking exam and Big Data trailing in a very distant last. There were a lot of terms in that exam I’d never even heard of before!

Results are due at the end of March which is when the beta collation period ends. I have no expectation on the Networking and Big Data exams, but then again you never know how these things are going to be scored and evaluated. The Security exam I felt went quite well, but who knows?

With respect to the content, these were the key takeaway areas :-

Networking

  • Direct Connect – tons of questions on this.
  • VPNs
  • VPC, including peering – what is and isn’t possible (between accounts in the same region, etc.)
  • BGP – including prepending and MED
  • Routing – both static and dynamic
  • Routing tables
  • Route propagation
  • DHCP Option Sets
  • NAT (gateways and instances)
  • S3 Endpoints
  • CloudFront
  • Jumbo Frames
  • Network optimised instances

Security

  • IAM (users, groups, roles, policies)
  • Encryption (data in flight and at rest – disk encryption, VPN etc)
  • Database encryption (TDE, disk encryption)
  • KMS
  • CloudHSM
  • CloudTrail and CloudWatch
  • Federation with other sources, SAML, Cognito etc
  • AssumeRole and how that works
  • Tagging
  • S3 (versioning, MFA delete)
  • IAM Access keys

Big Data

  • EMR
  • RedShift (including loading and unloading data to S3, performance issues loading files, Avro, gzip etc.)
  • Pig
  • Hive
  • Hadoop
  • iPython
  • Zeppelin
  • DynamoDB (make sure you understand partitioning, performance issues and indexes – global and local)
  • QuickSight
  • Machine Learning (including models)
  • RDS
  • Lambda
  • S3 ETags
  • Kinesis
  • ElasticSearch
  • Kibana
  • IoT
  • API Gateway
  • Kafka
  • Encryption (TDE, CloudHSM,KMS)

As you can see from above, the focus may be relatively narrow, but you do need to understand things pretty well. I wouldn’t say you need to go right into deep depth in the exam questions, but you certainly need to know each of the topics listed above and really what they can and can’t do. From there, you should be able to work out what you think is the right answer.

So now we wait until the end of March, I expect and am prepared to sit all three again as we continue on the never ending treadmill that is IT certification 😉

Study materials included acloud.guru as usual and also the AWS YouTube channel. The Re:Invent 300 and 400 level videos are really good preparation for the exams as they go into some decent depth.

Any comments or questions, please feel free to hit me up on Twitter.

 

31-12-16

AWS Certified DevOps Engineer Professional – Exam Experience & Tips

aws-certified-devops-engineer-professional_large1

I managed to find the time yesterday to sit the above exam before the end of the year to reach my goal of holding all five current AWS certifications. There isn’t a lot out there about this exam, so as usual I thought I would try to pass on the benefit of my experiences for others planning to sit this one.

The exam is 80 questions over 170 minutes. I finished with about 20 minutes to spare and passed barely with a 66%, but as we always say – a pass is a pass! Looking back over the score report, there are four domains tested in the exam:-

  • Domain 1: Continuous Delivery and Process Automation
  • Domain 2: Monitoring, Metrics, and Logging
  • Domain 3: Security, Governance, and Validation
  • Domain 4: High Availability and Elasticity

I managed to score really well on domains 1, 3 and 4 (between 75% and 85%0, but really bombed on domain 2, which really surprised me. This domain focusses mainly on CloudWatch, so it goes without saying that I didn’t know it as well as I thought I did!

Like all the other AWS exams, the questions are worded in a very specific way, and it can take time to read and re-read the questions to truly understand what is being asked. I wouldn’t worry too much about time running out, some of the questions are quite short but you need to look for key words in the questions – such as “cost-effective”, “fault tolerant” and “efficient”. This can help you rule out the obviously incorrect answers.

In terms of what you need to know, I’d say the following :-

  • Domain 1: CloudFront (templates, custom resources), OpsWorks (lifecycles), Elastic Beanstalk (platform support, scaling, Docker), SQS, SNS, Data Pipeline (I was surprised to see this feature in the exam as I figured it was being phased out in favour of Lambda), SWF, bootstrapping
  • Domain 2: CloudWatch, CloudTrail (what it can and can’t do), CloudWatch Logs (Log streams, Log filters, Log agent), EMR
  • Domain 3: IAM (Roles, users, STS, AssumeRole(s))
  • Domain 4: Load balancing, auto scaling, EC2, S3, Glacier, EBS, RDS,  DynamoDB, instance types

And for what I used for study, use your AWS account and the free tier entitlement to much around with all the services. There are loads of walkthroughs in the documentation and provided you don’t leave massive instances running 24/7 it should only cost you pennies to use.

The A Cloud Guru course is well worth the investment of time and money – Adrian and Nick do a great job of taking you through most of what you need to know for the exam. I did find that there wasn’t as much DynamoDB content on the exam as I was expecting, not that I’m complaining because a lot of how it works still really mashes my head!

There are lots of good videos on YouTube, from Re:Invent conferences from years gone by which go into a lot of depth. I can also recommend Ian Massingham’s CloudFormation Masterclass video as a good refresher/primer for CF.

Difficulty wise, it’s definitely a tough exam, don’t let anyone tell you otherwise. 80 questions is a lot and many of them are very verbose in both the question and the answers. I’d say it’s not as tough as the Solutions Architect Pro as it doesn’t cover as broad a range of topics, but you can’t really wing it.

I hope this article helps anyone doing this exam any time soon. I’m going to enjoy being part of the “All 5” club for as long as it lasts (the three “Specialty” exams are coming up early next year, I’ve registered to sit all the betas).

all5

19-08-16

AWS : Keeping up with the changes

aws

As we all know, working in the public cloud space means changes in the blink of an eye. Services are added, updated (and in some cases, removed) at short notice and it’s vital from not just a Solutions Architect’s perspective but from an end user or operational standpoint that we keep up to date with these announcements, as and when they happen.

In days of old, we’d keep an eye on a vendor’s annual conference when they’d reveal something cool in their keynote, with a release on that day or to follow shortly after. In the public cloud, innovation happens much quicker and it’s no longer a case of waiting for “Geek’s Christmas”.

To that end, today I was pointed towards the AWS “What’s New” blog, which in essence is a change log for AWS services. Yesterday alone lists 8 announcements or service updates.

It’s a site well worth bookmarking and reviewing on a regular basis, I’d suggest weekly if you have time. If you’re designing AWS infrastructures or running your business on AWS, you need to know what’s on the roadmap so you can plan accordingly.

You can visit the What’s New blog site here.

 

16-08-16

AWS Certified Solutions Architect Professional – Study Guide – Domain 8.0: Cloud Migration and Hybrid Architecture (10%)

Solutions-Architect-Professional

The final part of the study guide is below – thanks to all those who have tuned in over the past few weeks and given some very positive feedback. I hope it helps (or has helped) you get into the Solutions Architect Pro club. It’s a tough exam to pass and the feeling of achievement is immense. Good luck!

8.1 Plan and execute for applications migrations

  • AWS Management Portal available to plug AWS infrastructure into vCenter. This uses a virtual appliance and can enable migration of vSphere workloads into AWS
  • Right click on VM and select “Migrate to EC2”
  • You then select region, environment, subnet, instance type, security group, private IP address
  • Use cases:-
    • Migrate VMs to EC2 (VM must be powered off and configured for DHCP)
    • Reach new regions from vCenter to use for DR etc
    • Self service AWS portal in vCenter
    • Create new EC2 instances using VM templates
  • The inventory view is presented as :-
    • Region
      • Environment (family of templates and subnets in AWS)
        • Template (prototype for EC2 instance)
          • Running instance
            • Folder for storing migrated VMs
  • Templates map to AMIs and can be used to let admins pick a type for their deployment
  • Storage Gateway can be used as a migration tool
    • Gateway cached volumes (block based iSCSI)
    • Gateway stored volumes (block based iSCSI)
    • Virtual tape library (iSCSI based VTL)
    • Takes snapshots of mounted iSCSI volumes and replicates them via HTTPS to AWS. From here they are stored in S3 as snapshots and then you can mount them as EBS volumes
    • It is recommended to get a consistent snapshot of the VM by powering it off, taking a VM snapshot and then replicating this
  • AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
  • AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos
  • Pipeline has the following concepts:-
    • Pipeline (container node that is made up of the items below, can run on either EC2 instance or EMR node which are provisioned automatically by DP)
    • Datanode (end point destination, such as S3 bucket)
    • Activity (job kicked off by DP, such as database dump, command line script)
    • Precondition (readiness check optionally associated with data source or activity. Activity will not be done if check fails. Standard and custom preconditions available- DynamoDBTableExists, DynamoDBDataExists, S3KeyExists, S3PrefixExists, ShellCommandPrecondition)
    • Schedule
  • Pipelines can also be used with on premises resources such as databases etc
  • Task Runner package is installed on the on premises resource to poll the Data Pipeline queue for work to do (database dump etc, copy to S3)
  • Much of the functionality has been replaced by Lambda
  • Setup logging to S3 so you can troubleshoot it

8.2 Demonstrate ability to design hybrid cloud architectures

  • Biggest CIDR block you can have is a /16 and smallest is /28 for reservations
  • First four IP addresses and last one are reserved by AWS – always 5 reserved
    • 10.0.0.0 – Network address
    • 10.0.0.1 – Reserved for VPC router
    • 10.0.0.2 – Reserved by AWS for DNS services
    • 10.0.0.3 – Reserved by AWS for future use
    • 10.0.0.255 – Reserved for network broadcast. Network broadcast not supported in a VPC, so this is reserved
  • When migrating to Direct Connect from a VPN, make the VPN connection and Direct Connect connection(s) as part of the same BGP area. Then configure the VPN to have a higher cost than the Direct Connect connection. BGP route prepending will do this as BGP is a metric based protocol. A single ASN is considered a more preferable route than an ASN with three or four values
  • For applications that require multicast, you need to configure a VPN between the EC2 instances with in-instance software, so the underlying AWS infrastructure is not aware of it. Multicast is not supported by AWS
  • VPN network must be a different CIDR block than the underlying instances are using (for example 10.x address for EC2 instances and 172.16.x addresses for VPN connection to another VPC)
  • SQL can be migrated by exporting database as flat files from SQL Management Studio, can’t replicate to another region or from on premises to AWS
  • CloudSearch can index documents stored in S3 and is powered by Apache SOLR
    • Full text search
    • Drill down searching
    • Highlighting
    • Boolean search
    • Autocomplete
    • CSV,PDF, HTML, Office docs and text files supported
  • Can also search DynamoDB with CloudSearch
  • CloudSearch can automatically scale based on load or can be manually scaled ahead of expected load increase
  • Multi-AZ is supported and it’s basically a service hosted on EC2, and these are how the costs are derived
  • EMR can be used to run batch processing jobs, such as filtering log files and putting results into S3
  • EMR uses Hadoop which uses HDFS, a distributed file system across all nodes in the cluster where there are multiple copies of the data, meaning resilience of the data and also enables parallel processing across multiple nodes
  • Hive is used to perform SQL like queries on the data in Hadoop, uses simple syntax to process large data sets
  • Pig is used to write MapReduce programs
  • EMR cluster has three components:-
    • Master node (manages data distribution)
    • Core node (stores data on HDFS from tasks run by task nodes and are managed by the master node)
    • Task nodes (managed by the master node and perform processing tasks only, do not form part of HDFS and pass processed data back to core nodes for storage)
  • EMRFS can be used to output data to S3 instead of HDFS
  • Can use spot, on demand or reserved instances for EMR cluster nodes
  • S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3. You use S3DistCp by adding it as a step in a cluster or at the command line. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster
  • Larger data files are more efficient than smaller ones in EMR
  • Storing data persistently on S3 may well be cheaper than leveraging HDFS as large data sets will require large instances sizes in the EMR cluster
  • Smaller EMR cluster with larger nodes may be just as efficient but more cost effective
  • Try to complete jobs within 59 minutes to save money (EMR billed by hour)

15-08-16

QwikLabs Competition Winner

download

Just a quick post today to say thanks to everyone who entered the QwikLabs competition and as promised, we have a winner! The random number generator picked out Hardik Mistry and he has already unwrapped his prize! Thanks again to QwikLabs for the token and for their support. If you haven’t yet swung by their site, I highly recommend it.

 

09-08-16

AWS Certified Solutions Architect Professional – Study Guide – Domain 7.0: Scalability and Elasticity (15%)

Solutions-Architect-Professional

7.1 Demonstrate the ability to design a loosely coupled system

  • Amazon CloudFront is a web service (CDN) that speeds up distribution of your static and dynamic web content, for example, .html, .css, .php, image, and media files, to end users. CloudFront delivers your content through a worldwide network of edge locations. When an end user requests content that you’re serving with CloudFront, the user is routed to the edge location that provides the lowest latency, so content is delivered with the best possible performance. If the content is already in that edge location, CloudFront delivers it immediately. If the content is not currently in that edge location, CloudFront retrieves it from an Amazon S3 bucket or an HTTP server (for example, a web server) that you have identified as the source for the definitive version of your content.
  • CloudFront has two aspects – origin and distribution. You create a distribution and link it to an origin, such as S3, an EC2 instance, existing website etc
  • Two types of distributions, web and RTMP
  • Geo restrictions can be used to white or blacklist traffic from specific countries, blocking access to the distribution
  • GET, HEAD, PUT, POST, PATCH, DELETE and OPTIONS HTTP commands supported
  • Allowed methods are what CloudFront will pass on to the origin server. If you do not need to modify content, consider not allowing PUT, POST, PATCH, DELETE to ensure users to not modify content
  • CloudFront does not cache responses to POST, PUT, DELETE and PATCH requests, can POST content to an Edge location and then this is send on to the origin server
  • SSL can be used to provide HTTPS. Can either use CloudFront’s own certificate or use your own
    • To support older browsers, need dedicated SSL IP certificate per edge location, can be very expensive
    • SNI (Server Name Indication) custom SSL certs can be used by adding all hostnames behind the certificate but it is presented as a single IP address. Uses SNI extensions in newer browsers
  • 100 CNAME aliases per distribution, can use wildcard CNAMEs
  • Use Invalidation Requests to forcibly remove content from Edge locations. Need to use API call to do this or do it from the console, or set a TTL on the content
  • Alias records can be used to map a friendly name to a CloudFront URL (Route 53 supports this). Supports zone apex entry (name without www, such as example.com). DNS records for the same name must have the same routing type (simple, weighted, latency, etc) or you will get an error in the console
  • Alias records can then have “evaluate target” set to yes so that existing health checks are used to ensure the underlying resources are up before sending traffic onwards. If a health check for the underlying resource does not exist, evaluate target settings have no effect
  • AWS doesn’t charge for mapping alias records to CloudFront distributions
  • CloudFront supports dynamic web content using cookies to forward on to the origin server
  • Forward query strings passes the whole URL to the origin if configured in CloudFront, but only for a web server or application as S3 does not support this feature
  • Cookie values can then be logged into CloudFront access logs
  • CloudFront can be used to proxy upload requests back to the origin to speed up data transfers
  • Use a zero value TTL for dynamic content
  • Different URL patterns can send traffic to different origins
  • Whitelist certain HTTP headers such as cloudfront-viewer-country so that locale details can be passed through to the web server for custom content
  • Device detection can serve different content based on the User Agent string in the header request
  • Invalidating objects removes them from CloudFront edge caches. A faster and less expensive method is to use versioned object or directory names
  • Enable access logs in CloudFront and then send them to an S3 bucket. EMR can be used to analyse the logs
  • Signed URLs can be used to provide time limited access or access to private content on CloudFront. Signed cookies can be used to limit secure access to certain parts of the site. Use cases are signed URLs for a marketing e-mail and signed cookies for web site streaming or whole site authentication
  • Cache-control max-age header will be sent to browser to control how long the content is in the local browser cache for, can help improve delivery, especially of static items
  • If-modified-since will allow the browser to send a request for content only if it is newer than the modification date specified in the request. If the content has not changed, content is pulled from the browser cache
  • Set a low TTL for dynamic content as most content can be cached even if it’s only for a few seconds. CloudFront can also present stale data if TTL is long
  • Popular Objects report and cache statistics can help you tune CloudFront behaviour
  • Only forward cookies that are used to vary or tailor user based content
  • Use Smooth Streaming on a web distribution for live streaming using Microsoft technology
  • RTMP is true media streaming, progressive download downloads in chunks to say a mobile device. RTMP is Flash only
  • Supports existing WAF policies
  • You can create custom error response pages
  • Two ElastiCache engines available – Redis and Memcached. Exam will give scenarios and you must select the most appropriate
  • As a rule of thumb, simple caching is done by memcached and complex caching is done by Redis
  • Only Redis is multi-AZ and has backup and restore and persistence capabilities, sorting, publisher/subscriber, failover
  • Redis uses a persistence key store or caching engine for persistence
  • Redis has backup and restore and automatic failover and is best used for frequently changing data in a complex scale
  • Doesn’t need a database to backend it like memcached does
  • Leader boards is a good use case for Redis
  • Redis can be configured to use an Append Only File (AOF) that will repopulate the cache in case all nodes are lost and cache is cleared. This is disabled by default. AOF is like a replay log
  • Redis has a primary node and read only nodes. If the primary fails, a read only node is promoted to primary. Writes done to primary node, reads done from read replicas (asynchronous replication)
  • Redis snapshots are used to increase the size of nodes. This is not the same as EC2 snapshots, the snapshot creates a new node based on the snapshot and size is picked when launching
  • Redis can be configured to automatically backup daily in a window or manual snapshots. Automatic have retention limits, manual don’t
  • Memcached can scale horizontally and is multi-threaded, supports sharding
  • Memcached uses lazy loading, so if an app doesn’t get a hit from the cache, it requests it from the DB and then puts that into cache. Write through updates the cache when the database is updated
  • TTL can be used to expire out stale or unread data from the cache
  • Memcached does not maintain it’s own data persistence, database does this, scale by adding more nodes to a cluster
  • Vertically scaling memcached nodes requires standing up a new cluster of required instance sizes/types. All instance types in a cluster are the same type
  • Single endpoint for all memcached nodes
  • Put memcached nodes in different AZs
  • Memcache nodes are empty when first provisioned, bear this in mind when scaling out as this will affect cache performance while the nodes warm up
  • For low latency applications, place Memcache clusters in the same AZ as the application stack. More configuration and management but better performance
  • When deciding between Memcached and Redis, here are a few questions to consider:
    • Is object caching your primary goal, for example to offload your database? If so, use Memcached.
    • Are you interested in as simple a caching model as possible? If so, use Memcached.
    • Are you planning on running large cache nodes, and require multithreaded performance with utilization of multiple cores? If so, use Memcached.
    • Do you want the ability to scale your cache horizontally as you grow? If so, use Memcached.
    • Does your app need to atomically increment or decrement counters? If so, use either Redis or Memcached.
    • Are you looking for more advanced data types, such as lists, hashes, and sets? If so, use Redis.
    • Does sorting and ranking datasets in memory help you, such as with leaderboards? If so, use Redis.
    • Are publish and subscribe (pub/sub) capabilities of use to your application? If so, use Redis.
    • Is persistence of your key store important? If so, use Redis.
    • Do you want to run in multiple AWS Availability Zones (Multi-AZ) with failover? If so, use Redis.
  • Amazon Kinesis is a managed service that scales elastically for real-time processing of streaming data at a massive scale. The service collects large streams of data records that can then be consumed in real time by multiple data-processing applications that can be run on Amazon EC2 instances.
  • You’ll create data-processing applications, known as Amazon Kinesis Streams applications. A typical Amazon Kinesis Streams application reads data from an Amazon Kinesis stream as data records. These applications can use the Amazon Kinesis Client Library, and they can run on Amazon EC2 instances. The processed records can be sent to dashboards, used to generate alerts, dynamically change pricing and advertising strategies, or send data to a variety of other AWS services. The PutRecord command is used to put data into a stream
  • Data is stored in Kinesis for 24 hours, but this can go up to 7 days
  • You can use Streams for rapid and continuous data intake and aggregation. The type of data used includes IT infrastructure log data, application logs, social media, market data feeds, and web clickstream data. Because the response time for the data intake and processing is in real time, the processing is typically lightweight
  • The following are typical scenarios for using Streams
    • Accelerated log and data feed intake and processing
    • Real-time metrics and reporting
    • Real-time data analytics
    • Complex stream processing
  • An Amazon Kinesis stream is an ordered sequence of data records. Each record in the stream has a sequence number that is assigned by Streams. The data records in the stream are distributed into shards
  • A data record is the unit of data stored in an Amazon Kinesis stream. Data records are composed of a sequence number, partition key, and data blob, which is an immutable sequence of bytes. Streams does not inspect, interpret, or change the data in the blob in any way. A data blob can be up to 1 MB
  • Retention Period is the length of time data records are accessible after they are added to the stream. A stream’s retention period is set to a default of 24 hours after creation. You can increase the retention period up to 168 hours (7 days) using the IncreaseRetentionPeriod operation
  • A partition key is used to group data by shard within a stream
  • Each data record has a unique sequence number. The sequence number is assigned by Streams after you write to the stream with client.putRecords or client.putRecord
  • In summary, a record has three things:-
    • Sequence number
    • Partition key
    • Data BLOB
  • Producers put records into Amazon Kinesis Streams. For example, a web server sending log data to a stream is a producer
  • Consumers get records from Amazon Kinesis Streams and process them. These consumers are known as Amazon Kinesis Streams Applications
  • An Amazon Kinesis Streams application is a consumer of a stream that commonly runs on a fleet of EC2 instances
  • A shard is a uniquely identified group of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity
  • Once a stream is created, you can add data to it in the form of records. A record is a data structure that contains the data to be processed in the form of a data blob. After you store the data in the record, Streams does not inspect, interpret, or change the data in any way. Each record also has an associated sequence number and partition key
  • There are two different operations in the Streams API that add data to a stream, PutRecords and PutRecord. The PutRecords operation sends multiple records to your stream per HTTP request, and the singular PutRecord operation sends records to your stream one at a time (a separate HTTP request is required for each record). You should prefer using PutRecords for most applications because it will achieve higher throughput per data producer
  • An Amazon Kinesis Streams producer is any application that puts user data records into an Amazon Kinesis stream (also called data ingestion). The Amazon Kinesis Producer Library (KPL) simplifies producer application development, allowing developers to achieve high write throughput to a Amazon Kinesis stream.
  • You can monitor the KPL with Amazon CloudWatch
  • The agent is a stand-alone Java software application that offers an easier way to collect and ingest data into Streams. The agent continuously monitors a set of log files and sends new data records to your Amazon Kinesis stream. By default, records within each file are determined by a new line, but can also be configured to handle multi-line records. The agent handles file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. It also emits CloudWatch metrics to help you better monitor and troubleshoot the streaming process.
  • You can install the agent on Linux-based server environments such as web servers, front ends, log servers, and database servers. After installing, configure the agent by specifying the log files to monitor and the Amazon Kinesis stream names. After it is configured, the agent durably collects data from the log files and reliably submits the data to the Amazon Kinesis stream
  • SNS is Simple Notification Services – publisher creates a topic and then subscribers get updates sent to topics. This can be push to Android, iOS, etc
  • Use SNS to send push notifications to desktops, Amazon Device Messaging, Apple Push for iOS and OSX, Baidu, Google Cloud for Android, MS push for Windows Phone and Windows Push notification services
  • Steps to create mobile push:-
    • Request credentials from mobile platforms
    • Request token from mobile platforms
    • Create platform application object
    • Publish message to mobile endpoint
  • Grid computing vs cluster computing
    • Grid computing is generally loosely coupled, often used with spot instances and tend to grow and shrink as required. Use different regions and instance types
    • Distributed workloads
    • Designed for resilience (auto scaling) – horizontal scaling rather than vertical scaling
    • Cluster computing has two or more instances working together in low latency, high throughput environments
    • Uses same instance types
    • GPU instances do not support SR-IOV networking
  • Elastic Transcoder encodes media files and uses a pipeline with a source and destination bucket, a job and a pre-set (what media type, watermarks etc). Pre-sets are templates and may be altered to provide custom settings. Pipelines can only have one source and one destination bucket
  • Integrates into SNS for job status updates and alerts