16-08-16

AWS Certified Solutions Architect Professional – Study Guide – Domain 8.0: Cloud Migration and Hybrid Architecture (10%)

Solutions-Architect-Professional

The final part of the study guide is below – thanks to all those who have tuned in over the past few weeks and given some very positive feedback. I hope it helps (or has helped) you get into the Solutions Architect Pro club. It’s a tough exam to pass and the feeling of achievement is immense. Good luck!

8.1 Plan and execute for applications migrations

  • AWS Management Portal available to plug AWS infrastructure into vCenter. This uses a virtual appliance and can enable migration of vSphere workloads into AWS
  • Right click on VM and select “Migrate to EC2”
  • You then select region, environment, subnet, instance type, security group, private IP address
  • Use cases:-
    • Migrate VMs to EC2 (VM must be powered off and configured for DHCP)
    • Reach new regions from vCenter to use for DR etc
    • Self service AWS portal in vCenter
    • Create new EC2 instances using VM templates
  • The inventory view is presented as :-
    • Region
      • Environment (family of templates and subnets in AWS)
        • Template (prototype for EC2 instance)
          • Running instance
            • Folder for storing migrated VMs
  • Templates map to AMIs and can be used to let admins pick a type for their deployment
  • Storage Gateway can be used as a migration tool
    • Gateway cached volumes (block based iSCSI)
    • Gateway stored volumes (block based iSCSI)
    • Virtual tape library (iSCSI based VTL)
    • Takes snapshots of mounted iSCSI volumes and replicates them via HTTPS to AWS. From here they are stored in S3 as snapshots and then you can mount them as EBS volumes
    • It is recommended to get a consistent snapshot of the VM by powering it off, taking a VM snapshot and then replicating this
  • AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
  • AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos
  • Pipeline has the following concepts:-
    • Pipeline (container node that is made up of the items below, can run on either EC2 instance or EMR node which are provisioned automatically by DP)
    • Datanode (end point destination, such as S3 bucket)
    • Activity (job kicked off by DP, such as database dump, command line script)
    • Precondition (readiness check optionally associated with data source or activity. Activity will not be done if check fails. Standard and custom preconditions available- DynamoDBTableExists, DynamoDBDataExists, S3KeyExists, S3PrefixExists, ShellCommandPrecondition)
    • Schedule
  • Pipelines can also be used with on premises resources such as databases etc
  • Task Runner package is installed on the on premises resource to poll the Data Pipeline queue for work to do (database dump etc, copy to S3)
  • Much of the functionality has been replaced by Lambda
  • Setup logging to S3 so you can troubleshoot it

8.2 Demonstrate ability to design hybrid cloud architectures

  • Biggest CIDR block you can have is a /16 and smallest is /28 for reservations
  • First four IP addresses and last one are reserved by AWS – always 5 reserved
    • 10.0.0.0 – Network address
    • 10.0.0.1 – Reserved for VPC router
    • 10.0.0.2 – Reserved by AWS for DNS services
    • 10.0.0.3 – Reserved by AWS for future use
    • 10.0.0.255 – Reserved for network broadcast. Network broadcast not supported in a VPC, so this is reserved
  • When migrating to Direct Connect from a VPN, make the VPN connection and Direct Connect connection(s) as part of the same BGP area. Then configure the VPN to have a higher cost than the Direct Connect connection. BGP route prepending will do this as BGP is a metric based protocol. A single ASN is considered a more preferable route than an ASN with three or four values
  • For applications that require multicast, you need to configure a VPN between the EC2 instances with in-instance software, so the underlying AWS infrastructure is not aware of it. Multicast is not supported by AWS
  • VPN network must be a different CIDR block than the underlying instances are using (for example 10.x address for EC2 instances and 172.16.x addresses for VPN connection to another VPC)
  • SQL can be migrated by exporting database as flat files from SQL Management Studio, can’t replicate to another region or from on premises to AWS
  • CloudSearch can index documents stored in S3 and is powered by Apache SOLR
    • Full text search
    • Drill down searching
    • Highlighting
    • Boolean search
    • Autocomplete
    • CSV,PDF, HTML, Office docs and text files supported
  • Can also search DynamoDB with CloudSearch
  • CloudSearch can automatically scale based on load or can be manually scaled ahead of expected load increase
  • Multi-AZ is supported and it’s basically a service hosted on EC2, and these are how the costs are derived
  • EMR can be used to run batch processing jobs, such as filtering log files and putting results into S3
  • EMR uses Hadoop which uses HDFS, a distributed file system across all nodes in the cluster where there are multiple copies of the data, meaning resilience of the data and also enables parallel processing across multiple nodes
  • Hive is used to perform SQL like queries on the data in Hadoop, uses simple syntax to process large data sets
  • Pig is used to write MapReduce programs
  • EMR cluster has three components:-
    • Master node (manages data distribution)
    • Core node (stores data on HDFS from tasks run by task nodes and are managed by the master node)
    • Task nodes (managed by the master node and perform processing tasks only, do not form part of HDFS and pass processed data back to core nodes for storage)
  • EMRFS can be used to output data to S3 instead of HDFS
  • Can use spot, on demand or reserved instances for EMR cluster nodes
  • S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3. You use S3DistCp by adding it as a step in a cluster or at the command line. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster
  • Larger data files are more efficient than smaller ones in EMR
  • Storing data persistently on S3 may well be cheaper than leveraging HDFS as large data sets will require large instances sizes in the EMR cluster
  • Smaller EMR cluster with larger nodes may be just as efficient but more cost effective
  • Try to complete jobs within 59 minutes to save money (EMR billed by hour)
Advertisement

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.