16-08-16

AWS Certified Solutions Architect Professional – Study Guide – Domain 8.0: Cloud Migration and Hybrid Architecture (10%)

The final part of the study guide is below – thanks to all those who have tuned in over the past few weeks and given some very positive feedback. I hope it helps (or has helped) you get into the Solutions Architect Pro club. It’s a tough exam to pass and the feeling of achievement is immense. Good luck!

8.1 Plan and execute for applications migrations

AWS Management Portal available to plug AWS infrastructure into vCenter. This uses a virtual appliance and can enable migration of vSphere workloads into AWS
Right click on VM and select “Migrate to EC2”
You then select region, environment, subnet, instance type, security group, private IP address
Use cases:-
- Migrate VMs to EC2 (VM must be powered off and configured for DHCP)
- Reach new regions from vCenter to use for DR etc
- Self service AWS portal in vCenter
- Create new EC2 instances using VM templates
The inventory view is presented as :-
- Region
  - Environment (family of templates and subnets in AWS)
    - Template (prototype for EC2 instance)
      - Running instance
        
        Folder for storing migrated VMs
Templates map to AMIs and can be used to let admins pick a type for their deployment
Storage Gateway can be used as a migration tool
- Gateway cached volumes (block based iSCSI)
- Gateway stored volumes (block based iSCSI)
- Virtual tape library (iSCSI based VTL)
- Takes snapshots of mounted iSCSI volumes and replicates them via HTTPS to AWS. From here they are stored in S3 as snapshots and then you can mount them as EBS volumes
- It is recommended to get a consistent snapshot of the VM by powering it off, taking a VM snapshot and then replicating this
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos
Pipeline has the following concepts:-
- Pipeline (container node that is made up of the items below, can run on either EC2 instance or EMR node which are provisioned automatically by DP)
- Datanode (end point destination, such as S3 bucket)
- Activity (job kicked off by DP, such as database dump, command line script)
- Precondition (readiness check optionally associated with data source or activity. Activity will not be done if check fails. Standard and custom preconditions available- DynamoDBTableExists, DynamoDBDataExists, S3KeyExists, S3PrefixExists, ShellCommandPrecondition)
- Schedule
Pipelines can also be used with on premises resources such as databases etc
Task Runner package is installed on the on premises resource to poll the Data Pipeline queue for work to do (database dump etc, copy to S3)
Much of the functionality has been replaced by Lambda
Setup logging to S3 so you can troubleshoot it

8.2 Demonstrate ability to design hybrid cloud architectures

Biggest CIDR block you can have is a /16 and smallest is /28 for reservations
First four IP addresses and last one are reserved by AWS – always 5 reserved
- 10.0.0.0 – Network address
- 10.0.0.1 – Reserved for VPC router
- 10.0.0.2 – Reserved by AWS for DNS services
- 10.0.0.3 – Reserved by AWS for future use
- 10.0.0.255 – Reserved for network broadcast. Network broadcast not supported in a VPC, so this is reserved
When migrating to Direct Connect from a VPN, make the VPN connection and Direct Connect connection(s) as part of the same BGP area. Then configure the VPN to have a higher cost than the Direct Connect connection. BGP route prepending will do this as BGP is a metric based protocol. A single ASN is considered a more preferable route than an ASN with three or four values
For applications that require multicast, you need to configure a VPN between the EC2 instances with in-instance software, so the underlying AWS infrastructure is not aware of it. Multicast is not supported by AWS
VPN network must be a different CIDR block than the underlying instances are using (for example 10.x address for EC2 instances and 172.16.x addresses for VPN connection to another VPC)
SQL can be migrated by exporting database as flat files from SQL Management Studio, can’t replicate to another region or from on premises to AWS
CloudSearch can index documents stored in S3 and is powered by Apache SOLR
- Full text search
- Drill down searching
- Highlighting
- Boolean search
- Autocomplete
- CSV,PDF, HTML, Office docs and text files supported
Can also search DynamoDB with CloudSearch
CloudSearch can automatically scale based on load or can be manually scaled ahead of expected load increase
Multi-AZ is supported and it’s basically a service hosted on EC2, and these are how the costs are derived
EMR can be used to run batch processing jobs, such as filtering log files and putting results into S3
EMR uses Hadoop which uses HDFS, a distributed file system across all nodes in the cluster where there are multiple copies of the data, meaning resilience of the data and also enables parallel processing across multiple nodes
Hive is used to perform SQL like queries on the data in Hadoop, uses simple syntax to process large data sets
Pig is used to write MapReduce programs
EMR cluster has three components:-
- Master node (manages data distribution)
- Core node (stores data on HDFS from tasks run by task nodes and are managed by the master node)
- Task nodes (managed by the master node and perform processing tasks only, do not form part of HDFS and pass processed data back to core nodes for storage)
EMRFS can be used to output data to S3 instead of HDFS
Can use spot, on demand or reserved instances for EMR cluster nodes
S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3. You use S3DistCp by adding it as a step in a cluster or at the command line. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster
Larger data files are more efficient than smaller ones in EMR
Storing data persistently on S3 may well be cheaper than leveraging HDFS as large data sets will require large instances sizes in the EMR cluster
Smaller EMR cluster with larger nodes may be just as efficient but more cost effective
Try to complete jobs within 59 minutes to save money (EMR billed by hour)

Blue Clouds

16-08-16

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply