CBT Campus' Online Skills Training Courses.

IT Skills

Enterprise Database Systems

Big Data

Hadoop Operations

df_ahop_a08_it_enus

df_ahop_a10_it_enus

df_ahop_a03_it_enus

df_ahop_a01_it_enus

df_ahop_a04_it_enus

df_ahop_a02_it_enus

df_ahop_a06_it_enus

df_ahop_a09_it_enus

df_ahop_a05_it_enus

df_ahop_a07_it_enus

Capacity Management for Hadoop Clusters

Course Number:
df_ahop_a08_it_enus

Expected Duration (hours)
2.2

Lesson Objectives

Capacity Management for Hadoop Clusters

start the course
compare the differences of availability versus performance
describe different strategies of resource capacity management
describe how schedulers perform various resource management
set quotas for the HDFS file system
recall how to set the maximum and minimum memory allocations per container
describe how the fair scheduling method allows all applications to get equal amounts of resource time
describe the primary algorithm and the configuration files for the Fair Scheduler
describe the default behavior of the Fair Scheduler methods
monitor the behavior of Fair Share
describe the policy for single resource fairness
describe how resources are distributed over the total capacity
identify different configuration options for single resource fairness
configure single resource fairness
describe the minimum share function of the Fair Scheduler
configure minimum share on the Fair Scheduler
describe the preemption functions of the Fair Scheduler
configure preemption for the Fair Scheduler
describe dominant resource fairness
write service levels for performance
use the fail scheduler with multiple users

Overview/Description
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. This course focuses on the capacity management of Hadoop clusters. You will be introduced to the concepts of resource management through scheduling. You will learn how to use the Fair Scheduler Tool, and how to plan for scaling.

Target Audience
Administrators looking to add to their knowledge of capacity management for Hadoop clusters

Cloudera Manager and Hadoop Clusters

Course Number:
df_ahop_a10_it_enus

Expected Duration (hours)
3.8

Lesson Objectives

Cloudera Manager and Hadoop Clusters

start the course
describe what cluster management entails and recall some of the tools that can be used
describe different tools from a functional perspective
describe the purpose and functionality of Cloudera Manager
install Cloudera Manager
use Cloudera Manager to deploy a cluster
use Cloudera Manager to install Hadoop
describe the different parts of the Cloudera Manager Admin Console
describe the Cloudera Manager internal architecture
use Cloudera Manager to manage a cluster
manage Cloudera Manager's services
manage hosts with Cloudera Manager
set up Cloudera Manager for high availability
user Cloudera Manager to manage resources
use Cloudera Manager's monitoring features
manage logs through Cloudera Manager
improve cluster performance with Cloudera Manager
install and configure Impala
install and configure Sentry
implement security administration using Hive
perform backups, snapshots, and upgrades using Cloudera Manager
configure Hue with My SQL
import data using Hue
use Hue to run a Hive job
use Hue to edit Oozie workflows and coordinators
format HDFS, create an HDFS directory, import data, run a WordCount |INS , |/INS and view the results

Overview/Description
Cloudera Manager is a simple automated, customizable management tool for Hadoop clusters. In this course, you will become familiar with the various web consoles available with Cloudera Manager. You will learn how to use Cloudera Manager to perform everything from a Hadoop cluster installation, to performance tuning, to diagnosing issues.

Target Audience
Administrators wanting to add Cloudera Manager to their skill sets

Deploying Hadoop Clusters

Course Number:
df_ahop_a03_it_enus

Expected Duration (hours)
3.0

Lesson Objectives

Deploying Hadoop Clusters

start the course
describe the configurations management tools
simulate a configuration management tool
build an image for a baseline server
build an image for a DataServer
build an image for a master server
provision an admin server
describe the layout and structure of the Hadoop cluster
provision a Hadoop cluster
distribute configuration files and admin scripts
use init scripts to start and stop a Hadoop cluster
configure a Hadoop cluster
configure logging for the Hadoop cluster
build images for required servers in the Hadoop cluster
configure a MySQL database
build the Hadoop clients
configure Hive daemons
test the functionality of Flume, Sqoop, HDFS, and MapReduce
test the functionality of Hive and Pig
configure Hcatalog daemons
configure Oozie
configure Hue and Hue users
install Hadoop on to the admin server

Overview/Description
There are important decisions you must make to ensure network, disks, and hosts are configured correctly when deploying a Hadoop Cluster. This course will walk you through all of the steps to install Hadoop in a pseudo-distributed mode and the set up of some of the common open source software used to create a Hadoop Ecosystem.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Designing Hadoop Clusters

Course Number:
df_ahop_a01_it_enus

Expected Duration (hours)
2.2

Lesson Objectives

Designing Hadoop Clusters

start the course
describe the principles of supercomputing
recall the roles and skills needed for the Hadoop engineering team
recall the advantages and shortcomings of using Hadoop as a supercomputing platform
describe the three axioms of supercomputing
describe the dumb hardware and smart software, and the share nothing design principles
describe the design principles for move processing not data, embrace failure, and build applications not infrastructure
describe the different rack architectures for Hadoop.
describe the best practices for scaling a Hadoop cluster.
recall the best practices for different types of network clusters
recall the primary responsibilities for the master, data, and edge servers
recall some of the recommendations for a master server and edge server
recall some of the recommendations for a data server
recall some of the recommendations for an operating system
recall some of the recommendations for hostnames and DNS entries
describe the recommendations for HDD
calculate the correct number of disks required for a storage solution
compare the use of commodity hardware with enterprise disks
plan for the development of a Hadoop cluster
set up flash drives as boot media
set up a kickstart file as boot media
set up a network installer
identify the hardware and networking recommendations for a Hadoop cluster

Overview/Description
Hadoop is an Apache Software Foundation project and open source software platform for scalable, distributed computing. Hadoop can provide fast and reliable analysis of both structured data and unstructured data. In this course you will learn about the design principles, the cluster architecture, considerations for servers and operating systems, and how to plan for a deployment.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Hadoop Cluster Availability

Course Number:
df_ahop_a04_it_enus

Expected Duration (hours)
2.8

Lesson Objectives

Hadoop Cluster Availability

start the course
describe how Hadoop leverages fault tolerance
recall the most common causes for NameNode failure
recall the uses for the Checkpoint node
test the availability for the NameNode
describe the operation of the NameNode during a recovery
swap to a new NameNode
recall the most common causes for DataNode failure
test the availability for the DataNode
describe the operation of the DataNode during a recovery
set up the DataNode for replication
identify and recover from a missing data block scenario
describe the functions of Hadoop high availability
edit the Hadoop configuration files for high availability
set up a high availability solution for NameNode
recall the requirements for enabling an automated failover for the NameNode
create an automated failover for the NameNode
recall the most common causes for YARN task failure
describe the functions of YARN containers
test YARN container reliability
recall the most common causes of YARN job failure
test application reliability
describe the system view of the Resource Manager configurations set for high availability
set up high availability for the Resource Manager
move the Resource Manager HA to alternate master servers

Overview/Description
When examining Hadoop availability it's important not to focus solely on the NameNode. There is a tendency since that is the single point of failure for HDFS, and many components in the ecosystem rely on HDFS, but Hadoop availability is a more general larger issue. In this course we are going to examine the availability and how to recover from failures for the NameNode, DataNode, HDFS, and YARN.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Hadoop in the Cloud

Course Number:
df_ahop_a02_it_enus

Expected Duration (hours)
2.9

Lesson Objectives

Hadoop in the Cloud

start the course
describe how cloud computing can be used as a solution for Hadoop
recall some of the most come services of the EC2 service bundle
recall some of the most common services that Amazon offers
describe how the AWS credentials are used for authentication
create an AWS account
describe the use of AWS access keys
describe AWS identification and access management
set up AWS IAM
describe the use of SSH key pairs for remote access
set up S3 and import data
provision a micro instance of EC2
prepare to install and configure a Hadoop cluster on AWS
create an EC2 baseline server
create an Amazon machine image
create an Amazon cluster
describe what the command line interface is used for
use the command line interface
describe the various ways to move data into AWS
recall the advantages and limitations of using Hadoop in the cloud
recall the advantages and limitations of using AWS EMR
describe EMR End-user connections and EMR security levels
set up an EMR cluster
run an EMR job from the web console
run an EMR job with Hue
run an EMR job with the command line interface
write an Elastic MapReduce script for AWS

Overview/Description
Amazon Web Services, also known as AWS, is a secure cloud-computing platform offered by Amazon.com. This course introduces AWS and it's most prominent tools such as IAM, S3, and EC2. Additionally we will cover how to install configure and use a Hadoop cluster on AWS.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Operating Hadoop Clusters

Course Number:
df_ahop_a06_it_enus

Expected Duration (hours)
2.5

Lesson Objectives

Operating Hadoop Clusters

start the course
monitor and improve service levels
deploy a Hadoop release
describe the purpose of change management
describe rack awareness
write configuration files for rack awareness
start and stop a Hadoop cluster
write init scripts for Hadoop
describe the tools fsck and dfsadmin
use fsck to check the HDFS file system
set quotas for the HDFS file system
install and configure trash
manage an HDFS DataNode
use include and exclude files to replace a DataNode
describe the operations for scaling a Hadoop cluster
add a DataNode to a Hadoop cluster
describe the process for balancing a Hadoop cluster
balance a Hadoop cluster
describe the operations involved for backing up data
use distcp to copy data from one cluster to another
describe MapReduce job management on a Hadoop cluster
perform MapReduce job management on a Hadoop cluster
plan an upgrade of a Hadoop cluster
write and complete a plan to install Hbase with high availability

Overview/Description
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware. In this course we will examine many of the HDFS administration and operational processes required to operate and maintain a Hadoop cluster. We will take a look at how to balance a Hadoop cluster, manage jobs, and perform backup and recovery for HDFS.

Target Audience
Administrators looking to expand their skill and knowledge surrounding operati onal activities of Hadoop clusters

Performance Tuning of Hadoop Clusters

Course Number:
df_ahop_a09_it_enus

Expected Duration (hours)
2.7

Lesson Objectives

Performance Tuning of Hadoop Clusters

start the course
recall the three main functions of service capacity
describe different strategies of performance tuning
list some of the best practices for network tuning
install compression
describe the configuration files and parameters used in performance tuning of the operating system
describe the purpose of Java tuning
recall some of the rules for tuning the datanode
describe the configuration files and parameters used in performance tuning of memory for daemons
describe the purpose of memory tuning for YARN
recall why the Node Manager kills containers
performance tune memory for the Hadoop cluster
describe the configuration files and parameters used in performance tuning of HDFS
describe the sizing and balancing of the HDFS data blocks
describe the use of TestDFSIO
performance tune HDFS
describe the configuration files and parameters used in performance tuning of YARN
configure Speculative execution
describe the configuration files and parameters used in performance tuning of MapReduce
tune up MapReduce for performance reasons
describe the practice of benchmarking on a Hadoop cluster
describe the different tools used for benchmarking a cluster
perform a benchmark of a Hadoop cluster
describe the purpose of application modeling
optimize memory and benchmark a Hadoop cluster

Overview/Description
The Apache Hadoop software library is a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage. This course will focus on performance tuning of the Hadoop cluster. We will examine best practices and recommendations for performance tuning of the operating system, memory, HDFS, YARN and MapReduce.

Target Audience
Administrators looking to expand their skill sets to include performance tuning Hadoop clusters

Securing Hadoop Clusters

Course Number:
df_ahop_a05_it_enus

Expected Duration (hours)
3.5

Lesson Objectives

Securing Hadoop Clusters

start the course
describe the four pillars of the Hadoop security model
recall the ports required for Hadoop and how network gateways are used
install security groups for AWS
describe Kerberos and recall some of the common commands
diagram Kerberos and label the primary components
prepare for a Kerberos installation
install Kerberos
configure Kerberos
describe how to configure HDFS and YARN for use with Kerberos
configure HDFS for Kerberos
configure YARN for Kerberos
describe how to configure Hive for use with Kerberos
configure Hive for Kerberos
describe how to configure Pig, Sqoop, and Oozie for use with Kerberos
configure Pig and HTTPFS for use with Kerberos
configure Oozie for use with Kerberos
configure Hue for use with Kerberos
describe how to configure Flume for use with Kerberos
describe the security model for users on a Hadoop cluster
describe the use of POSIX and ACL for managing user access
create access control lists
describe how to encrypt data in motion for Hadoop, Sqoop, and Flume
encrypt data in motion
describe how to encrypt data at rest
recall the primary security threats faced by the Hadoop cluster
describe how to monitor Hadoop security
configure Hbase for Kerberos

Overview/Description
Hadoop development has allowed big data technologies to reach companies in all sectors of the economy. But as this grows so do the security concerns. In this course you will examine the risks and learn how to implement the security protocols for Hadoop clusters.

Target Audience
Administrators looking to expand their skill set into Hadoop security.

Stabilizing Hadoop Clusters

Course Number:
df_ahop_a07_it_enus

Expected Duration (hours)
3.7

Lesson Objectives

Stabilizing Hadoop Clusters

start the course
describe the importance of event management
describe the importance of incident management
describe the different methodologies used for root cause analysis
recall what Ganglia is and what it can be used for
recall how Ganglia monitors Hadoop clusters
install Ganglia
describe Hadoop Metrics2
install Hadoop Metrics2 for Ganglia
describe how to use Ganglia to monitor a Hadoop cluster
use Ganglia to monitor a Hadoop cluster
recall what Nagios is and what it can be used for
install Nagios
use Nagios commands
use Nagios to monitor a Hadoop cluster
use Hadoop Metrics2 for Nagios
describe how to manage logging levels
describe how to configure Hadoop jobs for logging
describe how to configure log4j for Hadoop
describe how to configure JogHistoryServer logs
configure Hadoop logs
describe the problem management lifecycle
recall some of the best practices for problem management
describe the categories of errors for a Hadoop cluster
conduct a root cause analysis on a major problem
use different monitoring tools to identify problems, failures, errors and solutions

Overview/Description
Apache Hadoop is increasingly in popularity as a framework for large-scale, data-intensive applications. Tuning Hadoop clusters is vital to improve cluster performance. In this course you will look at the importance of incident and log management and examine the best practices for root cause analysis.

Target Audience
Engineers looking to expand their skill sets in the area of Hadoop stability