Enterprise Database Systems
Hadoop Operations
Capacity Management for Hadoop Clusters
Cloudera Manager and Hadoop Clusters
Deploying Hadoop Clusters
Designing Hadoop Clusters
Hadoop Cluster Availability
Hadoop in the Cloud
Operating Hadoop Clusters
Performance Tuning of Hadoop Clusters
Securing Hadoop Clusters
Stabilizing Hadoop Clusters

Capacity Management for Hadoop Clusters

Course Number:
df_ahop_a08_it_enus
Lesson Objectives

Capacity Management for Hadoop Clusters

  • start the course
  • compare the differences of availability versus performance
  • describe different strategies of resource capacity management
  • describe how schedulers perform various resource management
  • set quotas for the HDFS file system
  • recall how to set the maximum and minimum memory allocations per container
  • describe how the fair scheduling method allows all applications to get equal amounts of resource time
  • describe the primary algorithm and the configuration files for the Fair Scheduler
  • describe the default behavior of the Fair Scheduler methods
  • monitor the behavior of Fair Share
  • describe the policy for single resource fairness
  • describe how resources are distributed over the total capacity
  • identify different configuration options for single resource fairness
  • configure single resource fairness
  • describe the minimum share function of the Fair Scheduler
  • configure minimum share on the Fair Scheduler
  • describe the preemption functions of the Fair Scheduler
  • configure preemption for the Fair Scheduler
  • describe dominant resource fairness
  • write service levels for performance
  • use the fail scheduler with multiple users

Overview/Description
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. This course focuses on the capacity management of Hadoop clusters. You will be introduced to the concepts of resource management through scheduling. You will learn how to use the Fair Scheduler Tool, and how to plan for scaling.

Target Audience
Administrators looking to add to their knowledge of capacity management for Hadoop clusters

Cloudera Manager and Hadoop Clusters

Course Number:
df_ahop_a10_it_enus
Lesson Objectives

Cloudera Manager and Hadoop Clusters

  • start the course
  • describe what cluster management entails and recall some of the tools that can be used
  • describe different tools from a functional perspective
  • describe the purpose and functionality of Cloudera Manager
  • install Cloudera Manager
  • use Cloudera Manager to deploy a cluster
  • use Cloudera Manager to install Hadoop
  • describe the different parts of the Cloudera Manager Admin Console
  • describe the Cloudera Manager internal architecture
  • use Cloudera Manager to manage a cluster
  • manage Cloudera Manager's services
  • manage hosts with Cloudera Manager
  • set up Cloudera Manager for high availability
  • user Cloudera Manager to manage resources
  • use Cloudera Manager's monitoring features
  • manage logs through Cloudera Manager
  • improve cluster performance with Cloudera Manager
  • install and configure Impala
  • install and configure Sentry
  • implement security administration using Hive
  • perform backups, snapshots, and upgrades using Cloudera Manager
  • configure Hue with My SQL
  • import data using Hue
  • use Hue to run a Hive job
  • use Hue to edit Oozie workflows and coordinators
  • format HDFS, create an HDFS directory, import data, run a WordCount |INS , |/INS and view the results

Overview/Description
Cloudera Manager is a simple automated, customizable management tool for Hadoop clusters. In this course, you will become familiar with the various web consoles available with Cloudera Manager. You will learn how to use Cloudera Manager to perform everything from a Hadoop cluster installation, to performance tuning, to diagnosing issues.

Target Audience
Administrators wanting to add Cloudera Manager to their skill sets

Deploying Hadoop Clusters

Course Number:
df_ahop_a03_it_enus
Lesson Objectives

Deploying Hadoop Clusters

  • start the course
  • describe the configurations management tools
  • simulate a configuration management tool
  • build an image for a baseline server
  • build an image for a DataServer
  • build an image for a master server
  • provision an admin server
  • describe the layout and structure of the Hadoop cluster
  • provision a Hadoop cluster
  • distribute configuration files and admin scripts
  • use init scripts to start and stop a Hadoop cluster
  • configure a Hadoop cluster
  • configure logging for the Hadoop cluster
  • build images for required servers in the Hadoop cluster
  • configure a MySQL database
  • build the Hadoop clients
  • configure Hive daemons
  • test the functionality of Flume, Sqoop, HDFS, and MapReduce
  • test the functionality of Hive and Pig
  • configure Hcatalog daemons
  • configure Oozie
  • configure Hue and Hue users
  • install Hadoop on to the admin server

Overview/Description
There are important decisions you must make to ensure network, disks, and hosts are configured correctly when deploying a Hadoop Cluster. This course will walk you through all of the steps to install Hadoop in a pseudo-distributed mode and the set up of some of the common open source software used to create a Hadoop Ecosystem.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Designing Hadoop Clusters

Course Number:
df_ahop_a01_it_enus
Lesson Objectives

Designing Hadoop Clusters

  • start the course
  • describe the principles of supercomputing
  • recall the roles and skills needed for the Hadoop engineering team
  • recall the advantages and shortcomings of using Hadoop as a supercomputing platform
  • describe the three axioms of supercomputing
  • describe the dumb hardware and smart software, and the share nothing design principles
  • describe the design principles for move processing not data, embrace failure, and build applications not infrastructure
  • describe the different rack architectures for Hadoop.
  • describe the best practices for scaling a Hadoop cluster.
  • recall the best practices for different types of network clusters
  • recall the primary responsibilities for the master, data, and edge servers
  • recall some of the recommendations for a master server and edge server
  • recall some of the recommendations for a data server
  • recall some of the recommendations for an operating system
  • recall some of the recommendations for hostnames and DNS entries
  • describe the recommendations for HDD
  • calculate the correct number of disks required for a storage solution
  • compare the use of commodity hardware with enterprise disks
  • plan for the development of a Hadoop cluster
  • set up flash drives as boot media
  • set up a kickstart file as boot media
  • set up a network installer
  • identify the hardware and networking recommendations for a Hadoop cluster

Overview/Description
Hadoop is an Apache Software Foundation project and open source software platform for scalable, distributed computing. Hadoop can provide fast and reliable analysis of both structured data and unstructured data. In this course you will learn about the design principles, the cluster architecture, considerations for servers and operating systems, and how to plan for a deployment.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Hadoop Cluster Availability

Course Number:
df_ahop_a04_it_enus
Lesson Objectives

Hadoop Cluster Availability

  • start the course
  • describe how Hadoop leverages fault tolerance
  • recall the most common causes for NameNode failure
  • recall the uses for the Checkpoint node
  • test the availability for the NameNode
  • describe the operation of the NameNode during a recovery
  • swap to a new NameNode
  • recall the most common causes for DataNode failure
  • test the availability for the DataNode
  • describe the operation of the DataNode during a recovery
  • set up the DataNode for replication
  • identify and recover from a missing data block scenario
  • describe the functions of Hadoop high availability
  • edit the Hadoop configuration files for high availability
  • set up a high availability solution for NameNode
  • recall the requirements for enabling an automated failover for the NameNode
  • create an automated failover for the NameNode
  • recall the most common causes for YARN task failure
  • describe the functions of YARN containers
  • test YARN container reliability
  • recall the most common causes of YARN job failure
  • test application reliability
  • describe the system view of the Resource Manager configurations set for high availability
  • set up high availability for the Resource Manager
  • move the Resource Manager HA to alternate master servers

Overview/Description
When examining Hadoop availability it's important not to focus solely on the NameNode. There is a tendency since that is the single point of failure for HDFS, and many components in the ecosystem rely on HDFS, but Hadoop availability is a more general larger issue. In this course we are going to examine the availability and how to recover from failures for the NameNode, DataNode, HDFS, and YARN.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Hadoop in the Cloud

Course Number:
df_ahop_a02_it_enus
Lesson Objectives

Hadoop in the Cloud

  • start the course
  • describe how cloud computing can be used as a solution for Hadoop
  • recall some of the most come services of the EC2 service bundle
  • recall some of the most common services that Amazon offers
  • describe how the AWS credentials are used for authentication
  • create an AWS account
  • describe the use of AWS access keys
  • describe AWS identification and access management
  • set up AWS IAM
  • describe the use of SSH key pairs for remote access
  • set up S3 and import data
  • provision a micro instance of EC2
  • prepare to install and configure a Hadoop cluster on AWS
  • create an EC2 baseline server
  • create an Amazon machine image
  • create an Amazon cluster
  • describe what the command line interface is used for
  • use the command line interface
  • describe the various ways to move data into AWS
  • recall the advantages and limitations of using Hadoop in the cloud
  • recall the advantages and limitations of using AWS EMR
  • describe EMR End-user connections and EMR security levels
  • set up an EMR cluster
  • run an EMR job from the web console
  • run an EMR job with Hue
  • run an EMR job with the command line interface
  • write an Elastic MapReduce script for AWS

Overview/Description
Amazon Web Services, also known as AWS, is a secure cloud-computing platform offered by Amazon.com. This course introduces AWS and it's most prominent tools such as IAM, S3, and EC2. Additionally we will cover how to install configure and use a Hadoop cluster on AWS.

Target Audience
Developers interested in expanding their knowledge of Hadoop from the operations perspective

Operating Hadoop Clusters

Course Number:
df_ahop_a06_it_enus
Lesson Objectives

Operating Hadoop Clusters

  • start the course
  • monitor and improve service levels
  • deploy a Hadoop release
  • describe the purpose of change management
  • describe rack awareness
  • write configuration files for rack awareness
  • start and stop a Hadoop cluster
  • write init scripts for Hadoop
  • describe the tools fsck and dfsadmin
  • use fsck to check the HDFS file system
  • set quotas for the HDFS file system
  • install and configure trash
  • manage an HDFS DataNode
  • use include and exclude files to replace a DataNode
  • describe the operations for scaling a Hadoop cluster
  • add a DataNode to a Hadoop cluster
  • describe the process for balancing a Hadoop cluster
  • balance a Hadoop cluster
  • describe the operations involved for backing up data
  • use distcp to copy data from one cluster to another
  • describe MapReduce job management on a Hadoop cluster
  • perform MapReduce job management on a Hadoop cluster
  • plan an upgrade of a Hadoop cluster
  • write and complete a plan to install Hbase with high availability

Overview/Description
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware. In this course we will examine many of the HDFS administration and operational processes required to operate and maintain a Hadoop cluster. We will take a look at how to balance a Hadoop cluster, manage jobs, and perform backup and recovery for HDFS.

Target Audience
Administrators looking to expand their skill and knowledge surrounding operati onal activities of Hadoop clusters

Performance Tuning of Hadoop Clusters

Course Number:
df_ahop_a09_it_enus
Lesson Objectives

Performance Tuning of Hadoop Clusters

  • start the course
  • recall the three main functions of service capacity
  • describe different strategies of performance tuning
  • list some of the best practices for network tuning
  • install compression
  • describe the configuration files and parameters used in performance tuning of the operating system
  • describe the purpose of Java tuning
  • recall some of the rules for tuning the datanode
  • describe the configuration files and parameters used in performance tuning of memory for daemons
  • describe the purpose of memory tuning for YARN
  • recall why the Node Manager kills containers
  • performance tune memory for the Hadoop cluster
  • describe the configuration files and parameters used in performance tuning of HDFS
  • describe the sizing and balancing of the HDFS data blocks
  • describe the use of TestDFSIO
  • performance tune HDFS
  • describe the configuration files and parameters used in performance tuning of YARN
  • configure Speculative execution
  • describe the configuration files and parameters used in performance tuning of MapReduce
  • tune up MapReduce for performance reasons
  • describe the practice of benchmarking on a Hadoop cluster
  • describe the different tools used for benchmarking a cluster
  • perform a benchmark of a Hadoop cluster
  • describe the purpose of application modeling
  • optimize memory and benchmark a Hadoop cluster

Overview/Description
The Apache Hadoop software library is a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage. This course will focus on performance tuning of the Hadoop cluster. We will examine best practices and recommendations for performance tuning of the operating system, memory, HDFS, YARN and MapReduce.

Target Audience
Administrators looking to expand their skill sets to include performance tuning Hadoop clusters

Securing Hadoop Clusters

Course Number:
df_ahop_a05_it_enus
Lesson Objectives

Securing Hadoop Clusters

  • start the course
  • describe the four pillars of the Hadoop security model
  • recall the ports required for Hadoop and how network gateways are used
  • install security groups for AWS
  • describe Kerberos and recall some of the common commands
  • diagram Kerberos and label the primary components
  • prepare for a Kerberos installation
  • install Kerberos
  • configure Kerberos
  • describe how to configure HDFS and YARN for use with Kerberos
  • configure HDFS for Kerberos
  • configure YARN for Kerberos
  • describe how to configure Hive for use with Kerberos
  • configure Hive for Kerberos
  • describe how to configure Pig, Sqoop, and Oozie for use with Kerberos
  • configure Pig and HTTPFS for use with Kerberos
  • configure Oozie for use with Kerberos
  • configure Hue for use with Kerberos
  • describe how to configure Flume for use with Kerberos
  • describe the security model for users on a Hadoop cluster
  • describe the use of POSIX and ACL for managing user access
  • create access control lists
  • describe how to encrypt data in motion for Hadoop, Sqoop, and Flume
  • encrypt data in motion
  • describe how to encrypt data at rest
  • recall the primary security threats faced by the Hadoop cluster
  • describe how to monitor Hadoop security
  • configure Hbase for Kerberos

Overview/Description
Hadoop development has allowed big data technologies to reach companies in all sectors of the economy. But as this grows so do the security concerns. In this course you will examine the risks and learn how to implement the security protocols for Hadoop clusters.

Target Audience
Administrators looking to expand their skill set into Hadoop security.

Stabilizing Hadoop Clusters

Course Number:
df_ahop_a07_it_enus
Lesson Objectives

Stabilizing Hadoop Clusters

  • start the course
  • describe the importance of event management
  • describe the importance of incident management
  • describe the different methodologies used for root cause analysis
  • recall what Ganglia is and what it can be used for
  • recall how Ganglia monitors Hadoop clusters
  • install Ganglia
  • describe Hadoop Metrics2
  • install Hadoop Metrics2 for Ganglia
  • describe how to use Ganglia to monitor a Hadoop cluster
  • use Ganglia to monitor a Hadoop cluster
  • recall what Nagios is and what it can be used for
  • install Nagios
  • use Nagios commands
  • use Nagios to monitor a Hadoop cluster
  • use Hadoop Metrics2 for Nagios
  • describe how to manage logging levels
  • describe how to configure Hadoop jobs for logging
  • describe how to configure log4j for Hadoop
  • describe how to configure JogHistoryServer logs
  • configure Hadoop logs
  • describe the problem management lifecycle
  • recall some of the best practices for problem management
  • describe the categories of errors for a Hadoop cluster
  • conduct a root cause analysis on a major problem
  • use different monitoring tools to identify problems, failures, errors and solutions

Overview/Description
Apache Hadoop is increasingly in popularity as a framework for large-scale, data-intensive applications. Tuning Hadoop clusters is vital to improve cluster performance. In this course you will look at the importance of incident and log management and examine the best practices for root cause analysis.

Target Audience
Engineers looking to expand their skill sets in the area of Hadoop stability

Close Chat Live