Cloud Masters

Course Info
Curriculum

About Course

TIt is a comprehensive big data course designed by industry experts considering current industry job requirements to help you learn big data Hadoop and Spark modules. This is an industry-recognized big data Hadoop certification course that is a combination of Hadoop development, Hadoop administration, Hadoop testing, and analytics with Apache Spark. This big data program will prepare you to clear the Cloudera CCA175 Big Data certification.

Course Curriculum

Module 01 - Hadoop Installation and Setup

The architecture of Hadoop Cluster
What is High Availability and Federation?
How to set up a production cluster?
Various shell commands in Hadoop
Understanding configuration files in Hadoop
Installing a single node cluster with Cloudera Manager
Understanding Spark, Scala, Sqoop, Pig, and Flume

Module 02 - Introduction to Big Data Hadoop and Understanding HDFS and MapReduce

Introducing Big Data and Hadoop
What is Big Data and where does Hadoop fit in?
Two important Hadoop ecosystem components, namely, MapReduce and HDFS
In-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability and in-depth YARN – resource manager and node manage

Module 03 - Deep Dive in MapReduce

Learning the working mechanism of MapReduce
Understanding the mapping and reducing stages in MR
Various terminologies in MR like Input Format, Output Format, Partitioners, Combiners, Shuffle, and Sort

Module 04 - Introduction to Hive

Introducing Hadoop Hive
Detailed architecture of Hive
Comparing Hive with Pig and RDBMS
Working with Hive Query Language
Creation of a database, table, group by and other clauses
Various types of Hive tables, HCatalog
Storing the Hive Results, Hive partitioning, and Buckets

Module 05 - Advanced Hive and Impala

Indexing in Hive
The Map Side Join in Hive
Working with complex data types
The Hive user-defined functions
Introduction to Impala
Comparing Hive with Impala
The detailed architecture of Impala

Module 06 - Introduction to Pig

Apache Pig introduction and its various features
Various data types and schema in Hive
The available functions in Pig, Hive Bags, Tuples, and Fields

Selenium

Module 07 - Flume, Sqoop and HBase

Apache Sqoop introduction
Importing and exporting data
Performance improvement with Sqoop
Sqoop limitations
Introduction to Flume and understanding the architecture of Flume
What are HBase and the CAP theorem?

Module 08 - Writing Spark Applications Using Scala

Using Scala for writing Apache Spark applications
Detailed study of Scala
The need for Scala
The concept of object-oriented programming
Various classes in Scala like getters, setters, constructors, abstract, extending objects, overriding methods
The Java and Scala interoperability
The concept of functional programming and anonymous functions
Bobsrockets package and comparing the mutable and immutable collections
Scala REPL, Lazy Values, Control Structures in Scala, Directed Acyclic Graph (DAG), first Spark application using SBT/Eclipse, Spark Web UI, Spark in Hadoop ecosystem.

Module 09 - Use Case Bobsrockets Package

Introduction to Scala packages and imports
The selective imports
The Scala test classes
Introduction to JUnit test class
JUnit interface via JUnit 3 suite for Scala test
Packaging of Scala applications in the directory structure
Examples of Spark Split and Spark Scala

Module 10 - Introduction to Spark

Introduction to Spark
Spark overcomes the drawbacks of working on MapReduce
Understanding in-memory MapReduce
Interactive operations on MapReduce
Spark stack, fine vs. coarse-grained update, Spark stack, Spark Hadoop YARN, HDFS Revision, and YARN Revision
The overview of Spark and how it is better than Hadoop
Deploying Spark without Hadoop
Spark history server and Cloudera distribution

Module 11 - Spark Basics

Spark installation guide
Memory management
Executor memory vs. driver memory
Working with Spark Shell
The concept of Resilient Distributed Datasets (RDD)
Learning to do functional programming in Spark

Module 12 - Working with RDDs in Spark

Spark RDD
Creating RDDs
RDD partitioning
Operations and transformation in RDD
Deep dive into Spark RDDs
The RDD general operations
Read-only partitioned collection of records

Module 13 - Aggregating Data with Pair RDDs

Understanding the concept of key-value pair in RDDs
Learning how Spark makes MapReduce operations faster
Various operations of RDD
MapReduce interactive operations

Module 14 - Writing and Deploying Spark Applications

Comparing the Spark applications with Spark Shell
Creating a Spark application using Scala or Java
Deploying a Spark application
Scala built application
Creation of the mutable list, set and set operations, list, tuple, and concatenating list
Creating an application using SBT

Module 15 - Project Solution Discussion and Cloudera Certification Tips and Tricks

Working towards the solution of the Hadoop project solution
Its problem statements and the possible solution outcomes
Preparing for the Cloudera certifications
Points to focus on scoring the highest marks

Module 16 - Parallel Processing

Learning about Spark parallel processing
Deploying on a cluster
Introduction to Spark partitions
File-based partitioning of RDDs
Understanding of HDFS and data locality
Mastering the technique of parallel operations
Comparing repartition and coalesce
RDD actions

Module 17 - Spark RDD Persistence

The execution flow in Spark
Understanding the RDD persistence overview
Spark execution flow, and Spark terminology
Distribution shared memory vs. RDD
RDD limitations
Spark shell arguments

Module 18 - Spark MLlib

Introduction to Machine Learning
Types of Machine Learning
Introduction to MLlib
Various ML algorithms supported by MLlib

Module 19 - Integrating Apache Flume and Apache Kafka

Why Kafka and what is Kafka?
Kafka architecture
Kafka workflow
Configuring Kafka cluster
Operations
Kafka monitoring tools
Integrating Apache Flume and Apache Kafka

Module 20 - Spark Streaming

Introduction to Spark Streaming
Features of Spark Streaming
Spark Streaming workflow
Initializing StreamingContext, discretized Streams (DStreams), input DStreams and Receivers
Transformations on DStreams, output operations on DStreams, windowed operators and why it is useful
Important windowed operators and stateful operators

Module 21 - Improving Spark Performance

Introduction to various variables in Spark like shared variables and broadcast variables
Learning about accumulators
The common performance issues
Troubleshooting the performance problems

Module 22 - Spark SQL and Data Frames

The context of SQL in Spark for providing structured data processing
JSON support in Spark SQL
Working with XML data
Parquet files
Creating Hive context
Reading JDBC files
Creating Data Frames
Working with CSV files
Data frame to JDBC
User-defined functions in Spark SQL

Module 23 - Scheduling/Partitioning

Learning about the scheduling and partitioning in Spark
Hash partition
Range partition
Scheduling within and around applications
Static partitioning, dynamic sharing, and fair scheduling
Map partition with index, the Zip, and GroupByKey
Running AutoIT Scripts from Selenium
Spark master high availability, standby masters with ZooKeeper, single-node recovery with the local file system and high order functions

Who can join this course

IT Professionals
Software Developers
System Administrators
Project Managers
Database Administrators
Marketing Professionals
B. Tech Fresher and Graduates
Job Seekers

Enquiry Form

Big Data

About Course

Course Curriculum

Selenium

Who can join this course

Enquiry Form