About Course

TIt is a comprehensive big data course designed by industry experts considering current industry job requirements to help you learn big data Hadoop and Spark modules. This is an industry-recognized big data Hadoop certification course that is a combination of Hadoop development, Hadoop administration, Hadoop testing, and analytics with Apache Spark. This big data program will prepare you to clear the Cloudera CCA175 Big Data certification.

Course Curriculum

Module 01 - Hadoop Installation and Setup

  • The architecture of Hadoop Cluster
  • What is High Availability and Federation?
  • How to set up a production cluster?
  • Various shell commands in Hadoop
  • Understanding configuration files in Hadoop
  • Installing a single node cluster with Cloudera Manager
  • Understanding Spark, Scala, Sqoop, Pig, and Flume

Module 02 - Introduction to Big Data Hadoop and Understanding HDFS and MapReduce

  • Introducing Big Data and Hadoop
  • What is Big Data and where does Hadoop fit in?
  • Two important Hadoop ecosystem components, namely, MapReduce and HDFS
  • In-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability and in-depth YARN – resource manager and node manage

Module 03 - Deep Dive in MapReduce

  • Learning the working mechanism of MapReduce
  • Understanding the mapping and reducing stages in MR
  • Various terminologies in MR like Input Format, Output Format, Partitioners, Combiners, Shuffle, and Sort

Module 04 - Introduction to Hive

  • Introducing Hadoop Hive
  • Detailed architecture of Hive
  • Comparing Hive with Pig and RDBMS
  • Working with Hive Query Language
  • Creation of a database, table, group by and other clauses
  • Various types of Hive tables, HCatalog
  • Storing the Hive Results, Hive partitioning, and Buckets

Module 05 - Advanced Hive and Impala

  • Indexing in Hive
  • The Map Side Join in Hive
  • Working with complex data types
  • The Hive user-defined functions
  • Introduction to Impala
  • Comparing Hive with Impala
  • The detailed architecture of Impala

Module 06 - Introduction to Pig

  • Apache Pig introduction and its various features
  • Various data types and schema in Hive
  • The available functions in Pig, Hive Bags, Tuples, and Fields

Selenium

Module 07 - Flume, Sqoop and HBase

  • Apache Sqoop introduction
  • Importing and exporting data
  • Performance improvement with Sqoop
  • Sqoop limitations
  • Introduction to Flume and understanding the architecture of Flume
  • What are HBase and the CAP theorem?

Module 08 - Writing Spark Applications Using Scala

  • Using Scala for writing Apache Spark applications
  • Detailed study of Scala
  • The need for Scala
  • The concept of object-oriented programming
  • Various classes in Scala like getters, setters, constructors, abstract, extending objects, overriding methods
  • The Java and Scala interoperability
  • The concept of functional programming and anonymous functions
  • Bobsrockets package and comparing the mutable and immutable collections
  • Scala REPL, Lazy Values, Control Structures in Scala, Directed Acyclic Graph (DAG), first Spark application using SBT/Eclipse, Spark Web UI, Spark in Hadoop ecosystem.

Module 09 - Use Case Bobsrockets Package

  • Introduction to Scala packages and imports
  • The selective imports
  • The Scala test classes
  • Introduction to JUnit test class
  • JUnit interface via JUnit 3 suite for Scala test
  • Packaging of Scala applications in the directory structure
  • Examples of Spark Split and Spark Scala

Module 10 - Introduction to Spark

  • Introduction to Spark
  • Spark overcomes the drawbacks of working on MapReduce
  • Understanding in-memory MapReduce
  • Interactive operations on MapReduce
  • Spark stack, fine vs. coarse-grained update, Spark stack, Spark Hadoop YARN, HDFS Revision, and YARN Revision
  • The overview of Spark and how it is better than Hadoop
  • Deploying Spark without Hadoop
  • Spark history server and Cloudera distribution

Module 11 - Spark Basics

  • Spark installation guide
  • Memory management
  • Executor memory vs. driver memory
  • Working with Spark Shell
  • The concept of Resilient Distributed Datasets (RDD)
  • Learning to do functional programming in Spark

Module 12 - Working with RDDs in Spark

  • Spark RDD
  • Creating RDDs
  • RDD partitioning
  • Operations and transformation in RDD
  • Deep dive into Spark RDDs
  • The RDD general operations
  • Read-only partitioned collection of records

Module 13 - Aggregating Data with Pair RDDs

  • Understanding the concept of key-value pair in RDDs
  • Learning how Spark makes MapReduce operations faster
  • Various operations of RDD
  • MapReduce interactive operations

Module 14 - Writing and Deploying Spark Applications

  • Comparing the Spark applications with Spark Shell
  • Creating a Spark application using Scala or Java
  • Deploying a Spark application
  • Scala built application
  • Creation of the mutable list, set and set operations, list, tuple, and concatenating list
  • Creating an application using SBT

Module 15 - Project Solution Discussion and Cloudera Certification Tips and Tricks

  • Working towards the solution of the Hadoop project solution
  • Its problem statements and the possible solution outcomes
  • Preparing for the Cloudera certifications
  • Points to focus on scoring the highest marks

Module 16 - Parallel Processing

  • Learning about Spark parallel processing
  • Deploying on a cluster
  • Introduction to Spark partitions
  • File-based partitioning of RDDs
  • Understanding of HDFS and data locality
  • Mastering the technique of parallel operations
  • Comparing repartition and coalesce
  • RDD actions

Module 17 - Spark RDD Persistence

  • The execution flow in Spark
  • Understanding the RDD persistence overview
  • Spark execution flow, and Spark terminology
  • Distribution shared memory vs. RDD
  • RDD limitations
  • Spark shell arguments

Module 18 - Spark MLlib

  • Introduction to Machine Learning
  • Types of Machine Learning
  • Introduction to MLlib
  • Various ML algorithms supported by MLlib

Module 19 - Integrating Apache Flume and Apache Kafka

  • Why Kafka and what is Kafka?
  • Kafka architecture
  • Kafka workflow
  • Configuring Kafka cluster
  • Operations
  • Kafka monitoring tools
  • Integrating Apache Flume and Apache Kafka

Module 20 - Spark Streaming

  • Introduction to Spark Streaming
  • Features of Spark Streaming
  • Spark Streaming workflow
  • Initializing StreamingContext, discretized Streams (DStreams), input DStreams and Receivers
  • Transformations on DStreams, output operations on DStreams, windowed operators and why it is useful
  • Important windowed operators and stateful operators

Module 21 - Improving Spark Performance

  • Introduction to various variables in Spark like shared variables and broadcast variables
  • Learning about accumulators
  • The common performance issues
  • Troubleshooting the performance problems

Module 22 - Spark SQL and Data Frames

  • The context of SQL in Spark for providing structured data processing
  • JSON support in Spark SQL
  • Working with XML data
  • Parquet files
  • Creating Hive context
  • Reading JDBC files
  • Creating Data Frames
  • Working with CSV files
  • Data frame to JDBC
  • User-defined functions in Spark SQL

Module 23 - Scheduling/Partitioning

  • Learning about the scheduling and partitioning in Spark
  • Hash partition
  • Range partition
  • Scheduling within and around applications
  • Static partitioning, dynamic sharing, and fair scheduling
  • Map partition with index, the Zip, and GroupByKey
  • Running AutoIT Scripts from Selenium
  • Spark master high availability, standby masters with ZooKeeper, single-node recovery with the local file system and high order functions
Big Data

Who can join this course

  • IT Professionals
  • Software Developers
  • System Administrators
  • Project Managers
  • Database Administrators
  • Marketing Professionals
  • B. Tech Fresher and Graduates
  • Job Seekers