Hadoop & Bigdata


    Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

    Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

Hadoop & Big Data

Course Content

Linux Basics

Java Basics

SQL Basics

Introduction to Big Data and Hadoop

  • What is Big Data?
  • What are the challenges for processing big data?
  • What technologies support big data?
  • What is Hadoop?
  • Why Hadoop?
  • History of Hadoop
  • Use cases of Hadoop
  • Hadoop ecosystem
  • HDFS
  • MapReduce
  • Statistics

SETTING UP HADOOP ENVIRONMENT

  • Pseudo mode
  • Cluster mode
  • Ipv6
  • Ssh
  • Installation of java, hadoop
  • Configurations of hadoop
  • Hadoop Processes ( NN, SNN, JT, DN, TT)
  • Temporary directory
  • UI
  •  Common errors when running hadoop cluster, solutions

HADOOP PROCESSES

  • Name node
  • Secondary name node
  • Job tracker
  • Task tracker
  • Data node

HDFS

  • Configuring HDFS
  • Interacting With HDFS
  • HDFS Permissions and Security
  • Additional HDFS Tasks
  • HDFS Overview and Architecture
  • HDFS Installation
  • Hadoop File System Shell
  • File System Java API

Understanding the Cluster

  • Typical workflow
  • Writing files to HDFS
  • Reading files from HDFS
  • Rack awareness
  • 5 daemons

Let’s talk MapReduce

  • Before MapReduce
  • MapReduce overview
  • Word count problem
  • Word count flow and solution
  • MapReduce flow
  • Algorithms for simple problems
  • Algorithms for complex problems

Developing the MapReduce Application

  • Data Types
  • File Formats
  • Explain the Driver, Mapper and Reducer code
  • Configuring development environment – Eclipse
  • Writing unit test
  • Running locally
  • Running on cluster
  • Hands on exercises

How MapReduce Works

  • Anatomy of MapReduce job run
  • Job submission
  • Job initialization
  • Task assignment
  • Job completion
  • Job scheduling
  • Job failures
  • Shuffle and sort
  • Oozie workflows
  • Hands on exercises

MapReduce Types and Formats

  • MapReduce types
  • Input Formats – Input splits & records, text input, binary input, multiple inputs & database input
  • Output Formats – text output, binary output, multiple outputs, lazy output and database output
  • Hands on exercises

MapReduce Features

  • Counters
  • Sorting
  • Joins – Map side and reduce side
  • Side data distribution
  • MapReduce combiner
  • MapReduce partitioner
  • MapReduce distributed cache
  • Hands exercises

Pig

  • Pig Overview
  • Installation
  • Pig Latin
  • Pig with HDFS

Hive

  • Hive Overview
  • Installation
  • Hive QL
  • Hive Unstructured Data Analyzation
  • Hive Semistructured Data Analyzation
  • Hive UDF

HBase

  • HBase Overview and Architecture
  • HBase Installation
  • HBase Shell
  • CRUD operations
  • Scanning and Batching
  • Filters
  • HBase Key Design

ZooKeeper

  • Zoo Keeper Overview
  • Installation
  • Server Mantainace

Sqoop

  • Sqoop Overview
  • Installation
  • Imports and Exports

Working with Flume

  • Introduction.
  • Configuration and Setup.
  • Flume Sink with example.
  • Channel.
  • Flume Source with example.
  • Complex flume architecture.

CONFIGURATION

  • Basic Setup
  • Important Directories
  • Selecting Machines
  • Cluster Configurations
  • Small Clusters: 2-10 Nodes
  • Medium Clusters: 10-40 Nodes
  • Large Clusters: Multiple Racks

Integrations

  • Distributed installations
  • Best Practices.