undefined India | Robotic Restaurant Kitchen in Chennai
About Spark
Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Course Contents

The following are the course contents offered for Spark

  • Starting an HDP 3.x Cluster
  • Introduction to the Hadoop Distributed File System (HDFS)
  • Demonstration: Understanding Block Storage
  • Using HDFS Commands
  • Big Data
  • Big Data and HDP
  • Environment Setup
  • Installing HDP
  • Installing HDP
  • Managing Ambari Users and Groups
  • Managing Ambari Users and Groups
  • Managing Cluster Nodes
  • Managing Cluster Nodes
  • Adding Decommissioning and Recommissioning Worker Nodes
  • Components of Spark
  • Downloading and setup (with Hands-On Exercise)
  • Core Spark - Driver Program & SparkContext, worker nodes, Executor, tasks
  • Spark standalone application (with Hands-On Exercise)
  • Spark Vs. Hadoop
  • Scala API
  • Python API
  • Scala Introduction
  • Scala Programming
  • Scala Intro
  • Why Scala
  • Installation
  • with Hands-On Exercise
  • Sample program
  • Scala execution workflow
  • Data types
  • First Scala program
  • Values and variables
  • Singleton Object
  • Functions
  • Classes & Objects
  • Constructors
  • Access modifiers
  • Control structures
  • If else
  • Loops
  • Arrays
  • with Hands-On Exercise
  • File I/O
  • Database connectivity using JDBC
  • Use case 1
  • Maven
  • SBT
  • with Hands-On Exercise
  • Intro to Spark
  • with Hands-On Exercise
  • Installation of Spark
  • Hardware requirements
  • Software requirements
  • Configuring and running the Spark cluster
  • Your first Spark program
  • Coding Spark jobs in Scala
  • Tools and utilities for administrators/developers
  • Scaling out the cluster
  • Batch versus real-time data processing
  • Batch processing
  • Real-time data processing
  • Architecture of Spark
  • Architecture of Spark Streaming
  • Cluster components
  • Memory configuration & management
  • Intro to RDD
  • Partitions
  • Immutability & Lineage
  • Types of RDD
  • Operations on RDD
  • DataFrame and SparkSQL operations
  • RDD intro
  • creating RDDs (with Hands-On Exercise)
  • RRD operations (with Hands-On Exercise)
  • Data types
  • Transformations and functions (with Hands-On Exercise)
  • Caching (with Hands-On Exercise)
  • Loading and saving your data (with Hands-On Exercise)
  • Input sources
  • Output operations (with Hands-On Exercise)
  • loading from S3 (with Hands-On Exercise)
  • loading from HDFS
  • Hadoop Configuration for Saprk
  • SPARK SQL
  • Aggregations
  • Databases - HBASE
  • HandsOn Exercise
  • Spark packaging structure and client APIs
  • Spark Core
  • SparkContext and Spark Config
  • RDD – APIs
  • Other Spark Core packages
  • Spark libraries and extensions
  • Spark Streaming
  • Spark MLlib
  • Spark SQL
  • Spark GraphX
  • Resilient distributed datasets and discretized streams
  • Resilient distributed datasets
  • Motivation behind RDD
  • Fault tolerance
  • Transformations and actions
  • RDD storage
  • RDD persistence
  • Shuffling in RDD
  • Discretized streams
  • Data loading from distributed and varied sources
  • Accumulators
  • Broadcast variables
  • Numeric RRD operations
  • Spark runtime architecture
  • Deploying applications
  • Packaging code with dependencies
  • Scheduling
  • Cluster managers
  • HandsOn Exercise
  • Setting Up Spark Cluster
  • Configuring Spark with SparkConf
  • Components of execution - Kobs
  • Finding information
  • Understanding the Structure of Data and the Need of Spark SQL
  • Anatomy of Spark SQL
  • DataFrame Programming
  • Understanding Aggregations and Multi-Datasource Joining with SparkSQL
  • Introducing Datasets and Understanding Data Catalogs
  • Getting Started with the SparkSession (or HiveContext or SQLContext)
  • Spark SQL Dependencies
  • Basics of Schemas
  • DataFrame API
  • Transformations
  • Multi-DataFrame Transformations
  • Plain Old SQL Queries and Interacting with Hive Data
  • Data Representation in DataFrames and Datasets
  • Data Loading and Saving Functions
  • DataFrameWriter and DataFrameReader
  • Formats
  • Save Modes
  • Partitions (Discovery and Writing)
  • Datasets
  • Extending with User-Defined Functions and Aggregate Functions (UDFs
  • Query Optimizer
  • Debugging Spark SQL Queries
  • JDBC/ODBC Server
  • Spark Stream Processing
  • Data Stream Processing and Micro Batch Data Processing
  • A Log Event Processor
  • Windowed Data Processing and More Processing Options
  • Kafka Stream Processing
  • Spark Streaming Jobs in Production
  • Metrics and Debugging
  • Spark WebUI
  • Monitoring Spark jobs
  • Evaluating spark jobs
  • Memory consumption and resource allocation
  • Job metrics
  • Monitoring tool for spark
  • Debugging & troubleshooting spark jobs
  • Understanding Machine Learning and the Need of Spark for it
  • Wine Quality Prediction and Model Persistence
  • Wine Classification
  • Spam Filtering
  • Feature Algorithms and Finding Synonyms
  • The Need for Spark and the Basics of the R Language
  • DataFrames in R and Spark
  • Spark DataFrame Programming with R
  • Understanding Aggregations and Multi- Datasource Joins in SparkR
  • Charting and Plotting Libraries and Setting Up a Dataset
  • Charts
  • Bar Chart and Pie Chart
  • Scatter Plot and Line Graph
  • Designing Spark Applications
  • Lambda Architecture
  • Estimating cluster resource requirements
  • HandsOn Use Case / PoCs on Spark Stack

Have Question?

Contact us







Website:

robochef.co