Apache Spark is a fast and general-purpose cluster computing system. It is used as a lightning-fast unified analytics engine for big data & Machine Learning applications. Apache Spark is an Engine:
To process data in real-time & batch mode
To respond in subsecond
To perform In-Memory processing
According to Spark documentation, it is an alternative to Hadoop Map Reduce:
10-100 percent faster than Hadoop
10 times faster than accessing data from a disk
Very fast speed and ease of use
It provides high-level APIs in JAVA, Scala, Python & R
This article explains how to set up and use Apache Spark in a multi-node cluster environment. Apache Spark is used for distributed, in-memory processing and it is popular due to the below offerings:
Spark Core
– Allows to consume and process Batch data
Spark Streaming
– Allows to consume and process continuous data streams
Clients
– Interactive Processing
Spark SQL
– Allows using SQL Queries for structured data processing
MLib
– Machine Learning library that delivers high-quality algorithms
GraphX
– It is a library for Graph Processing
Apache spark supports multiple resource managers:
Standalone
– It is a basic cluster manager that comes with Spark compute engine. It provides basic functionalities like Memory management, Fault recovery, Task Scheduling, Interaction with the cluster manager:
Apache YARN
– It is the cluster manager for Hadoop
Apache Mesos
– It is another general-purpose cluster manager
Kubernetes
– It is a general-purpose container orchestration platform
Every developer needs an local environment to run and test the Spark application. This article explains detailed steps in setting up the multinode Apache Spark Cluster.
When an application is submitted to Spark, it will create one driver process and multiple executer processes for the application on multiple nodes.
The entire set of Drivers & Executors is available for application.
The driver is responsible for Analysing, Distributing, Scheduling, Monitoring, and maintaining all the information during the lifetime of the application.
Each node can have multiple executors. Executers are responsible for executing the code assigned to them by Driver and reporting the status back to the Driver.
The steps of this article are tested with the below Laptop configuration:
RAM
: 12GB DDR4Processor
: Intel Core i5, 4 coresGraphic Processor
: NVIDIA GeForce 940MXCreate below configuration Machine’s in VirtualBox:
Apache spark needs Static IP addresses to communicate between nodes. VirtualBox has a Host-only network mode for communicating between a host and guests. In simple words, nodes created with this network mode can communicate with each other and The VirtualBox host machine(Laptop) can access all VMs connected to the host-only network. In VirtualBox software Navigate File -> Host Network Manager.
Click on Create -> Configure Adapter manually with IPv4 Address
: 192.168.56.1 and Network Mask
: 255.255.255.0
Select the machine
-> Settings
Navigate to Network
-> Adapter1
and set as below:
Since the internet is needed for the nodes, the NAT network is used. Select Adapter 2
and Configure as below:
Two Ethernet networks must be connected
Click on Ethernet 1 Settings -> IPv4 -> Manual
For Master/Driver
Address: 192.168.56.101
Network Mask: 255.255.255.0
Gateway: 192.168.56.1
For node1
Address: 192.168.56.102
Network Mask: 255.255.255.0
Gateway: 192.168.56.1
For node2
Address: 192.168.56.103
Network Mask: 255.255.255.0
Gateway: 192.168.56.1
Click on Ethernet 2 Settings -> IPv4 -> Automatic(DHCP) in all the nodes
Open the hostname file on master, node1, and node2 and set its respective hostnames as below:
sudo nano /etc/hostname
# Add below lines
* On master - master.spark.com
* On node1 - node1.spark.com
* On node2 - node2.spark.com
sudo nano /etc/hosts
# Add below lines
192.168.56.101 master.spark.com
192.168.56.102 node1.spark.com
192.168.56.103 node2.spark.com
Next, reboot all the machines:
reboot
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install openjdk-11-jdk
java -version
sudo apt-get install openssh-server openssh-client
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh-copy-id ubuntu@master.spark.com
ssh-copy-id ubuntu@node1.spark.com
ssh-copy-id ubuntu@node2.spark.com
sudo nano slaves
# Add below lines
node1.spark.com
node2.spark.com
sudo ufw disable
cd /usr/local/spark
./sbin/start-all.sh
http://192.168.56.101:8080/
pip install jupyter
sudo nano ~/.bashrc
# Add below lines
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
source ~/.bashrc
jupyter notebook --ip 0.0.0.0
http://192.168.56.101:8888/
Source:
https://www.innovationmerge.com/2021/06/26/Setting-up-a-multi-node-Apache-Spark-Cluster-on-a-Laptop/