Useful magic commands in Jupyter Notebook/Lab
2022-02-17
A guide on PySpark Window Functions with Partition By
2022-02-18
Show all

Setting up a multi-node Apache Spark Cluster on a local Windows machine with Virtual Box

6 mins read

Prerequisite

Apache Spark is a fast and general-purpose cluster computing system. It is used as a lightning-fast unified analytics engine for big data & Machine Learning applications. Apache Spark is an Engine:

To process data in real-time & batch mode

To respond in subsecond

To perform In-Memory processing

According to Spark documentation, it is an alternative to Hadoop Map Reduce:

10-100 percent faster than Hadoop

10 times faster than accessing data from a disk

Very fast speed and ease of use

It provides high-level APIs in JAVA, Scala, Python & R

This article explains how to set up and use Apache Spark in a multi-node cluster environment. Apache Spark is used for distributed, in-memory processing and it is popular due to the below offerings:

Spark Core – Allows to consume and process Batch data

Spark Streaming – Allows to consume and process continuous data streams

Clients – Interactive Processing

Spark SQL – Allows using SQL Queries for structured data processing

MLib – Machine Learning library that delivers high-quality algorithms

GraphX – It is a library for Graph Processing

Apache spark supports multiple resource managers:

Standalone – It is a basic cluster manager that comes with Spark compute engine. It provides basic functionalities like Memory management, Fault recovery, Task Scheduling, Interaction with the cluster manager:

Apache YARN – It is the cluster manager for Hadoop

Apache Mesos – It is another general-purpose cluster manager

Kubernetes – It is a general-purpose container orchestration platform

Every developer needs an local environment to run and test the Spark application. This article explains detailed steps in setting up the multinode Apache Spark Cluster.

Block Diagram (Source: iNNovationMerge)

When an application is submitted to Spark, it will create one driver process and multiple executer processes for the application on multiple nodes.

The entire set of Drivers & Executors is available for application.

The driver is responsible for Analysing, Distributing, Scheduling, Monitoring, and maintaining all the information during the lifetime of the application.

Each node can have multiple executors. Executers are responsible for executing the code assigned to them by Driver and reporting the status back to the Driver.

Hardware Requirements

The steps of this article are tested with the below Laptop configuration:

  • RAM : 12GB DDR4
  • Processor : Intel Core i5, 4 cores
  • Graphic Processor : NVIDIA GeForce 940MX

Software Requirements

Implementation

Create Master and Worker Nodes

Create below configuration Machine’s in VirtualBox:

  • Master – 2 vCPU, 3GB RAM, Ubuntu OS
  • Node1 – 2 vCPU, 1GB RAM, Ubuntu OS
  • Node2 – 2 vCPU, 1GB RAM, Ubuntu OS

Create Host-only network

Apache spark needs Static IP addresses to communicate between nodes. VirtualBox has a Host-only network mode for communicating between a host and guests. In simple words, nodes created with this network mode can communicate with each other and The VirtualBox host machine(Laptop) can access all VMs connected to the host-only network. In VirtualBox software Navigate File -> Host Network Manager.

Host Network Manager (Source: iNNovationMerge)

Click on Create -> Configure Adapter manually with IPv4 Address : 192.168.56.1 and Network Mask : 255.255.255.0Configure Adapter manually (Source: iNNovationMerge)

Assign a Host-only network for Master and worker nodes

Select the machine -> SettingsSelect machine (Source: iNNovationMerge)

Navigate to Network -> Adapter1 and set as below:

Adapter 1 Settings (Source: iNNovationMerge)

Assign NAT Network Adapter for Master and worker nodes

Since the internet is needed for the nodes, the NAT network is used. Select Adapter 2 and Configure as below:

Adapter 2 Settings (Source: iNNovationMerge)

Verify network configurationverify Network (Source: iNNovationMerge)

Start Master and Worker nodes from VirtualBoxStart machine (Source: iNNovationMerge)

Check network settings in each node

Two Ethernet networks must be connected

Network settings (Source: iNNovationMerge)

Click on Ethernet 1 Settings -> IPv4 -> Manual

Ethernet 1 Settings (Source: iNNovationMerge)

For Master/Driver

Address: 192.168.56.101

Network Mask: 255.255.255.0

Gateway: 192.168.56.1

For node1

Address: 192.168.56.102

Network Mask: 255.255.255.0

Gateway: 192.168.56.1

For node2

Address: 192.168.56.103

Network Mask: 255.255.255.0

Gateway: 192.168.56.1

Click on Ethernet 2 Settings -> IPv4 -> Automatic(DHCP) in all the nodes

Ethernet 2 Settings (Source: iNNovationMerge)

Check network connectivity between Hosts and NodesConnectivity (Source: iNNovationMerge)

Get all Host-only network IPs

  • master – 192.168.56.101
  • node1 – 192.168.56.102
  • node2 – 192.168.56.103

Set hostname

Open the hostname file on master, node1, and node2 and set its respective hostnames as below:Set Hostname (Source: iNNovationMerge)

sudo nano /etc/hostname
# Add below lines
* On master - master.spark.com
* On node1  - node1.spark.com
* On node2 - node2.spark.com

Add network information to the hosts file of the master, node1, and node2

sudo nano /etc/hosts

# Add below lines
192.168.56.101	master.spark.com
192.168.56.102	node1.spark.com
192.168.56.103	node2.spark.com

Next, reboot all the machines:

reboot

Install java on all the nodes

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install openjdk-11-jdk
java -version

Setup SSH only in master

SSH Setup (Source: iNNovationMerge)

sudo apt-get install openssh-server openssh-client
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh-copy-id ubuntu@master.spark.com
ssh-copy-id ubuntu@node1.spark.com
ssh-copy-id ubuntu@node2.spark.com

Configure Slaves/Worker information only on Master

sudo nano slaves
# Add below lines
node1.spark.com
node2.spark.com

Disable firewall

sudo ufw disable

Start spark from master

cd /usr/local/spark
./sbin/start-all.sh

Open Spark URL http://192.168.56.101:8080/Spark URL (Source: iNNovationMerge)

Configure Jupyter Notebook

pip install jupyter
sudo nano ~/.bashrc

# Add below lines 
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

source ~/.bashrc

Run Jupyter Notebook

jupyter notebook --ip 0.0.0.0

Open Jupyter Notebook from URL http://192.168.56.101:8888/

Start Coding

Jupyter Notebook (Source: iNNovationMerge)

View application from Spark Master URLApplication (Source: iNNovationMerge)

Source:

https://www.innovationmerge.com/2021/06/26/Setting-up-a-multi-node-Apache-Spark-Cluster-on-a-Laptop/

https://medium.com/@jootorres_11979/how-to-install-and-set-up-an-apache-spark-cluster-on-hadoop-18-04-b4d70650ed42

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.