The process of Apache Spark install on Ubuntu requires some dependency packages like JDK, Scala, and Git installed on the system.
As a part of our Server Management Services, we help our Customers with software installations regularly.
Let us today discuss the steps to get the basic setup going on a single system.
How to install Apache Spark on Ubuntu?
Apache Spark, a distributed open-source, general-purpose framework helps in analyzing big data in cluster computing environments. It can easily process and distribute work on large datasets across multiple computers.
The first step in installing Apache stark on Ubuntu is to install its dependencies. Before we start with installing the dependencies, it is a good idea to ensure that the system packages are up to date with the update command.
root@ubuntu1804:~# apt update -y
root@ubuntu1804:~# apt -y upgrade
Step to install dependencies includes installing the packages JDK, Scala, and Git. This can be done with the following command:
root@ubuntu1804:~# apt install default-jdk scala git -y
We can now verify the installed dependencies by running these commands:
java -version; javac -version; scala -version; git --version
The output prints the versions if the installation completed successfully for all packages.
Steps to install Apache Spark on Ubuntu
The steps to install Apache Spark include:
- Download Apache Spark
- Configure the Environment
- Start Apache Spark
- Start Spark Worker Process
- Verify Spark Shell
Let us now discuss each of these steps in detail.
Download Apache Spark
Now that the dependencies are installed in the system, the next step is to download Apache Spark to the server. The Mirrors with the latest Apache Spark version can be found on the Apache Spark download page. Download Apache Spark using the following command.
root@ubuntu1804:~# wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
After completing the download, extract the Apache Spark tar file using this command and move the extracted directory to /opt:
root@ubuntu1804:~# tar -xvzf spark-*
root@ubuntu1804:~# mv spark-3.0.1-bin-hadoop2.7/ /opt/spark
Configure the Environment
Before starting the Spark master server, we need to configure a few environmental variables. First, set the environment variables in the .profile file by running the following commands:
root@ubuntu1804:~# echo "export SPARK_HOME=/opt/spark" >> ~/.profile
root@ubuntu1804:~# echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
root@ubuntu1804:~# echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
To ensure that these new environment variables are accessible within the shell and available to Apache Spark, it is also necessary to run the following command.
root@ubuntu1804:~# source ~/.profile
Start Apache Spark
Now that we have configured the environment, the next step is to start the Spark master server. This can be done with the command:
root@ubuntu1804:~# start-master.sh
To view the web interface, it is necessary to use SSH tunneling to forward a port from the local machine to the server. Logout of the server and then run the following command replacing the hostname with the server’s hostname or IP address:
ssh -L 8080:localhost:8080 root@hostname
It should now be possible to view the web interface from a browser on the local machine by visiting http://localhost:8080/. Once the web interface loads, copy the URL as it will be needed in the next step.
Start Spark Worker Process
In this case, the installation of Apache Spark is on a single machine. For this reason, the worker process will also be started on this server. Back in the terminal to start up the worker, run the following command, pasting in the Spark URL from the web interface.
root@ubuntu1804:~# start-slave.sh spark://ubuntu1804.awesome.com:7077
Now that the worker is running, it should be visible back in the web interface.
Verify Spark Shell
The web interface is handy, but it will also be necessary to ensure that Spark’s command-line environment works as expected. In the terminal, run the following command to open the Spark Shell.
root@ubuntu1804:~# spark-shell
The Spark Shell is not only available in Scala but also Python. Exit the current Spark Shell by holding the CTRL key + D. To test out pyspark run the following command.
root@ubuntu1804:~# pyspark
Shut Down Apache Spark
If it becomes necessary for any reason to turn off the main and worker Spark processes, run the following commands:
root@ubuntu1804:~# stop-slave.sh
root@ubuntu1804:~# stop-master.sh
[Need any further assistance to install Apache spark on Ubuntu?– We’re available 24*7]
Conclusion
In short, Apache Spark is a distributed open-source, general-purpose framework used in cluster computing environments for analyzing big data. Today, we saw how our Support Engineers install Apache spark on a single Ubuntu system.
0 Comments