How to install Apache Spark on Ubuntu?

Please Note: This article is part of our historical archive. Because it was published a while ago, some of the information, links, or context may now be outdated.

The process of Apache Spark install on Ubuntu requires some dependency packages like JDK, Scala, and Git installed on the system.

As a part of our Server Management Services, we help our Customers with software installations regularly.

Let us today discuss the steps to get the basic setup going on a single system.

How to install Apache Spark on Ubuntu?

Apache Spark, a distributed open-source, general-purpose framework helps in analyzing big data in cluster computing environments. It can easily process and distribute work on large datasets across multiple computers.

The first step in installing Apache stark on Ubuntu is to install its dependencies. Before we start with installing the dependencies, it is a good idea to ensure that the system packages are up to date with the update command.

root@ubuntu1804:~# apt update -y
root@ubuntu1804:~# apt -y upgrade

Step to install dependencies includes installing the packages JDK, Scala, and Git. This can be done with the following command:

root@ubuntu1804:~# apt install default-jdk scala git -y

We can now verify the installed dependencies by running these commands:

java -version; javac -version; scala -version; git --version

The output prints the versions if the installation completed successfully for all packages.

Steps to install Apache Spark on Ubuntu

The steps to install Apache Spark include:

Download Apache Spark
Configure the Environment
Start Apache Spark
Start Spark Worker Process
Verify Spark Shell

Let us now discuss each of these steps in detail.

Download Apache Spark

Now that the dependencies are installed in the system, the next step is to download Apache Spark to the server. The Mirrors with the latest Apache Spark version can be found on the Apache Spark download page. Download Apache Spark using the following command.

root@ubuntu1804:~# wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

After completing the download, extract the Apache Spark tar file using this command and move the extracted directory to /opt:

root@ubuntu1804:~# tar -xvzf spark-*

root@ubuntu1804:~# mv spark-3.0.1-bin-hadoop2.7/ /opt/spark

Configure the Environment

Before starting the Spark master server, we need to configure a few environmental variables. First, set the environment variables in the .profile file by running the following commands:

root@ubuntu1804:~# echo "export SPARK_HOME=/opt/spark" >> ~/.profile
root@ubuntu1804:~# echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
root@ubuntu1804:~# echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

To ensure that these new environment variables are accessible within the shell and available to Apache Spark, it is also necessary to run the following command.

root@ubuntu1804:~# source ~/.profile

Start Apache Spark

Now that we have configured the environment, the next step is to start the Spark master server. This can be done with the command:

root@ubuntu1804:~# start-master.sh

To view the web interface, it is necessary to use SSH tunneling to forward a port from the local machine to the server. Logout of the server and then run the following command replacing the hostname with the server’s hostname or IP address:

 ssh -L 8080:localhost:8080 root@hostname

It should now be possible to view the web interface from a browser on the local machine by visiting http://localhost:8080/. Once the web interface loads, copy the URL as it will be needed in the next step.

Start Spark Worker Process

In this case, the installation of Apache Spark is on a single machine. For this reason, the worker process will also be started on this server. Back in the terminal to start up the worker, run the following command, pasting in the Spark URL from the web interface.

root@ubuntu1804:~# start-slave.sh spark://ubuntu1804.awesome.com:7077

Now that the worker is running, it should be visible back in the web interface.

Verify Spark Shell

The web interface is handy, but it will also be necessary to ensure that Spark’s command-line environment works as expected. In the terminal, run the following command to open the Spark Shell.

root@ubuntu1804:~# spark-shell

The Spark Shell is not only available in Scala but also Python. Exit the current Spark Shell by holding the CTRL key + D. To test out pyspark run the following command.

root@ubuntu1804:~# pyspark

Shut Down Apache Spark

If it becomes necessary for any reason to turn off the main and worker Spark processes, run the following commands:

root@ubuntu1804:~# stop-slave.sh
root@ubuntu1804:~# stop-master.sh

[Need any further assistance to install Apache spark on Ubuntu?– We’re available 24*7]

Conclusion

In short, Apache Spark is a distributed open-source, general-purpose framework used in cluster computing environments for analyzing big data. Today, we saw how our Support Engineers install Apache spark on a single Ubuntu system.

How to install Apache Spark on Ubuntu?

How to install Apache Spark on Ubuntu?

Steps to install Apache Spark on Ubuntu

Download Apache Spark

Configure the Environment

Start Apache Spark

Start Spark Worker Process

Verify Spark Shell

Shut Down Apache Spark

Conclusion

Submit a Comment Cancel reply

Subscribe to our newsletter

Footer newsletter