Apache Spark is a highly complex platform with many interlocking parts. Online web services such as AWS allow you to launch clusters of nodes with Spark already installed.
I chose Digital Ocean as my Cloud Platform in which to learn and run Spark. This is a low cost option that has the right balance between ease of use and custom control. I can make a simple Droplet online, connect to it via SSH and install all the packages I need for a Spark/Data Science setup.
Digital Ocean has preconfigured Droplet images with applications already installed. As of time of writing, there does not appear to be one for Data Science/Data Engineering. For the amateur (i.e: me), it is useful to understand how to install the Spark framework from scratch. Digital Ocean also gives you the option to make your own images (called “Snapshots”) and save them. Since I am tinkering with many different areas of Data Science, it makes sense for me to generate my own prebuilt image, and customize this “base” image as I move along.
However, having an image file is not enough – starting from some base image may not always be desirable. It is nice to be able to start from the beginning for some highly customizable and small builds.
When installing and building from scratch, there are many (many...) steps to follow. Along the way, I built a rudimentary shell script to automate the tasks for the next node. And so, the script was born. I have tested it on a fresh Droplet, and it actually works!
The first version of my build scripts can be seen below. There are two parts to my build script (called 1.sh and 2.sh). Once the Droplet is initiated, I sftp into it, and place the two scripts in the root folder. Using the admin account, I run them.
The first script (1.sh) sets up a user account on the Droplet, and copies over root ssh keys to the user folder.
#Start a new droplet with both user and root keys loaded.
#this will transfer over keys properly, without dealing with permissions.
usermod -aG sudo user
rsync --archive --chown=user:user ~/.ssh /home/user
#Now go into /root/.ssh/authorized_keys and delete the user key.
#It is ok if user has root and user key (either is fine).
I already have two ssh keyfiles generated (one for user and root). Digital Ocean has an option to install them properly in your Droplet. Once the ssh keys are copied over, I manually delete the user key from the root ssh authorized_keys file.
With the user setup is done, it’s time to install our system with the 2.sh script:
#The rest of this assumes we are using the root account.
#folder for user to do installs.
apt-get upgrade -y
apt install default-jdk scala -y
apt install git build-essential -y
apt install zlib1g-dev libffi-dev -y
#Droplet comes with python3.6, but doesn’t have these modules.
apt install python3-pip python3-venv -y
apt install r-base -y
#install for python 3.7
apt-get install libsqlite3-dev libssl-dev -y
curl -O https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tar.xz
tar xf Python-3.7.2.tar.xz
#create an env folder:
python3.7 -m venv /home/user/env
curl -O http://apache.mirror.colo-serv.net/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar xvf spark-2.4.0-bin-hadoop2.7.tgz
#a nicer renaming.
mv ./spark-2.4.0-bin-hadoop2.7 ./spark2.4-hadoop2.7
rm -f *.tgz
rm -f *.xz
#set /home/user folder so it is owned by user!
chown -R user:user ./user/*
Finally, I test to see that Spark, Python3.7, venv, R and Java work on the console. The Droplet is now ready to go!
: The “yes” command for scripted installs: https://askubuntu.com/questions/519/how-do-i-write-an-application-install-shell-script
: Basic Spark Install: https://medium.com/@josemarcialportilla/installing-scala-and-spark-on-ubuntu-5665ee4b62b1
: untar a tar.xz file: https://askubuntu.com/questions/92328/how-do-i-uncompress-a-tarball-that-uses-xz
: Install C Compilation Tools: https://stackoverflow.com/questions/19816275/no-acceptable-c-compiler-found-in-path-when-installing-python
: Fixing Zlib error : https://unix.stackexchange.com/questions/291737/zipimport-zipimporterror-cant-decompress-data-zlib-not-available
: Setup SSH keys for user: https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-18-04
: Create SSH Keys: https://www.digitalocean.com/docs/droplets/how-to/add-ssh-keys/create-with-openssh/