Distributed LIBLINEAR: A Practical Guide to run Spark LIBLINEAR Using VirtualBox

1. Introduction

By following this guide, you can build a distributed Spark cluster on VirtualBox in your computer. Because the settings are complicated, we provide an image with Ubuntu-13.10-server-i386 and all necessary paskages installed to save your time. You can download the image here and follow the guide step by step. We assume your master node is called pineapple0 and slave nodes are called pineapple1 and pineapple2, and so on.

IMPORTANT:

We don't recommend you to modify the name of master (pineapple0) since it is involved in lots of configuration files.

2. Build VirtualBox

2.1 Verify the image

$ md5sum pineapple.ova
>> 670545fb0bed92dfc9a7ec3296dec80e pineapple.ova

2.2 Build a virtual environment

Please visit this tutorial to see how to establish a virtual environment.

3.Build Apache Hadoop

You may notice that we have built a Hadoop system in this image file. All you have to do is to specify the list of slave machines.

3.1 Edit ~/hadoop-1.2.1/conf/slaves

The setting needs to be apply on the master machine. Add the slave machines for HDFS usage.

pineapple0
pineapple1

3.2 Build the authenticity

Please execute the following commands in the same terminal.

3.2.1 Master authenticity

For master machine pineapple0, login to master itself

$ ssh pineapple0
(type: yes)

Return to master machine pineapple0.

$ exit

3.2.2 Slave authenticity

Login to slave

$ ssh pineapple1

In slave machine pineapple1, login to master

$ ssh pineapple0
(type: yes)

Return to master machine pineapple0.

$ exit
$ exit

Note that you should type "yes" to confirm the ssh authenticity.

IMPORTANT: You should be able to login WITHOUT typing password.

4. Test the Built Hadoop System

All the following steps must be done in master.

4.1 Format the HDFS

Format the system by

$ ~/hadoop-1.2.1/bin/hadoop namenode -format

4.2 Start the HDFS

In master machine pineapple0, activate the cluster.

$ ~/hadoop-1.2.1/bin/start-all.sh

4.3 Put a file into HDFS

$ cd ~/hadoop-1.2.1/
$ hadoop fs -put LICENSE.txt LICENSE.txt

Note that you may need to wait a moment to run this command if HDFS is on safe mode.

$ hadoop fs -ls

>> -rw-r--r-- 2 spongebob supergroup 13366 2014-03-04 03:09 /user/spongebob/LICENSE.txt

$ hadoop fs -cat LICENSE.txt

>> Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
(skip some output message)

If you see this License, your building is successful.

4.4 Shut donw HDFS

If you want to shut down HDFS, run

$ ~/hadoop-1.2.1/bin/stop-all.sh

Note that remember to start HDFS before you run your Spark process.

5. Build Apache Spark

You first download a prebuilt Spark package online by our script. Then, you must specify the list of slave machines.

5.1 Download prebuilt Spark package by a script

We choose Spark version 1.0.0 to demonstrate the process. Please execute the following commands in the terminal of the master machine.

$ cd ~
$ ./initSpark.pl http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0-bin-hadoop1.tgz
(skip some output message)

For slave machines, you also need to download the prebuilt package.
$ ssh pineapple1
$ ./initSpark.pl http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0-bin-hadoop1.tgz
(skip some output message)
$ exit

5.2 Edit ~/spark-1.0.0-bin-hadoop1/conf/slaves

The following settings must be applied on the master mahchine. Modify the list of slave machines.

pineapple0
pineapple1

5.3 Specify the configures

Copy the configure files.

$ cd ~/spark-1.0.0-bin-hadoop1/conf/
$ cp spark-defaults.conf.template spark-defaults.conf
$ cp spark-env.sh.template spark-env.sh
$ chmod 755 spark-env.sh

Add "spark.master spark://pineapple0:7077" to spark-defaults.conf by the following command.

echo "spark.master spark://pineapple0:7077" >> spark-defaults.conf

Add "export SPARK_MASTER_IP=pineapple0" and "export SPARK_MASTER_PORT=7077" to spark-env.sh by the following commands.

$ echo "export SPARK_MASTER_IP=pineapple0" >> spark-env.sh
$ echo "export SPARK_MASTER_PORT=7077" >> spark-env.sh

5.4 Start Spark Cluster

$ ~/spark-1.0.0-bin-hadoop1/sbin/start-all.sh

5.5 Try a built-in application first

In master pineapple0, go to the top-level Spark directory (~/spark-1.0.0-bin-hadoop1/) and run a simple application to compute Pi.

$ cd ~/spark-1.0.0-bin-hadoop1/
$ ./bin/run-example SparkPi

It is successful if you see the computed result of Pi.

5.6 Shut down Spark Cluster

$ ~/spark-1.0.0-bin-hadoop1/sbin/stop-all.sh

Note that remember to start Spark Cluster before you run your Spark process.

All settings about Hadoop and Spark are done, please continue to the page Running Spark LIBLINEAR