Spark LIBLINEAR: A Practical Guide for Running a Spark LIBLINEAR Example

1. Introduction

This guide demonstrates how to run Spark LIBLINEAR in your Spark cluster. We assume that you already have a Spark environment. If not, please check the pages VirtualBox Guide and Amazon EC2 Guide

This guide is based on the following dependencies.
Ubuntu 13.10
Java JDK 7u55 (You have to use a Java JDK before 8u because of dependence issue)
Hadoop 1.2.1
Spark 1.0.0

2. Running a Spark LIBLINEAR Example

We will demonstrate how to run Spark LIBLINEAR in this section. We assume your Spark home is ~/spark-1.0.0-bin-hadoop1/ and Hadoop home is ~/hadoop-1.2.1/.

Note that if you use Amazon EC2 to create your Spark cluster, Spark will be placed at ~/spark/ instead of ~/spark-1.0.0-bin-hadoop1/, while Hadoop will be placed at ~/ephemeral-hdfs/ insteads of ~/hadoop-1.2.1/. Please properly change directory names, paths, master name and user name in this guide.

To simplify this guide, we assume the names of master machine and user are "pineapple0" and "spongebob," respectively.

2.1 Download Spark LIBLINEAR

Download Spark LIBLINEAR package by tar.gz file or zip file and put the Spark LIBLINEAR directory to your Spark directory.

$ cd ~
$ tar zxvf spark-liblinear-1.95.tar.gz
$ mv spark-liblinear-1.95/ ~/spark-1.0.0-bin-hadoop1/

2.2 Check jar file `spark-liblinear-1.95.jar`

To save your time, we have packed Spark LIBLINEAR java class files into the jar file spark-liblinear-1.95.jar. You can find this file at ~/spark-1.0.0-bin-hadoop1/spark-liblinear-1.95/spark-liblinear-1.95.jar. If you want to pack it by yourself, please check README file at ~/spark-1.0.0-bin-hadoop1/spark-liblinear-1.95/README.spark.

2.3 Start your HDFS

$ ~/hadoop-1.2.1/bin/start-all.sh

2.4 Put your training data into HDFS

For Spark LIBLINEAR, we support libsvm data format. Take dataset heart_scale for example. You can find dataset heart_scale in the directory of Spark LIBLINEAR and put it into HDFS by

~$ hadoop fs -put ~/spark-1.0.0-bin-hadoop1/spark-liblinear-1.95/heart_scale heart_scale
~$ hadoop fs -ls

>> Found 1 item
>> -rw-r--r-- 2 spongebob supergroup 27670 2014-04-25 20:07 /user/spongebob/heart_scale

It is successful if you see this informantion in HDFS.

2.5 Running Spark LIBLINEAR by Spark Shell

Remember to start Spark cluster before executing the following steps.

$ ~/spark-1.0.0-bin-hadoop1/sbin/start-all.sh

2.5.1 Start Spark Shell

Spark provides an interactive shell to execute API. To start the spark shell, move to your Spark directory and run

$ cd ~/spark-1.0.0-bin-hadoop1/
$ ./bin/spark-shell --jars "/home/spongebob/spark-1.0.0-bin-hadoop1/spark-liblinear-1.95/spark-liblinear-1.95.jar"

2.5.2 Run a Spark LIBLINEAR Training Task

Import the classes of Spark LIBLINEAR.

scala> import tw.edu.ntu.csie.liblinear._

Load dataset to memory by API loadLibSVMData.

scala> val data = Utils.loadLibSVMData(sc, "hdfs://pineapple0:9000/user/spongebob/heart_scale")
scala> val model = SparkLiblinear.train(data, "-s 0 -c 1.0 -e 1e-2")

Note that you can specify LIBLINEAR options in the second argument of train().

2.5.3 Run a Spark LIBLINEAR Predicting Task

Let's use the obtained model to predict the training data by predict().

scala> val LabelAndPreds = data.map { point =>
val prediction = model.predict(point)
(point.y, prediction)
}

scala> val accuracy = LabelAndPreds.filter(r => r._1 == r._2).count.toDouble / data.count
scala> println("Training Accuracy = " + accuracy)