A Practical Guide of Running MPI LIBLINEAR

1. Introduction

Distributed training is useful when a data set is too large to be stored/trained on a single machine. Distributed LIBLINEAR, extended from LIBLINEAR, is an open source library for large-scale linear classification in distributed environments.

In this guide, we provide a step-by-step tutorial of setting up an environment for MPI LIBLINEAR on a cloud computing platform, Amazon Elastic Compute Cloud (EC2). We provide examples of how to rent and setup machines on AWS. Then we demonstrate how to train a linear classifier with MPI LIBLINEAR, where data are distributedly stored on many nodes.

For users who already have prepared machines, please directly go to check Section 3.

2. Environment Preparation on AWS EC2

We assume that you already have an AWS account and some experience on running LIBLINEAR. To get a sense of AWS EC2, you may first visit the official web console, on which you can do many operations like launching instances, creating volumes, monitoring machines, etc.

Boto 3 is AWS’s Python SDK, which offers a Python interface for developers to manage their settings on AWS EC2. In this guide, we consider the package because it is more programmable and simple to demonstrate, though settings can also be done by using the web console.

First, follow the below steps to set up Boto 3 on your local machine.

Install Boto 3 package via pip. (or using other Python package management tools such as Conda)
```
pip install boto3
```
Set up an access key on the AWS website. An access key can be created from the AWS web page (by clicking your account name > My Security Credentials > Access Keys)
Put the key information at ~/.aws/credentials as follows
```
[default]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_SECRET
```
Note that you should carefully manage your access keys. You may check AWS Identity and Access Management for advanced permission controls.

To see if you are able to access AWS, try the following example. We recommend using IPython or other similar Python interactive shells.

import boto3
ec2 = boto3.resource('ec2', region_name='us-east-1')
for instances in ec2.instances.all():
    print(instances)

The above code retrieves all existing EC2 instances under the us-east-1 region. If no errors are outputted, you can successfully access AWS by Boto 3. Note that you may find nothing is outputted if you have not created any instance.

Before launching an instance, we have to set up a security group and a key pair.

2.1 Creating a Security Group

On EC2, a security group is a virtual firewall that controls the network traffic on instances. In order to do distributed training, the network traffic among machines need to be allowed. Here we show an example of how to set up a security group.

sec_group = ec2.create_security_group(
    GroupName='SECURITY_GROUP_LIBLINEAR',
    Description='Security group for distributed liblinear'
)
sec_group.authorize_ingress(
    CidrIp='0.0.0.0/0',
    IpProtocol='tcp',
    FromPort=0,
    ToPort=65535
)
print(sec_group.id)

The code creates a security group named SECURITY_GROUP_LIBLINEAR and allows all inbound traffic through TCP protocol for any port (which is required by MPI communication). Once the group has been created on EC2, you will be able to launch instances with this security group by specifying its ID. Note that you may consider more advanced settings (e.g., only allowing certain IP addresses for inbound traffic) to enhance the security. For more details, please visit the document at Boto 3 website.

2.2 Creating a Key Pair

Before launching instances, we may want to create an ssh-key pair on EC2 for logging in.

key_pair = ec2.create_key_pair(KeyName='KEY_PAIR_DEMO')
print(key_pair.key_material)

The content of private key will be returned and outputted. You are supposed to put the returned key in some private directory, say a file at ~/.ssh/id_rsa_aws and remember to change the permission of the key file (see the command below) to make sure it is not accessible by other users.

chmod 600 ~/.ssh/id_rsa_aws

2.3 Launching an Instance and getting its Public IP

Launching an instance on EC2 can be complicated because there are lots of options offered by AWS. For simplicity, we mostly choose default settings here. The following example demonstrates how to create a t2.micro instance, which provides a basic level of CPU performance.

instances = ec2.create_instances(
    ImageId='ami-66506c1c',
    InstanceType='t2.micro',
    MaxCount=1,
    MinCount=1,
    KeyName='KEY_PAIR_DEMO',
    SecurityGroupIds=[sec_group.id]
)
instances[0].wait_until_running()
print(instances[0].id)

The ID of the launched machine will be printed. Now if you execute the following code.

for instances in ec2.instances.all():
    print(instances)

The instance just created will be shown.

NOTICE: ami-66506c1c is the image id of Ubuntu Server 16.04. You can find a detailed list of image ids from AWS Marketplace.

To get more information of the instance, we run the following code

demo_instances = ec2.instances.filter(
    Filters=[{'Name': 'key-name', 'Values': ['KEY_PAIR_DEMO']}]
)
for instance in demo_instances:
    print(instance.id, instance.instance_type, instance.public_ip_address)

Then you should get messages like this:

<ID-of-the-instance> t2.micro <public-ip-of-the-instance>

Now you have successfully launched an instance! Try to login to it by

ssh -i ~/.ssh/id_rsa_aws ubuntu@<public-ip-of-the-instance>

2.4 Creating Amazon Machine Images (AMIs)

Preparing the environments (e.g, install the dependency package) of all training nodes can be a tedious work. On AWS EC2, it is possible to create your custom image, which can be further used to launch instances (just like we adopt the official ubuntu one in Section 2.3). Therefore, we can only prepare all the stuffs on one machine and copy its volume to others. This may save many efforts.

ssh -i ~/.ssh/id_rsa_aws ubuntu@<public-ip-of-the-instance>
sudo apt update
sudo apt install build-essential openmpi-bin libopenmpi-dev libopenblas-dev zip bzip2
cd ~
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/mpi/mpi-liblinear-2.20.zip
unzip mpi-liblinear-2.20.zip

Generate an ssh-key and make other nodes accessible by this shared key

ssh-keygen
(Several ENTERs)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Back to your local machine and create an image based on this instance

image = list(demo_instances)[0].create_image(
    Description='Image of distributed liblinear',
    Name='IMAGE_LIBLINEAR'
)
image.wait_until_exists()
print(image.id)

Then we will obtain the image’s id. The image will also appear on AWS web console.

Note that you can terminate the instance if you do not need it any more.

for instance in demo_instances:
    instance.terminate()

2.5 Creating a number of instances as training nodes

In this guide, we consider 4 instances as the training nodes. We are going to launch those instances with the image created in Section 2.4. By filling in the image id to the argument of ec2.create_instances, the instance will be initialized with the specified image.

instances = ec2.create_instances(
    ImageId=image.id,
    InstanceType='t2.micro',
    MaxCount=4,
    MinCount=4,
    KeyName='KEY_PAIR_DEMO',
    SecurityGroupIds=[sec_group.id]
)
for instance in instances:
    instance.wait_until_running()
    instance.reload()
    print(instance.id, instance.instance_type, instance.public_ip_address)

Then 4 public IPs corresponding to each instance will be outputted. Note that one of the instances should be selected as the “master” node, and others as “slave” nodes.

Now you got 4 instances which are available to run MPI LIBLINEAR.

3. Environment Preparation without using AWS EC2

For users who follow the above steps to set up AWS EC2 instances, please directly go to Section 4.

If you already have machines and would like to set up the environment for them, please follow the requirements below.

MPI LIBLINEAR is supposed to be run on UNIX-like machines. Basically, the following libraries are required: Open MPI OpenBLAS If you are using Ubuntu OS, you may install them by shell sudo apt-get update sudo apt-get install openmpi-bin libopenmpi-dev libopenblas-dev Note that you have to install them on ALL machines. For other Unix-like systems, you may need to download the source code and build it. For example, you may check the Open MPI offical guide for Open MPI and OpenBLAS official guide.
All your machines are accessible via SSH login without password prompt from any other ones Every machine should have the SSH key and the key is authorized on every node (i.e, put into ~/.ssh/authorized_keys). This site may be helpful.
All machines have the exact SAME working directory You should download MPI LIBLINEAR at the same working directory for each machine. For example, all machines have the code directory put at /home/your_username/mpi-liblinear-2.20 (note that username should also be the same, so you may consider creating a new user, say mpiuser, on all machines).

NOTICE: Firewall could be an issue to MPI communication. If you encounter errors please turn it off and try again.

Then check Section 4 for running MPI LIBLIENAR.

4. Running MPI LIBLINEAR

After completing the above configuration, we are about to run the distributed training on AWS instances. The procedure is simple.

Put the data on one instance
Create a machine configuration file and split the data to other instances
Train & predict data set distributedly

4.1 Putting the data on one instance

You have to prepare your data set by following LIBSVM format. Several examples can be found in LIBSVM Data. For demonstration, we select the small data, rcv1_train.binary.

Then we choose one instance as the master node and ssh login to it.

ssh -i ~/.ssh/id_rsa_aws ubuntu@<public-ip-of-the-instance>
cd mpi-liblinear-2.20
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2
bzip2 -d rcv1_train.binary.bz2

4.2 Creating a machine configuration file and splitting the data to other instances

Create a file named machinefile which contains IP information of each nodes (4 public IPs corresponding to all instances; see the output in Section 2.5).

MASTER_INSTANCE_IP
SLAVE_INSTANCE_A_IP
SLAVE_INSTANCE_B_IP
SLAVE_INSTANCE_C_IP

NOTICE: The first line should be the IP of the master instance, otherwise you might encounter problems when using mpirun.

Then running split.py will equally split and distribute the data set to other instances specified in machinefile.

python3 split.py machinefile rcv1_train.binary

4.3 Training & predicting distributedly

We first need to generate the executable files on the 4 instances. We specify the number of available nodes by -n 4 -npernode 1.

mpirun -n 4 -npernode 1 --machinefile machinefile make

For more advanced options of mpirun, please visit Open MPI official website.

Now we are ready to train the data distributedly by running the below command. Take the -s 0 solver for example.

mpirun -n 4 -npernode 1 --machinefile machinefile ./train -s 0 rcv1_train.binary

You will see messages like this.

#instance = 20242, #feature = 47236
iter  1 act 8.598e+03 pre 7.607e+03 delta 3.571e+01 f 1.403e+04 |g| 6.197e+02 CG   4
iter  2 act 1.167e+03 pre 1.004e+03 delta 3.571e+01 f 5.433e+03 |g| 1.505e+02 CG   5
iter  3 act 1.408e+02 pre 1.253e+02 delta 3.571e+01 f 4.266e+03 |g| 4.297e+01 CG   4
iter  4 act 1.046e+01 pre 9.910e+00 delta 3.571e+01 f 4.125e+03 |g| 9.166e+00 CG   5

More options of train can be found in README.distributed.

We then try to predict the training data

mpirun -n 4 --machinefile machinefile ./predict rcv1_train.binary rcv1_train.binary.model rcv1_train.binary.out

The prediction accuracy will be outputted as follows.

Accuracy = 97.7571% (19788/20242)

5. Miscellaneous

5.1 Using On-Demand Instances or Spot Instances

Amazon EC2 offers two types of instance pricing, On-Demand Instances and Spot Instances. In this tutorial, we only consider On-Demand Instances. Normally the price of Spot Instances can be much cheaper than On-Demand Instances (check the price history at EC2 website). However, your Spot Instances could be interrupted if EC2 needs the machines back. If your application does not involve long time training, you may consider using Spot instances to save money.

Any questions?

You may first check our FAQ. If you still encounter problems or have any suggestions, please send mails to Chih-Jen Lin.