Distributed training is useful when a data set is too large to be stored/trained on a single machine. Distributed LIBLINEAR, extended from LIBLINEAR, is an open source library for large-scale linear classification in distributed environments.
In this guide, we provide a step-by-step tutorial of setting up an environment for MPI LIBLINEAR on a cloud computing platform, Amazon Elastic Compute Cloud (EC2). We provide examples of how to rent and setup machines on AWS. Then we demonstrate how to train a linear classifier with MPI LIBLINEAR, where data are distributedly stored on many nodes.
For users who already have prepared machines, please directly go to check Section 3.
We assume that you already have an AWS account and some experience on running LIBLINEAR. To get a sense of AWS EC2, you may first visit the official web console, on which you can do many operations like launching instances, creating volumes, monitoring machines, etc.
Boto 3 is AWS’s Python SDK, which offers a Python interface for developers to manage their settings on AWS EC2. In this guide, we consider the package because it is more programmable and simple to demonstrate, though settings can also be done by using the web console.
First, follow the below steps to set up Boto 3 on your local machine.
Install Boto 3 package via pip. (or using other Python package management tools such as Conda)
pip install boto3
Set up an access key on the AWS website. An access key can be created from the AWS web page (by clicking your account name > My Security Credentials > Access Keys)
Put the key information at
~/.aws/credentials as follows
[default] aws_access_key_id = YOUR_KEY aws_secret_access_key = YOUR_SECRET
Note that you should carefully manage your access keys. You may check AWS Identity and Access Management for advanced permission controls.
To see if you are able to access AWS, try the following example. We recommend using IPython or other similar Python interactive shells.
The above code retrieves all existing EC2 instances under the
us-east-1 region. If no errors are outputted, you can successfully access AWS by Boto 3. Note that you may find nothing is outputted if you have not created any instance.
Before launching an instance, we have to set up a security group and a key pair.
On EC2, a security group is a virtual firewall that controls the network traffic on instances. In order to do distributed training, the network traffic among machines need to be allowed. Here we show an example of how to set up a security group.
The code creates a security group named
SECURITY_GROUP_LIBLINEAR and allows all inbound traffic through TCP protocol for any port (which is required by MPI communication). Once the group has been created on EC2, you will be able to launch instances with this security group by specifying its ID. Note that you may consider more advanced settings (e.g., only allowing certain IP addresses for inbound traffic) to enhance the security. For more details, please visit the document at Boto 3 website.
Before launching instances, we may want to create an ssh-key pair on EC2 for logging in.
The content of private key will be returned and outputted. You are supposed to put the returned key in some private directory, say a file at
~/.ssh/id_rsa_aws and remember to change the permission of the key file (see the command below) to make sure it is not accessible by other users.
chmod 600 ~/.ssh/id_rsa_aws
Launching an instance on EC2 can be complicated because there are lots of options offered by AWS. For simplicity, we mostly choose default settings here. The following example demonstrates how to create a
t2.micro instance, which provides a basic level of CPU performance.
The ID of the launched machine will be printed. Now if you execute the following code.
The instance just created will be shown.
ami-66506c1c is the image id of Ubuntu Server 16.04. You can find a detailed list of image ids from AWS Marketplace.
To get more information of the instance, we run the following code
Then you should get messages like this:
<ID-of-the-instance> t2.micro <public-ip-of-the-instance>
Now you have successfully launched an instance! Try to login to it by
ssh -i ~/.ssh/id_rsa_aws ubuntu@<public-ip-of-the-instance>
Preparing the environments (e.g, install the dependency package) of all training nodes can be a tedious work. On AWS EC2, it is possible to create your custom image, which can be further used to launch instances (just like we adopt the official ubuntu one in Section 2.3). Therefore, we can only prepare all the stuffs on one machine and copy its volume to others. This may save many efforts.
Login to the instance launched previously and install some dependencies
ssh -i ~/.ssh/id_rsa_aws ubuntu@<public-ip-of-the-instance> sudo apt update sudo apt install build-essential openmpi-bin libopenmpi-dev libopenblas-dev zip bzip2 cd ~ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/mpi/mpi-liblinear-2.20.zip unzip mpi-liblinear-2.20.zip
Generate an ssh-key and make other nodes accessible by this shared key
ssh-keygen (Several ENTERs) cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Back to your local machine and create an image based on this instance
Then we will obtain the image’s id. The image will also appear on AWS web console.
Note that you can terminate the instance if you do not need it any more.
In this guide, we consider 4 instances as the training nodes. We are going to launch those instances with the image created in Section 2.4. By filling in the image id to the argument of
ec2.create_instances, the instance will be initialized with the specified image.
instances = ec2.create_instances( ImageId=image.id, InstanceType='t2.micro', MaxCount=4, MinCount=4, KeyName='KEY_PAIR_DEMO', SecurityGroupIds=[sec_group.id] ) for instance in instances: instance.wait_until_running() instance.reload() print(instance.id, instance.instance_type, instance.public_ip_address)
Then 4 public IPs corresponding to each instance will be outputted. Note that one of the instances should be selected as the “master” node, and others as “slave” nodes.
Now you got 4 instances which are available to run MPI LIBLINEAR.
For users who follow the above steps to set up AWS EC2 instances, please directly go to Section 4.
If you already have machines and would like to set up the environment for them, please follow the requirements below.
MPI LIBLINEAR is supposed to be run on UNIX-like machines. Basically, the following libraries are required:
Open MPI OpenBLAS If you are using Ubuntu OS, you may install them by
shell sudo apt-get update sudo apt-get install openmpi-bin libopenmpi-dev libopenblas-dev Note that you have to install them on ALL machines. For other Unix-like systems, you may need to download the source code and build it. For example, you may check the Open MPI offical guide for Open MPI and OpenBLAS official guide.
All your machines are accessible via SSH login without password prompt from any other ones Every machine should have the SSH key and the key is authorized on every node (i.e, put into
~/.ssh/authorized_keys). This site may be helpful.
All machines have the exact SAME working directory You should download MPI LIBLINEAR at the same working directory for each machine. For example, all machines have the code directory put at
/home/your_username/mpi-liblinear-2.20 (note that username should also be the same, so you may consider creating a new user, say
mpiuser, on all machines).
NOTICE: Firewall could be an issue to MPI communication. If you encounter errors please turn it off and try again.
Then check Section 4 for running MPI LIBLIENAR.
After completing the above configuration, we are about to run the distributed training on AWS instances. The procedure is simple.
You have to prepare your data set by following LIBSVM format. Several examples can be found in LIBSVM Data. For demonstration, we select the small data, rcv1_train.binary.
Then we choose one instance as the master node and ssh login to it.
ssh -i ~/.ssh/id_rsa_aws ubuntu@<public-ip-of-the-instance> cd mpi-liblinear-2.20 wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2 bzip2 -d rcv1_train.binary.bz2
Create a file named
machinefile which contains IP information of each nodes (4 public IPs corresponding to all instances; see the output in Section 2.5).
MASTER_INSTANCE_IP SLAVE_INSTANCE_A_IP SLAVE_INSTANCE_B_IP SLAVE_INSTANCE_C_IP
NOTICE: The first line should be the IP of the master instance, otherwise you might encounter problems when using
split.py will equally split and distribute the data set to other instances specified in
python3 split.py machinefile rcv1_train.binary
We first need to generate the executable files on the 4 instances. We specify the number of available nodes by
-n 4 -npernode 1.
mpirun -n 4 -npernode 1 --machinefile machinefile make
For more advanced options of
mpirun, please visit Open MPI official website.
Now we are ready to train the data distributedly by running the below command. Take the
-s 0 solver for example.
mpirun -n 4 -npernode 1 --machinefile machinefile ./train -s 0 rcv1_train.binary
You will see messages like this.
#instance = 20242, #feature = 47236 iter 1 act 8.598e+03 pre 7.607e+03 delta 3.571e+01 f 1.403e+04 |g| 6.197e+02 CG 4 iter 2 act 1.167e+03 pre 1.004e+03 delta 3.571e+01 f 5.433e+03 |g| 1.505e+02 CG 5 iter 3 act 1.408e+02 pre 1.253e+02 delta 3.571e+01 f 4.266e+03 |g| 4.297e+01 CG 4 iter 4 act 1.046e+01 pre 9.910e+00 delta 3.571e+01 f 4.125e+03 |g| 9.166e+00 CG 5
More options of
train can be found in
We then try to predict the training data
mpirun -n 4 --machinefile machinefile ./predict rcv1_train.binary rcv1_train.binary.model rcv1_train.binary.out
The prediction accuracy will be outputted as follows.
Accuracy = 97.7571% (19788/20242)
Amazon EC2 offers two types of instance pricing, On-Demand Instances and Spot Instances. In this tutorial, we only consider On-Demand Instances. Normally the price of Spot Instances can be much cheaper than On-Demand Instances (check the price history at EC2 website). However, your Spot Instances could be interrupted if EC2 needs the machines back. If your application does not involve long time training, you may consider using Spot instances to save money.
You may first check our FAQ. If you still encounter problems or have any suggestions, please send mails to Chih-Jen Lin.