Q: While using the Python interface, I have memory efficiency issue on storing instances in the required data structure. What should I do?

Currently, we support Python list/tuple and dictionary as the data structure to store instances. However, it is easy to modify the code to support other data structures.
Here, we use "ctpyes array" from built-in Python library "ctypes" as an example. There are two functions to be modified.
  1. Modify "def svm_read_problem" in svmutil.py (LIBSVM) or liblinearutil.py (LIBLINEAR).
    Replace
    	label, features = line
    	xi = {}
    	for e in features.split():
    		ind, val = e.split(":")
    		xi[int(ind)] = float(val)
    	prob_y += [float(label)]
    	prob_x += [xi]
    
    with
    	label, features = line
    	features = features.split()
    	# Two-dimensional array
    	# One row is for feature index, 
    	# the other is for feature value
    	array = (c_double * 2) * len(features)
    	xi = array()
    	for i in range(len(features)):
    		ind, val = features[i].split(":")
    		xi[i][0] = int(ind)
    		xi[i][1] = float(val)
    	prob_y += [float(label)]
    	prob_x += [xi]
    
  2. Modify "def gen_svm_nodearray" in svm.py (LIBSVM) or "def gen_feature_nodearray" in liblinear.py (LIBLINEAR). Here we take "def gen_feature_nodearray"(LIBLINEAR) as an example. The way of modifying "def gen_svm_nodearray"(LIBSVM) is similar.
    Replace
    		index_range = range(1, len(xi))
    	else:
    		raise TypeError('xi should be a dictionary, list or tuple')
    
    with
    		index_range = range(1, len(xi))
    	elif isinstance(xi, Array):
    		# Handle ctypes array
    		return gen_feature_nodearray_from_ctypes_array(xi, feature_max, issparse)
    	else:
    		raise TypeError('xi should be a dictionary, list, tuple or ctypes.Array')
    
    Add another function to handle this data structure respectively.
    	def gen_feature_nodearray_from_ctypes_array(xi, feature_max=None, issparse=True):
    		values = list(xi)
    		if feature_max:
    			assert(isinstance(feature_max, int))
    			values = filter(lambda x: x[0] <= feature_max, values)
    		if issparse:
    			values = filter(lambda x: x[1] != 0, values)
    
    		values = sorted(values, key=lambda x: x[0])
    		ret = (feature_node * (len(values)+2))()
    		ret[-1].index = -1 # for bias term
    		ret[-2].index = -1
    		for idx, (fea, val) in enumerate(values):
    			ret[idx].index = int(fea)
    			ret[idx].value = val
    		max_idx = 0
    		if values:
    			max_idx = int(values[-1][0])
    		return ret, max_idx
    
    And, we are done.

We conducted experiments to compare the memory usage between ctypes array and Python dictionary. See the table below.

Test environment:
LIBLINEAR version 1.91
Intel E5620 2.4G, 8-core, 64-bit, Memory:64G, Disk:4000G
****************************************************************************************************
dataset: random-generated, 1.5M instances, 150K features, 100 non-zeros per instance.

data structure     |   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND   VER
----------------------------------------------------------------------------------------------------
list of dictionary | 17122 applerma  20   0 21.5g  21g 3808 S    0 34.1   2:58.07 python    (2.7.3)
list of array      | 16433 applerma  20   0 2664m 2.6g 3812 S    0  4.1   4:01.64 python    (2.7.3)
----------------------------------------------------------------------------------------------------
list of dictionary | 12209 applerma  20   0 22.0g  21g 3648 S    0 34.9   4:06.99 python3.2 (3.2.3)
list of array      | 12140 applerma  20   0 2664m 2.6g 3660 S    0  4.1   5:39.53 python3.2 (3.2.3)
----------------------------------------------------------------------------------------------------
list of dictionary | 12069 applerma  20   0 12.5g  12g 3528 S    0 19.7   3:24.30 python3.3 (3.3.0)
list of array      | 11821 applerma  20   0 2660m 2.5g 3532 S    0  4.0   5:26.49 python3.3 (3.3.0)
====================================================================================================
Ratio between two approaches
python (2.7.3) 21 / 2.6 ~ 8.07
       (3.2.3) 21 / 2.6 ~ 8.07
       (3.3.0) 12 / 2.5 ~ 4.80
****************************************************************************************************
dataset: 'url_combined', 2396130 instances, 3231961 features, around 100 non-zeros per instance.

data structure     |   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND   VER
----------------------------------------------------------------------------------------------------
list of dictionary | 16730 applerma  20   0 38.6g  38g 3808 S    0 61.4   6:03.76 python    (2.7.3)
list of array      | 17106 applerma  20   0 4810m 4.7g 3812 S    0  7.4   7:50.32 python    (2.7.3)
----------------------------------------------------------------------------------------------------
list of dictionary | 12191 applerma  20   0 40.0g  39g 3648 S    0 63.6   9:06.73 python3.2 (3.2.3)
list of array      | 12232 applerma  20   0 4915m 4.8g 3652 S    0  7.6  10:56.34 python3.2 (3.2.3)
----------------------------------------------------------------------------------------------------
list of dictionary | 12118 applerma  20   0 26.2g  26g 3512 S    0 41.5   6:25.71 python3.3 (3.3.0)
list of array      | 12088 applerma  20   0 4901m 4.7g 3520 S    0  7.5   9:55.15 python3.3 (3.3.0)
====================================================================================================
Ratio between two approaches
python (2.7.3) 38 / 4.7 ~ 8.085
       (3.2.3) 39 / 4.8 ~ 8.125
       (3.3.0) 2.6 / 4.7 ~ 5.53
****************************************************************************************************
Ctypes array has much better memory efficiency than Python dictionary. Note that Python 3.3.0 consumes much less memory than earlier Python. We tested on Python list/tuple, but these data structures could not finish in feasible time.

However, ctypes array seems to take more time than dictionary, so next we check running time by presenting

READ TIME: read data file and store it in the data structure.
TRAIN TIME: convert the data structure to LIBSVM/LIBLINEAR's structures and train the data.

********************************************************************************************************************
version: python (3.3.0)
dataset: randomly generated sparse data (for each setting, we run the experiment twice)
total features: 20K

Data Structure     | Instance Size | Feature Size |   READ TIME(s)    |   TRAIN TIME(s)     |   TOTAL TIME(s)     |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |         1.5M  |         100  | 233.873 / 230.885 | 2442.176 / 2372.660 | 2676.049 / 2603.545 |
list of array      |           --  |          --  | 343.717 / 357.864 | 2142.012 / 2125.639 | 2485.730 / 2483.504 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |         500K  |         100  |  74.262 /  75.424 |  616.010 /  616.809 |  690.272 /  692.234 |
list of array      |           --  |          --  | 110.396 / 112.538 |  698.169 /  691.384 |  808.566 /  803.923 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |          10K  |         100  |   1.644 /   1.779 |    7.987 /    7.945 |    9.632 /    9.725 |
list of array      |           --  |          --  |   2.664 /   2.280 |    9.661 /    9.256 |   12.326 /   11.537 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |          10K  |          1K  |  14.047 /  14.140 |   51.475 /   50.927 |   65.522 /   65.067 |
list of array      |           --  |          --  |  22.564 /  22.553 |   67.680 /   67.753 |   90.244 /   90.307 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |          10K  |         10K  | 138.109 / 138.069 |  111.620 /  105.487 |  249.730 /  243.556 |
list of array      |           --  |          --  | 235.826 / 214.531 |  285.419 /  278.632 |  521.246 /  493.164 |

********************************************************************************************************************
version: python (3.3.0)
dataset: randomly generated dense data (for each setting, we run the experiment twice)

Data Structure     | Instance Size | Feature Size |   READ TIME(s)    |   TRAIN TIME(s)     |   TOTAL TIME(s)     |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |         1.5M  |         100  | 212.936 / 222.638 | 2501.161 / 2504.249 | 2714.097 / 2726.887 |
list of array      |           --  |          --  | 350.020 / 348.115 | 2394.653 / 2401.696 | 2744.673 / 2749.812 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |         500K  |         100  |  72.373 /  74.261 |  755.902 /  750.575 |  828.275 /  824.836 |
list of array      |           --  |          --  | 116.867 / 118.757 |  702.789 /  706.021 |  819.657 /  824.778 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |          10K  |         100  |   1.574 /   1.526 |    8.292 /    8.115 |    9.866 /    9.641 |
list of array      |           --  |          --  |   2.290 /   2.362 |    9.510 /    9.996 |   11.800 /   12.359 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |          10K  |          1K  |  13.768 /  13.935 |   56.815 /   56.858 |   70.583 /   70.794 |
list of array      |           --  |          --  |  22.779 /  22.954 |   73.417 /   73.496 |   96.197 /   96.451 |
--------------------------------------------------------------------------------------------------------------------
list of dictionary |          10K  |         10K  | 140.726 / 145.051 |  176.726 /  170.361 |  317.453 /  315.413 |
list of array      |           --  |          --  | 230.698 / 233.546 |  348.275 /  330.043 |  578.973 /  563.590 |

********************************************************************************************************************
We have also checked on different Python version, 2.7.3 and 3.2.3. The results are similar.
The READ TIME of ctypes array is higher than dictionary. The reason may be that ctypes array is not a built-in data structure, and therefore it takes more time. For TRAIN TIME, the better structure depends on instance size and feature size.

Currently, we do not switch to ctypes array because of the following reasons.
  1. The time required in using ctypes array is not always better than dictionary.
  2. Python dictionary is more friendly to python users, and it is easy to manipulate.