LIBSVM Data: Multi-label Classification
Recently multi-label classification has been an important topic. Currently there are very few publicly available data sets. We tried hard to collect the following sets. Labels are in the beginning of each line and separated by commas.
bibtex
BlogCatalog
delicious
EUR-Lex
- Source:
[LM10a]
- Preprocessing:
Both the tokenized texts and tf-idf features provided here are the same as those used by AttentionXML (except tiny numerical differences in the tf-idf features). The texts are extracted from the source documents obtained from the original EUR-Lex dataset. Before tokenization, symbols like '>', '<', '"' and '&' are replaced with their textual representatons 'gt', 'lt', 'quot' and 'amp', respectively. Then, the text is further processed using LetterTokenizer, LowerCaseFilter, StopFilter (with stopwords taken from the original dataset) and PorterStemFilter provided by PyLucene in the mentioned order. To reproduce the text used by AttentionXML, words without any english letter in it are removed and the greek letter 'σ' at the end of words is transformed to 'ς' (except the formula dH/dσ appeared in one document). The EUROVOC labels for each instance are taken from the original dataset directly. Finally, the tokenized texts are outputted in the format of labels<TAB>texts, where the labels and texts are respectively separated by white spaces. The tf-idf features are not calculated from the text provided here. Instead, they are calculated from the tokenized texts provided by the original EUR-Lex dataset. The tf-idf features are then generated using TfidfVectorizer from sklearn with the idf formula modified to be log(N/df). The code for generating the dataset is provided.
- # of classes: 3,956
- # of data:
15,449
/ 3,865 (testing)
- # of features:
186,104
- Files:
EURLEX57K
- Source:
[IC19a]
- Preprocessing:
The data for generating the raw text are downloaded from this website. Following lmtc-eurlex57k, we concatenated four sections header, recitals, main_body, and attachments with white space between them. All newlines are replaced with white spaces in addition. The code used to generate the sets is also provided. The data is in the format of
ID<TAB>labels<TAB>raw texts.
- # of classes: 4,271
- # of data:
45,000
/ 6,000 (valid)
/ 6,000 (testing)
- # of features:
N/A
- Files:
Flickr
mediamill (exp1)
- Source:
Mediamill
/ The Mediamill Challenge Problem
- Preprocessing:
We combine all binary classification problems into a multi-label one.
- # of classes: 101
- # of data:
30,993
/ 12,914 (testing)
- # of features:
120
- Files:
PPI
rcv1v2 (topics; subsets)
- Source:
[DL04b]
- # of classes: 101
- # of data:
3,000
/ 3,000 (testing)
- # of features:
47,236
- Files:
rcv1v2 (topics; full sets)
- Source:
[DL04b]
- Preprocessing:
The four test sets correspond to the four testing files from the RCV1 site (appendix B.13). A combined file is also provided. In the test set, the number of classes is 103. We further provide files of original labels and tokenized texts (B.12 of RCV1 site) in the format of
ID<TAB>labels<TAB>tokens,
where labels and tokens are respectively separated by spaces.
- # of classes: 101
- # of data:
23,149
/ 781,265 (testing)
- # of features:
47,236
- Files:
rcv1v2 (industries; full sets)
- Source:
[DL04b]
- Preprocessing:
The four testing sets correspond to the four testing files from the RCV1 site. In the testing set, the number of classes is 350.
- # of classes: 313
- # of data:
23,149
/ 781,265 (testing)
- # of features:
47,236
- Files:
rcv1v2 (regions; full sets)
- Source:
[DL04b]
- Preprocessing:
The four testing sets correspond to the four testing files from the RCV1 site. In the testing set, the number of classes is 296.
- # of classes: 228
- # of data:
23,149
/ 781,265 (testing)
- # of features:
47,236
- Files:
scene-classification
- Source:
[MB04a]
- # of classes: 6
- # of data:
1,211
/ 1,196 (testing)
- # of features:
294
- Files:
siam-competition2007
- Source:
SIAM Text Mining Competition 2007
/ SIAM Text Mining Competition 2007
- Preprocessing:
We remove "." before transforming data to vectors. We use
binary term frequencies and normalize each instance to unit
length.
- # of classes: 22
- # of data:
21,519
/ 7,077 (testing)
- # of features:
30,438
- Files:
Wiki10-31K
- Source:
[AZ09a]
- Preprocessing:
The raw texts are extracted from the original html documents by concatenating all the <p> tags within the block <div id="bodyContent"> ... </div> in each file with white spaces between them. We tried to generate the raw texts and also the split as close as possible to those used in AttentionXML, which is based on those provided by the Extreme Classification Repository. It seems that due to the update of the source pages, eight instances are slightly different from those in AttentionXML. Other instances are exactly the same except that tabs in the texts are replaced with white spaces. The raw text data is in the format of labels<TAB>raw texts, where the labels are separated by spaces. The tf-idf features are calculated from the raw texts provided here using sklearn's TfidfVectorizer with default configurations except that "min_df" is set to be 3 to avoid too many features. Note that the resulting tf-idf features are different from the one provided by the Extreme Classification Repository. The code used to generate the raw texts and tf-idf features are both provided.
- # of classes: 30,938
- # of data:
14,146
/ 6,616 (testing)
- # of features:
104,374
- Files:
yeast
- Source:
[AE02a]
- # of classes: 14
- # of data:
1,500
/ 917 (testing)
- # of features:
103
- Files:
YouTube