Skip to content
Rajan edited this page Dec 7, 2022 · 21 revisions

RCV1-V2 (Reuters Corpus Volume 1, Text categorization dataset) Example for VW

One source for test data come from Leon Bottou's page on stochastic gradient descent. To use this, download the RCV1-V2 dataset here.

There are 3 files, rcv1.train.dat.gz, rcv1.test.dat.gz and vw_process. vw_process is a simple script that converts from an svmlight format to VW's format.

The individual files look like:

1 |f 13:3.9656971e-02 24:3.4781646e-02 69:4.6296168e-02 85:6.1853945e-02 ... 
0 |f 9:8.5609287e-02 14:2.9904654e-02 19:6.1031535e-02 20:2.1757640e-02 ... 
...

From the above, you can see that the input data format is similar to SVMlight's feature:value sparse representation format. There are two important differences:

  1. The feature can be a string not including colon ':', space ' ', or pipe '|' (which are special characters). In the above, this is not used. In general, this is pretty handy because you can use much less processed data than most learning algorithm take in.
  2. The features are divided into namespaces. The semantics of a namespace is: features with the same name but a different namespace are different features. This example just has one namespace "f" which is the simplest (and probably most common) case.

There are a couple variation on the above format. If you want to importance weight examples, place the importance weight after the label and before the first namespace. A missing importance weight is treated as 1 by default. Similarly, if features have a weight of 1, they can be represented as just it's name rather than name:1.

A command for training is the following.

vw rcv1.train.vw.gz --cache_file cache_train -f r_temp
Here:
  1. --cache_file flag parses the data into VWs own internal compressed format. The second time you run the above command, it should be much faster---about 1.5 seconds on my current desktop machine.
  2. -f r_temp stores the output regressor in the file r_temp.

Next, you can test according to the following:

vw -t --cache_file cache_test -i r_temp -p p_out rcv1.test.vw.gz

Here the flags are:

  1. -t tells vw to not use the labels for training.
  2. -i r_temp loads the regressor at r_temp before examples are processed
  3. -p p_out makes the predictions be output to the file p_out.
To measure performance, I often use the perf which Rich Caruana put together for the 2004 KDD cup challenge. This software has the advantage that many people cared that it worked right. To use perf, you first create a file with the labels
zcat rcv1.test.vw.gz | cut -d ' ' -f 1 | sed -e 's/^-1/0/' > labels

and then type:

perf -ACC -files labels p_out -t 0.5

The results on my machine are summarized by the following table(*):

Method Wall clock Execution Time Test Set Error rate
VW 3.0s 5.54%
svmsgd 21.8s 5.74%
There are several things to understand about the results.
  1. VW is optimizing squared loss and then thresholding on 0.5 rather than hinge loss, so their internal optimizations fundamentally differ. For svmsgd, 5.74% was the best I found for one pass with lambda = 0.00001. Note that nobody is worrying about overfitting in parameter tuning here.
  2. The timing numbers are wall-clock execution time. svmsgd spends about 19s (wall clock time) loading the data and making it ready to train in 0.38s (cputime). VW runs fully online, so the process of loading and running data is fundamentally mixed.

(*) The comparison with svmsgd is obsolete, as Leon has updated his code.

Clone this wiki locally