PyML tutorial

Author:Asa Ben-Hur

Overview

PyML is an interactive object oriented framework for machine learning written in Python. PyML is focused on kernel-methods for classification and regression, including Support Vector Machines (SVM). It provides tools for feature selection, model selection, syntax for combining classifiers and methods for assessing classifier performance.

Installation

Requirements:

  • Python version 2.5 and higher.
  • The numpy package version 1.0 or higher.
  • The matplotlib package is required in order to use the graphical functionality of PyML, but can be used without it.

Currently Unix/Linux and Mac OS-X are supported (there is some C++ code so the library is not automatically portable). A setup.py script is provided so installation follows the standard python way:

python setup.py build
python setup.py install

The newest versions of python allow you to install packages in a site-packages directory which does not require administrator privileges (see python documentation for details).

To check that PyML is installed correctly, go to the data directory of PyML, run the python interpreter and type

>>> from PyML import *

NOTE: You may get an import error if you perform this import operation from the directory in which you performed the installation. To test the installation:

>>> from PyML.demo import pyml_test
>>> pyml_test.test('svm')

which will test the SVM functionality of PyML. To check the version of PyML you’re using do:

>>> import PyML
>>> PyML.__version__

Getting help

To get help on any PyML class, method or function, type: help('PyMLname') at the python prompt or look at the API documentation which is at doc/autodoc/index.html of the PyML distribution. A copy of the tutorial is provided with the PyML distribution at doc/tutorial.pdf.

PyML overview

PyML has classes that contain data (dataset containers), and support computation of various kernels as well as various kernel-based classifiers. Additional functionality includes feature selection, model selection and preprocessing.

Importing modules and classes

NOTE: In version 0.7.0 the code has been restructured, so if you have used earlier versions you will need to change the way you import modules and classes.

There are several way of importing a PyML class. Suppose you want to import the SparseDataSet class which is in PyML/containers/vectorDatasets. The standard import statement

>>> from PyML.containers.vectorDatasets import SparseDataSet

will obviously work. As a shortcut both

>>> from PyML.containers import SparseDataSet

and

>>> from PyML import SparseDataSet

will do the job. The statement from PyML import * imports some of the most commonly used PyML modules and classes. In what follows we will assume you have invoked this command, and we will point out when additional imports are required.

Data Containers

A dataset is a collection of patterns and their class labels. Class labels and pattern IDs are stored in a a Labels object accessible as the labels attribute of a dataset; it holds for each pattern its class label and pattern id.

PyML has several dataset containers:

  • A container for vector data: VectorDataSet.
  • A container for sparse vector data: SparseDataSet.
  • A container for holding a precomputed kernel matrix: KernelData.
  • A container for strings: SequenceData.
  • A container for patterns that are composed of pairs of objects: PairDataSet.
  • A container that collects an assortment of datasets into a single container: Aggregate.

PyML offers several methods for constructing a dataset instance:

  1. Reading data from a file (the sparse format is compatible with the format used by libsvm and svmlight); delimited files are also supported.
  2. Construction from a 2-dimensional array provided as a numpy array or a list of python lists.
  3. Various flavors of copy construction.

Reading data from a file

Data is read from a file by calling the constructor of a dataset container class with a file name argument. PyML supports two file formats for vector data:

  • Delimited format (comma, tab, or space delimited).
  • Sparse format.

Note that the non-sparse dataset container VectorDataSet, only supports the delimited, non-sparse format; the sparse containers handle both formats. If you need to convert a dataset from one format to another, read it into a sparse dataset container, and use its save method, specifying the file format.

Sparse file format

The sparse file format supported by PyML is similar to that used by LIBSVM (LIBSVM-format is recognized by PyML). Each pattern is represented by a line of the form:

[id,]label fid1:fval1 fid2:fval2 ....

or for unlabeled data:

[id,]fid1:fval1 fid2:fval2 ....

where:

  • id - a pattern ID [optional].
  • label - the class label associated with the pattern. Class labels can be arbitrary strings, not just -1/1; PyML converts the labels into an internal representation which is a number between 0 and the number of class - 1.
  • data is provided as pairs fid:fval where fid is the feature ID and fval is its value; feature IDs can be integers or strings. A caveat for string based feature IDs – these are converted to integers using python’s hash function. If the hash values of the feature IDs are not unique, an error will occur notifying you of a “non-unique hash”. If that occurs convert your feature IDs into integers and the problem will go away. This usually happens only when the data contains a few hundred thousand features.

Using one of the example files in sparse format provided with the PyML distribution:

>>> data = SparseDataSet('heartSparse.data')

Typing the name of the variable gives some useful information about the data:

>>> data
<SparseDataSet instance>
number of patterns: 270
number of features: 13
class Label  /  Size
 +1 : 120
 -1 : 150

Other useful things: len(data) is the number of examples in a dataset. (Note that other PyML objects define a length function.) data.numFeatures is the number of features in the data. Containers of vector data also have a variety of other member functions such as mean, std, scale, translate, eliminateFeatures etc. Check the documentation for details.

Delimited files

Constructing a dataset from a delimited file is the same as for the sparse format, only now we have to give the parser some hints on which column contains the labels and which column contains the pattern IDs (if any). As in the sparse format, class labels can be arbitrary strings. For example:

>>> data = VectorDataSet('iris.data', labelsColumn = -1)

or

>>> data = SparseDataSet('iris.data', labelsColumn = -1)

The labelsColumn keyword argument provides a hint to the parser which column of the file contains the labels. Column numbering follows the python array indexing convention: labelsColumn = -1 denotes the last column, and counting starts at zero, i.e. labelsColumn = 0 denotes the first column. If the file contains also pattern IDs, the idColumn keyword argument provides the column in which these are located. If the labelsColumn is equal to 1, then the parser assumes that the first column contains the ids. Labels and ids can only appear in certain columns. The following are the allowed values for them: (idColumn,labelsColumn) = (None,None), (None,0), (None,-1), (0,1), (0,-1). None means that a value is not provided.

A few notes:

  • If the first non-comment line of a delimited file is non-numeric, its tokens are taken as feature IDs.
  • PyML recognizes the type of file (sparse or delimited); if your file is not recognized correctly you can give a hint to the constructor which parser to invoke using the ‘hint’ keyword argument whose values can be ‘sparse’ or ‘csv’.
  • VectorDataSet has a few limitations: It does not support feature selection, and cannot be used with the liblinear SVM solver, which is the fastest PyML SVM solver.

Gzipped files

All dataset containers can read from gzipped files. You don’t need to do anything – the parser automatically detects gzipped files and substitutes the default file handler with the python gzip.GzipFile handler. A file is detected as being a gzipped file if python’s gzip.GzipFile handler can read from it.

Copy Construction

To make a copy of a dataset data of class VectorDataSet:

>>> data2 = VectorDataSet(data)

An even better way of doing this is:

>>> data2 = data.__class__(data)

This says “make me another copy of yourself”, so that you don’t even need to keep track of the class of the data object. This method of copy construction works for ANY PyML object, not just data containers.

Now going back to the iris data: suppose we only want to look at two out of the three classes then the following form of copy construction is used:

>>> data2 = data.__class__(data, classes = ['Iris-versicolor', 'Iris-virginica'])

The classes keyword gives a list of classes to be extracted. One can also give a list of patterns to copy in the form:

>>> data3 = data.__class__(data, patterns = listOfPatterns)

If you have a list of patterns you want to eliminate:

>>> data4 = data.__class__(data, patterns = misc.setminus(range(len(data)),
                                                          patternsToEliminate))

Constructing a dataset from an array

You can use PyML to create datasets on the fly from python lists or Numpy arrays. In this method of construction the input is a two dimensional array. Suppose this array is called X, then it is assumed that X[i] is pattern i in the dataset. Given X you can now create a dataset using:

>>> data = VectorDataSet(X)

Note that you can use SparseDataSet as well. If your dataset is labeled (a list of strings the length of your data matrix) you can add the labels by passing them using the ‘L’ keyword. You can also pass pattern IDs and feature IDs as well (by default the pattern and feature IDs are just running numbers).

>>> data = VectorDataSet(X, L = L, patternID = patternID, featureID = featureID)

Adding features to a dataset

Once you have created a dataset it is possible to add features to it. You can add one feature at a time by calling a dataset’s addFeature(id, values) method; id is the feature’s id, and values is a list of that feature’s values. If you would like to update a dataset with the features in another dataset object use the method addFeatures(other). You can start the process of adding features from an existing dataset object or with an empty one. An empty dataset is constructed as SparseDataSet(n), where n is an integer which specifies the number of examples.

Adding/modifying labels

If you want to add labels to a dataset or want to change a dataset’s labels you can do so by using the dataset’s attachLabels method that takes as an argument a Labels object. Suppose your labels are stored in a list L and the IDs are stored in a list called ids, proceed as follows:

>>> data.attachLabels(Labels(L, patternID = ids))

If you are dealing with a classification problem the elements of L should be strings. To convert a multi-class dataset into a two class dataset use the oneAgainstRest function of the datafunc module.

Using kernels

Many of the classifiers implemented in PyML are kernel classifiers in which classification is performed using a kernel function which measures the similarity of a pairs of patterns. A data container comes equipped with a kernel object (making it in effect a feature space). By default a linear kernel is attached to a container object. To change the kernel use the attachKernel method of the container. You can either construct a kernel object from the ker module, e.g. k = ker.Polynomial(degree = 2), followed by data.attachKernel(k), or alternatively do data.attachKernel('poly', degree = 2). For the Gaussian kernel: data.attachKernel('gaussian', gamma = 1). The keyword argument passed to attachKernel are passed to the kernel constructor. To view individual entries of the kernel matrix you can use its eval function as follows: data.kernel.eval(data, 0, 0) computes the 0,0 entry of the kernel matrix for a dataset data. In this command a kernel object is passed a data object since it is not aware of the dataset it was attached to.

The kernel matrix associated with a particular dataset (or rather with the feature space associated with the dataset by virtue of the kernel function attached to it) can be exracted by using the getKernelMatrix() method of a dataset. This returns the kernel matrix as a two dimensional Numpy array. The kernel matrix can then be displayed using the ker.showKernel(data) function (requires the matplotlib library). Note that you are passing the dataset rather than the kernel matrix.

Kernels for biological sequences

The method sequenceData.spectrum_data constructs a sparse dataset that represents the k-mer composition of a sequence (say, DNA or protein). This kernel has been shown to be useful in a variety of sequence analysis problems. It was originally proposed in: C Leslie, E Eskin, and WS Noble. The spectrum kernel: A string kernel for SVM protein classification. This implementation allows you to account for a single mismatch. Type help(sequenceData.spectrum_data) for details on how to use this method.

The spectrum kernel ignores the position at which a k-mer occurs within the sequence. If position is relevant for your problem, use the method sequenceData.positional_kmer_data. This is an implementation of the so-called weighted-degree kernel of Sonenburg et al. In this representation, each k-mer in your sequence is indexed by the position in which it appears. This implementation does not allow mismatches and shifts. If that is important for you, use the SequenceData container described below, which provides these features. This implementation explicitly constructs the features, so it’s easy to analyze the contribution and features, perform feature selection, and allows the use of the super-fast liblinear solver.

Non-vector data

PyML supports several containers for non-vector data:

  • A class for storing pre-computed kernels (KernelData). This class support kernels stored in tab/space/comma delimited format. Each row of the kernel is stored in a line in the file; pattern IDs appear in the first column, follwed by the corresponding kernel matrix row.
  • A class for biological sequence data: SequenceData. So far this container supports only the so-called weighted-degree kernel. Other sequence kernels, such as the spectrum kernel can be used by constructing sparse datasets as described above.
  • A container for storing pairs of data objects (PairDataSet).

Usage example:

# construct the dataset:
>>> kdata = KernelData(kernelFile)
# construct a Labels object out of a file that contains the labels
# a labels file is a delimited file with two columns -- the first
# contains the pattern IDs, and the second the labels
>>> labels = Labels(labelsFile)
# attach the labels to the dataset:
>>> kdata.attachLabels(labels)

The aggregate container

In many applications you will be faced with heterogeneous data that is composed of several different types of features, where each type of feature would benefit from a different kernel. The Aggregate container is what you need in this case. To use this container, construct a dataset object for each set of features as appropriate. Suppose these are stored in a python list called datas. The aggregate is then constructed as:

>>> dataAggregate = Aggregate(datas)

It is assumed that each dataset in the list of datasets refers to the same set of examples in the same order. The kernel of the Aggregate object is the sum of the kernels of the individual datasets. In constructing the aggregate you can also set a weight for each dataset using the weights keyword argument of the Aggregate constructor. Also note that the Aggregate container works only with the C++ data containers.

Training and testing a classifier

All the classifiers in PyML offer the same interface:

  • A constructor that offers both copy construction, and construction “from scratch”.
  • train(data) - train the classifier on the given dataset
  • test(data)
  • trainTest(data, trainingPatterns, testingPatters)
  • cv(data)
  • stratifiedCV(data)
  • loo(data)

Most classifiers also implement a classify(data, i) and decisionFunc(data, i) that classify individual data points; these are not typically invoked by the user who is encouraged to use the “test” method instead, since it does some additional bookkeeping.

SVMs

Let’s go back to the 'heart' dataset and construct an SVM classifier for that problem:

>>> s=SVM()
>>> s
<SVM instance>
C : 10.000000
Cmode: classProb
trained: 0

Notes: If you would like to change the value of the parameter C, simply type:

>>> s.C = some_other_value

Or set C in the constructor:

>>> s = svm.SVM(C = someValue)

The Cmode attribute indicates how the C parameter is used; there are two modes: ‘equal’ - all classes get the same value of C, and ‘classProb’, which is the default behavior, where each class is assigned a different value of C, inversly proportional to the number of examples in the class. This way, in the minority class (the class with fewer examples), each misclassification is given a higher penalty. This is a good way of handling datasets with unbalanced class distributions.

To train the svm use its train method:

>>> s.train(data)

By default the libsvm solver is used in training. You have a few other options:

  • For large datasets, and if all you need is a linear SVM, you can use the liblinear solver, which can be orders of magnitude faster than libsvm. Instantiate an SVM which uses liblinear by the command

    >>> s = svm.SVM(optimizer = 'liblinear')
    

    For an L1 loss SVM or

    >>> s = svm.SVM(optimizer = 'liblinear', loss = 'l2')
    

    For an L2 loss SVM. The liblinear optimizers require the use of the SparseDataSet container.

  • When your dataset is a non-vector container, you need to use the PyML native optimizer. It is chosen automatically in this case. If you want to explicitly choose it instantiate an svm instance as svm.SVM(optimizer = 'mysmo'). It is slower than libsvm so is not the default.

To assess the performance of a classifier, use its cv (cross-validation) method:

>>> r = s.cv(data, 5)

This performs 5 fold cross validation and stores the results in a Results object. An alternative way of specifying the number of folds is:

>>> r = s.cv(data, numFolds=5)

Stratified cross-validation stratifiedCV is a better choice when the data is unbalanced, since it samples according to the class size. There is also a leave- one-out method (loo). The Results object obtained by performing cross- validation stores information on classification accuracy in each of the folds, and averaged over the folds. Try printing the object to get an idea of what it provides. A detailed description of the Results object is found in Appendix Results objects.

NOTE: As of version 0.6.9, SVM training is supported only for the C++ containers.

Saving results and models

To save a results object use its save method. This saves the object using python’s pickle module. To load results import the loadResults function as from PyML.evaluators.resultsObjects import loadResults. Note that the save method of a Results object first converts the object into an object whose representation will remain constant between versions, so that result object will continue to be readable across versions.

The model obtained after training an SVM can be saved for future use (saving of trained classifier is only available for SVM and OneAgainstRest classifiers):

>>> s.train(data)
>>> s.save(file_name)
>>> new_svm = SVM()
>>> new_svm.load(file_name, data)
>>> results = new_svm.test(test_data)

Note that loading a saved SVM requires giving the method the dataset that was used to train the SVM.

Using kernels

As mentioned above, a dataset comes equipped with a kernel, so the SVM object knows what type of kernel to use in training the classifier. To override the kernel attached to a dataset define a kernel object and instantiate an SVM object that uses that kernel:

>>> k = ker.Polynomial()
>>> s = SVM(k)

Alternatively, attach a different kernel to the dataset:

>>> data.attachKernel('polynomial')

This attaches a polynomial kernel (default degree = 2) to the dataset (attachKernel also accepts a kernel object), and

>>> r = s.cv(data)

performs CV on an SVM with a polynomial kernel.

Linear SVMs have functionality not found in the nonlinear SVM, namely explicit computation of the weight vector; this results in more efficient classification, and is also used for feature selection (see the RFE class in the feature selection module).

SVM regression

Reading data for a regression problem is different than for classification: you need to tell the parser to interpret the labels as numbers rather than class labels. The file formats are the same, simply replace class label by the numerical value you want to predict. To read data use e.g.:

>>> data = SparseDataSet(fileName, numericLabels = True)

Now construct a Support Vector Regression (SVR) object:

>>> from PyML.classifiers.svm import SVR
>>> s = SVR()

This object supports the classifier interface, except for stratifiedCV; the classify function return the predicted value, performing the same function as the decisionFunc method. The result of any of the testing methods (cv, test etc.) contains attributes similar to the Results object used for classification problems, and contains the attributes Y - the predicted values, givenY - the given values, and patternID - the IDs of the patterns that were tested.

Note that SVR has an additional parameter – eps (epsilon). See any standard SVM reference for an explanation of the epsilon insensitive loss-function.

Other classifiers in PyML

Additional classifiers include:

  • k-nearest neighbor classifier (class KNN).
  • Ridge Regression classifier (class RidgeRegression).

To instantiate a KNN instance:

>>> from PyML.classifiers import knn
>>> num_neighbors = 3
>>> k = knn.KNN(num_neighbors)

A ridge regression instance:

>>> from PyML.classifiers import RidgeRegression
>>> regularization_param = 1
>>> rr = ridgeRegression.RidgeRegression(regularization_param)

Both the KNN classifier and the Ridge Regression are implemented as kernel methods.

Multi-class classification

Multi-class classifiers are found in the multi module. PyML supports one- against-one and one-against-the-rest classification.

To construct a one-against-the-rest classifier that uses a linear SVM as a base classifier:

>>> from PyML.classifiers import multi
>>> mc = multi.OneAgainstRest (SVM())

To assess the performance of the classifier proceed as usual:

>>> r = mc.cv(data)

A one-against-one classifier is provided by the class OneAgainstOne.

Model selection

Selecting classifier parameters is performed using the ModelSelector class in the modelSelection module. In order for the ModelSelector object to know which sets of parameters it needs to consider it needs to be supplied with a Param class object. The Param object specifies both a classifier, the parameter that needs to be selected and a list of values to consider for that parameter:

>>> param = modelSelection.Param(svm.SVM(), 'C', [0.1, 1, 10, 100, 1000])

The Param object is now supplied to a ModelSelector:

>>> m = modelSelection.ModelSelector(param)

The ModelSelector class implements the standard classifier interface, and

>>> m.train(data)

performs cross-validation for each value of the parameter defined in the Param instance, and chooses the value of the parameter that gives the highest success-rate. It then trains a classifier using the best parameter choice. You can also create a ModelSelector that optimizes a different measure of accuracy, for example the ROC score:

>>> m = modelSelection.ModelSelector(param, measure = 'roc')

To perform a grid search for a two-parameter classifier (e.g. SVM with a Gaussian kernel), use the ParamGrid object. This generates a grid of parameter values using the values supplied by the user:

>>> param = modelSelection.ParamGrid(svm.SVM(ker.Gaussian()), 'C', [0.1, 1, 10, 100, 1000],
                                     'kernel.gamma', [0.01, 0.1, 1, 10])

The ParamGrid object is then supplied to a ModelSelector as before. A more efficient method specifically for SVMs is the SVMselect class that performs a more efficient search by first searching for an optimal value of the width of the Gaussian kernel, using a relatively low value of the soft-margin constant, and then optimizing the soft-margin constant once the width parameter is chosen.

Feature selection

All feature selection methods offer a select(data) method that applies the feature selection criterion to the dataset, and selects a subset of features according to the setting of the feature selection object.

The feature selection methods offered by PyML are:

  • Recursive Feature Elimination (RFE) (class RFE)
  • Filter methods (class Filter)
  • Random feature selection (class Random)

To use RFE on our training_data we would do the following:

>>> rfe = featsel.RFE()
>>> rfe.select(training_data)

You can examine the features that were selected by looking at the featureID attribute of your data. You can now train your classifier in the usual way:

>>> classifier.train(training_data)

When you apply your classifier to test data, PyML automatically projects the data to the features that were used in training, so all you need to do for testing is:

>>> classifier.test(testing_data)

Cross-validation of feature selection methods is a little tricky, since feature selection needs to be performed for each fold separately, rather than on the data as a whole. Doing it on the whole dataset before classifier training introduces information about the test set into the training process. The process is done correctly using the FeatureSelect classifier template:

>>> from PyML.classifiers.composite import FeatureSelect
>>> featureSelector = FeatureSelect(svm.SVM(), featsel.RFE())

This feature selector uses an SVM as a classifier and RFE for feature selection. FeatureSelect(classifier, featureSelector) is a classifier that is trained using the classifier’s train method after applying the featureSelector’s select method to the data. Training a FeatureSelect object affects the data on which it was trained: It will now contain only the selected set of features, and looking at the dataset’s featureID attribute, you can see which features were selected. In cross-validation of a FeatureSelect object, feature selection is performed on the training data of each fold separately. Therefore the overall dataset is not affected. To determine which features were used in each fold you can do the following:

>>> m = composite.FeatureSelect(SVM(), featsel.RFE())
>>> results = m.stratifiedCV(data)
>>> print results.getLog()[0]  # the features selected in the first fold

Recursive feature elimination

When you instantiate an RFE object you see the following:

>>> rfe = featsel.RFE()
>>> rfe
<RFE instance>
mode: byFraction
Fraction to eliminate each iteration : 0.1
target number of features : 20
automatic selection of the number of features : 1

At each iteration RFE trains an SVM and removes features with the smallest components of the vector w. Either a given fraction of the features are removed (rfe.fractionToEliminate) in the 'byFraction' mode, or a given number in the 'byNumber' mode (in this case the number of features that are eliminated at each step is given by 'numToEliminate'. rfe.targetNumFeatures is the number of features at which to stop the elimination process. If automatic selection of the number of features is chosen (default behavior) then the number of features is chosen as the smallest number of features that minimizes the number of support vectors (the number of features tested is limited from below by the variable targetNumFeatures).

There is also a very simple simple feature selection class based on a feature scoring function (a filter method) – featsel.Filter. This class’s select method applies a feature scoring function to the data to obtain a ranking of the features and keeps the numFeatures highest scoring features.

Let’s see how this works:

>>> score = featsel.FeatureScore('golub')
>>> filter = featsel.Filter(score)
>>> filter
<Filter instance>
mode: byNum
number of features to keep : 100
<FeatureScore instance>
score name : golub
mode : oneAgainstOne

The filter has several modes of operation:

  • ‘byNum’ - the user specifies the number of features to keep. To change the number of features to keep, modify the attribute numFeatures.
  • ‘byThreshold’ - the user specifies the threshold below which a feature is discarded.
  • ‘bySignificance’ - the score for a feature is compared to the score obtained using random labels. Only features that score a number of standard deviations above the average value are retained (set the variable ‘sigma’ to control the number of standard deviations).

The filter constructed above is composed of a FeatureScore of type ‘golub’ which scores a feature by the difference in the means of that feature between classes, weighted by the standard deviation (see code for how this works for a multi-class problem).

To use with a classifier define a featureSelector:

>>> featureSelector = FeatureSelect(classifier, filter)

Since featsel.Filter and featsel.RFE both have the same interface, both can be used in the same way in conjunction with the FeatureSelect classifier.

The Chain Classifier Object

When designing a classifier one often needs to do a series of operations before training a classifier – preprocessing, feature selection etc. When performing CV, such operations need to be part of the training procedure. Instead of coding a class whose train method does a series of operations, you can use the Chain object. For example, instead of using the FeatureSelect classifier:

>>> from PyML.classifiers.composite import Chain
>>> chain = Chain([featureSelector, classifier])

The constructor takes a list of classes, the last of them being a classifier. Each member of the chain needs to implement a train and test method. The command

>>> chain.train(data)

performs featureSelector.train(data) (which invokes the feature selector’s select method), followed by classifier.train(data), sequentially performing the actions in the chain. Note that by default a Chain classifier uses deepcopy copying of data, as does the FeatureSelect object.

Preprocessing and Normalization

There are many ways to normalize a kernel/dataset. We differentiate between two major types of normalization methods:

  • Normalize the features or the kernel such that k(x,x’) = 1, i.e. feature vectors have unit length.
  • Normalize the features such that each feature has magnitude O(1).

First we begin by considering normalizing data such that inputs are unit vectors. If your data is explicitly represented as vectors you can directly normalize your data vectors to be unit vectors by using the normalize method of a container: data.normalize(norm), where norm is either 1 or 2 for the L1 or L2 norms (this operation divides each vector by its L1 or L2 norm). If possible, use this method of normalizing your data. When your data is not in explicit vector form, normalization can be performed at the level of the kernel function, i.e. in feature space. Given a kernel k(x,x’), cosine normalization computes the kernel k(x, x’) / sqrt(k(x, x) k(x’,x’)). In PyML this is achieved by attaching to your data a kernel with the cosine form of normalization associated with it. For example:

>>> data.attachKernel('polynomial', degree = 3, normalization = 'cosine')

PyML recognizes two additional forms of feature space normalization similar to the cosine kernel—Tanimoto normalization (named after the Tanimoto coefficient), and a form which is called after Dice’s coefficient. These can be used e.g. setting the normalization keyword to 'tanimoto' or 'dices'. These should give very similar results to cosine normalization.

PyML also provides a method for normalizing each feature separately, namely standardization, i.e. for each feature, subtracting the mean and dividing by the standard deviation). Standardization is implemented by the class Standardizer class:

>>> from PyML.preproc import preproc
>>> p = preproc.Standardizer()
>>> p.train(data)

Standardization destroys the sparsity of sparse data. Therefore this operation is not recommended for such data; moreoever, do not use it on SparseDataSet objects, since the result is not the one you would expect.

Visualizing classifier decision surface

PyML.demo.demo2d is a module for visualizing the decision surface of a classifier. SVMs and other classifiers base their classification on what’s called a decision function or discriminant function. For the SVM and ridge regression classifier, the classification is determined by the sign of this function. To get an intuition of how the decision function behaves as a function of classifier and kernel parameters you can use the demo2d module. To use it proceed as follows:

>>> from PyML.demo import demo2d
# first create a dataset by following the instructions onscreen:
>>> demo2d.getData()
# decision surface of an SVM with Gaussian kernel:
>>> demo2d.decisionSurface(svm.SVM(ker.Gaussian(gamma = 0.5)))

Practical notes on SVM usage

The SVM is not converging — what can I do?

The default kernel attached to a dataset object is the linear kernel. In some cases SMO-type SVM training algorithms do not converge when using a linear kernel, unless the data is first normalized, or a non-linear kernel is used. When the SVM does not converge even for nonlinear kernels / normalized data consider using a lower value for the SVM soft-margin constant (C): a lower value of C makes the problem easier because outliers can be more easily ignored. When all fails, try the ‘gradientDescent’ or ‘gist’ solvers — they are much slower, but may provide a result when SMO algorithms fail.

Results objects

PyML stores the results of using a classifier’s test or various flavors of cross-validation in a Results object. The type of object returned depends on the type of learner: it’s a RegressionResults for regression and a ClassificationResults for a classifier. A Results object is generated either by cross-validation

>>> r = classifier.cv(data)

or by testing a trained classifier:

>>> classifier.train(trainingData)
>>> r = classifier.test(testingData)

In the following we focus on the ClassificationResults object, as it is more developed, but the ideas are applicable to RegressionResults as well. The Results objects are lists, where each element of the list groups the results on a chunk of data. When the object is the result of cross-validation, the number of elements in the list is equal to the number of cross-validation folds. When it is the result of using a classifier’s test function, it contains a single element. The results in fold i of a results object r are accessed as r[i]. The ClassificationResults object has a set of accessor functions that allow the user to obtain detailed information about the results of testing the classifier. For example r.getSuccessRate() returns the average success rate over the cross-validation folds, while r.getSuccessRate(0) returns the success rate in the first fold (folds are indexed from 0 to numFolds - 1). All the accessor function follow the same interface: getAttribute(fold = None), where getAttribute is one of the above. The ‘fold’ parameter is the cross-validation fold that you want to query. If it is not specified then the attribute is “pooled” over the different folds. In the case of statistics that measure the success of the classifier, “pooling” means an average of the results in the different folds. When the attribute is say the list of predicted classes for a given fold, then these lists are aggregated into a list whose length equals the number of fold, and each element contains the list of results pertinent to a particular fold.

A complete list of accessor functions follows. First is a list of accessor functions for statistics that summarize classifier performance:

  • getSuccessRate, getBalancedSuccessRate, getSensitivity, getPPV. The success rate is the fraction of correctly classified examples; the balanced success rate takes into account the size of each class – useful for unbalanced datasets. sensitivity is the fraction of examples from the positive class that are correctly classified (number of correct positive predictions over the number of examples in the positive class); the ppv is the fraction of correct predictions made (number of correct positive predictions over number of positive predictions).
  • The area under ROC curve is accessed through the getROC accessor function, or simply as r.roc. In the case of cross-validation results this is an average over the different folds. r[0].roc returns the ROC score for the first fold. The area under the ROC50 curve is accessed through getROCn that also accepts an optional parameter that specifies a value different than 50 (getROCn(rocN, fold), and by default rocN is 50). You can also type something like r.roc10. If you want to specify the fraction of false positives rather than their number, use getROCn('1%') or getROCn(0.01).
  • The confusion matrix (getConfusionMatrix) – element i,j is the number of patterns in class j that were classified into class i (in other words, column j in the matrix describes the breakdown of how members of class j were classified). The string label of class j is accessed through getClassLabels. When no fold is specified, the confusion matrices of individual folds are summed.

Additional accessor functions are:

  • getPatternID returns a list of the pattern IDs of the classified examples.
  • getPredictedLabels returns a list of predicted (string) labels. The order matches that of the patternIDs, so that getPredictedLabels(fold)[i] is the label of getPatternID(fold)[i].
  • getGivenLabels returns the labels provided by the user.
  • getPredictedClass, getGivenClass provide class IDs rather than their string names. These are lists of numbers between in the range [0:numClasses] (note that in python notation does not include the last index).
  • getDecisionFunction — the decision function values produced by the classifier.
  • getInfo — a description of the dataset and classifier used in each fold.
  • getLog returns information about the training/testing process. Information includes training and testing time; each classifier may include different information—an SVM classifier for example provides the number of support vectors.

If you have matplotlib installed on your system, the ROC curve can be displayed by r.plotROC(). If the results object was obtained by running some form of cross-validation this produces an ROC curve that is averaged over the cross-validation folds. If you want to plot an ROC50 curve use r.plotROC(rocN = 50), and r.plotROC('roc.pdf') saves the ROC curve to a pdf file. See the documentation for more details.

In addition to the accessor functions shared with ClassificationResults, the RegressionResults object has a getRMSE accessor function.

Python note: to determine the attributes of an object o type dir(o), or if you have tab completion enabled, just type o. and use the tab to obtain the list of attributes.

Registered Attributes

In some cases one may want to associate some additional information with a dataset. You can always do something like:

data.attr = someObject

However, under copy construction, e.g. data2 = data.__class__(data, patterns = ...), that attribute does not get copied. In order for the attribute to be copied in copy construction you need to “register” it. Supposed you have an attribute called ‘auxiliaryData’ that you want to attach to a dataset, then proceed as follows:

# read a dataset:
>>> data = ...
# creat the auxiliary data
>>> auxiliaryData = ...
>>> data.registerAttribute('auxiliaryData', auxiliaryData)

An alternative form is:

>>> data.auxiliaryData = auxiliaryData
>>> data.registerAttribute('auxiliaryData', auxiliaryData)

The semantics of copying the auxiliary data under copy construction is as follows: if the attribute is a list or a dataset whose length matches the length of the dataset, it is assumed that pattern i matches pattern i in the auxiliary dataset or element i of the list. Under copy construction the appropriate elements of the dataset/list are copied. Otherwise, the copied dataset will simply contain a new reference to the auxiliary data.

An example where you may want to use registered attributes is in setting a value of the SVM C parameter on a pattern by pattern basis. Given a list ov values Clist the syntax for using them with an SVM classifier is:

>>> data.registerAttribute('C', Clist)
>>> s = svm.SVM(optimizer = 'mysmo', Cmode = 'fromData')

Note that you need to use the ‘mysmo’ optimizer since libsvm does not support setting individual C values.

If you then do classifier.cv(data), you are using values of the test data in training. To avoid this, you will need to incorporate the rescaling as part of the training of the classifier using the Chain object. This is done as:

chain = Chain([Standardizer(), SVM()])

which instantiates an SVM classifier that normalizes its input before training and testing, where the test data is normalized according to the values computed for the training data.