TrainRegression - Train a regression model¶

Train a classifier from multiple images to perform regression.

Detailed description¶

This application trains a classifier from multiple input images or a csv file, in order to perform regression. Predictors are composed of pixel values in each band optionally centered and reduced using an XML statistics file produced by the ComputeImagesStatistics application.: The output value for each predictor is assumed to be the last band (or the last column for CSV files). Training and validation predictor lists are built such that their size is inferior to maximum bounds given by the user, and the proportion corresponds to the balance parameter. Several classifier parameters can be set depending on the chosen classifier. In the validation process, the mean square error is computed between the ground truth and the estimated model. This application is based on LibSVM and on OpenCV Machine Learning classifiers, and is compatible with OpenCV 2.3.1 and later.

Parameters¶

This section describes in details the parameters available for this application. Table [1] presents a summary of these parameters and the parameters keys to be used in command-line and programming languages. Application key is TrainRegression .

[1]	Table: Parameters table for Train a regression model.

Parameter Key	Parameter Name	Parameter Type
io	Input and output data	Group
io.il	Input Image List	Input image list
io.csv	Input CSV file	Input File name
io.imstat	Input XML image statistics file	Input File name
io.out	Output regression model	Output File name
io.mse	Mean Square Error	Float
sample	Training and validation samples parameters	Group
sample.mt	Maximum training predictors	Int
sample.mv	Maximum validation predictors	Int
sample.vtr	Training and validation sample ratio	Float
classifier	Classifier to use for the training	Choices
classifier libsvm	LibSVM classifier	Choice
classifier dt	Decision Tree classifier	Choice
classifier gbt	Gradient Boosted Tree classifier	Choice
classifier ann	Artificial Neural Network classifier	Choice
classifier rf	Random forests classifier	Choice
classifier knn	KNN classifier	Choice
classifier sharkrf	Shark Random forests classifier	Choice
classifier sharkkm	Shark kmeans classifier	Choice
classifier.libsvm.k	SVM Kernel Type	Choices
classifier.libsvm.k linear	Linear	Choice
classifier.libsvm.k rbf	Gaussian radial basis function	Choice
classifier.libsvm.k poly	Polynomial	Choice
classifier.libsvm.k sigmoid	Sigmoid	Choice
classifier.libsvm.m	SVM Model Type	Choices
classifier.libsvm.m epssvr	Epsilon Support Vector Regression	Choice
classifier.libsvm.m nusvr	Nu Support Vector Regression	Choice
classifier.libsvm.c	Cost parameter C	Float
classifier.libsvm.nu	Cost parameter Nu	Float
classifier.libsvm.opt	Parameters optimization	Boolean
classifier.libsvm.prob	Probability estimation	Boolean
classifier.libsvm.eps	Epsilon	Float
classifier.dt.max	Maximum depth of the tree	Int
classifier.dt.min	Minimum number of samples in each node	Int
classifier.dt.ra	Termination criteria for regression tree	Float
classifier.dt.cat	Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split	Int
classifier.dt.f	K-fold cross-validations	Int
classifier.dt.r	Set Use1seRule flag to false	Boolean
classifier.dt.t	Set TruncatePrunedTree flag to false	Boolean
classifier.gbt.t	Loss Function Type	Choices
classifier.gbt.t sqr	Squared Loss	Choice
classifier.gbt.t abs	Absolute Loss	Choice
classifier.gbt.t hub	Huber Loss	Choice
classifier.gbt.w	Number of boosting algorithm iterations	Int
classifier.gbt.s	Regularization parameter	Float
classifier.gbt.p	Portion of the whole training set used for each algorithm iteration	Float
classifier.gbt.max	Maximum depth of the tree	Int
classifier.ann.t	Train Method Type	Choices
classifier.ann.t back	Back-propagation algorithm	Choice
classifier.ann.t reg	Resilient Back-propagation algorithm	Choice
classifier.ann.sizes	Number of neurons in each intermediate layer	String list
classifier.ann.f	Neuron activation function type	Choices
classifier.ann.f ident	Identity function	Choice
classifier.ann.f sig	Symmetrical Sigmoid function	Choice
classifier.ann.f gau	Gaussian function (Not completely supported)	Choice
classifier.ann.a	Alpha parameter of the activation function	Float
classifier.ann.b	Beta parameter of the activation function	Float
classifier.ann.bpdw	Strength of the weight gradient term in the BACKPROP method	Float
classifier.ann.bpms	Strength of the momentum term (the difference between weights on the 2 previous iterations)	Float
classifier.ann.rdw	Initial value Delta_0 of update-values Delta_{ij} in RPROP method	Float
classifier.ann.rdwm	Update-values lower limit Delta_{min} in RPROP method	Float
classifier.ann.term	Termination criteria	Choices
classifier.ann.term iter	Maximum number of iterations	Choice
classifier.ann.term eps	Epsilon	Choice
classifier.ann.term all	Max. iterations + Epsilon	Choice
classifier.ann.eps	Epsilon value used in the Termination criteria	Float
classifier.ann.iter	Maximum number of iterations used in the Termination criteria	Int
classifier.rf.max	Maximum depth of the tree	Int
classifier.rf.min	Minimum number of samples in each node	Int
classifier.rf.ra	Termination Criteria for regression tree	Float
classifier.rf.cat	Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split	Int
classifier.rf.var	Size of the randomly selected subset of features at each tree node	Int
classifier.rf.nbtrees	Maximum number of trees in the forest	Int
classifier.rf.acc	Sufficient accuracy (OOB error)	Float
classifier.knn.k	Number of Neighbors	Int
classifier.knn.rule	Decision rule	Choices
classifier.knn.rule mean	Mean of neighbors values	Choice
classifier.knn.rule median	Median of neighbors values	Choice
classifier.sharkrf.nbtrees	Maximum number of trees in the forest	Int
classifier.sharkrf.nodesize	Min size of the node for a split	Int
classifier.sharkrf.mtry	Number of features tested at each node	Int
classifier.sharkrf.oobr	Out of bound ratio	Float
classifier.sharkkm.maxiter	Maximum number of iteration for the kmeans algorithm.	Int
classifier.sharkkm.k	The number of class used for the kmeans algorithm.	Int
rand	set user defined seed	Int
inxml	Load otb application from xml file	XML input parameters file
outxml	Save otb application to xml file	XML output parameters file

[Input and output data]: This group of parameters allows setting input and output data.

Input Image List: A list of input images. First (n-1) bands should contain the predictor. The last band should contain the output value to predict.
Input CSV file: Input CSV file containing the predictors, and the output values in last column. Only used when no input image is given.
Input XML image statistics file: Input XML file containing the mean and the standard deviation of the input images.
Output regression model: Output file containing the model estimated (.txt format).
Mean Square Error: Mean square error computed with the validation predictors.

[Training and validation samples parameters]: This group of parameters allows you to set training and validation sample lists parameters.

Maximum training predictors: Maximum number of training predictors (default = 1000) (no limit = -1).
Maximum validation predictors: Maximum number of validation predictors (default = 1000) (no limit = -1).
Training and validation sample ratio: Ratio between training and validation samples (0.0 = all training, 1.0 = all validation) (default = 0.5).

Classifier to use for the training: Choice of the classifier to use for the training. Available choices are:

LibSVM classifier: This group of parameters allows setting SVM classifier parameters.

SVM Kernel Type: SVM Kernel Type. Available choices are:

Linear: Linear Kernel, no mapping is done, this is the fastest option.

Gaussian radial basis function: This kernel is a good choice in most of the case. It is an exponential function of the euclidian distance between the vectors.

Polynomial: Polynomial Kernel, the mapping is a polynomial function.

Sigmoid: The kernel is a hyperbolic tangente function of the vectors.

SVM Model Type: Type of SVM formulation. Available choices are:

Epsilon Support Vector Regression: The distance between feature vectors from the training set and the fitting hyper-plane must be less than Epsilon. For outliers the penalty multiplier C is used .

Nu Support Vector Regression: Same as the epsilon regression except that this time the bounded parameter nu is used instead of epsilon.

Cost parameter C: SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.

Cost parameter Nu: Cost parameter Nu, in the range 0..1, the larger the value, the smoother the decision.

Parameters optimization: SVM parameters optimization flag.

Probability estimation: Probability estimation flag.

Epsilon: The distance between feature vectors from the training set and the fitting hyper-plane must be less than Epsilon. For outliersthe penalty mutliplier is set by C.

Decision Tree classifier: This group of parameters allows setting Decision Tree classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/decision_trees.html}.

Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.

Minimum number of samples in each node: If the number of samples in a node is smaller than this parameter, then this node will not be split.

Termination criteria for regression tree: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split further.

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

K-fold cross-validations: If cv_folds > 1, then it prunes a tree with K-fold cross-validation where K is equal to cv_folds.

Set Use1seRule flag to false: If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.

Set TruncatePrunedTree flag to false: If true, then pruned branches are physically removed from the tree.

Gradient Boosted Tree classifier: This group of parameters allows setting Gradient Boosted Tree classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/gradient_boosted_trees.html}.

Loss Function Type: Type of loss functionused for training. Available choices are:

Squared Loss

Absolute Loss

Huber Loss

Number of boosting algorithm iterations: Number “w” of boosting algorithm iterations, with w*K being the total number of trees in the GBT model, where K is the output number of classes.

Regularization parameter: Regularization parameter.

Portion of the whole training set used for each algorithm iteration: Portion of the whole training set used for each algorithm iteration. The subset is generated randomly.

Maximum depth of the tree: The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.

Artificial Neural Network classifier: This group of parameters allows setting Artificial Neural Network classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/neural_networks.html}.

Train Method Type: Type of training method for the multilayer perceptron (MLP) neural network. Available choices are:

Back-propagation algorithm: Method to compute the gradient of the loss function and adjust weights in the network to optimize the result.

Resilient Back-propagation algorithm: Almost the same as the Back-prop algorithm except that it does not take into account the magnitude of the partial derivative (coordinate of the gradient) but only its sign.

Number of neurons in each intermediate layer: The number of neurons in each intermediate layer (excluding input and output layers).

Neuron activation function type: This function determine whether the output of the node is positive or not depending on the output of the transfert function. Available choices are:

Identity function

Symmetrical Sigmoid function

Gaussian function (Not completely supported)

Alpha parameter of the activation function: Alpha parameter of the activation function (used only with sigmoid and gaussian functions).

Beta parameter of the activation function: Beta parameter of the activation function (used only with sigmoid and gaussian functions).

Strength of the weight gradient term in the BACKPROP method: Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.

Strength of the momentum term (the difference between weights on the 2 previous iterations): Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.

Initial value Delta_0 of update-values Delta_{ij} in RPROP method: Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).

Update-values lower limit Delta_{min} in RPROP method: Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).

Termination criteria: Termination criteria. Available choices are:

Maximum number of iterations: Set the number of iterations allowed to the network for its training. Training will stop regardless of the result when this number is reached.

Epsilon: Training will focus on result and will stop once the precision isat most epsilon.

Max. iterations + Epsilon: Both termination criteria are used. Training stop at the first reached.

Epsilon value used in the Termination criteria: Epsilon value used in the Termination criteria.

Maximum number of iterations used in the Termination criteria: Maximum number of iterations used in the Termination criteria.

Random forests classifier: This group of parameters allows setting Random Forests classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/random_trees.html}.

Maximum depth of the tree: The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.

Minimum number of samples in each node: If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

Termination Criteria for regression tree: If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.

Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split: Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.

Size of the randomly selected subset of features at each tree node: The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.

Maximum number of trees in the forest: The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

Sufficient accuracy (OOB error): Sufficient accuracy (OOB error).

KNN classifier: This group of parameters allows setting KNN classifier parameters. See complete documentation here url{http://docs.opencv.org/modules/ml/doc/k_nearest_neighbors.html}.

Number of Neighbors: The number of neighbors to use.

Decision rule: Decision rule for regression output. Available choices are:

Mean of neighbors values: Returns the mean of neighbors values.

Median of neighbors values: Returns the median of neighbors values.

Shark Random forests classifier: This group of parameters allows setting Shark Random Forests classifier parameters. See complete documentation here url{http://image.diku.dk/shark/doxygen_pages/html/classshark_1_1_r_f_trainer.html}. It is noteworthy that training is parallel.

Maximum number of trees in the forest: The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.

Min size of the node for a split: If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.

Number of features tested at each node: The number of features (variables) which will be tested at each node in order to compute the split. If set to zero, the square root of the number of features is used.

Out of bound ratio: Set the fraction of the original training dataset to use as the out of bag sample.A good default value is 0.66. .

Shark kmeans classifier: This group of parameters allows setting Shark kMeans classifier parameters. See complete documentation here url{http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/kmeans.html}. .

Maximum number of iteration for the kmeans algorithm.: The maximum number of iteration for the kmeans algorithm. 0=unlimited.

The number of class used for the kmeans algorithm.: The number of class used for the kmeans algorithm. Default set to 2 class.

set user defined seed: Set specific seed. with integer value.

Load otb application from xml file: Load otb application from xml file.

Save otb application to xml file: Save otb application to xml file.

Example¶

To run this example in command-line, use the following:

otbcli_TrainRegression -io.il training_dataset.tif -io.out regression_model.txt -io.imstat training_statistics.xml -classifier libsvm

To run this example from Python, use the following code snippet:

#!/usr/bin/python

# Import the otb applications package
import otbApplication

# The following line creates an instance of the TrainRegression application
TrainRegression = otbApplication.Registry.CreateApplication("TrainRegression")

# The following lines set all the application parameters:
TrainRegression.SetParameterStringList("io.il", ['training_dataset.tif'])

TrainRegression.SetParameterString("io.out", "regression_model.txt")

TrainRegression.SetParameterString("io.imstat", "training_statistics.xml")

TrainRegression.SetParameterString("classifier","libsvm")

# The following line execute the application
TrainRegression.ExecuteAndWriteOutput()

Limitations¶

None

Authors¶

This application has been written by OTB-Team.