TrainImagesRegression¶
Train a regression model from multiple triplets of feature images, predictor images and training vector data.
Description¶
Train a classifier from multiple triplets of predictor images, label images and training vector data.
The training vector data must contain polygons corresponding to the input sampling positions. This data is used to extract samples using pixel values in each band of the predictor image and the corresponding ground truth extracted from the lagel image. If no training vector data is provided, the samples will be extracted on the full image extent.
At the end of the application, the mean square error between groundtruth and predicted values is computed using the output model and the validation vector data. Note that if no validation data is given, the training data will be used for validation.
The number of training and validation samples can be specified with parameters. If no size is given, all samples will be used.
This application is based on LibSVM, OpenCV Machine Learning, and Shark ML. The output of this application is a text model file, whose format corresponds to the ML model type chosen. There is no image nor vector data output.
This application has several output images and supports “multi-writing”. Instead of computing and writing each image independently, the streamed image blocks are written in a synchronous way for each output. The output images will be computed strip by strip, using the available RAM to compute the strip size, and a user defined streaming mode can be specified using the streaming extended filenames (type, mode and value). Note that multi-writing can be disabled using the multi-write extended filename option: &multiwrite=false, in this case the output images will be written one by one. Note that multi-writing is not supported for MPI writers.
Parameters¶
Input and output data¶
This group of parameters allows setting input and output data.
Input predictor Image List -io.il image1 image2...
Mandatory
A list of input predictor images.
Input label Image List -io.ip image1 image2...
Mandatory
A list of input label images.
Input Vector Data List -io.vd vectorfile1 vectorfile2...
A list of vector data to select the training samples.
Validation Vector Data List -io.valid vectorfile1 vectorfile2...
A list of vector data to select the validation samples.
Input XML image statistics file -io.imstat filename [dtype]
XML file containing mean and variance of each feature.
Output model -io.out filename [dtype]
Mandatory
Output file containing the model estimated (.txt format).
Mean Square Error -io.mse float
Mean square error computed using the validation dataset
Sampling parameters¶
This group of parameters allows setting sampling parameters
Number of training samples -sample.nt int
Number of training samples.
Number of validation samples -sample.nv int
Number of validation samples.
Training and validation sample ratio -sample.ratio float
Default value: 0.5
Ratio between training and validation samples.
Sampler type -sample.type [periodic|random]
Default value: periodic
Type of sampling (periodic, pattern based, random)
- Periodic sampler
Takes samples regularly spaced - Random sampler
The positions to select are randomly shuffled.
Periodic sampler options¶
Jitter amplitude -sample.type.periodic.jitter int
Default value: 0
Jitter amplitude added during sample selection (0 = no jitter)
Random seed -rand int
Set a specific random seed with integer value.
Available RAM (MB) -ram int
Default value: 256
Available memory for processing (in MB).
Elevation management¶
This group of parameters allows managing elevation values. Supported formats are SRTM, DTED or any geotiff. DownloadSRTMTiles application could be a useful tool to list/download tiles related to a product.
DEM directory -elev.dem directory
This parameter allows selecting a directory containing Digital Elevation Model files. Note that this directory should contain only DEM files. Unexpected behaviour might occurs if other images are found in this directory.
Geoid File -elev.geoid filename [dtype]
Use a geoid grid to get the height above the ellipsoid in case there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles. A version of the geoid can be found on the OTB website(https://gitlab.orfeo-toolbox.org/orfeotoolbox/otb-data/blob/master/Input/DEM/egm96.grd).
Default elevation -elev.default float
Default value: 0
This parameter allows setting the default height above ellipsoid when there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles, and no geoid file has been set. This is also used by some application as an average elevation value.
Classifier to use for the training -classifier [libsvm|dt|ann|rf|knn|sharkrf]
Default value: libsvm
Choice of the classifier to use for the training.
- LibSVM classifier
This group of parameters allows setting SVM classifier parameters. - Decision Tree classifier
http://docs.opencv.org/modules/ml/doc/decision_trees.html - Artificial Neural Network classifier
http://docs.opencv.org/modules/ml/doc/neural_networks.html - Random forests classifier
http://docs.opencv.org/modules/ml/doc/random_trees.html - KNN classifier
http://docs.opencv.org/modules/ml/doc/k_nearest_neighbors.html - Shark Random forests classifier
http://image.diku.dk/shark/doxygen_pages/html/classshark_1_1_r_f_trainer.html.
It is noteworthy that training is parallel.
LibSVM classifier options¶
SVM Kernel Type -classifier.libsvm.k [linear|rbf|poly|sigmoid]
Default value: linear
SVM Kernel Type.
- Linear
Linear Kernel, no mapping is done, this is the fastest option. - Gaussian radial basis function
This kernel is a good choice in most of the case. It is an exponential function of the euclidian distance between the vectors. - Polynomial
Polynomial Kernel, the mapping is a polynomial function. - Sigmoid
The kernel is a hyperbolic tangente function of the vectors.
SVM Model Type -classifier.libsvm.m [epssvr|nusvr]
Default value: epssvr
Type of SVM formulation.
- Epsilon Support Vector Regression
The distance between feature vectors from the training set and the fitting hyper-plane must be less than Epsilon. For outliers the penalty multiplier C is used - Nu Support Vector Regression
Same as the epsilon regression except that this time the bounded parameter nu is used instead of epsilon
Cost parameter C -classifier.libsvm.c float
Default value: 1
SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.
Cost parameter Nu -classifier.libsvm.nu float
Default value: 0.5
Cost parameter Nu, in the range 0..1, the larger the value, the smoother the decision.
Parameters optimization -classifier.libsvm.opt bool
Default value: false
SVM parameters optimization flag.
Probability estimation -classifier.libsvm.prob bool
Default value: false
Probability estimation flag.
Epsilon -classifier.libsvm.eps float
Default value: 0.001
The distance between feature vectors from the training set and the fitting hyper-plane must be less than Epsilon. For outliersthe penalty mutliplier is set by C.
Decision Tree classifier options¶
Maximum depth of the tree -classifier.dt.max int
Default value: 10
The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
Minimum number of samples in each node -classifier.dt.min int
Default value: 10
If the number of samples in a node is smaller than this parameter, then this node will not be split.
Termination criteria for regression tree -classifier.dt.ra float
Default value: 0.01
If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split further.
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split -classifier.dt.cat int
Default value: 10
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
Set Use1seRule flag to false -classifier.dt.r bool
Default value: false
If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.
Set TruncatePrunedTree flag to false -classifier.dt.t bool
Default value: false
If true, then pruned branches are physically removed from the tree.
Artificial Neural Network classifier options¶
Train Method Type -classifier.ann.t [back|reg]
Default value: reg
Type of training method for the multilayer perceptron (MLP) neural network.
- Back-propagation algorithm
Method to compute the gradient of the loss function and adjust weights in the network to optimize the result. - Resilient Back-propagation algorithm
Almost the same as the Back-prop algorithm except that it does not take into account the magnitude of the partial derivative (coordinate of the gradient) but only its sign.
Number of neurons in each intermediate layer -classifier.ann.sizes string1 string2...
Mandatory
The number of neurons in each intermediate layer (excluding input and output layers).
Neuron activation function type -classifier.ann.f [ident|sig|gau]
Default value: sig
This function determine whether the output of the node is positive or not depending on the output of the transfert function.
- Identity function
- Symmetrical Sigmoid function
- Gaussian function (Not completely supported)
Alpha parameter of the activation function -classifier.ann.a float
Default value: 1
Alpha parameter of the activation function (used only with sigmoid and gaussian functions).
Beta parameter of the activation function -classifier.ann.b float
Default value: 1
Beta parameter of the activation function (used only with sigmoid and gaussian functions).
Strength of the weight gradient term in the BACKPROP method -classifier.ann.bpdw float
Default value: 0.1
Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.
Strength of the momentum term (the difference between weights on the 2 previous iterations) -classifier.ann.bpms float
Default value: 0.1
Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.
Initial value Delta_0 of update-values Delta_{ij} in RPROP method -classifier.ann.rdw float
Default value: 0.1
Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).
Update-values lower limit Delta_{min} in RPROP method -classifier.ann.rdwm float
Default value: 1e-07
Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).
Termination criteria -classifier.ann.term [iter|eps|all]
Default value: all
Termination criteria.
- Maximum number of iterations
Set the number of iterations allowed to the network for its training. Training will stop regardless of the result when this number is reached - Epsilon
Training will focus on result and will stop once the precision isat most epsilon - Max. iterations + Epsilon
Both termination criteria are used. Training stop at the first reached
Epsilon value used in the Termination criteria -classifier.ann.eps float
Default value: 0.01
Epsilon value used in the Termination criteria.
Maximum number of iterations used in the Termination criteria -classifier.ann.iter int
Default value: 1000
Maximum number of iterations used in the Termination criteria.
Random forests classifier options¶
Maximum depth of the tree -classifier.rf.max int
Default value: 5
The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.
Minimum number of samples in each node -classifier.rf.min int
Default value: 10
If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
Termination Criteria for regression tree -classifier.rf.ra float
Default value: 0
If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split -classifier.rf.cat int
Default value: 10
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
Size of the randomly selected subset of features at each tree node -classifier.rf.var int
Default value: 0
The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.
Maximum number of trees in the forest -classifier.rf.nbtrees int
Default value: 100
The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
Sufficient accuracy (OOB error) -classifier.rf.acc float
Default value: 0.01
Sufficient accuracy (OOB error).
KNN classifier options¶
Number of Neighbors -classifier.knn.k int
Default value: 32
The number of neighbors to use.
Decision rule -classifier.knn.rule [mean|median]
Default value: mean
Decision rule for regression output
- Mean of neighbors values
Returns the mean of neighbors values - Median of neighbors values
Returns the median of neighbors values
Shark Random forests classifier options¶
Maximum number of trees in the forest -classifier.sharkrf.nbtrees int
Default value: 100
The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
Min size of the node for a split -classifier.sharkrf.nodesize int
Default value: 25
If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
Number of features tested at each node -classifier.sharkrf.mtry int
Default value: 0
The number of features (variables) which will be tested at each node in order to compute the split. If set to zero, the square root of the number of features is used.
Out of bound ratio -classifier.sharkrf.oobr float
Default value: 0.66
Set the fraction of the original training dataset to use as the out of bag sample.A good default value is 0.66.
Temporary files cleaning -cleanup bool
Default value: true
If activated, the application will try to clean all temporary files it created
Examples¶
From the command-line:
otbcli_TrainImagesRegression -io.il inputPredictorImage.tif -io.ip inputLabelImage.tif -io.vd trainingData.shp -io.valid validationData.shp -sample.nt 500 -sample.nv 100 -io.imstat imageStats.xml -classifier rf -io.out model.txt
From Python:
import otbApplication
app = otbApplication.Registry.CreateApplication("TrainImagesRegression")
app.SetParameterStringList("io.il", ['inputPredictorImage.tif'])
app.SetParameterStringList("io.ip", ['inputLabelImage.tif'])
app.SetParameterStringList("io.vd", ['trainingData.shp'])
app.SetParameterStringList("io.valid", ['validationData.shp'])
app.SetParameterInt("sample.nt", 500)
app.SetParameterInt("sample.nv", 100)
app.SetParameterString("io.imstat", "imageStats.xml")
app.SetParameterString("classifier","rf")
app.SetParameterString("io.out", "model.txt")
app.ExecuteAndWriteOutput()