Classifier training
The whole process of preparing the dataset is operated by cli_train.py
that is in turn parametrized by config.yml. While the configuration is exhaustively described in the configuration file itself, here we recapitulate some key takeaways.
The input of cli_train.py
is the dataset prepared during data preparation pipeline atleast up to records selection (feature selection or feature scaling can be excluded). The output of is the stored trained classifier and optionally the prediction on the test dataset. The output can be further evaluated (exploratory data analysis) and interpreted (classifier interpretation).
Supported classifiers
Below are outlined the classifiers with all posible tags that can be used in the configuration:
Random Forest: rf, random_forest, random forest
Linear Support Vector Machines: svm, support_vector_machines, support_vector_machine, support vector machine, support vector machiens
Gaussian Naive Bayes: gnb, gaussian_naive_bayes, gaussian naive bayes, gaussian_nb, gaussian nb
Complement Naive Bayes: cnb, complement_naive_bayes, complement naive bayes, complement_nb, complement nb
Linear Discriminant Analysis: lda, linear_discriminant_analysis, lienar discriminant analysis
Explainable Boosting: eb, ebm, explainable_boosting, explainable boosting
Decision Rule List: rules
Please note that, only Random Forest is currently fully supported. The other classifiers are considered experimental.
Correct input format
The input is a single .h5
(hdf) file which must contain dataframes under keys:
X_train
- training featuresy_train
- training labelsX_test
- testing features
While it is desired to use the output of data preparation pipeline, the input can be generated by different tools. The input file only has to have the specified dataframes as keys with correct dimensionality and corresponding data quality, for example no missing values.
Classifier training
The classifiers are trained using cross-validation with the count of splits specified in configuration as cv_splits
, which is by default 5. The metric which is optimized is usually f1-score, which is macro-averaged during multi-class classification. Additionally, inverse class weights are used when possible to deal with any class imbalances.
Some of the columns in the dataset can be optionally ignored during training. Please see columns_to_ignore
in configuration. However, take care to leave the default columns there because they are only intended as metadata.
Training for multiple objective and in parallel
Currently, the classifiers specified in a single configuration file are trained sequentially. Still, the training of a single classifier can be parallelized by specifying threads
, which is by default 8, in configuration. If the goal is to train multiple classifiers in parallel then the solution is to have a configuration file for each classifier.
Additionally, when training classifiers for more objectives (or just using different input datasets for the same objective) new configuration files have to be specified.