Classifier training

The whole process of preparing the dataset is operated by cli_train.py that is in turn parametrized by config.yml. While the configuration is exhaustively described in the configuration file itself, here we recapitulate some key takeaways.

The input of cli_train.py is the dataset prepared during data preparation pipeline atleast up to records selection (feature selection or feature scaling can be excluded). The output of is the stored trained classifier and optionally the prediction on the test dataset. The output can be further evaluated (exploratory data analysis) and interpreted (classifier interpretation).

Supported classifiers

Below are outlined the classifiers with all posible tags that can be used in the configuration:

Random Forest: rf, random_forest, random forest
LightGBM Gradient Boosted Decision Trees: lgbm
Linear Support Vector Machines: svm, support_vector_machines, support_vector_machine, support vector machine, support vector machiens
Gaussian Naive Bayes: gnb, gaussian_naive_bayes, gaussian naive bayes, gaussian_nb, gaussian nb
Complement Naive Bayes: cnb, complement_naive_bayes, complement naive bayes, complement_nb, complement nb
Linear Discriminant Analysis: lda, linear_discriminant_analysis, lienar discriminant analysis
Explainable Boosting: eb, ebm, explainable_boosting, explainable boosting
Decision Rule List: rules

Please note that, only Random Forest is currently fully supported. The other classifiers are considered experimental.

Correct input format

The input is a single .h5 (hdf) file which must contain dataframes under keys:

X_train - training features
y_train - training labels
X_test - testing features

While it is desired to use the output of data preparation pipeline, the input can be generated by different tools. The input file only has to have the specified dataframes as keys with correct dimensionality and corresponding data quality, for example no missing values.

Classifier training

The classifiers are trained using cross-validation with the count of splits specified in configuration as cv_splits, which is by default 5. The metric which is optimized is usually f1-score, which is macro-averaged during multi-class classification. Additionally, inverse class weights are used when possible to deal with any class imbalances.

Some of the columns in the dataset can be optionally ignored during training. Please see columns_to_ignore in configuration. However, take care to leave the default columns there because they are only intended as metadata.

Training for multiple objective and in parallel

Currently, the classifiers specified in a single configuration file are trained sequentially. Still, the training of a single classifier can be parallelized by specifying threads, which is by default 8, in configuration. If the goal is to train multiple classifiers in parallel then the solution is to have a configuration file for each classifier.

Additionally, when training classifiers for more objectives (or just using different input datasets for the same objective) new configuration files have to be specified.