Data preparation

The whole process of preparing the dataset is operated by cli_prepare.py that is in turn parametrized by config.yml. While the configuration is exhaustively described in the configuration file itself, here we recapitulate the steps of preparation pipeline. There are four steps:

Cleaning
Feature engineering
Records selection
Feature scaling
Feature selection

The input of the full pipeline are files processed during crypto API mining. The output of the full pipeline is usually a single hdf file per objective, which contains train and test dataframes of features and target labels. The currently supported objectives are:

malware labeling – label malicious sample as one of the malware families
malware detection – classify a sample as malicious or benign

The configuration can be tweaked to output results of the steps in the pipeline. For example, it might be beneficial to store the ouput of the cleaning. The pipeline’s steps can be also skipped, but only from the beginning or the end. Specifically, each step expects the output of the previous step or paths to the data generated by previous step.

We further explain each of the pipeline steps in their order. We also provide an overview of tweaked configuration files.

Cleaning

During this step, the outputs of crypto API mining are merged together and cleaned. Each input file can be further specified as benign (or malicious) in the configuration. Additionally, the source of the sample can be added to possibly differentiate between different datasets during analysis.

Overall, the process of cleaning can be decomposed into five sequential steps, which are optional if not stated otherwise:

Clean missing values – mandatory step that replaces the missing values with appropriate constants
Clean labels:
1. Rename family names, which are considered inaccurate, based on the rules in the configuration.
2. Rename types based, which are considered inaccurate, based on the rules in the configuration.
3. Adjust the mapping between family names and types to be 1:1. Specifically, change each sample’s type to the most common type corresponding to the sample’s family name.
Remove third party cryptography libraries from third party packages because of duplicit information
Remove similar classes for crypto API calls records because they are considered to be duplicites. Generally, classes with illegal characters in their names and another similar class are removed.
Clean crypto API calls:
1. Remove crypto API calls records when crypto API imports are empty for the sample.
2. Remove all the crypto API calls that are prefixes of another one on the same line.
3. Remove crypto API calls false positives, for example when it is used as a substring in user-defined method, etc.

The output of this step can be optionally stored into a single .json file.

Feature engineering

During this step, features suitable for machine learning are engineered. This step’s output can be optionally stored into a single .csv (hdf) file.

Records selection

First, the dataset is split into train and test with ratio specified in configuration. Next, the labels are adjusted for the specified objectives:

malware detection – change target label benign for malicious
malware labeling – take only the top family labels, optionally merge others into special label OTHER, optionally remove all the samples with UNKNOWN family tag

After this step, the dataset is ready for classifier training. This step’s output per objective can be optionally stored into a single .h5 (hdf) file. The dataframes are then stored under keys X_train, X_test, y_train, y_test where prefix X represents features and y represents labels.

Feature scaling

During this step, the features are optionally scaled according to configuration in the given order:

Normalize features by overall class count (metadata_n_classes)
Normalize features by total lines of code (metadata_n_lines)
Scale features using standard scaling In default configuration, only normalization by class count is used.

An important configuration is columns_to_ignore. The default columns should be left there because they are metadata columns and are not intended for use during training. However, more columns can be added to not scale them (in general only numeric columns are scaled).

This step’s output per objective can be optionally stored into a single .h5 (hdf) file. The dataframes are then stored under keys X_train, X_test, y_train, and y_test where prefix X represents features and y represents labels.

Feature selection

During this step, a subset of features that is deemed to be useful for each objective is selected in three steps:

Remove features with low variance using Variance Threshold
Remove features that are linearly correlated above a threshold specified in configuration
Use Boruta to remove features with low predictive power

An important configuration is columns_to_ignore. The default columns should be left there because they are metadata columns and are not intended for use during training. However, more columns can be added to ignore them during feature selection.

This step’s output per objective can be optionally stored into a single .h5 (hdf) file. The dataframes are then stored under keys X_train, X_test, y_train, and y_test where prefix X represents features and y represents labels.

Configurations

Below are outlined various configurations with their simple description:

full_with_clean_output.yml – whole pipeline with output of cleaned dataset
without_feature_selection.yml – whole pipeline without feature selection
clean_only.yml – clean only configuration
from_clean.yml – pipeline without cleaning (starting from records selection)
without_feature_scaling.yml – pipeline without feature scaling but with feature selection
to_feature_scaling.yml – pipeline from cleaning up to records selection (without feature scaling and feature selection)
full_with_feature_engineering_output.yml – full pipeline with output during feature engineering