Exploratory data analysis

There are generally two ways to explore the dataset in a semi-automatic way:

  • using Evaluator on a raw dataset from crypto API mining.

  • using Automatic exploratory analysis on prepared (train) dataset from feature engineering step of data preparation pipeline These two ways are outlined below.

Evaluator

Some useful information can be explored from the dataset using Evaluator. Evaluator can be used as an API inside own notebooks/scripts.

Even though the evaluator can use raw dataset, it is actually preferred to clean the dataset using data preparation pipeline. For this reason there is available explore_template.ipynb which deals with cleaning before using evaluator and all needed configuration can be set up in constants in the first cell. For examples of executed exploratory noteeboks see:

However, these notebooks (but also evaluator) expect a single records.json file. If there are multiple records.json files, for example per year, it can be useful to merge them together. For this reason there is a simple utility in the repository, merge_jsons.py.

Automatic exploratory analysis

Automatic exploratory analysis is performed using pandas-profiling library. Our wrapper around this library, cli_automatically_explore.py can be used with specified input .csv file and output path where to store the report.

The input file should contain features after feature engineering. Please note that feature scaling and feature selection are not needed here. Generating the report for the whole dataset can be very time-consuming, so, there are two ways to deal with this:

  • use minimal mode by specifying --minimal_mode flag – this mode does not calculate some information, for example correlation between all features

  • use sample size by specifying --sample_size, which is by default 10 000 When minimal mode is used then sample size is completely ignored.