APK Dataset

There are two ways of supplying the dataset of APK binaries for analysis:

  • Load own APKs from a disk,

  • command the download task of cli_process.py to provide a dataset from Androzoo.

Load own APKs from disk

Just prepare a folder that looks like this

dataset
├── data
│   ├── apk
│   │   ├── first.apk
│   │   ├── second.apk
│   │   ├── third.apk
│   │   ├── fourth.apk
│   │   ├── fifth.apk

and navigate the tool to the folder by specifying

dataset_path: "/path/to/folder/with/dataset"

in the config file.

Download samples from Androzoo

Apart from providing your own dataset, this tool is capable of leveraing the Androzoo dataset to download malicious APKs directly from their database. This is done as a part of the download task in cli_process.py. In order to be able to create datasets from Androzoo, you have to:

  • register in the Androzoo service and obtain your API key,

  • download the list of APKs in the Androzoo (csv) and possibly shuffle the lines of it (if you want to ensure random sampling),

  • in the configuration file, provide path to Androzoo csv_file as

csv_path: '/path/to/androzoo/androzoo_shuffled.csv' # path to your csv file of Androzoo
  • include the Androzoo token as a command-line parameter when running cli_process as

./cli_process.py -a my_androzoo_token

It should be noted that the parameters for the Androzoo sampling (how many samples to download, should they be malicious?, etc.) can be found in the download task configuration in the configuration file.

How the dataset looks like after processing

The folder sample_dataset shows the form of a dataset (and is used in the sample experiment). The directory structure is shown in the listing below.

dataset
├── data
│   ├── apk
│   │   ├── 0d9e99e79d9a04e41da65fed037b8560612c12d55958166427557f78bc887585.apk
│   │   ├── 3d44791ccd1d5b871a1e4d4fcfc90b667f219c7979ca02d220e442552ec2b357.apk
│   │   ├── 570232b8f1b197fcf31a91a924f0fa2b0480a7b1cd98c2f0ab79b1110eb94e3c.apk
│   │   ├── 58d75f26c56e9f91b06513cf9a990289222d85641d338be6d9cb3967583d3727.apk
│   │   ├── 63890df926f97eb10370a0fb5f667ce9610887128a8cc69e0e53c57ffe06c05d.apk
│   │   ├── 9fb6230d41fa2a9c7b89d61ab61e77079dd415b481810f320aa267758ced59e9.apk
│   │   ├── a002fecefcd5e26b14e4221c1c59a2e60751034e8f9cd2a36c4e2a4267d03521.apk
│   │   ├── a5513cc8ed6bac8ec3252a51b550ee38c1b3a463e5f21303a1af0b7d83e0fb7a.apk
│   │   ├── d9e5c58dced69e209b5fc8721cc249ef3cd8b9f425469a3d8e3a1c829b0932a9.apk
│   │   └── f26a759ff6b136ba291494b0cbf42f06b240092662ab78d2f7d6369d685b5a4a.apk
│   └── dx
├── meta.yml
└── readme.md

Meta.yml file

The meta.yml file is automatically created for each supplied dataset. You don’t have to explicitly provide it, it will be created automatically.