APK Dataset
There are two ways of supplying the dataset of APK binaries for analysis:
Load own APKs from a disk,
command the
download
task ofcli_process.py
to provide a dataset from Androzoo.
Load own APKs from disk
Just prepare a folder that looks like this
dataset
├── data
│ ├── apk
│ │ ├── first.apk
│ │ ├── second.apk
│ │ ├── third.apk
│ │ ├── fourth.apk
│ │ ├── fifth.apk
and navigate the tool to the folder by specifying
dataset_path: "/path/to/folder/with/dataset"
in the config file.
Download samples from Androzoo
Apart from providing your own dataset, this tool is capable of leveraing the Androzoo dataset to download malicious APKs directly from their database. This is done as a part of the download
task in cli_process.py
. In order to be able to create datasets from Androzoo, you have to:
register in the Androzoo service and obtain your API key,
download the list of APKs in the Androzoo (csv) and possibly shuffle the lines of it (if you want to ensure random sampling),
in the configuration file, provide path to Androzoo csv_file as
csv_path: '/path/to/androzoo/androzoo_shuffled.csv' # path to your csv file of Androzoo
include the Androzoo token as a command-line parameter when running
cli_process
as
./cli_process.py -a my_androzoo_token
It should be noted that the parameters for the Androzoo sampling (how many samples to download, should they be malicious?, etc.) can be found in the download
task configuration in the configuration file.
How the dataset looks like after processing
The folder sample_dataset shows the form of a dataset (and is used in the sample experiment). The directory structure is shown in the listing below.
dataset
├── data
│ ├── apk
│ │ ├── 0d9e99e79d9a04e41da65fed037b8560612c12d55958166427557f78bc887585.apk
│ │ ├── 3d44791ccd1d5b871a1e4d4fcfc90b667f219c7979ca02d220e442552ec2b357.apk
│ │ ├── 570232b8f1b197fcf31a91a924f0fa2b0480a7b1cd98c2f0ab79b1110eb94e3c.apk
│ │ ├── 58d75f26c56e9f91b06513cf9a990289222d85641d338be6d9cb3967583d3727.apk
│ │ ├── 63890df926f97eb10370a0fb5f667ce9610887128a8cc69e0e53c57ffe06c05d.apk
│ │ ├── 9fb6230d41fa2a9c7b89d61ab61e77079dd415b481810f320aa267758ced59e9.apk
│ │ ├── a002fecefcd5e26b14e4221c1c59a2e60751034e8f9cd2a36c4e2a4267d03521.apk
│ │ ├── a5513cc8ed6bac8ec3252a51b550ee38c1b3a463e5f21303a1af0b7d83e0fb7a.apk
│ │ ├── d9e5c58dced69e209b5fc8721cc249ef3cd8b9f425469a3d8e3a1c829b0932a9.apk
│ │ └── f26a759ff6b136ba291494b0cbf42f06b240092662ab78d2f7d6369d685b5a4a.apk
│ └── dx
├── meta.yml
└── readme.md
Meta.yml file
The meta.yml
file is automatically created for each supplied dataset. You don’t have to explicitly provide it, it will be created automatically.