Malrec Dataset

What is Malrec?

Malware sandbox systems have become a critical part of the Internet's defensive infrastructure. These systems allow malware researchers to quickly understand a sample's behavior and effect on a system. However, current systems face two limitations: first, for performance reasons, the amount of data they can collect is limited (typically to system call traces and memory snapshots). Second, they lack the ability to perform retrospective analysis – that is, to later extract features of the malware's execution that were not considered relevant when the sample was originally executed. We have created a new malware sandbox system, Malrec, which uses PANDA's whole-system deterministic record and replay to capture high-fidelity, whole-system traces of malware executions with low time and space overheads. Here we present a new dataset of 66,301 malware recordings collected over a two-year period. The Malrec system and dataset can help provide a standardized benchmark for evaluating the performance of future dynamic analyses.

Download

Raw data:

malrec_dataset.tar (Torrent) (1.3T, MD5: 35699d63041b390ed794dd4c2e215246)
The record/replay logs; see the "Getting Started" section for details on how to use them.
Note that this is a very large file. The dataset can also be obtained (with some delay) by emailing Brendan Dolan-Gavitt and asking him to mail you a hard drive with the dataset (for the price of the drive + shipping).
references.tar.xz (1.1G, MD5: a80e1c5740a8087e623a5d0b3b9100ac) - The reference snapshots (see below)
tools.tar.xz (4.1K, MD5: e2b0a040ab2ab7057d9df42ee7876862) - Tools for unpacking traces
virustotal.tar.gz (229M MD5: d0879d306ae663159cfb26092847a422) - Antivirus labels for each sample
uuid_md5.txt (4.5M MD5: 10042d787a6f1566893f75e3d27c91cc) - Mapping between recording UUIDs and sample MD5s

Derived datasets:

bagofwords_full.tar.gz (52G, MD5: bfa92e1acbfc3eab46f20438c2f18775) - Bag-of-words counts for textual data in memory accesses
bagofwords_ranges.tar.gz (20G, MD5: 9e4149b2184febfa3996a5ef3029e6f5) - Bag-of-words counts for textual data in memory accesses (active malware ranges only)
weighted_tfidf.tar.gz (12G, MD5: bdfd41cd67dba4cc9aa84e7bad7e0d4c) - Weighted TFIDF scores for textual data in memory accesses
malrec_pcaps.tar.gz (73G, MD5: d8dfeb6c1c632338843fcae02ac0dcb1) - Network activity in PCAP form
activityranges.tar.gz (4G, MD5: e0bc1ada961bca62069de6629b4ff9a1) - Instruction ranges where the malware sample was active
sctext_results.tar (465G, MD5: af0c821d10112305c0cd25a495bd4f26) - System call traces for each sample
avclass.csv (3.1M, MD5: 5ceee211cf45dda6e67a945e4e6bae4b) - AVClass results for each sample
recordings.csv (4.8M, MD5: 7b6b4b4903734f43d55b6af03c04f5c7) - Recording timestamp and instruction counts
insthist_results (281M) - Histogram data for instruction mnemonics (aggregate)
insthist_window_results (2.2T) - Histogram data for instruction mnemonics (rolling window of 1000 instructions)

Getting Started

First, you will need to get and install PANDA 1.0 (not PANDA 2.0). You can find instructions on setting up PANDA in the user's manual.

The recordings (each in a .txz file) are stored in the following format:

A .patch file, listing the reference snapshot and differences needed to obtain the specific snapshot for this replay
The -rr-nondet.log

To run a replay:

Unpack the reference snapshots:


    $ tar xJvf references.tar.xz
    logs/rr/references/0000c18e-a947-42ea-abb2-234ea18facdc-rr-snp
    logs/rr/references/0002f074-cd1b-4523-aacd-eeccd61c0f96-rr-snp
    logs/rr/references/00568419-706b-4c2e-ad3a-4de0add3780d-rr-snp
    logs/rr/references/023c870e-4be8-4f1c-a712-340e21c67565-rr-snp
    logs/rr/references/0335fe75-a7bd-4963-8304-da7e59005692-rr-snp
    logs/rr/references/097b607a-735e-4ac0-b853-c15dc58b58fc-rr-snp
    logs/rr/references/1b5091e3-98a5-4058-a944-c5d6f87fe103-rr-snp
    logs/rr/references/5ad7f823-1b5f-4f99-8f2b-53bf69e0fc08-rr-snp

Unpack the replay you want to run (named, say, UUID):


    $ tar xJvf UUID.txz
    logs/rr/UUID.patch
    logs/rr/UUID-rr-nondet.log

Create the patched snapshot:


    $ python tools/bpatch.py logs/rr/UUID.patch
    Using reference logs/rr/references/0000c18e-a947-42ea-abb2-234ea18facdc-rr-snp as a base
    Creating patched snapshot logs/rr/UUID-rr-snp
    All done, no errors.

Run your replay:


    $ /path/to/qemu-system-x86_64 -replay logs/rr/UUID -m 1G

The Malrec Dataset

What is Malrec?

Download

Getting Started