What is Malrec?

Malware sandbox systems have become a critical part of the Internet's defensive infrastructure. These systems allow malware researchers to quickly understand a sample's behavior and effect on a system. However, current systems face two limitations: first, for performance reasons, the amount of data they can collect is limited (typically to system call traces and memory snapshots). Second, they lack the ability to perform retrospective analysis – that is, to later extract features of the malware's execution that were not considered relevant when the sample was originally executed. We have created a new malware sandbox system, Malrec, which uses PANDA's whole-system deterministic record and replay to capture high-fidelity, whole-system traces of malware executions with low time and space overheads. Here we present a new dataset of 66,301 malware recordings collected over a two-year period. The Malrec system and dataset can help provide a standardized benchmark for evaluating the performance of future dynamic analyses.


Raw data:

  • malrec_dataset.tar (Torrent) (1.3T, MD5: 35699d63041b390ed794dd4c2e215246)
    The record/replay logs; see the "Getting Started" section for details on how to use them.
    Note that this is a very large file. The dataset can also be obtained (with some delay) by emailing Brendan Dolan-Gavitt and asking him to mail you a hard drive with the dataset (for the price of the drive + shipping).
  • references.tar.xz (1.1G, MD5: a80e1c5740a8087e623a5d0b3b9100ac) - The reference snapshots (see below)
  • tools.tar.xz (4.1K, MD5: e2b0a040ab2ab7057d9df42ee7876862) - Tools for unpacking traces
  • virustotal.tar.gz (229M MD5: d0879d306ae663159cfb26092847a422) - Antivirus labels for each sample
  • uuid_md5.txt (4.5M MD5: 10042d787a6f1566893f75e3d27c91cc) - Mapping between recording UUIDs and sample MD5s

Derived datasets:

  • bagofwords_full.tar.gz (52G, MD5: bfa92e1acbfc3eab46f20438c2f18775) - Bag-of-words counts for textual data in memory accesses
  • bagofwords_ranges.tar.gz (20G, MD5: 9e4149b2184febfa3996a5ef3029e6f5) - Bag-of-words counts for textual data in memory accesses (active malware ranges only)
  • weighted_tfidf.tar.gz (12G, MD5: bdfd41cd67dba4cc9aa84e7bad7e0d4c) - Weighted TFIDF scores for textual data in memory accesses
  • malrec_pcaps.tar.gz (73G, MD5: d8dfeb6c1c632338843fcae02ac0dcb1) - Network activity in PCAP form
  • activityranges.tar.gz (4G, MD5: e0bc1ada961bca62069de6629b4ff9a1) - Instruction ranges where the malware sample was active
  • sctext_results.tar (465G, MD5: af0c821d10112305c0cd25a495bd4f26) - System call traces for each sample
  • avclass.csv (3.1M, MD5: 5ceee211cf45dda6e67a945e4e6bae4b) - AVClass results for each sample
  • recordings.csv (4.8M, MD5: 7b6b4b4903734f43d55b6af03c04f5c7) - Recording timestamp and instruction counts
  • insthist_results (281M) - Histogram data for instruction mnemonics (aggregate)
  • insthist_window_results (2.2T) - Histogram data for instruction mnemonics (rolling window of 1000 instructions)

Getting Started

First, you will need to get and install PANDA 1.0 (not PANDA 2.0). You can find instructions on setting up PANDA in the user's manual.

The recordings (each in a .txz file) are stored in the following format:

  1. A .patch file, listing the reference snapshot and differences needed to obtain the specific snapshot for this replay
  2. The -rr-nondet.log

To run a replay:

  1. Unpack the reference snapshots:
        $ tar xJvf references.tar.xz
  2. Unpack the replay you want to run (named, say, UUID):
        $ tar xJvf UUID.txz
  3. Create the patched snapshot:
        $ python tools/bpatch.py logs/rr/UUID.patch
        Using reference logs/rr/references/0000c18e-a947-42ea-abb2-234ea18facdc-rr-snp as a base
        Creating patched snapshot logs/rr/UUID-rr-snp
        All done, no errors.
  4. Run your replay:
        $ /path/to/qemu-system-x86_64 -replay logs/rr/UUID -m 1G