What is Malrec?
Malware sandbox systems have become a critical part of the Internet's defensive infrastructure. These systems allow malware researchers to quickly understand a sample's behavior and effect on a system. However, current systems face two limitations: first, for performance reasons, the amount of data they can collect is limited (typically to system call traces and memory snapshots). Second, they lack the ability to perform retrospective analysis – that is, to later extract features of the malware's execution that were not considered relevant when the sample was originally executed. We have created a new malware sandbox system, Malrec, which uses PANDA's whole-system deterministic record and replay to capture high-fidelity, whole-system traces of malware executions with low time and space overheads. Here we present a new dataset of 66,301 malware recordings collected over a two-year period. The Malrec system and dataset can help provide a standardized benchmark for evaluating the performance of future dynamic analyses.
- malrec_dataset.tar (Torrent)
(1.3T, MD5: 35699d63041b390ed794dd4c2e215246)
The record/replay logs; see the "Getting Started" section for details on how to use them.
Note that this is a very large file. The dataset can also be obtained (with some delay) by emailing Brendan Dolan-Gavitt and asking him to mail you a hard drive with the dataset (for the price of the drive + shipping).
- references.tar.xz (1.1G, MD5: a80e1c5740a8087e623a5d0b3b9100ac) - The reference snapshots (see below)
- tools.tar.xz (4.1K, MD5: e2b0a040ab2ab7057d9df42ee7876862) - Tools for unpacking traces
- virustotal.tar.gz (229M MD5: d0879d306ae663159cfb26092847a422) - Antivirus labels for each sample
- uuid_md5.txt (4.5M MD5: 10042d787a6f1566893f75e3d27c91cc) - Mapping between recording UUIDs and sample MD5s
- bagofwords_full.tar.gz (52G, MD5: bfa92e1acbfc3eab46f20438c2f18775) - Bag-of-words counts for textual data in memory accesses
- bagofwords_ranges.tar.gz (20G, MD5: 9e4149b2184febfa3996a5ef3029e6f5) - Bag-of-words counts for textual data in memory accesses (active malware ranges only)
- weighted_tfidf.tar.gz (12G, MD5: bdfd41cd67dba4cc9aa84e7bad7e0d4c) - Weighted TFIDF scores for textual data in memory accesses
- malrec_pcaps.tar.gz (73G, MD5: d8dfeb6c1c632338843fcae02ac0dcb1) - Network activity in PCAP form
- activityranges.tar.gz (4G, MD5: e0bc1ada961bca62069de6629b4ff9a1) - Instruction ranges where the malware sample was active
- sctext_results.tar (465G, MD5: af0c821d10112305c0cd25a495bd4f26) - System call traces for each sample
The recordings (each in a .txz file) are stored in the following format:
- A .patch file, listing the reference snapshot and differences needed to obtain the specific snapshot for this replay
- The -rr-nondet.log
To run a replay:
- Unpack the reference snapshots:
$ tar xJvf references.tar.xz logs/rr/references/0000c18e-a947-42ea-abb2-234ea18facdc-rr-snp logs/rr/references/0002f074-cd1b-4523-aacd-eeccd61c0f96-rr-snp logs/rr/references/00568419-706b-4c2e-ad3a-4de0add3780d-rr-snp logs/rr/references/023c870e-4be8-4f1c-a712-340e21c67565-rr-snp logs/rr/references/0335fe75-a7bd-4963-8304-da7e59005692-rr-snp logs/rr/references/097b607a-735e-4ac0-b853-c15dc58b58fc-rr-snp logs/rr/references/1b5091e3-98a5-4058-a944-c5d6f87fe103-rr-snp logs/rr/references/5ad7f823-1b5f-4f99-8f2b-53bf69e0fc08-rr-snp
- Unpack the replay you want to run (named, say, UUID):
$ tar xJvf UUID.txz logs/rr/UUID.patch logs/rr/UUID-rr-nondet.log
- Create the patched snapshot:
$ python tools/bpatch.py logs/rr/UUID.patch Using reference logs/rr/references/0000c18e-a947-42ea-abb2-234ea18facdc-rr-snp as a base Creating patched snapshot logs/rr/UUID-rr-snp All done, no errors.
- Run your replay:
$ /path/to/qemu-system-x86_64 -replay logs/rr/UUID -m 1G