Unsupervised Binary Code Translation for Low-Resource Architectures with Applications to Vulnerability Discovery and Malware Detection

This research proposes to apply the ideas and techniques in neural machine translation (NMT) to binary code analysis, translating it to a binary in a high-resource instruction set architecture (ISA). A single model on a high-resource ISA can be trained, then the process reused on other ISAs.

Funded by the CCI Hub

Project Investigator

Principal Investigator (PI): Lannan Lisa Luo, Associate Professor, George Mason University Computer Science Department

Rationale and Background

Binary analysis techniques can identify vulnerabilities in software components within the supply chain, finding security flaws such as buffer overflows, memory corruption, or other programming errors that attackers can exploit.

In addition, software components are not immune to malware attacks, which can disrupt critical operations and compromise sensitive data.

With diverse ISAs on the market, training a deep-learning model can require a large amount of data, a challenge for ISAs with data scarcity issues.

For instance, acquiring a large dataset of PowerPC malware proves to be challenging, and given a binary analysis task and multiple ISAs, it takes time and effort (for data collection, labeling and cleaning, parameter tuning) to train one model per ISA.

Methodology

Researchers propose a retargeted-architecture binary code analysis, to alleviate the per-ISA effort and cope with the data scarcity issue.

To conduct this analysis, researchers will design an unsupervised binary code translation model.

This will advance binary analysis by developing a bridge for facilitating model reuse across ISAs, and use its applications in vulnerability discovery, code clone detection, and malware classification.

A single model on a high-resource ISA can be trained, then the process can be reused on other ISAs.

Researchers plan to train one model on a high-resource ISA (x86) and reuse it for other ISAs without any modification by translating a binary from a low-resource ISA to a high-resource ISA.

Following this translation, a model that has been trained with rich data for the high-resource ISA can be used to test the translated binary.

This approach eliminates the need for data collection in multiple ISAs, as well as the per-ISA fine tuning efforts.

Projected Outcomes

The research will advance binary analysis by facilitating model reuse across ISAs, and propel its various security applications, such as vulnerability discovery and malware detection.

This will save computing resources to train a large number of models and the effort in collecting datasets for each ISA, especially low-resource ISAs.

Researchers, companies, and government agents will be able to apply models towards secure analysis of binaries across ISA.