Malware-DNA: Machine Learning for Malware Analysis that Treats Malwares as Mutations in the Genome of the Software

Abstract

Malware is one of the most dangerous and costly cyber threats to organizations, the public, and national security, and a crucial factor in modern warfare. The adoption of ML-based solutions against malware threats has been relatively slow despite the potential cost savings. The majority of prior malware defense solutions based on ML do not sufficiently address the following real-world challenges; considering the cost associated with labeled malware, use of supervised solutions that poorly generalize to new malware, training and testing models under class imbalance where both rare and prominent malware are included, and incorporating the ability to identify new/novel malware families. Cybersecurity analysts regularly go through large quantities of malware samples to understand if a new specimen belongs to a previously known malware family. Classifying a new malware sample into a known family reduces the number of files analysts need to examine, and aids in understanding the behavior of the malware, which is helpful for estimating the severity of the threat, developing mitigation strategies, and reducing cost and time spend on malware analysis. We have developed a new ML method, named Malware-DNA, for malware family classification, characterization, and identification that achieves state-of-the-art results and addresses major shortcomings in the field. Malware-DNA considers malware analogous to the genomic DNA, while exploring the hidden hierarchical structure of malware data without prior knowledge, using the ideas of our SmartTensors AI Platform, 2021 R&D100 winner in the IT category and recognized with an R&D100 2021 Market Disruptor Bronze Medal, enabling the discovery of multi-structure composition of malware, and separating mixed latent features. This hierarchical exploration is done based on semi-supervised and unsupervised ML techniques, yielding better generalization to new malware data. Under the DNA analogy, our innovation takes an approach that follows ideas of our recent ML methods in human cancer, the mutations to the genome can cause various inherited diseases such as certain cancers. Similarly, this project treats malware as malicious mutations (e.g., cancer) in the software genome (i.e. the computer code), and targets extraction and recognition of these new mutational malware signatures. The hierarchical approach is a tensor factorization technique that incorporates automatic model determination, combined with the ability to perform abstaining predictions (selective classification, predict “we do not know what this is”), allow Malware-DNA to identify and classify both rare and prominent malwares, work under label imbalance, maintain accuracy even when a low-quantity of labeled data is used, and detect novel malware (i.e. malware without labels or the software that we did not see before). Our method first builds an archive of latent multi-modal signatures (ALMAS) whose combinations describe and characterize complex data. This archive is then used for rapid/real-time characterization and classification of new data and detect unknown or novel phenomena. In our preliminary studies, we created a catalog of malware from multiple families and benign specimens with static analysis features from the EMBER-2018 dataset6. We select one malware class to be the novel family. We showcase the performance of our approach by classifying the malware families and benign-ware, detecting the specimens belonging to the novel malware family, and report our results with Area Under the Curve of Risk-Coverage (AURC) score.

Publication
Presented at the Conference on Data Analysis 2023 (CoDA 23), Santa Fe, New Mexico. March 7-9, 2023

Keywords:

malware analysis, non-negative matrix factorization, large data, reject-option, selective classification

Citation:

Eren, M.E., Rasmussen, K.O., Nicholas, C., and Alexandrov, B.. Malware-DNA: Machine Learning for Malware Analysis that Treats Malwares as Mutations in the Genome of the Software. CoDA ’23: Conference on Data Analysis, March 7-9, 2023, Santa Fe, New Mexico, USA.

BibTeX:

@INPROCEEDINGS{eren_coda_2023,
  author={M. E. {Eren} and K.O. {Rasmussen} and N. {Charles} and B. S. {Alexandrov}},
  booktitle={Conference on Data Analysis 2023 (CoDA 23')}, 
  title={Malware-DNA: Machine Learning for Malware Analysis that Treats Malwares as Mutations in the Genome of the Software}, 
  year={2023}}
Maksim E. Eren
Maksim E. Eren
Scientist

My research interests lie at the intersection of the machine learning and cybersecurity disciplines, with a concentration in tensor decomposition.