Publications Details
Faster classification using compression analytics
Ting, Christina T.; Johnson, Nicholas; Onunkwo, Uzoma O.; Tucker, James D.
Compression analytics have gained recent interest for application in malware classification and digital forensics. This interest is due to the fact that compression analytics rely on measured similarity between byte sequences in datasets without requiring prior feature extraction; in other words, these methods are featureless. Being featureless makes compression analytics particularly appealing for computer security applications, where good static features are either unknown or easy to circumvent by adversaries. However, previous classification methods based on compression analytics relied on algorithms that scaled with the size of each labeled class and the number of classes. In this work, we introduce an approach that, in addition to being featureless, can perform fast and accurate inference that is independent of the size of each labeled class. Our method is based on calculating a representative sample, the Fréchet mean, for each labeled class and using it at inference time. We introduce a greedy algorithm for calculating the Fréchet mean and evaluate its utility for classification across a variety of computer security applications, including authorship attribution of source code, file fragment type detection, and malware classification.