HP Threat Research Blog Tracking Malware Campaigns Using String Metrics

December 6, 2019 Category: Threat Research By: Alex Holland Comments: 0

Tracking Malware Campaigns Using String Metrics


A requirement of effective network security monitoring is situational awareness, which simply means understanding benign and malicious activity in a network. Situational awareness enables network defenders to establish baselines of normal behaviour, freeing up analyst capacity to investigate anomalous activity that might be malicious. One aspect of situational awareness is perceiving potential and actual threats as they develop over time, such as phishing campaigns. Visualising data and presenting them through dashboards is a useful analytic technique for understanding the threats facing a network at a glance. In this post, we describe how you can apply string metrics to email logs to identify potential malware campaigns and visualise the results as similarity link charts. We have also released an accompanying tool to generate similarity link charts called graph_similiar_strings.py.[1]

Pattern Recognition

The human brain is excellent at recognising visual and auditory patterns.[2] This ability to recognise patterns lends itself to analysing security data. For example, phishing emails of the same campaign tend to contain similarities in subject lines, attachment names, header information and message text that an analyst can spot. The problem is that analyst time is scarce and email logs are extremely noisy, which means that it’s implausible for an analyst to examine every email log. The result is that patterns of potentially malicious activity in email logs can be missed by security teams, leaving a gap in situational awareness. A solution is to automate the process of spotting patterns, which we describe here. String metrics are one tool that we can use for this purpose.

Introduction to String Metrics

String metrics or string similarity functions measure how similar two strings are.[3] The unit that measures string similarity is the distance between strings. By setting distance thresholds it’s possible to use string metrics to identify similar but different strings. This is a useful property for spotting patterns in email logs because even if an attacker varies some characters in the string being analysed, the distance between two emails of the same campaign is likely to be small.

String metrics are a complex topic and here we only scratch the surface of what’s possible by applying them to one problem. We’ve listed some resources at the end of the article if you want to learn more about how they work and their application to other scenarios. For this project we compared three string metrics: Jaccard, Hamming and Levenshtein. Each metric requires two sets (e.g. A and B) that contain the strings to compare.

Here are some simple examples to illustrate how each metric works using Michaël Meyer’s Python distance library.[4]

Jaccard Distance

The Jaccard distance between two strings is the size of the intersection of sets A and B divided by the size of the union. For example, to find the Jaccard distance of the strings “hello” and “world” we first calculate the size of the union by counting the number of elements in the sets after combining them (9) and subtracting the number of elements that are shared, i.e. the intersection. In this case, the elements “l” and o” are shared, so the union is 7 (9 – 2 = 7). Finally, we divide the intersection by the union (2 / 7 = 0.29). Jaccard distance is quick to compute, produces a normalised result (i.e. 0 means no characters are shared while 1 means all characters are shared) and can calculate the distance between strings of different lengths. The downside of Jaccard is that it isn’t sensitive to character order or duplication.

>>> import distance
>>> index = distance.jaccard('hello', 'world')
>>> 1 - index

Hamming Distance

The Hamming distance between two strings is the total number of differing characters. For example, to find the Hamming distance between “hello” and “world” we count the characters that differ in the string (4). Hamming is quick to compute but has a significant shortcoming in that it can only compare strings that are the same length.

Position 0 1 2 3 4
String A h e l l o
String B w o r l d
Different? True True True False True
>>> import distance
>>> distance.hamming('hello', 'world')

Levenshtein Distance

The Levenshtein distance is the smallest number of character operations to change one string to another. Using our example again, to find the Levenshtein distance between “hello” and “world” we count the character operations required to turn one into the other (4). Of the three metrics, Levenshtein distance is the most expensive to compute, but it doesn’t have the shortcomings of Jaccard and Hamming in that it’s sensitive to character order and duplication and can calculate strings of different lengths.

Position 0 1 2 3 4
String A h e l l o
String B w o r l d
Operation Substitution Substitution Substitution None Substitution
>>> import distance
>>> distance.levenshtein('hello', 'world')

Overall, we chose Levenshtein as the string metric that was best suited to the requirements of the dashboard. Since the dashboard would be updated periodically rather than in real time, we weren’t concerned about speed. If speed is important, you may find that another string metric is more appropriate. You can also speed up the computation of Levenshtein distance by setting a limit on the maximum string length difference to calculate.

After identifying what string metric to use, the next step is to gather and filter data for the dashboard.

Gathering and Refining Email Data

As of 2019, email is the most common initial access vector for malware, meaning that special focus should be placed on tracking email-borne threats.[5] The first task is obtaining email logs to analyse. In an enterprise setting, email logs could be obtained from an email gateway or from email threats isolated by Bromium Secure Platform. Since we’re interested in identifying potentially malicious email campaigns, you can filter out irrelevant data such as emails where the sender and recipient are on the same domain. As many phishing emails spoof the sender address to create more credible lures, verify that the emails originated from your domain by checking Sender Policy Framework (SPF), DomainKeys Identified Mail (DKIM) and Domain-based Message Authentication, Reporting, and Conformance (DMARC) information before filtering out data.[6][7][8]

You can also enrich the similarity link charts by highlighting anomalous data, such as unusual file extensions. Bromium data on isolated email threats suggests that there is a lot of diversity in the file extensions of malware delivered by email. For instance, in November 2019 we observed 47 unique file extensions for email-borne threats across our customers. The most common file extensions are those associated with Microsoft Office, but we also found many exotic archive formats containing malware (e.g. ARJ, UUE, Z). Based on this observation, we can highlight uncommon formats because these are more likely to be malicious.

G_ R24
GZ R29

Table 1 – File extensions of malware delivered by email, November 2019.

Introducing graph_similar_strings.py

To demonstrate how to analyse email logs using string metrics and visualise the results, we’ve published a Python script called graph_similar_strings.py.[1] It reads a list of strings and generates link charts that cluster similar strings together based on a chosen string metric and distance threshold.

The script can be used to create a dashboard to maintain situational awareness of potentially malicious email campaigns by clustering similar strings together, such as filenames and subject lines. To use the script, supply a text file containing a list of strings. By default, the script outputs DOT files that can be exported as images using NetworkX and Graphviz graph generation and visualisation libraries.[9][10] If Graphviz is in your PATH, the script can export to SVG (recommended) or PNG images for you. We recommend adjusting the distance thresholds to your environment by running the script against a sample dataset from your network first.

Figure 1 – Attachment name similarity link chart generated with graph_similar_strings.py. Similar filenames are clustered together.

Practical Example – Tracking Malware Campaigns using Similar Filenames

Figures 2 to 4 are filename similarity link charts that show the development of a subset of Emotet malspam campaign activity over three months (September to November 2019). Each node is a unique filename that is connected by edges to other filenames that met the distance threshold. By regularly generating link charts, you can understand the scale and nature of malware campaigns at a glance and gain better situational awareness of email-borne threats as they develop.

Figure 2 – Filename similarity link chart for an Emotet malspam campaign in September 2019 (month 1).

Figure 3 – Filename similarity link chart for an Emotet malspam campaign in October 2019 (month 2).

Figure 4 – Filename similarity link chart for an Emotet malspam campaign in November 2019 (month 3).

Further Reading

String metrics can be used in other detection scenarios too. For instance, they can be used to identify typosquatted domain names and malicious files that are named to closely resemble legitimate system files.[11] To learn more, we recommend the following resources:

  • Metcalf, Leigh and Casey, William, Cybersecurity and Applied Mathematics (Cambridge, MA: Syngress, 2016)
  • Collins, Michael, Network Security Through Data Analysis (Sebastopol, CA: O’Reilly, 2017), 2nd Ed.
  • Saxe, Joshua and Sanders, Hillary, Malware Data Science (San Francisco, CA: No Starch Press, 2018)

[1] https://github.com/cryptogramfan/Malware-Analysis-Scripts/tree/master/graph_similar_strings

[2] Mattson, Mark P. “Superior pattern processing is the essence of the evolved human brain.” Frontiers in neuroscience vol. 8 265. 22 Aug. 2014, doi:10.3389/fnins.2014.00265

[3] https://en.wikipedia.org/wiki/String_metric

[4] https://pypi.org/project/Distance/

[5] Verizon, 2019 Data Breach Investigations Report, p. 13, https://enterprise.verizon.com/resources/reports/2019-data-breach-investigations-report.pdf

[6] https://en.wikipedia.org/wiki/Sender_Policy_Framework

[7] https://en.wikipedia.org/wiki/DomainKeys_Identified_Mail

[8] https://en.wikipedia.org/wiki/DMARC

[9] https://networkx.github.io/

[10] https://www.graphviz.org/

[11] https://attack.mitre.org/techniques/T1036/

2021-05-12T04:22:26-07:00December 6th, 2019|Threat Research|

Leave A Comment