Monday, March 25, 2024

Analyse, hunt and classify malware using .NET metadata

Introduction

Earlier last week, I ran into a sample that turned out to be PureCrypter, a loader and obfuscator for all different kinds of malware such as Agent Tesla and RedLine. 

Upon further investigation, I developed Yara rules for the various stages, which can be found here (excluding the final payload):

With that out of the way, all of this reminded me of the fact that we can also write Yara rules for unique identifiers specific to malware written in .NET, or any other .NET assemblies for that matter.

A bit of history

This isn’t my first encounter with analysing .NET malware at scale: several years ago, I co-authored a presentation with Santiago on hunting SteamStealer malware, which was surging exponentially at the time (the malware intended to steal your Steam inventory items and/or your account). A huge thanks goes to Brian Wallace who had developed a tool at the time called GetNetGUIDs which made it trivial to extract all the GUID types and start clustering to identify patterns: basically, which of the malware samples are likely authored by the same person or belong to the same attack campaign.

.NET assemblies or binaries often contain all sorts of metadata, such as the internal assembly name and GUIDs, specifically; the MVID and TYPELIB.

  • GUID: Also known as the TYPELIB ID, generated when creating a new project.

  • MVID: Module Version ID, a unique identifier for a .NET module, generated at build time.

  • TYPELIB: the TYBELIB version – or number of the type library (think major & minor version).

These specific identifiers can be parsed with the strings command and a simple regular expression (regex): [a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}

Taking a sample of PureLogStealer posted by James_in_the_box, you could then write a Yara rule based on the MVID or Typelib detected.

As shown on VirusTotal for this sample:

A screen shot of a computer

Description automatically generated
Figure 1 - Sample with MVID 9066ee39-87f9-4468-9d70-b57c25f29a67

And the resulting (simple) Yara rule, could then be as follows:

rule PureLogStealer_GUID

{

strings:

$mvid = "9066ee39-87f9-4468-9d70-b57c25f29a67" ascii wide fullword

condition:

$mvid

}

There are however some issues with this: 

  • The MVID is stored as a binary value rather than a string, whereas the Typelib GUID is effectively stored as a string and since we only have the MVID here, the sample above will not be detected with this rule.

  • It is important to note that VirusTotal does not seem to report the Typelib.

  • It is cumbersome to “do it the manual way” with strings and regex, especially on larger data sets – and it’s prone to issues such as:

    • false positives: if you run "strings" on the sample and then use the following CyberChef recipe – we get plenty of GUIDs, but only 1 is the actual Typelib;

    • false negatives: we miss out on unique identifiers, which means we might miss detection of samples, campaigns or actors.

Note that with tools such as IlSpy or dnSpy(Ex), you can also view the Typelib GUID and MVID, however, not all tools display all data, for example:

A screenshot of a computer program

Description automatically generated
Figure 2 - dnSpy detects the Typelib GUID of the sample

And if we go the "oldschool" route using ildasm:

Figure 3 - ildasm displays the MVID or Module Version ID


For all the above reasons, let’s go beyond and do more: both with Yara, and with a new Python tool I’ve created.

The now and the tooling

Before we dive into the tooling, some final history to say that Yara has evolved and thanks to that, we can now hunt and detect more effectively due to the following modules added:

  • 2017: introduction of .NET module (link)

  • 2022: introduction of console module (link)

This means that using the .NET module, we can now write a Yara rule like so instead:

import "dotnet"

rule PureLogStealer_GUID

{

condition:

dotnet.guids[0]== "9066ee39-87f9-4468-9d70-b57c25f29a67"

}

And indeed:

Figure 4 - Yara now detects the sample

Yara rule

Let’s now leverage the power of Yara and its dotnet and console modules to write a new Yara rule that displays useful data of any given .NET sample that can be leveraged to create meaningful rules, for example: assembly name, typelib and MVID. 

A screenshot of a computer code

Description automatically generated
Figure 5 - Yara rule to display .NET information to the console

We first verify if the binary is a .NET compiled file, if so, log certain Portable Executable (PE) or binary information to the console as well, and then display all relevant .NET information.

And the output will be, again for the same sample:

A computer screen shot of a computer program

Description automatically generated
Figure 6 - Yara rule output: sample metadata!


Meaning we can now write a rule as follows:

import "dotnet"

rule PureLogStealer_GUID

{

condition:

dotnet.guids[0]=="9066ee39-87f9-4468-9d70-b57c25f29a67" or

dotnet.typelib=="856e9a70-148f-4705-9549-d69a57e669b0"

}

Python tool

But what if we want to run this on a large set of samples and produce statistics, which we can then use to hunt or classify malware families, or cluster campaigns?

A newly developed Python tool will help you do exactly just that. It supports both a single file as well as a whole folder of your samples or malware repository. It will skip over any non-.NET binary and simply report the typelib, MVID and typelib ID (if present, which is seldom the case and rarely useful).


If we run it on our single sample like before:

A computer code with white text

Description automatically generated
Figure 7 - New tool output on single sample


The tool (or script) has the following capabilities:

A screen shot of a computer program

Description automatically generated
Figure 8 - Run the tool with -h to display usage or help

You need Python 3, pythonnet and a compiled dnlib.dll in order for it to work.

You are of course not limited to just using the MVID or Typelib for .NET malware hunting: you can also use the assembly name and other features that could be unique, using either the Yara rule or the Python tool to extract the data you’d like.
Both the Yara rule and the Python tool are published on the following GitHub page: https://github.com/bartblaze/DotNet-MetaData 

I highly recommend to use the tool rather than the Yara rule, as it detects .NET metadata more reliably. Both Yara rule and Python tool can be adapted to display less or more information according to your needs. 


Clustering

Tracking attacker’s campaigns is always an exercise, and can be both fun and exhausting, depending on how many rabbit holes you (want to) go through. An example of clustering campaigns as well as malware developers was done in the work I did with Santiago as mentioned earlier, which resulted in the following graphics:

A screenshot of a graph

Description automatically generated
Figure 9 - Statistics from 2016 research (bonus obfuscation stats)


This was a pretty large dataset (1.300 samples!) and specific to SteamStealers at the time.

For our analysis purposes, I took 4 of the most current popular malware (that are .NET based or have at least a .NET variant) according to Any.run’s Malware Trends: https://any.run/malware-trends/. These are:

  • RedLine

  • Agent Tesla

  • Quasar

  • Pure*: basically anything related to PureCrypter, PureLogs, …

Downloading the latest available samples per family from MalwareBazaar, then running my DotNetMetadata Python script, and playing around with pandas and matplot, we can create the following graphs per family:



RedLine – 56 samples

A pie chart with colorful circles

Description automatically generated
Figure 10 - RedLine Typelib GUID frequency


A colorful circular chart with numbers and numbers

Description automatically generated
Figure 11 -RedLine MVID frequency


Agent Tesla – 140 samples

A pie chart with numbers and a number

Description automatically generated
Figure 12 - Agent Tesla Typelib GUID frequency



A circular pattern with different colors

Description automatically generated with medium confidence
Figure 13 -Agent Tesla MVID frequency





Quasar – 141 samples


A pie chart with colorful circles

Description automatically generated
Figure 14 - Quasar Typelib GUID frequency



A pie chart with different colored circles

Description automatically generated
Figure 15 -Quasar MVID frequency




Pure* family - 194 samples 


A diagram of a pie chart

Description automatically generated
Figure 16 - Pure* Typelib GUID frequency



A circular pattern with different colors

Description automatically generated with medium confidence
Figure 17 -Pure* MVID frequency




While these piecharts are certainly hypnotic and display the frequency - or occurrence of the same typelib or MVID, we can also leverage these and create meaningful Yara rules for clustering samples per family, especially in the case of Quasar, the MVID with GUID "60f5dce2-4de4-4c86-aa69-383ebe2f504c" appears like a good candidate.

You might think that while these charts look visually appealing (depending on your art preferences), they may not be particularly useful because they don't scale well with larger datasets. You’re exactly right! By limiting the amount of results displayed, we can indeed produce even better results. In our sample dataset for the 4 malware families above, so a total of 531 samples, let’s run our visualisations again and now we will:

  • Run it on the whole sample set

  • Extract the assembly name

  • List only the top 10 of assembly names

  • Use a bar chart instead of a pie


And the result:

A bar chart with blue squares

Description automatically generated
Figure 18 - Assembly name frequency - looking better right?

The top 3 is then:

  • “Client”: Quasar family

  • “Product Design 1”: Pure family

  • “Sample Design 1”: Pure family

Client is likely the default assembly name when compiling the Quasar malware (project), and Product Design and Sample Design are likely default assembly names from the PureCrypter builder. 

If we then want to write a Yara rule for Quasar based on the default assembly name:

import "dotnet"

rule Quasar_AssemblyName

{

condition:

dotnet.assembly.name == "Client"

}


But why stop there? We can build a Yara rule to classify our malware dataset or repository:

import "dotnet"

import "console"

rule DotNet_Malware_Classifier

{

condition:

(dotnet.assembly.name == "Client" and console.log(“Likely Quasar, assembly name: ", dotnet.assembly.name)) or

(dotnet.assembly.name == "Product Design 1" and console.log("Likely Pure family, assembly name: ", dotnet.assembly.name)) or

(dotnet.assembly.name == "Sample Design 1" and console.log("Likely Pure family, assembly name: ", dotnet.assembly.name))

}


And we run this new Yara rule on the combined samples of the Pure family and Quasar:

A screenshot of a computer

Description automatically generated
Figure 19 - Simple "malware classifier"


We can combine sets of Yara rules bases on assembly name, Typelib, MVID and so on to create rules with a higher confidence, and we can use this in further hunting, classification and... much more. 


Bonus

If you’ve made it this far, it only makes sense to add in an additional extra use-case for all of this: finding new crypters or obfuscators! 

When I ran the script on the +500 samples, there was 1 assembly / binary that stood out:

A cartoon of a bathtub

Description automatically generated
Figure 20 - Potential new crypter "Cronos"

Making a simple Yara rule again:

import "dotnet"

rule cronos_crypter

{

strings:

$cronos = "Cronos-Crypter" ascii wide nocase

condition:

dotnet.is_dotnet and $cronos

}


Running this on the Unpac.me dataset yields:

A screenshot of a computer

Description automatically generated
Figure 21 - Unpac.me Yara hunt results


4 matches in 12 weeks: it appears this crypter is not popular (yet): 2 Async RAT samples and 2 PovertyStealer samples have used it so far. 


Bonus on Bonus


Let’s go with a final bonus round: improving the previous “classification” rule by also reviewing results for Async RAT. Seeing the previous crypter was used on at least 2 Async RAT samples, I wanted to see some statistics for this malware as well, for just the assembly name. This results in the following, based on 86 samples:

A pie chart with different colored circles

Description automatically generated
Figure 22 - Another pie chart: AsyncRat top used assembly names

 

Jumping out are the following assembly names:

  • AsyncClient

  • Client --> Also seen in Quasar!

  • XClient

  • Output

  • Loader

  • Stub


AsyncClient is likely the default name when building the Async RAT project. But we are interested in widening the net: from the previous rule DotNet_Malware_Classifier, let’s update it with these new “generic” or default assembly names:


import "dotnet"

import "console"

rule DotNet_Malware_Classifier

{

condition:

(dotnet.assembly.name == "Client" and console.log("Suspicious assembly name: ", dotnet.assembly.name)) or

(dotnet.assembly.name == "Output" and console.log("Suspicious assembly name: ", dotnet.assembly.name)) or

(dotnet.assembly.name == "Loader" and console.log("Suspicious assembly name: ", dotnet.assembly.name)) or

(dotnet.assembly.name == "Stub" and console.log("Suspicious assembly name: ", dotnet.assembly.name))

}




A screenshot of a computer

Description automatically generated
Figure 23 - Classifier Yara rule results


Conclusion

In this blog post, two new tools were presented to extract metadata from .NET malware samples. Specifically, we can now reliably extract 2 unique GUIDs: the Typelib and the MVID.

The Python script is capable of extracting the desired data from a large set of .NET assemblies, whereas the Yara rule is tailored for use with one particular sample. Of course, either of them can be used interchangeably: you can still fine-tune the Yara rule for a large set and work this way if you don’t want to rely on an external script. Similarly, the script can be extended to extract more data to be used.

Based on the output of these tools, you can then create Yara hunting rules, combine it with your existing rule sets, or use them in an attempt to classify malware families or specific attack campaigns.

Some closing remarks:

  • GUIDs could be spoofed or even removed. No method is 100% reliable.

  • However, this method can enhance already existing rulesets, especially those where .NET obfuscators (e.g. SmartAssembly) obfuscate (user) strings, modules and more, making it harder to write Yara rules for a malware family. Detecting based on GUID however, can work regardless of obfuscation method.

  • That said, obfuscating or deobfuscating may also alter the GUIDs. Keep this in mind when creating your detection rules based on an original or unpacked/deobfuscated sample.

  • If you encounter a GUID comprised entirely of zeros, such as 00000000-0000-0000-0000-000000000000, avoid using it for hunting since it's an empty GUID. This indicates the value may not be set or has been altered. This would make for a poor hunting rule as it can be a default value for any .NET project.

  • You can also use this methodology and tooling for .NET assemblies that are not malicious: extract developer information and other metadata per your use case or purpose.

    The Python tool in addition, just as the Yara rule, allows for analysing, classifying and hunting on much more .NET (meta)data.

     

Happy .NET hunting! You can find the tools and some of the example Yara rules in the repository: https://github.com/bartblaze/DotNet-MetaData 

As always, feedback is welcomed.