Modern emails are formatted using HTML and CSS, allowing the writer to apply structure, layout, and branding that would not be possible using plain text. Many companies issue style guides and templates for external or official communications that prescribe fonts, text size and other aspects of layout.
Use of HTML/CSS introduces complexity to the email space, and with it the potential for exploitation. For example, malicious emails often use legitimate CSS styling, or user-invisible text to conceal a malicious payload and mimic a known company layout.
We have found that legitimate email communication broadly falls into three categories:
- Internal communication using plain text or simple HTML (interpersonal communications)
- External communication using complex HTML with recurring structures and styles (such as email signatures)
- External communication using complex HTML with full set templates and styles (such as newsletters and announcements)
The categories are characterized by features, such as the frequency of CSS appearance, frequency of HTML node appearance, and HTML tree depth. A classifier can use these categories to further direct feature extraction and tracking. The ability to quantify the complexity and style of a HTML document, and to track changes over time or against a model, allows the detection of anomalous and potentially malicious email communications.
This approach has been incorporated into the Darktrace Antigena Email product and contributes to detecting account takeovers and behavioral anomalies.