Phishing subjects with invisible characters: RFC 2047 + soft hyphen evasion, and how to hunt it

Attackers are splitting RFC 2047 encoded Subject headers and peppering them with soft hyphens (U+00AD) to sneak past filters. Here’s the...

SANS ISC documented a phishing message whose Subject was split into multiple RFC 2047 “encoded-words,” with soft hyphen characters (U+00AD) inserted between letters to break keyword matches. Outlook renders these as normal-looking text, so users never see the obfuscation, but filters that don’t normalize Unicode or decode RFC 2047 first can miss it (SANS ISC). Soft hyphen is a format character that’s typically invisible except at line breaks (Unicode UAX #14; see “Use of Soft Hyphen”), and Microsoft has previously called out invisible Unicode (including U+00AD and U+2060) as a phish-evasion tactic in both bodies and subject lines (Microsoft Threat Intelligence, 2021).

Intrusion Flow

  • Sender composes a Subject using MIME encoded-words like =?UTF-8?B?...?= and splits it across multiple encoded segments. This is allowed by RFC 2047, which defines the encoded-word syntax and permits multiple segments separated by folding whitespace (RFC 2047). Header folding itself is defined by RFC 5322, which allows CRLF + WSP folding and mandates unfolding before display (RFC 5322).
  • Inside the decoded text, the actor inserts soft hyphen U+00AD between letters. SHY is generally not rendered unless a line break occurs, so the Subject looks clean in common MUAs while keyword matching sees a different string (Unicode UAX #14; Unicode Core Spec, Hyphenation).
  • Some gateways and rulesets evaluate “subject contains word X” after MIME decoding but without Unicode normalization/stripping of format characters, so “pa­ss­word” doesn’t match “password” (Exchange mail flow predicates decode first).

Here’s why this matters for forensics: if you only hunt on visible text or don’t remove format/invisible code points, you’ll miss messages that look benign to users but are deliberately obfuscated to slide past filters.

Key Artifacts to Pull

Detection Notes

In most IR engagements, we pull the raw message, decode headers, normalize Unicode, then run string and rule matching on the sanitized text. Below are drop-in snippets and rules that do exactly that.

1) Decode RFC 2047 and strip invisible/format characters (Python)

The email package decodes RFC 2047 encoded-words; we then normalize and remove code points commonly abused for evasion (SOFT HYPHEN U+00AD, ZERO WIDTH SPACE U+200B, ZWNJ U+200C, ZWJ U+200D, WORD JOINER U+2060, ZWNBSP/BOM U+FEFF). Soft hyphen is typically invisible until a line break (Unicode UAX #14).

# Python 3.11+
from email.header import decode_header, make_header  # RFC 2047 decoding
import unicodedata

# characters often abused for subject/body evasion
INVISIBLES = {
    "\u00AD",  # SOFT HYPHEN
    "\u200B",  # ZERO WIDTH SPACE
    "\u200C",  # ZERO WIDTH NON-JOINER
    "\u200D",  # ZERO WIDTH JOINER
    "\u2060",  # WORD JOINER
    "\uFEFF",  # ZERO WIDTH NO-BREAK SPACE (BOM)
}

def decode_rfc2047(value: str) -> str:
    # returns a Unicode string
    parts = decode_header(value)  # list of (bytes/str, charset)
    return str(make_header(parts))  # joins parts correctly

def strip_invisibles(s: str) -> str:
    # NFC is fine; normalization does not remove SHY - you must strip explicitly
    s = unicodedata.normalize("NFC", s)
    return "".join(ch for ch in s if ch not in INVISIBLES)

raw_subject = "=?UTF-8?B?WcKtb3XCrXIgUMKtYXPCrXN3wq1vwq1yZCBpwq1zIEHCrWLCrW91dCA=?= =?UTF-8?B?dMKtbyBFwq14wq1wwq1pcsKtZQ==?="
decoded = decode_rfc2047(raw_subject)
cleaned = strip_invisibles(decoded)
print({"decoded": decoded, "cleaned": cleaned})
  • RFC 2047 decoding via email.header is standard Python (docs).
  • The invisibles chosen are documented as zero-width/format characters used for line breaking or join control (Unicode Core Spec on WJ/ZWSP/ZWNJ/ZWJ).

2) Microsoft 365 Defender Advanced Hunting (KQL)

Flag subjects that contain common invisible characters, and produce a normalized subject for downstream matching.

let Invis = dynamic(["­","​","‌","‍","⁠",""]); // U+00AD, U+200B, U+200C, U+200D, U+2060, U+FEFF
EmailEvents
| where isnotempty(Subject)
| extend HasInvisible = array_length(set_intersect(split(Subject, ''), Invis)) > 0
| extend NormSubject = replace_string(replace_string(replace_string(replace_string(replace_string(replace_string(Subject,
    "­",""), "​",""), "‌",""), "‍",""), "⁠",""), "",""))
| where HasInvisible
| project Timestamp, NetworkMessageId, InternetMessageId, SenderFromAddress, RecipientEmailAddress, Subject, NormSubject, DeliveryLocation, ThreatTypes

3) Exchange Online mail flow rule (server-side)

If you must match a risky keyword even when attackers intersperse invisibles, use a regex condition over the Subject with optional zero-width characters between letters. Mail flow rules evaluate text after MIME decoding, which is what we want (predicate behavior).

Example: detect “password” with any count of U+00AD/200B/200C/200D/2060/FEFF between letters:

Condition: The subject matches these text patterns
Pattern (single line):
(?i)p(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)a(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)s(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)s(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)w(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)o(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)r(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)d
  • Mail flow conditions that “match text patterns” use regular expressions; guidance and syntax are documented by Microsoft (Exchange Online regex usage).

4) Message trace pivots (delivery evidence)

  • Query delivery and event details for an affected message ID using the new cmdlets; sample in docs shows chaining Get-MessageTraceV2 | Get-MessageTraceDetailV2 (docs). Note that Microsoft’s GA announcement clarifies WW rollout and legacy deprecation timelines; GCC and other sovereign clouds have a separate schedule (announcement).

Response Guidance

  • Pull the raw message and decode the Subject before triage. Don’t trust the visible rendering-decode RFC 2047 and strip format characters first (RFC 2047; Unicode UAX #14).
  • Search for similarly obfuscated subjects tenant-wide:
    • Defender AH: run the KQL above and pivot on sender, sending IP, URLs (EmailUrlInfo), and delivery locations (tables reference).
    • Exchange trace: enumerate deliveries by Subject and NetworkMessageId using V2 cmdlets (Get-MessageTraceDetailV2).
  • Consider a containment rule while you tune detections: use a temporary mail flow rule that either prepends a warning or quarantines when the Subject matches your invisible-character regex (mail flow predicates and regex; regex guidance).
  • For long-term hygiene, normalize/strip format characters in preprocessing before keyword or ML pipelines evaluate subjects and bodies (SHY and WJ/ZW* often carry no semantic meaning in English text; see Unicode behavior for these characters: Core Spec Chapter 23).

Takeaways

  • Add a pre-processing step to decode RFC 2047 and remove invisible/format characters before any subject/body matching (RFC 2047; UAX #14).
  • Ship a Defender hunting query to find Subjects containing U+00AD/200B/200C/200D/2060/FEFF, and keep a normalized subject column for triage (EmailEvents).
  • If you run Exchange Online, consider a temporary mail flow rule using a Unicode-aware regex to block or tag subjects with interspersed invisibles (Exchange predicates; regex usage).
  • When investigating, use Graph/EWS or Gmail API to get raw headers and verify the Subject’s true code points-not just what the client UI shows (Graph; Gmail API).

Sources / References