Phishing subjects with invisible characters: RFC 2047 + soft hyphen evasion, and how to hunt it

October 28, 2025 (Last Modified: October 28, 2025)

4n6 Beat

6 min read

Attackers are splitting RFC 2047 encoded Subject headers and peppering them with soft hyphens (U+00AD) to sneak past filters. Here’s the...

SANS ISC documented a phishing message whose Subject was split into multiple RFC 2047 “encoded-words,” with soft hyphen characters (U+00AD) inserted between letters to break keyword matches. Outlook renders these as normal-looking text, so users never see the obfuscation, but filters that don’t normalize Unicode or decode RFC 2047 first can miss it (SANS ISC). Soft hyphen is a format character that’s typically invisible except at line breaks (Unicode UAX #14; see “Use of Soft Hyphen”), and Microsoft has previously called out invisible Unicode (including U+00AD and U+2060) as a phish-evasion tactic in both bodies and subject lines (Microsoft Threat Intelligence, 2021).

Intrusion Flow

Sender composes a Subject using MIME encoded-words like =?UTF-8?B?...?= and splits it across multiple encoded segments. This is allowed by RFC 2047, which defines the encoded-word syntax and permits multiple segments separated by folding whitespace (RFC 2047). Header folding itself is defined by RFC 5322, which allows CRLF + WSP folding and mandates unfolding before display (RFC 5322).
Inside the decoded text, the actor inserts soft hyphen U+00AD between letters. SHY is generally not rendered unless a line break occurs, so the Subject looks clean in common MUAs while keyword matching sees a different string (Unicode UAX #14; Unicode Core Spec, Hyphenation).
Some gateways and rulesets evaluate “subject contains word X” after MIME decoding but without Unicode normalization/stripping of format characters, so “password” doesn’t match “password” (Exchange mail flow predicates decode first).

Here’s why this matters for forensics: if you only hunt on visible text or don’t remove format/invisible code points, you’ll miss messages that look benign to users but are deliberately obfuscated to slide past filters.

Key Artifacts to Pull

Full RFC 5322 message (raw headers):
- Microsoft 365: Use Microsoft Graph to retrieve internetMessageHeaders from the message resource (explicit $select=internetMessageHeaders) (Graph v1.0 example). EWS also exposes Internet headers via the InternetMessageHeaders element (EWS reference).
- Gmail/Google Workspace: users.messages.get with format=full returns the top-level payload.headers including Subject (Gmail API; field reference shows headers on the MessagePart, including Subject](https://developers.google.com/workspace/gmail/api/reference/rest/v1/users.messages)).
Message tracing and delivery metadata:
- Exchange Online “new message trace” cmdlets surface 90 days of data; use Get-MessageTraceV2/Get-MessageTraceDetailV2 (WW tenants) for delivery path and NetworkMessageId (Microsoft Learn; GA announcement & deprecation of legacy cmdlets).
Defender for Office 365 telemetry:
- Advanced Hunting tables: EmailEvents, MessageEvents, EmailPostDeliveryEvents, EmailUrlInfo, EmailAttachmentInfo for subject, delivery, and URL/artifact pivoting (EmailEvents; MessageEvents; EmailPostDeliveryEvents; EmailUrlInfo; EmailAttachmentInfo).

Detection Notes

In most IR engagements, we pull the raw message, decode headers, normalize Unicode, then run string and rule matching on the sanitized text. Below are drop-in snippets and rules that do exactly that.

1) Decode RFC 2047 and strip invisible/format characters (Python)

The email package decodes RFC 2047 encoded-words; we then normalize and remove code points commonly abused for evasion (SOFT HYPHEN U+00AD, ZERO WIDTH SPACE U+200B, ZWNJ U+200C, ZWJ U+200D, WORD JOINER U+2060, ZWNBSP/BOM U+FEFF). Soft hyphen is typically invisible until a line break (Unicode UAX #14).

# Python 3.11+
from email.header import decode_header, make_header  # RFC 2047 decoding
import unicodedata

# characters often abused for subject/body evasion
INVISIBLES = {
    "\u00AD",  # SOFT HYPHEN
    "\u200B",  # ZERO WIDTH SPACE
    "\u200C",  # ZERO WIDTH NON-JOINER
    "\u200D",  # ZERO WIDTH JOINER
    "\u2060",  # WORD JOINER
    "\uFEFF",  # ZERO WIDTH NO-BREAK SPACE (BOM)
}

def decode_rfc2047(value: str) -> str:
    # returns a Unicode string
    parts = decode_header(value)  # list of (bytes/str, charset)
    return str(make_header(parts))  # joins parts correctly

def strip_invisibles(s: str) -> str:
    # NFC is fine; normalization does not remove SHY - you must strip explicitly
    s = unicodedata.normalize("NFC", s)
    return "".join(ch for ch in s if ch not in INVISIBLES)

raw_subject = "=?UTF-8?B?WcKtb3XCrXIgUMKtYXPCrXN3wq1vwq1yZCBpwq1zIEHCrWLCrW91dCA=?= =?UTF-8?B?dMKtbyBFwq14wq1wwq1pcsKtZQ==?="
decoded = decode_rfc2047(raw_subject)
cleaned = strip_invisibles(decoded)
print({"decoded": decoded, "cleaned": cleaned})

RFC 2047 decoding via email.header is standard Python (docs).
The invisibles chosen are documented as zero-width/format characters used for line breaking or join control (Unicode Core Spec on WJ/ZWSP/ZWNJ/ZWJ).

2) Microsoft 365 Defender Advanced Hunting (KQL)

Flag subjects that contain common invisible characters, and produce a normalized subject for downstream matching.

let Invis = dynamic(["","","‌","‍","⁠",""]); // U+00AD, U+200B, U+200C, U+200D, U+2060, U+FEFF
EmailEvents
| where isnotempty(Subject)
| extend HasInvisible = array_length(set_intersect(split(Subject, ''), Invis)) > 0
| extend NormSubject = replace_string(replace_string(replace_string(replace_string(replace_string(replace_string(Subject,
    "",""), "",""), "‌",""), "‍",""), "⁠",""), "",""))
| where HasInvisible
| project Timestamp, NetworkMessageId, InternetMessageId, SenderFromAddress, RecipientEmailAddress, Subject, NormSubject, DeliveryLocation, ThreatTypes

EmailEvents schema is documented for Defender hunting (EmailEvents).
You can pivot to delivery/post-delivery actions via MessageEvents and EmailPostDeliveryEvents (MessageEvents; EmailPostDeliveryEvents).

3) Exchange Online mail flow rule (server-side)

If you must match a risky keyword even when attackers intersperse invisibles, use a regex condition over the Subject with optional zero-width characters between letters. Mail flow rules evaluate text after MIME decoding, which is what we want (predicate behavior).

Example: detect “password” with any count of U+00AD/200B/200C/200D/2060/FEFF between letters:

Condition: The subject matches these text patterns
Pattern (single line):
(?i)p(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)a(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)s(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)s(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)w(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)o(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)r(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)d

Mail flow conditions that “match text patterns” use regular expressions; guidance and syntax are documented by Microsoft (Exchange Online regex usage).

4) Message trace pivots (delivery evidence)

Query delivery and event details for an affected message ID using the new cmdlets; sample in docs shows chaining Get-MessageTraceV2 | Get-MessageTraceDetailV2 (docs). Note that Microsoft’s GA announcement clarifies WW rollout and legacy deprecation timelines; GCC and other sovereign clouds have a separate schedule (announcement).

Response Guidance

Pull the raw message and decode the Subject before triage. Don’t trust the visible rendering-decode RFC 2047 and strip format characters first (RFC 2047; Unicode UAX #14).
Search for similarly obfuscated subjects tenant-wide:
- Defender AH: run the KQL above and pivot on sender, sending IP, URLs (EmailUrlInfo), and delivery locations (tables reference).
- Exchange trace: enumerate deliveries by Subject and NetworkMessageId using V2 cmdlets (Get-MessageTraceDetailV2).
Consider a containment rule while you tune detections: use a temporary mail flow rule that either prepends a warning or quarantines when the Subject matches your invisible-character regex (mail flow predicates and regex; regex guidance).
For long-term hygiene, normalize/strip format characters in preprocessing before keyword or ML pipelines evaluate subjects and bodies (SHY and WJ/ZW* often carry no semantic meaning in English text; see Unicode behavior for these characters: Core Spec Chapter 23).

Takeaways

Add a pre-processing step to decode RFC 2047 and remove invisible/format characters before any subject/body matching (RFC 2047; UAX #14).
Ship a Defender hunting query to find Subjects containing U+00AD/200B/200C/200D/2060/FEFF, and keep a normalized subject column for triage (EmailEvents).
If you run Exchange Online, consider a temporary mail flow rule using a Unicode-aware regex to block or tag subjects with interspersed invisibles (Exchange predicates; regex usage).
When investigating, use Graph/EWS or Gmail API to get raw headers and verify the Subject’s true code points-not just what the client UI shows (Graph; Gmail API).

Sources / References

SANS ISC diary: A phishing with invisible characters in the subject line (Oct 28, 2025): https://isc.sans.edu/diary/A%2Bphishing%2Bwith%2Binvisible%2Bcharacters%2Bin%2Bthe%2Bsubject%2Bline/32428
RFC 2047 - Message Header Extensions for Non-ASCII Text: https://www.rfc-editor.org/rfc/rfc2047
RFC 5322 - Internet Message Format (folding/unfolding): https://www.ietf.org/rfc/inline-errata/rfc5322.html
Unicode UAX #14 - Line Breaking Properties (Soft Hyphen behavior): https://unicode.org/reports/tr14/
Unicode Core Specification (Spaces, Word Joiner, Zero-Width characters): https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/
Microsoft Security Blog (2021): Invisible Unicode characters used to bypass detection: https://www.microsoft.com/en-us/security/blog/2021/08/18/trend-spotting-email-techniques-how-modern-phishing-emails-hide-in-plain-sight/
Exchange Online mail flow predicates (decode-before-match behavior): https://learn.microsoft.com/en-us/exchange/security-and-compliance/mail-flow-rules/conditions-and-exceptions
Exchange Online regex usage in mail flow rules: https://learn.microsoft.com/en-us/exchange/mail-flow-best-practices/regular-expressions-usage-transport-rules
Get-MessageTraceDetailV2 documentation: https://learn.microsoft.com/en-us/powershell/module/exchangepowershell/get-messagetracedetailv2
Exchange Team GA announcement for new Message Trace (WW rollout, legacy deprecation): https://techcommunity.microsoft.com/blog/exchange/announcing-general-availability-ga-of-the-new-message-trace-in-exchange-online/4420243
Microsoft Graph - Get message (internetMessageHeaders): https://learn.microsoft.com/en-us/graph/api/message-get?view=graph-rest-1.0
EWS - InternetMessageHeaders element: https://learn.microsoft.com/en-us/exchange/client-developer/web-service-reference/internetmessageheaders
Gmail API - users.messages.get: https://developers.google.com/gmail/api/v1/reference/users/messages/get
Python email.header (RFC 2047 decoding): https://docs.python.org/3.11/library/email.header.html
Defender XDR Advanced Hunting - EmailEvents table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailevents-table
Defender XDR Advanced Hunting - MessageEvents table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-messageevents-table
Defender XDR Advanced Hunting - EmailPostDeliveryEvents table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailpostdeliveryevents-table
Defender XDR Advanced Hunting - EmailUrlInfo table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailurlinfo-table
Defender XDR Advanced Hunting - EmailAttachmentInfo table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailattachmentinfo-table