Phishing subjects with invisible characters: RFC 2047 + soft hyphen evasion, and how to hunt it
SANS ISC documented a phishing message whose Subject was split into multiple RFC 2047 “encoded-words,” with soft hyphen characters (U+00AD) inserted between letters to break keyword matches. Outlook renders these as normal-looking text, so users never see the obfuscation, but filters that don’t normalize Unicode or decode RFC 2047 first can miss it (SANS ISC). Soft hyphen is a format character that’s typically invisible except at line breaks (Unicode UAX #14; see “Use of Soft Hyphen”), and Microsoft has previously called out invisible Unicode (including U+00AD and U+2060) as a phish-evasion tactic in both bodies and subject lines (Microsoft Threat Intelligence, 2021).
Intrusion Flow
- Sender composes a Subject using MIME encoded-words like
=?UTF-8?B?...?=and splits it across multiple encoded segments. This is allowed by RFC 2047, which defines the encoded-word syntax and permits multiple segments separated by folding whitespace (RFC 2047). Header folding itself is defined by RFC 5322, which allows CRLF + WSP folding and mandates unfolding before display (RFC 5322). - Inside the decoded text, the actor inserts soft hyphen U+00AD between letters. SHY is generally not rendered unless a line break occurs, so the Subject looks clean in common MUAs while keyword matching sees a different string (Unicode UAX #14; Unicode Core Spec, Hyphenation).
- Some gateways and rulesets evaluate “subject contains word X” after MIME decoding but without Unicode normalization/stripping of format characters, so “password” doesn’t match “password” (Exchange mail flow predicates decode first).
Here’s why this matters for forensics: if you only hunt on visible text or don’t remove format/invisible code points, you’ll miss messages that look benign to users but are deliberately obfuscated to slide past filters.
Key Artifacts to Pull
- Full RFC 5322 message (raw headers):
- Microsoft 365: Use Microsoft Graph to retrieve
internetMessageHeadersfrom the message resource (explicit$select=internetMessageHeaders) (Graph v1.0 example). EWS also exposes Internet headers via theInternetMessageHeaderselement (EWS reference). - Gmail/Google Workspace:
users.messages.getwithformat=fullreturns the top-levelpayload.headersincluding Subject (Gmail API; field reference shows headers on the MessagePart, includingSubject](https://developers.google.com/workspace/gmail/api/reference/rest/v1/users.messages)).
- Microsoft 365: Use Microsoft Graph to retrieve
- Message tracing and delivery metadata:
- Exchange Online “new message trace” cmdlets surface 90 days of data; use
Get-MessageTraceV2/Get-MessageTraceDetailV2(WW tenants) for delivery path and NetworkMessageId (Microsoft Learn; GA announcement & deprecation of legacy cmdlets).
- Exchange Online “new message trace” cmdlets surface 90 days of data; use
- Defender for Office 365 telemetry:
- Advanced Hunting tables:
EmailEvents,MessageEvents,EmailPostDeliveryEvents,EmailUrlInfo,EmailAttachmentInfofor subject, delivery, and URL/artifact pivoting (EmailEvents; MessageEvents; EmailPostDeliveryEvents; EmailUrlInfo; EmailAttachmentInfo).
- Advanced Hunting tables:
Detection Notes
In most IR engagements, we pull the raw message, decode headers, normalize Unicode, then run string and rule matching on the sanitized text. Below are drop-in snippets and rules that do exactly that.
1) Decode RFC 2047 and strip invisible/format characters (Python)
The email package decodes RFC 2047 encoded-words; we then normalize and remove code points commonly abused for evasion (SOFT HYPHEN U+00AD, ZERO WIDTH SPACE U+200B, ZWNJ U+200C, ZWJ U+200D, WORD JOINER U+2060, ZWNBSP/BOM U+FEFF). Soft hyphen is typically invisible until a line break (Unicode UAX #14).
# Python 3.11+
from email.header import decode_header, make_header # RFC 2047 decoding
import unicodedata
# characters often abused for subject/body evasion
INVISIBLES = {
"\u00AD", # SOFT HYPHEN
"\u200B", # ZERO WIDTH SPACE
"\u200C", # ZERO WIDTH NON-JOINER
"\u200D", # ZERO WIDTH JOINER
"\u2060", # WORD JOINER
"\uFEFF", # ZERO WIDTH NO-BREAK SPACE (BOM)
}
def decode_rfc2047(value: str) -> str:
# returns a Unicode string
parts = decode_header(value) # list of (bytes/str, charset)
return str(make_header(parts)) # joins parts correctly
def strip_invisibles(s: str) -> str:
# NFC is fine; normalization does not remove SHY - you must strip explicitly
s = unicodedata.normalize("NFC", s)
return "".join(ch for ch in s if ch not in INVISIBLES)
raw_subject = "=?UTF-8?B?WcKtb3XCrXIgUMKtYXPCrXN3wq1vwq1yZCBpwq1zIEHCrWLCrW91dCA=?= =?UTF-8?B?dMKtbyBFwq14wq1wwq1pcsKtZQ==?="
decoded = decode_rfc2047(raw_subject)
cleaned = strip_invisibles(decoded)
print({"decoded": decoded, "cleaned": cleaned})
- RFC 2047 decoding via
email.headeris standard Python (docs). - The invisibles chosen are documented as zero-width/format characters used for line breaking or join control (Unicode Core Spec on WJ/ZWSP/ZWNJ/ZWJ).
2) Microsoft 365 Defender Advanced Hunting (KQL)
Flag subjects that contain common invisible characters, and produce a normalized subject for downstream matching.
let Invis = dynamic(["","","","","",""]); // U+00AD, U+200B, U+200C, U+200D, U+2060, U+FEFF
EmailEvents
| where isnotempty(Subject)
| extend HasInvisible = array_length(set_intersect(split(Subject, ''), Invis)) > 0
| extend NormSubject = replace_string(replace_string(replace_string(replace_string(replace_string(replace_string(Subject,
"",""), "",""), "",""), "",""), "",""), "",""))
| where HasInvisible
| project Timestamp, NetworkMessageId, InternetMessageId, SenderFromAddress, RecipientEmailAddress, Subject, NormSubject, DeliveryLocation, ThreatTypes
EmailEventsschema is documented for Defender hunting (EmailEvents).- You can pivot to delivery/post-delivery actions via
MessageEventsandEmailPostDeliveryEvents(MessageEvents; EmailPostDeliveryEvents).
3) Exchange Online mail flow rule (server-side)
If you must match a risky keyword even when attackers intersperse invisibles, use a regex condition over the Subject with optional zero-width characters between letters. Mail flow rules evaluate text after MIME decoding, which is what we want (predicate behavior).
Example: detect “password” with any count of U+00AD/200B/200C/200D/2060/FEFF between letters:
Condition: The subject matches these text patterns
Pattern (single line):
(?i)p(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)a(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)s(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)s(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)w(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)o(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)r(?:[\u00AD\u200B\u200C\u200D\u2060\uFEFF]*)d
- Mail flow conditions that “match text patterns” use regular expressions; guidance and syntax are documented by Microsoft (Exchange Online regex usage).
4) Message trace pivots (delivery evidence)
- Query delivery and event details for an affected message ID using the new cmdlets; sample in docs shows chaining
Get-MessageTraceV2 | Get-MessageTraceDetailV2(docs). Note that Microsoft’s GA announcement clarifies WW rollout and legacy deprecation timelines; GCC and other sovereign clouds have a separate schedule (announcement).
Response Guidance
- Pull the raw message and decode the Subject before triage. Don’t trust the visible rendering-decode RFC 2047 and strip format characters first (RFC 2047; Unicode UAX #14).
- Search for similarly obfuscated subjects tenant-wide:
- Defender AH: run the KQL above and pivot on sender, sending IP, URLs (
EmailUrlInfo), and delivery locations (tables reference). - Exchange trace: enumerate deliveries by Subject and NetworkMessageId using V2 cmdlets (Get-MessageTraceDetailV2).
- Defender AH: run the KQL above and pivot on sender, sending IP, URLs (
- Consider a containment rule while you tune detections: use a temporary mail flow rule that either prepends a warning or quarantines when the Subject matches your invisible-character regex (mail flow predicates and regex; regex guidance).
- For long-term hygiene, normalize/strip format characters in preprocessing before keyword or ML pipelines evaluate subjects and bodies (SHY and WJ/ZW* often carry no semantic meaning in English text; see Unicode behavior for these characters: Core Spec Chapter 23).
Takeaways
- Add a pre-processing step to decode RFC 2047 and remove invisible/format characters before any subject/body matching (RFC 2047; UAX #14).
- Ship a Defender hunting query to find Subjects containing U+00AD/200B/200C/200D/2060/FEFF, and keep a normalized subject column for triage (EmailEvents).
- If you run Exchange Online, consider a temporary mail flow rule using a Unicode-aware regex to block or tag subjects with interspersed invisibles (Exchange predicates; regex usage).
- When investigating, use Graph/EWS or Gmail API to get raw headers and verify the Subject’s true code points-not just what the client UI shows (Graph; Gmail API).
Sources / References
- SANS ISC diary: A phishing with invisible characters in the subject line (Oct 28, 2025): https://isc.sans.edu/diary/A%2Bphishing%2Bwith%2Binvisible%2Bcharacters%2Bin%2Bthe%2Bsubject%2Bline/32428
- RFC 2047 - Message Header Extensions for Non-ASCII Text: https://www.rfc-editor.org/rfc/rfc2047
- RFC 5322 - Internet Message Format (folding/unfolding): https://www.ietf.org/rfc/inline-errata/rfc5322.html
- Unicode UAX #14 - Line Breaking Properties (Soft Hyphen behavior): https://unicode.org/reports/tr14/
- Unicode Core Specification (Spaces, Word Joiner, Zero-Width characters): https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/
- Microsoft Security Blog (2021): Invisible Unicode characters used to bypass detection: https://www.microsoft.com/en-us/security/blog/2021/08/18/trend-spotting-email-techniques-how-modern-phishing-emails-hide-in-plain-sight/
- Exchange Online mail flow predicates (decode-before-match behavior): https://learn.microsoft.com/en-us/exchange/security-and-compliance/mail-flow-rules/conditions-and-exceptions
- Exchange Online regex usage in mail flow rules: https://learn.microsoft.com/en-us/exchange/mail-flow-best-practices/regular-expressions-usage-transport-rules
- Get-MessageTraceDetailV2 documentation: https://learn.microsoft.com/en-us/powershell/module/exchangepowershell/get-messagetracedetailv2
- Exchange Team GA announcement for new Message Trace (WW rollout, legacy deprecation): https://techcommunity.microsoft.com/blog/exchange/announcing-general-availability-ga-of-the-new-message-trace-in-exchange-online/4420243
- Microsoft Graph - Get message (internetMessageHeaders): https://learn.microsoft.com/en-us/graph/api/message-get?view=graph-rest-1.0
- EWS - InternetMessageHeaders element: https://learn.microsoft.com/en-us/exchange/client-developer/web-service-reference/internetmessageheaders
- Gmail API - users.messages.get: https://developers.google.com/gmail/api/v1/reference/users/messages/get
- Python email.header (RFC 2047 decoding): https://docs.python.org/3.11/library/email.header.html
- Defender XDR Advanced Hunting - EmailEvents table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailevents-table
- Defender XDR Advanced Hunting - MessageEvents table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-messageevents-table
- Defender XDR Advanced Hunting - EmailPostDeliveryEvents table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailpostdeliveryevents-table
- Defender XDR Advanced Hunting - EmailUrlInfo table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailurlinfo-table
- Defender XDR Advanced Hunting - EmailAttachmentInfo table: https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailattachmentinfo-table