THE PERIMETER · AI SECURITY · May 16, 2026

Your RAG pipeline is a covert exfiltration channel waiting to happen.

VectorSmuggle, released by ThirdKey researcher Jascha Wanger, catalogs six steganographic techniques to hide arbitrary data inside vector embeddings. The vectors still work for legitimate search. They also carry payloads your DLP cannot see, headed somewhere your security team is not monitoring. Enterprise RAG adoption moved sensitive corporate data into a storage format existing tooling cannot inspect — and the security industry is only starting to notice.

6 smuggling techniques
0 DLP tools that inspect this
May 14 public disclosure
TL;DR 30-second version · free
  1. 01 VectorSmuggle (disclosed May 14) catalogs six steganographic techniques to hide arbitrary data inside vector embeddings. The embeddings still work correctly for legitimate retrieval; they just also carry payloads invisible to existing tooling.
  2. 02 Your DLP, CASB, and IDS inspect bytes that look like documents and packets that look like network traffic. None of them inspect floats that look like embeddings. Vector databases are not in scope for most current DLP audits.
  3. 03 If your RAG pipeline ingests sensitive content from any source, you have an unmanaged data egress channel today. Inventory who can call your embedding API and where the resulting vectors land.
DEEP ANALYSIS · free while in beta
READING AS
FOR YOU

If your team operates a RAG pipeline, ask one question this week: who can call our embedding API, and where do the resulting vectors land? If the answer is "anyone with the API key, and we do not track destinations" — that is the data egress channel.

FOR YOU

When you build RAG features, treat the embedding service like an outbound data egress, not like a function call. The trust boundary is real and existing tooling does not enforce it for you.

FOR YOU

DLP and CASB vendors will need to extend coverage to vector databases. Adjacent plays: embedding-inspection startups, vector-DB-native security tooling, RAG observability platforms. Watch funding and acqui-hire activity in this space over the next two quarters.

FOR YOU

If your product ships a RAG feature to enterprise customers, security questionnaires from large buyers will soon include vector-DB egress and embedding-inspection questions. Have an answer ready before the questionnaire arrives.

FOR YOU

The open question: what is the detection accuracy ceiling for embedding-payload steganography? VectorSmuggle proves the attack works; nobody has published rigorous detection benchmarks yet. A well-designed study would shape the entire defense category.

What shipped that matters.

May 14

LSB float encoding

Hide data in the least-significant bits of embedding floats. Search still works.

steganography
May 14

Subspace utilization

Use unused dimensions of the embedding space as a covert channel.

steganography
May 14

Multi-vector splitting

Spread a payload across many vectors. Each looks benign in isolation.

steganography
May 14

Adversarial perturbation

Modify vectors in directions that preserve semantic meaning but encode data.

steganography
May 14

Cross-corpus blending

Mix sensitive content into legitimate document embeddings during ingestion.

steganography
May 14

VectorPin (defense)

Proposed cryptographic signing of embeddings so any modification breaks the signature.

defense

The technical primitive that makes this work is simple: a 1024-dimensional float vector has enormous information capacity beyond the semantic signal.

BEFORE
Traditional exfiltration paths
  • Email attachments — scanned by DLP
  • File transfers — logged at egress
  • Cloud storage uploads — flagged by CASB
  • Network tunneling — detected by IDS
  • Direct printing — controlled by endpoint policy
AFTER
Vector embedding exfiltration
  • Document is converted to a 1024-dim float vector
  • Vector contains semantic meaning AND a steganographic payload
  • Vector ships to embedding service over HTTPS
  • Vector lands in vector database as a normal RAG record
  • DLP, CASB, and IDS see only legitimate RAG ingestion traffic
  • Recipient retrieves vector and decodes payload at their leisure

Your security stack inspects bytes that look like documents and packets that look like network traffic. None of it inspects floats that look like embeddings.

Six failure modes that exist today, mostly invisible to current tooling. Severity reflects how reachable each is given current RAG adoption patterns.

  1. 01 HIGH

    Insider exfiltration via the embedding API

    Anyone with access to your embedding service can encode arbitrary data into the embeddings they generate. The embedding then ships to a vector DB — yours or an attacker-controlled one — through legitimate HTTPS traffic. No data loss prevention tool inspects float vectors today.

    DO Treat the embedding API as a high-trust data egress point. Audit who has access. Log embedding payloads alongside source documents for forensic correlation.
  2. 02 HIGH

    Compromised RAG pipeline = mass exfiltration window

    If an attacker compromises the embedding service, ingestion pipeline, or the vector DB itself, they can encode steganographic payloads at scale across every document ingested. The compromise might be invisible to legitimate users because retrieval still works correctly.

    DO Apply embedding integrity verification at retrieval time. The VectorPin scheme — cryptographic signing — is one approach. If you cannot deploy that yet, at minimum log embedding hashes alongside source documents to detect drift.
  3. 03 HIGH

    Vector DBs are not in scope for most DLP audits

    DLP rules cover email, file shares, cloud storage, code repos. They almost never cover vector databases. So sensitive content can flow into vector storage and out again without triggering a single DLP alert, even if your DLP would have caught it in any other format.

    DO Add vector DB egress to your DLP scope. At minimum, classify which vector DBs hold sensitive embeddings and monitor egress traffic from them like you would for any sensitive data store.
  4. 04 MEDIUM

    No audit trail for what content went into embeddings

    Most RAG pipelines log "ingested document X." They do not log what semantic content was actually encoded into the vector. If steganographic payload encoding happens at ingestion, there is no easy way to detect it after the fact.

    DO Log document-to-embedding mappings with content hashes on both sides. The asymmetry — knowing the source but not the encoded vector — is what hides the attack.
  5. 05 MEDIUM

    Adversarial ingestion via supply chain

    If your RAG pipeline ingests external content (web pages, customer-submitted documents, third-party data feeds), an attacker can submit pre-perturbed documents that produce vectors carrying their payload. They never need access to your embedding API.

    DO Apply input validation at the document layer before embedding generation. The validation does not need to detect the steganography — it needs to flag content that does not match expected sources or patterns.
  6. 06 MEDIUM

    Cross-corpus blending in trusted documents

    An attacker with write access to even one trusted document source can blend sensitive payload data into the embeddings of that source's documents. Subsequent retrievals carry the payload alongside legitimate content.

    DO Audit write access to RAG source corpora as carefully as you audit write access to production databases. The corpus is the attack surface.

Three concrete actions this week.

  1. 1

    Inventory your embedding API usage

    Who has access? What documents flow through it? Where do the vectors end up? If you cannot answer in one paragraph, you have an unmanaged data egress channel.

  2. 2

    Add vector DBs to DLP scope

    Get the egress traffic from vector databases into the same monitoring posture as your other sensitive data stores. Even if your tools cannot inspect the contents, they can flag anomalous egress patterns.

  3. 3

    Consider embedding signing if data is sensitive

    VectorPin or equivalent — cryptographically sign embeddings at creation. Verify signatures at retrieval. Detection is not perfect, but it raises the bar dramatically.

Signals in the next 60 days that matter.

First public CVE for VectorSmuggle-style attack

Once a real incident is reported, regulatory and enterprise adoption of embedding inspection will accelerate hard. Track the news cycle.

DLP vendor responses

Major DLP players (Forcepoint, Symantec, Microsoft Purview) will eventually add embedding inspection. Watch which one ships first — they will set the de facto standard.

OWASP / NIST RAG security guidance

Standards bodies tend to catch up to research by 6–12 months. When OWASP publishes a RAG-specific top-10, embedding exfiltration will be on it. Procurement requirements will follow.