RAG Pipelines and the Right to Be Forgotten: An Engineering Problem Disguised as a Legal One
Vector databases break the assumptions behind GDPR Article 17. Here's why deletion in RAG systems is architecturally different from relational databases, and how to design for it.
A user submits a deletion request. In a relational database, the operation is one line of SQL: DELETE FROM users WHERE id = 42. The B-tree rebalances, VACUUM reclaims space, and the data is gone. ACID guarantees make the outcome verifiable.
In a vector database powering a RAG pipeline, that same request triggers a cascade of problems that most engineering teams haven't designed for. The embedding is structurally entangled with an HNSW graph. The metadata might live in a separate store. The chunked text that generated the embedding might be cached in three places. And "gone" is harder to prove when your index is eventually consistent.
This is the gap between what data protection law demands and what most RAG architectures can deliver. It's not a compliance checkbox problem. It's a systems design problem.
What the Law Actually Requires
GDPR Article 17(1) grants data subjects the right to obtain erasure of their personal data "without undue delay." The controller must comply when, among other grounds, the data subject withdraws consent (Art. 17(1)(b)), the data was unlawfully processed (Art. 17(1)(d)), or the data is no longer necessary for its original purpose (Art. 17(1)(a)).
Article 17(2) extends the obligation: if the controller has made the data public, it must take "reasonable steps, including technical measures, to inform controllers which are processing the personal data" to erase it. In a RAG pipeline, this means every downstream system that received chunked text or derived embeddings.
Brazil's LGPD mirrors this in Article 18(IV), granting the right to "anonymization, blocking, or deletion of unnecessary, excessive data, or data processed in violation of this Law," and Article 18(VI) specifically covers "deletion of personal data processed with the consent of the data subject." Ecuador's LOPDP Article 15 establishes a 15-day window for controllers to fulfill deletion requests.
These are not theoretical provisions. In March 2025, the EDPB launched a coordinated enforcement action (CEF 2025) focused specifically on the right to erasure, with 30 data protection authorities plus the EDPS participating. The stated goal: to assess whether erasure rights are "truly actionable in practice and not merely theoretical."
The Precedent: Algorithmic Disgorgement
Regulators have already demonstrated they will order the destruction of AI systems built on improperly handled data. The FTC has made this a core enforcement tool.
In Everalbum/Paravision (FTC Case No. 192-3172, January 2021), the FTC ordered destruction of all facial recognition models trained on photos collected without proper consent. Not just the data. The models derived from the data.
In WW International/Kurbo (March 2022), the FTC ordered deletion of all personal data collected from children without parental consent and "any models or algorithms developed using that personal information." This was the first COPPA case requiring algorithm destruction.
In Rite Aid (December 2023), the same pattern: delete the images, delete the models, and a five-year ban on AI facial recognition.
In Amazon Ring (May 2023), the FTC required a $5.8 million refund plus deletion of both data and algorithms trained on improperly accessed customer videos.
The EDPB's Opinion 28/2024 (adopted December 17, 2024) addresses this directly for European enforcement: when an AI model was developed with unlawfully processed personal data, remediation measures may include "the erasure of the whole dataset used to develop the AI model and/or the AI model itself."
The pattern is clear. If your data pipeline is non-compliant, regulators are not limiting orders to the data. They are ordering deletion of everything derived from it.
Why Vector Databases Break the Deletion Assumption
In a relational database, a row is an independent unit. Deleting it is an O(log n) B-tree operation that doesn't affect the correctness of queries on other rows.
In an HNSW index (the dominant indexing structure in pgvector, Qdrant, and Weaviate), vectors are structurally interdependent. The graph maintains edges between neighboring vectors across hierarchical layers. Deleting a vector has three consequences that relational databases don't share:
Graph connectivity degrades. Nodes that used the deleted vector as a routing hop now have a dead link. If it was a bridge node in an upper layer, entire regions of the search space become harder to reach, reducing recall.
Tombstones accumulate. Most implementations mark deleted nodes rather than removing them from the graph. Tombstoned nodes are skipped during search but still consume memory and I/O. Heavy deletion leads to index bloat where a significant fraction of the graph is dead weight.
Consistency is eventual, not immediate. Pinecone's documentation states explicitly that deletes are eventually consistent and "there can be a slight delay before new or changed records are visible to queries." Qdrant and Weaviate handle this through segment optimization cycles. pgvector, running on PostgreSQL, is the exception: its deletions are ACID-compliant and immediately consistent.
The practical cost: after deleting more than 10-20% of vectors from an HNSW index, search quality degrades enough that a full reindex is necessary. For pgvector, that's REINDEX INDEX CONCURRENTLY. For Qdrant, it's a segment rebuild triggered by the deleted_threshold parameter. For Pinecone, there is no user-triggered reindex. You delete and recreate the index.
For a million-vector index at 1536 dimensions, reindexing takes minutes to tens of minutes on modern hardware, plus roughly $100 in embedding API costs if re-computation is needed. At ten million vectors, multiply accordingly.
Are Embeddings Personal Data?
This is the threshold question. If embeddings aren't personal data, deletion obligations don't apply to them. The answer, in most practical RAG deployments, is that they are.
GDPR Article 4(1) defines personal data as "any information relating to an identified or identifiable natural person." The test isn't whether you can reconstruct the original text from the embedding. It's whether the embedding relates to an identifiable person and could be used to single out that person.
Three factors push embeddings into the "personal data" category in practice:
Linkability. If the embedding is stored alongside a user ID, tenant ID, document source, or any identifier, the record as a whole is personal data regardless of whether the vector alone is.
Singling out. Research on embedding inversion attacks has demonstrated that original inputs can be recovered from embeddings. In the best-performing attacks studied, exact inputs were recovered in 92% of cases, including names and health diagnoses. Even without full inversion, if a user's documents produce a distinctive cluster in embedding space, that cluster can re-identify the user, meeting the "singling out" criterion from Recital 26.
Metadata coupling. RAG systems inherently maintain the link between embeddings and source documents. That's the "Retrieval" in Retrieval-Augmented Generation. The chunk text, the document ID, the user who uploaded it: these travel with the vector. Even if a future ruling determines standalone embeddings aren't personal data, the record they're stored in certainly is.
The EDPB's Opinion 28/2024 addresses when AI models can be considered anonymous. The assessment must be case-by-case, considering both "the likelihood of direct extraction of personal data from the model" and "the likelihood of obtaining personal data from queries." For a RAG pipeline where vectors are stored alongside source metadata and chunk text, anonymity is hard to argue.
The Article 29 Working Party's Opinion on Anonymisation Techniques (WP216, April 2014) established that data is only truly anonymous if it resists three attacks: singling out, linkability, and inference. Embeddings stored with identifiers fail the linkability test. Embeddings with distinctive clustering properties may fail the singling out test.
The practical conclusion: treat embeddings as personal data. The cost of implementing proper deletion is far lower than the cost of an enforcement action.
Designing for Deletion: Architecture Patterns
The core insight is that tenant-level physical isolation turns an O(n) deletion problem into an O(1) metadata operation. Instead of finding and deleting individual vectors from a shared index, you drop a partition.
pgvector: Partitioned Tables with Row-Level Security
pgvector's strongest advantage is that it inherits PostgreSQL's full relational tooling. Partitioned tables give you physical isolation per tenant. Row-Level Security enforces it at the query layer.
CREATE TABLE documents (
id BIGSERIAL,
tenant_id UUID NOT NULL,
content TEXT,
embedding vector(1536),
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY LIST (tenant_id);
-- Each tenant gets a physical partition
CREATE TABLE documents_tenant_a PARTITION OF documents
FOR VALUES IN ('aaaa-...');
-- Each partition gets its own HNSW index
CREATE INDEX ON documents_tenant_a
USING hnsw (embedding vector_cosine_ops);
-- Row-Level Security for defense in depth
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents
USING (tenant_id = current_setting('app.current_tenant')::uuid);
To delete all data for a tenant: DROP TABLE documents_tenant_a;. Instant. No tombstones. No index degradation. ACID-compliant.
For user-level deletion within a tenant, metadata in a JSONB column enables targeted deletes:
DELETE FROM documents
WHERE metadata->>'user_id' = 'user-456';
A GIN index on the JSONB column (CREATE INDEX ON documents USING GIN (metadata)) keeps this fast.
Pinecone: Namespace Isolation
Pinecone's namespaces provide hard isolation boundaries. One namespace per tenant means deletion is index.delete(delete_all=True, namespace="tenant-a").
For user-level deletion within a namespace, Pinecone supports metadata-filtered deletion on pod-based indexes:
index.delete(
filter={"user_id": {"$eq": "user-456"}},
namespace="tenant-a"
)
Critical caveat: serverless indexes do not support delete-by-metadata-filter. The documented workaround is to query by metadata first, collect the IDs, then delete by ID. For compliance workflows, this two-step process needs error handling and verification.
Weaviate: Native Multi-Tenancy
Weaviate introduced native multi-tenancy in v1.20, designed specifically for tenant data isolation. Each tenant gets its own shard at the storage level:
client.schema.create_class({
"class": "Document",
"multiTenancyConfig": {"enabled": True},
"properties": [
{"name": "content", "dataType": ["text"]},
{"name": "userId", "dataType": ["text"]}
]
})
Deleting a tenant is a shard-level operation: client.schema.remove_tenants("Document", ["tenant-a"]). No point-by-point deletion. The shard is dropped.
Weaviate also supports a built-in TTL (Time-To-Live) at the collection level for automatic object expiration, a useful complement to explicit deletion.
Qdrant: Payload-Filtered Deletion
Qdrant supports deletion by payload filter without the two-step query-then-delete pattern:
client.delete(
collection_name="documents",
points_selector=Filter(
must=[FieldCondition(
key="user_id",
match=MatchValue(value="user-456")
)]
)
)
For tenant-level isolation, Qdrant supports one collection per tenant. The deleted_threshold optimizer parameter controls when automatic segment rebuilding triggers after deletions, defaulting to 0.1 (10% deleted vectors in a segment).
The Deletion Pipeline
Individual vector store operations are necessary but not sufficient. A complete deletion pipeline must account for every location where personal data exists in the RAG system:
1. Source documents → object store (S3, GCS, local)
2. Text chunks → may be cached or stored separately
3. Chunk-to-doc maps → relational table linking chunks to sources
4. Embeddings → vector store
5. Embedding cache → if pre-computed embeddings are cached
6. LLM conversation → if context windows are logged
A defensible deletion workflow:
- Receive and log the request. Record the erasure request with timestamp, data subject identifier, and scope. This log itself is the audit trail.
- Soft-delete immediately. Exclude the user's data from all queries. This can happen in milliseconds and ensures no further processing while hard deletion runs.
- Look up all artifacts. Use the chunk-to-document mapping table to find every chunk ID, every embedding, and every cached object associated with the user.
- Hard-delete from every store. Vector store, object store, cache, mapping table. Each deletion should be logged independently.
- Verify. Run a count query against the vector store filtered by the user's identifier. The result must be zero. Log the verification.
- Schedule reindex if necessary. If cumulative deletions exceed your threshold (10-20% of vectors), trigger a background reindex to restore search quality.
GDPR Article 17(1) requires erasure "without undue delay." A pipeline that soft-deletes within seconds and hard-deletes within hours is defensible. A pipeline that queues deletion requests for a monthly batch job is likely not.
The OWASP Standard
The OWASP AI Security Verification Standard (AISVS) codifies these patterns in its Control Family C08 (Memory, Embeddings & Vector Database). Three requirements are directly relevant:
8.3.1: Every vector and metadata record must carry a TTL or explicit retention label honored by automated cleanup jobs.
8.3.2: User-initiated deletion requests must purge vectors, metadata, cache copies, and derivative indices within 30 days.
8.3.3: Post-deletion, logical deletes must be followed by cryptographic shredding of storage blocks or key-vault key destruction.
The OWASP Top 10 for LLM Applications (2025 edition) lists "Vector and Embedding Weaknesses" as LLM08, identifying data lifecycle and deletion risks as a top-10 concern for LLM applications.
What This Means for Architecture Decisions
The choice of vector database is a compliance decision, not just a performance one.
pgvector offers the strongest compliance story: ACID transactions, partitioned tables for tenant isolation, Row-Level Security, standard SQL deletion, and immediate consistency. The trade-off is that you're managing PostgreSQL infrastructure and HNSW performance tuning yourself.
Weaviate's native multi-tenancy provides shard-level isolation with a managed experience. Tenant deletion is clean and atomic.
Qdrant's payload filtering handles user-level deletion without a two-step process, and the configurable segment optimizer automates post-deletion cleanup.
Pinecone's serverless limitation on metadata-filtered deletion is a real constraint for compliance workflows. If you're on serverless Pinecone, design around namespaces for isolation, not metadata filters for deletion.
Regardless of the vector store, three design decisions matter more than the choice of database:
-
Tag every embedding with tenant and user identifiers at ingestion. If you embed documents into a shared index without provenance metadata, you cannot honor a deletion request without scanning the entire dataset or reindexing from scratch.
-
Maintain a chunk-to-document mapping table in a relational database. The vector store should not be your source of truth for data lineage. A PostgreSQL table mapping
chunk_id → document_id → user_idlets you look up every artifact that needs deletion in milliseconds. -
Design for tenant-level isolation from day one. Partitions, namespaces, shards, or collections. The mechanism varies by database, but the principle is universal: deletion at the isolation boundary is atomic and fast. Deletion within a shared index is slow, lossy, and hard to verify.
Retrofitting any of these into an existing system is expensive. The schema changes, the data migration, the reindexing. This is load-bearing infrastructure. It needs to be in the foundation, not bolted on after the building is up.
Keep reading:
- Why Trademark Monitoring Is a Natural Fit for RAG
- The Direct Effect of the EU AI Act: Why Local Compliance Isn't Enough
- Your ISO 27001 Certificate Doesn't Make You GDPR Compliant
Building RAG pipelines and need help with data protection compliance? Let's talk.