Personal data management in the age of Machine Learning

Published on
Authors

Regulations and approaches such as Privacy by Design1 require that collected personal information is stored, processed and transferred in a way that reduces the risk of unauthorised access, minimises the amount of data collected and shared with third parties, and provides users with granular controls. Machine Learning applications for natural language processing are no exception, as the text passed to these tools may contain personal identifiable information (or PII) as well as also sensitive information.

There are number of ways to manage sensitive data in a machine learning application. Here we focus on the Retrieval Augmented Generation (RAG) use case and please note that this is by no means to be taken as legal advise on how to comply with privacy and security regulations.

Retrieval Augmented Generation is the process of retrieving existing information or data to support the generation of new content or responses. It involves using a retrieval system to extract relevant information from a corpus of text, which is then used to inform the output generated by a generative large language model, or LLM (such as OpenAI GPTs, Google’s Gemini, Mistral and others). This approach combines the benefits of both retrieval and generation techniques, allowing for more accurate and contextually relevant output by incorporating factual information from external sources.

Here is a high level process chart of a RAG framework, with the numbered circles representing the stages where personal or sensitive data may be encountered.

The diagram shows a high level view of a RAG application highlighting the stages where personal data may be encountered.
High level process chart of a RAG framework.

  1. The user’s query. It may contain names, phone numbers or other types of personal data
  2. The documents. Names, phone numbers, addresses and much more can be placed anywhere in a document
  3. The context retrieved to answer the question may contain personal data taken from the original document
  4. The answer produced by the generative model may contain the same personal data as in the input.

When and how should personal be data handled? These are the questions.

When

There’s no perfect workflow, it’s all about your use case. However, from our research and case studies, two key points can be made:

  • Hide sensitive information contained in the documents before storing them. This is probably not the best option, as it would also affect text retrieval and not only answers. If the user questions contain personal or sensitive data it might be difficult to match them with stored anonymised (or pseudonymised) data. If you own and control where the data is stored, it may not be worth the extra effort.
  • Hide sensitive information contained in the relevant text extracted from the documents, before sending it to the LLM. This way less text is processed and can be combined with the question. To avoid processing the same chunk of text multiple times and to reduce latency, storing both the masked and non-masked text is a viable option.

Thinking by example. If the question is "What is Gianluca's role in OOT?" and the anonymised source document says "[Person 1] is the CEO of [Company 1]", the model can never return a meaningful answer. Therefore, the question and the context must be processed in such a way that "[Person 1]" and "[Company 1]" in both sentences have exactly the same semantics.

How

There are a few services out there, but we are interested in Open Source solutions. Two projects have caught our attention:

Microsoft Presidio Analyzer

Presidio is an open source extensible kit for PII de-identification for text and images. It uses multiple recognizers, such as regular expressions, checksum, rule-based logic, Named Entity Recognition and context from surrounding words.

LLM-Guard

LLM-Guard is an open-source Python library tailored for sensitive and personal data handling when interacting with Large Language Models (LLMs). It offers scanners and methods for sanitization, detection of harmful language, prevention of data leakage, and resistance against prompt injection attacks. It can handle for example Credit Cards numbers, Person names, phone number, E-mail addresses and others. It worth mentioning that LLM Guard uses Presidio.