Artificial intelligence (AI) is everywhere, and almost every organization is either using AI technology or developing AI innovations itself. However, AI initiatives rely on data, and that data is increasingly protected by government regulations around the world.
In particular, the European Union General Data Protection Regulation (GDPR), which applies to any organization that does business with European citizens, has strict rules about how any data pertaining to living humans can be used, stored, and processed. This is why anyone practicing AI should be aware of the legal implications of the data they collect and use.
What is GDPR Compliance?
The GDPR is a law that updates and unifies the data privacy laws of the European Union (EU), replacing the EU Data Protection Directive of 1995. It went into force on 25 May 2018. GDPR obligations apply to all data generated by EU citizens, regardless of whether the company collecting the data is located in the EU. The GDPR also defines penalties for non-compliance.
The main focus of GDPR compliance is on maintaining business transparency and extending the rights of “data subjects” – owners of data that is collected by organizations as part of their business activity. When a data breach is detected, the GDPR requires businesses to notify all affected individuals and regulators within 72 hours.
The purpose of the GDPR is to protect individuals and their personally identifiable information (PII), ensuring that organizations that collect this data, known as data controllers, do it in a responsible way. GDPR also aims to ensure personal data is secure, protected from risks like unauthorized access and accidental loss.
GDPR Compliance Challenges of AI Projects
This GDPR principle requires organizations to inform data subjects as to the purpose for which their data is collected and processed. This can be an obstacle for AI-powered solutions.
AI uses data to analyze and identify patterns and gain new insights, but this might not be the original purpose of the raw data. If the data controller does not obtain further consent to use the data for AI analysis, the GDPR only allows further processing of the data if the new purpose is compatible with the original purpose.
Fairness and Discrimination
GDPR stipulates that the interests of data subjects must be taken into account when processing data. It also requires that data controllers take steps to prevent discriminatory impacts on individuals. This also applies to machine learning teams that store and use data for AI projects.
Many machine learning systems are discriminatory because they are trained using biased data. ML teams need to learn how to mitigate this bias, otherwise they might not be compliant with GDPR.
GDPR stipulates that the data collected should be “adequate, limited, and relevant”. This means ML teams should think twice when using data for their models. Engineers need to determine the type and quantity of data needed for the project. This is not always predictable, so teams must constantly reevaluate the type and amount of data needed and aim to meet the data minimization requirement.
The GDPR gives data subjects the power to determine how their data is used by third-party controllers. This means organizations must be open and transparent about why and for what purpose they collect data. However, the nature of machine learning system development makes this difficult. Most AI models are black boxes, and it’s not always clear how models make decisions, especially in advanced applications.
How to Develop Compliance-friendly AI
Make Sure You’re Allowed to Use the Data
The GDPR requires privacy by design, meaning that privacy should be implemented by default, and respect for user privacy should be a guiding principle of system design. This means that if an application collects data that contains PII, it should minimize the amount of data collected, specify the exact purpose of the data, and limit the retention period.
GDPR also requires positive consent for the collection and use of personal data. This means the user must explicitly grant permission to use the data for a specific purpose. Even open-source datasets might contain personal data such as social security numbers. When using this type of data, it is important to anonymize or mask it to avoid compliance risk.
To comply with GDPR, training datasets must be cleaned before training, by removing personal identifiers that could violate user rights. However, in many cases this could make the dataset less useful for analysis. AI model training thrives in context. So, models based on “scrubbed” datasets could be less accurate and effective.
A novel solution is generating synthetic data. This is data that is not based on real people or events, but still appears realistic, to augment AI/ML training. For example, a synthetic dataset could be a list of names, addresses, and social security numbers which resemble real details but do not belong to any real person.
Explainable AI (XAI) methods help mitigate the black box effect of AI. The goal of explainable AI is to help humans understand what is happening inside AI systems. In other words, AI models can explain their decisions.
Explainable AI can also provide insights into a model’s abilities and help understand future behavior. While it does not reduce the need for data, it can help researchers understand the exact data they need to improve model accuracy, helping to comply with data minimization requirements.
Respond to Breaches of Personal Data
The GDPR sets several rules on how businesses should handle personal data, with a well-documented incident response process for data breach incidents.
These incident response guidelines require businesses to notify the GDPR supervisory authority within 72 hours of becoming aware of the incident. The notice must include the type and amount of data involved, the number of people associated with the data, contact information of the data protection officer (DPO), expected outcome of the breach, and the organization’s incident response and mitigation plan.
The GDPR also requires the organization to notify affected individuals as soon as possible, if an incident creates a situation that puts the individual at risk.
To meet these requirements, organizations must be ready to take immediate action to respond to an incident. This requires:
- Visibility over PII – making it clear how this data is collected, stored and used.
- Strict security measures – security controls that can protect data and prevent breaches.
- Ownership – identifying who in the organization owns the incident response process, and ensuring they understand and have the means to comply with regulatory requirements.
Keep Track of All Personal Data Collected
Data protection regulations, including GDPR, often require organizations to understand the location and use of all collected PII. Accurate data classification is essential to comply with the user’s right to be forgotten and the right to access personal information. The organization must have a way to know what personal information is contained in a data set, to understand the security measures needed to protect that information.
In this article, I explained the compliance challenge of AI projects, especially with respect to the GDPR. I provided several best practices that can help you make your AI initiative legally compliant:
- Make sure you’re allowed to use the data
- Check if synthetic data is sufficiently different from original data
- Ensure AI models are explainable
- Immediately respond to breaches of personal data
- Keep track of personal data collected
I hope this will be useful as you continue with AI innovation while respecting privacy rights and minimizing compliance risk for your organization.
Read Next: An incident response plan for your website