← All writing
· 6 min read

GDPR Compliant? Let's check!

A Spark and Jupyter walkthrough for scanning datasets for personally identifiable information before GDPR trouble starts.

  • bigdata
  • gdpr
  • spark
  • jupyter

Checking PII data on large datasets

This repository contains a simple PySpark notebook that reads through each line of a given Spark dataframe to extract all PII values. There is a check at two levels: the first one is at the column name level, and the second one checks each cell in the dataframe. Get started by cloning the repo:

git clone https://github.com/markthebault/pii-check-spark.git

Interpret the PII check

If the check at the column level is positive, that does not necessarily mean that the dataframe contains PII data. If the cell-level check is positive, there is a very good chance that this dataframe contains PII data.

The algorithm used

Columns

The PII check consists of first checking the columns. Typical column names are used for PII values, such as name, address, and phone. The algorithm reads the column name and calculates a similarity ratio against typical PII column names. If the result is higher than 0.85, then we consider the column a PII column.

The cells

To check if a cell contains a PII value, the algorithm runs it against some regexes:

  • Name check
  • Email
  • Phone number
  • Street addresses
  • IPs
  • Credit card number

To check a person’s name, the Python code uses a provided list of names. Of course, if the name is not in the list, it cannot be considered PII.

Limitations of this algorithm

This code is written with spark, so it scales as long as your cluster does. Checking large datasets > 1TB can be very expensive and very long. Note that the code is not optimized to have the maximum performances.

Currently, the number of checks is not sufficient to be certain that there is no undetected PII value. I would advise you to add other checks. For instance, AWS Macie defines these checks:

NameClassificationMinimum number of matchesRisk
Arista network configurationRegex17
BBVA Compass Routing Number - CaliforniaRegex11
Bank of America Routing Numbers - CaliforniaRegex101
Box LinksRegex13
CVE NumberRegex13
California Drivers LicenseRegex101
Chase Routing Numbers - CaliforniaRegex501
Cisco Router ConfigRegex39
Citibank Routing Numbers - CaliforniaRegex11
DSA Private KeyRegex18
Dropbox LinksRegex13
EC Private KeyRegex18
Encrypted DSA Private KeyRegex13
Encrypted EC Private KeyRegex13
Encrypted Private KeyRegex13
Encrypted PuTTY SSH DSA KeyRegex13
Encrypted PuTTY SSH RSA KeyRegex13
Encrypted RSA Private KeyRegex13
Google Application IdentifierRegex12
HIPAA PHI National Drug CodeRegex22
Huawei config fileRegex18
Individual Taxpayer Identification Numbers (ITIN)Regex1004
John the RipperRegex11
KeePass 1.x CSV PasswordsRegex18
KeePass 1.x XML PasswordsRegex18
Large number of US Phone NumbersRegex1001
Large number of US Zip CodesRegex1003
Lightweight Directory Access ProtocolRegex32
Metasploit ModuleRegex16
MySQL database dumpRegex17
MySQLite database dumpRegex17
Network Proxy Auto-ConfigRegex13
Nmap Scan ReportRegex17
PGP HeaderRegex15
PGP Private Key BlockRegex18
PKCS7 Encrypted DataRegex15
Password etc passwdRegex48
Password etc shadowRegex48
PlainText Private KeyRegex18
PuTTY SSH DSA KeyRegex18
PuTTY SSH RSA KeyRegex18
Public Key Cryptography System (PKCS)Regex13
Public encrypted keyRegex11
RSA Private KeyRegex18
SSL CertificateRegex13
SWIFT CodesRegex24
Samba Password config fileRegex17
Simple Network Management Protocol Object IdentifierRegex15
Slack 2FA Backup CodesRegex18
UK Drivers License NumbersRegex504
UK Passport NumberRegex51
USBank Routing Numbers - CaliforniaRegex501
United Bank Routing Number - CaliforniaRegex11
Wells Fargo Routing Numbers - CaliforniaRegex101
aws_access_keyRegex13
aws_credentials_contextRegex13
aws_secret_keyRegex110
facebook_secretRegex18
github_keyRegex18
google_two_factor_backupRegex18
heroku_keyRegex17
microsoft_office_365_oauth_contextRegex11
pgSQL Connection InformationRegex12
slack_api_keyRegex17
slack_api_tokenRegex18
ssh_dss_publicRegex11
ssh_rsa_publicRegex11

Speed up the process

My recommendation is to improve the performance of the algorithm, but this can be a hard task. The other solution is to take a large number of rows, such as 200k, and test the PII data on those rows. Of course, if you need to comply with GDPR, you need to read all rows of the dataframe. Even reading through all rows, I don’t guarantee the output of this algorithm is 100% objective.

Conclusion

Asserting that the data does not contain any PII data is not a simple task, and it is expensive. If you have the budget, I strongly recommend using AWS Macie, or having a strong data scientist improve this algorithm.