Checking PII data on large datasets
This repository contains a simple PySpark notebook that reads thought each line of a given Spark dataframe to extract all pii values. There is a check at two levels, the first one is at the column name level and the second one consists on checking each cell all the dataframe. Get started by cloning the repo:
git clone https://github.com/markthebault/pii-check-spark.git
Interpret the PII check
If the check at the column level is positive, that does not necessarily means that the dataframe contains PII data. In case of a cell check level positive there is a very good chance that this dataframe contains pii data
The algorithm used
Columns
The Pii check consist in first checking the columns, typical columns names are used for pii value (such as name, address, phone….). The algorithms read the column name and perform a ratio of similarity to the tipical Pii data columns names. If the results is superior at 0.85 percent then we consider than the column is a PII column.
The cells
To check if a cell contains a PII value, the algorithm runs it against some REGEX to check the value: - Name check - Email - Phone number - Street addresses - IPs - Credit card number In order to check the Names of the a person, in the python code the is a list of provided Names. Of course if the name is not in the list, it can not be considered as PII information.
Limitations of this algorithm
This code is written with spark, so it scales as long as your cluster does. Checking large datasets > 1TB can be very expensive and very long. Note that the code is not optimized to have the maximum performances.
Currently the number of checks are not sufficient to be certain if there is no PII value undetected. I will advise you to add some other checks (For instance AWS Macie define those checks):
Name | Classification | Minimum number of matches | Risk |
---|---|---|---|
Arista network configuration | Regex | 1 | 7 |
BBVA Compass Routing Number - California | Regex | 1 | 1 |
Bank of America Routing Numbers - California | Regex | 10 | 1 |
Box Links | Regex | 1 | 3 |
CVE Number | Regex | 1 | 3 |
California Drivers License | Regex | 10 | 1 |
Chase Routing Numbers - California | Regex | 50 | 1 |
Cisco Router Config | Regex | 3 | 9 |
Citibank Routing Numbers - California | Regex | 1 | 1 |
DSA Private Key | Regex | 1 | 8 |
Dropbox Links | Regex | 1 | 3 |
EC Private Key | Regex | 1 | 8 |
Encrypted DSA Private Key | Regex | 1 | 3 |
Encrypted EC Private Key | Regex | 1 | 3 |
Encrypted Private Key | Regex | 1 | 3 |
Encrypted PuTTY SSH DSA Key | Regex | 1 | 3 |
Encrypted PuTTY SSH RSA Key | Regex | 1 | 3 |
Encrypted RSA Private Key | Regex | 1 | 3 |
Google Application Identifier | Regex | 1 | 2 |
HIPAA PHI National Drug Code | Regex | 2 | 2 |
Huawei config file | Regex | 1 | 8 |
Individual Taxpayer Identification Numbers (ITIN) | Regex | 100 | 4 |
John the Ripper | Regex | 1 | 1 |
KeePass 1.x CSV Passwords | Regex | 1 | 8 |
KeePass 1.x XML Passwords | Regex | 1 | 8 |
Large number of US Phone Numbers | Regex | 100 | 1 |
Large number of US Zip Codes | Regex | 100 | 3 |
Lightweight Directory Access Protocol | Regex | 3 | 2 |
Metasploit Module | Regex | 1 | 6 |
MySQL database dump | Regex | 1 | 7 |
MySQLite database dump | Regex | 1 | 7 |
Network Proxy Auto-Config | Regex | 1 | 3 |
Nmap Scan Report | Regex | 1 | 7 |
PGP Header | Regex | 1 | 5 |
PGP Private Key Block | Regex | 1 | 8 |
PKCS7 Encrypted Data | Regex | 1 | 5 |
Password etc passwd | Regex | 4 | 8 |
Password etc shadow | Regex | 4 | 8 |
PlainText Private Key | Regex | 1 | 8 |
PuTTY SSH DSA Key | Regex | 1 | 8 |
PuTTY SSH RSA Key | Regex | 1 | 8 |
Public Key Cryptography System (PKCS) | Regex | 1 | 3 |
Public encrypted key | Regex | 1 | 1 |
RSA Private Key | Regex | 1 | 8 |
SSL Certificate | Regex | 1 | 3 |
SWIFT Codes | Regex | 2 | 4 |
Samba Password config file | Regex | 1 | 7 |
Simple Network Management Protocol Object Identifier | Regex | 1 | 5 |
Slack 2FA Backup Codes | Regex | 1 | 8 |
UK Drivers License Numbers | Regex | 50 | 4 |
UK Passport Number | Regex | 5 | 1 |
USBank Routing Numbers - California | Regex | 50 | 1 |
United Bank Routing Number - California | Regex | 1 | 1 |
Wells Fargo Routing Numbers - California | Regex | 10 | 1 |
aws_access_key | Regex | 1 | 3 |
aws_credentials_context | Regex | 1 | 3 |
aws_secret_key | Regex | 1 | 10 |
facebook_secret | Regex | 1 | 8 |
github_key | Regex | 1 | 8 |
google_two_factor_backup | Regex | 1 | 8 |
heroku_key | Regex | 1 | 7 |
microsoft_office_365_oauth_context | Regex | 1 | 1 |
pgSQL Connection Information | Regex | 1 | 2 |
slack_api_key | Regex | 1 | 7 |
slack_api_token | Regex | 1 | 8 |
ssh_dss_public | Regex | 1 | 1 |
ssh_rsa_public | Regex | 1 | 1 |
Speed up the process
The recommendation that I will make is of course to improve the performances of the algorithm but this can be an hard task. The other solution will take a large amount of rows like 200k and test the Pii data on those rows. Of course if you need to comply against GDPR you need to read all rows of the dataframe. Even reading through all rows I don’t guaranty the output of this algortim is 100% objective.
Conclusion
Asserting that the data does not contains any PII data is not a simple task and it is expensive. I strongly recommend if you have the budget for that to use AWS Macie. Or having strong DataScientist that can improve this algorithm.