Identifying generic emails
I am working on a project with a few colleagues in the politics department and we had a large number of email addresses to sort through. Some of them were generic:
email@example.com where others were personal
firstname.lastname@example.org. My first task was to sort the generic from the personal and discard the generic.
I had no idea how to start with this. The basic structure of the data was:
There were lots of other variables, but these were the ones I ended up using. Eventually I decided that if I already had the name and the email address, I could find some way to compare the two and determine how similar they were.
I found a package called stringdist which could be used to figure out how similar different strings were.
But I needed to clean everything up a bit first. I needed to paste their names into a single string and convert it to lower case.
df$fullname <- paste0(df$firstname, df$surname) %>% tolower()
Easy enough. Then I needed to delete everything after the
@ in the email address, because that would never be helpful. I did this using regular expressions and then made the email prefix all lower case too. I also removed all numbers because names don’t often have numbers in.
df$short_email <- gsub("@(.*)", "", df$email ) %>% tolower() df$short_email <- gsub("[[:digit:]]", "", df$short_email)
Finally I got to the good bit. I used the longest common substring method of the
stingdist() function to determine which names and email addresses were the most similar.
df$str_dist <- stringdist(df$fullname, df$short_email, method = "lcs")
This gave me the difference between the two strings. The lower the number, the more similar the email address and the full name were.
However, the difference is biased against shorter names/email addresses, so I divided by the absolute difference in number of characters then arranged the df by the score.
df$score <- df$str_dist/abs(nchar(df$fullname) - nchar(df$short_email)) df$score[is.na(df$score)] <- 0
This is not a perfect way of doing things, and as you can see, causes a few problems with very short emails like
gs, one of which is more likely to be generic. But in general it saves a huge amount of time. Going through 10,000 emails was much faster like this, and it generally has a high level of accuracy.
You could easily edit this around the format of emails in your dataframe. For example, if you know email addresses will start with initials, such as
gs791 you can change your system to compare the string distance between the first initials of their names and their email.