Identifying generic emails

I am working on a project with a few colleagues in the politics department and we had a large number of email addresses to sort through. Some of them were generic: inquiries@example.com where others were personal gstride@example.com. My first task was to sort the generic from the personal and discard the generic.

I had no idea how to start with this. The basic structure of the data was:

firstname surname email
Greg Stride
Greg Stride
Greg Stride
Greg Stride
Greg Stride
Greg Stride

There were lots of other variables, but these were the ones I ended up using. Eventually I decided that if I already had the name and the email address, I could find some way to compare the two and determine how similar they were.

I found a package called stringdist which could be used to figure out how similar different strings were.

But I needed to clean everything up a bit first. I needed to paste their names into a single string and convert it to lower case.

df$fullname <- paste0(df$firstname, df$surname) %>% tolower()
firstname surname email fullname
Greg Stride gregstride
Greg Stride gregstride
Greg Stride gregstride
Greg Stride gregstride
Greg Stride gregstride
Greg Stride gregstride

Easy enough. Then I needed to delete everything after the @ in the email address, because that would never be helpful. I did this using regular expressions and then made the email prefix all lower case too. I also removed all numbers because names don’t often have numbers in.

df$short_email <- gsub("@(.*)", "", df$email ) %>% tolower()
df$short_email <- gsub("[[:digit:]]", "", df$short_email)
firstname surname email fullname short_email
Greg Stride gregstride inquiries
Greg Stride gregstride gstride
Greg Stride gregstride gs
Greg Stride gregstride gregstride
Greg Stride gregstride gds
Greg Stride gregstride inq

Finally I got to the good bit. I used the longest common substring method of the stingdist() function to determine which names and email addresses were the most similar.

df$str_dist <- stringdist(df$fullname, df$short_email, 
                          method = "lcs")
firstname surname email fullname short_email str_dist
Greg Stride gregstride inquiries 13
Greg Stride gregstride gstride 3
Greg Stride gregstride gs 8
Greg Stride gregstride gregstride 0
Greg Stride gregstride gds 9
Greg Stride gregstride inq 11

This gave me the difference between the two strings. The lower the number, the more similar the email address and the full name were.

However, the difference is biased against shorter names/email addresses, so I divided by the absolute difference in number of characters then arranged the df by the score.

df$score <- df$str_dist/abs(nchar(df$fullname) - nchar(df$short_email))
df$score[is.na(df$score)] <- 0
firstname surname email fullname short_email str_dist score
Greg Stride gregstride gregstride 0 0.000000
Greg Stride gregstride gstride 3 1.000000
Greg Stride gregstride gs 8 1.000000
Greg Stride gregstride gds 9 1.285714
Greg Stride gregstride inq 11 1.571429
Greg Stride gregstride inquiries 13 13.000000

This is not a perfect way of doing things, and as you can see, causes a few problems with very short emails like inq vs gs, one of which is more likely to be generic. But in general it saves a huge amount of time. Going through 10,000 emails was much faster like this, and it generally has a high level of accuracy.

You could easily edit this around the format of emails in your dataframe. For example, if you know email addresses will start with initials, such as gs791 you can change your system to compare the string distance between the first initials of their names and their email.

I have used three excellent packages for this: kableExtra to make the tables look nice, stringdist which I mentioned earliery and dplyr for the pipe and arrange functions.

Dr Greg Stride
Dr Greg Stride
Researcher

My research interests include UK elections, election administration and public opinion

Related