Identifying generic emails

Last updated on Aug 3, 2021

I am working on a project with a few colleagues in the politics department and we had a large number of email addresses to sort through. Some of them were generic: inquiries@example.com where others were personal gstride@example.com. My first task was to sort the generic from the personal and discard the generic.

I had no idea how to start with this. The basic structure of the data was:

firstname	surname	email
Greg	Stride	Inquiries@example.com
Greg	Stride	gstride@example.com
Greg	Stride	gs791@example.com
Greg	Stride	gregstride@example.com
Greg	Stride	gds@example.com
Greg	Stride	inq@example.com

There were lots of other variables, but these were the ones I ended up using. Eventually I decided that if I already had the name and the email address, I could find some way to compare the two and determine how similar they were.

I found a package called stringdist which could be used to figure out how similar different strings were.

But I needed to clean everything up a bit first. I needed to paste their names into a single string and convert it to lower case.

df$fullname <- paste0(df$firstname, df$surname) %>% tolower()

firstname	surname	email	fullname
Greg	Stride	Inquiries@example.com	gregstride
Greg	Stride	gstride@example.com	gregstride
Greg	Stride	gs791@example.com	gregstride
Greg	Stride	gregstride@example.com	gregstride
Greg	Stride	gds@example.com	gregstride
Greg	Stride	inq@example.com	gregstride

Easy enough. Then I needed to delete everything after the @ in the email address, because that would never be helpful. I did this using regular expressions and then made the email prefix all lower case too. I also removed all numbers because names don’t often have numbers in.

df$short_email <- gsub("@(.*)", "", df$email ) %>% tolower()
df$short_email <- gsub("[[:digit:]]", "", df$short_email)

firstname	surname	email	fullname	short_email
Greg	Stride	Inquiries@example.com	gregstride	inquiries
Greg	Stride	gstride@example.com	gregstride	gstride
Greg	Stride	gs791@example.com	gregstride	gs
Greg	Stride	gregstride@example.com	gregstride	gregstride
Greg	Stride	gds@example.com	gregstride	gds
Greg	Stride	inq@example.com	gregstride	inq

Finally I got to the good bit. I used the longest common substring method of the stingdist() function to determine which names and email addresses were the most similar.

df$str_dist <- stringdist(df$fullname, df$short_email, 
                          method = "lcs")

firstname	surname	email	fullname	short_email	str_dist
Greg	Stride	Inquiries@example.com	gregstride	inquiries	13
Greg	Stride	gstride@example.com	gregstride	gstride	3
Greg	Stride	gs791@example.com	gregstride	gs	8
Greg	Stride	gregstride@example.com	gregstride	gregstride	0
Greg	Stride	gds@example.com	gregstride	gds	9
Greg	Stride	inq@example.com	gregstride	inq	11

This gave me the difference between the two strings. The lower the number, the more similar the email address and the full name were.

However, the difference is biased against shorter names/email addresses, so I divided by the absolute difference in number of characters then arranged the df by the score.

df$score <- df$str_dist/abs(nchar(df$fullname) - nchar(df$short_email))
df$score[is.na(df$score)] <- 0

firstname	surname	email	fullname	short_email	str_dist	score
Greg	Stride	gregstride@example.com	gregstride	gregstride	0	0.000000
Greg	Stride	gstride@example.com	gregstride	gstride	3	1.000000
Greg	Stride	gs791@example.com	gregstride	gs	8	1.000000
Greg	Stride	gds@example.com	gregstride	gds	9	1.285714
Greg	Stride	inq@example.com	gregstride	inq	11	1.571429
Greg	Stride	Inquiries@example.com	gregstride	inquiries	13	13.000000

This is not a perfect way of doing things, and as you can see, causes a few problems with very short emails like inq vs gs, one of which is more likely to be generic. But in general it saves a huge amount of time. Going through 10,000 emails was much faster like this, and it generally has a high level of accuracy.

You could easily edit this around the format of emails in your dataframe. For example, if you know email addresses will start with initials, such as gs791 you can change your system to compare the string distance between the first initials of their names and their email.

I have used three excellent packages for this: kableExtra to make the tables look nice, stringdist which I mentioned earliery and dplyr for the pipe and arrange functions.

Problem-Solving R Coding Useful

Identifying generic emails

Dr Greg Stride

Researcher

Related