Set difference of two lists using BASH shell

Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.

One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).

I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).

I had two files:

  • possibly-delivered.txt
  • definitely-delivered.txt

First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):

sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt

I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):

wc -l possibly-delivered-edited.txt

204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?

With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):

comm -23 possibly-delivered-edited.txt definitely-delivered.txt

However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:

sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt

comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt

comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)