Set difference of two lists using BASH shell

Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.

One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).

I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).

I had two files:

  • possibly-delivered.txt
  • definitely-delivered.txt

First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):

sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt

I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):

wc -l possibly-delivered-edited.txt

204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?

With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):

comm -23 possibly-delivered-edited.txt definitely-delivered.txt

However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:

sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)

One thought on “Set difference of two lists using BASH shell”

  1. Thanks for the tip about sorting the inputs, was banging my head against the wall just now and that was it!

Leave a Reply

Your email address will not be published. Required fields are marked *