Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.
One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).
I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).
I had two files:
- possibly-delivered.txt
- definitely-delivered.txt
First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):
sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt
I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):
wc -l possibly-delivered-edited.txt
204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?
With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):
comm -23 possibly-delivered-edited.txt definitely-delivered.txt
However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:
sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt
Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt
This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt
Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt
That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)
Thanks for the tip about sorting the inputs, was banging my head against the wall just now and that was it!