{"id":926,"date":"2013-03-05T18:53:25","date_gmt":"2013-03-05T23:53:25","guid":{"rendered":"http:\/\/osric.com\/chris\/accidental-developer\/?p=926"},"modified":"2013-03-05T18:54:24","modified_gmt":"2013-03-05T23:54:24","slug":"set-difference-of-two-lists-using-bash-shell","status":"publish","type":"post","link":"https:\/\/osric.com\/chris\/accidental-developer\/2013\/03\/set-difference-of-two-lists-using-bash-shell\/","title":{"rendered":"Set difference of two lists using BASH shell"},"content":{"rendered":"<p>Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.<\/p>\n<p>One server had a record of all the addresses it <em>thought<\/em> it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).<\/p>\n<p>I had both lists, but what I really wanted was just the <em>set difference<\/em>: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).<\/p>\n<p>I had two files:<\/p>\n<ul>\n<li>possibly-delivered.txt<\/li>\n<li>definitely-delivered.txt<\/li>\n<\/ul>\n<p>First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: &#8220;undelivered&#8221;. Since that term did not exist in any of the lines I was looking for, I removed all the lines using <em>sed<\/em> (stream editor):<\/p>\n<p><code>sed '\/undelivered\/d' possibly-delivered.txt &gt; possibly-delivered-edited.txt<\/code><\/p>\n<p>I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using <em>wc<\/em> (word count):<\/p>\n<p><code>wc -l possibly-delivered-edited.txt<\/code><\/p>\n<p>204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?<\/p>\n<p>With a little help from <a href=\"http:\/\/www.catonmat.net\/blog\/set-operations-in-unix-shell\/\">Set Operations in the Unix Shell<\/a> I found what I needed&#8211;<em>comm<\/em> (compare):<\/p>\n<p><code>comm -23 possibly-delivered-edited.txt definitely-delivered.txt<\/code><\/p>\n<p>However, comm warned me that the 2 files were not in sorted order, so first I had to <em>sort<\/em> them:<\/p>\n<p><code>sort possibly-delivered-edited.txt &gt; possibly-delivered-edited-sorted.txt<br \/>\nsort definitely-delivered.txt &gt; definitely-delivered-sorted.txt<\/code><\/p>\n<p>Again:<br \/>\n<code>comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt<\/code><\/p>\n<p>This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:<br \/>\n<code>unix2dos possibly-delivered-edited-sorted.txt<\/code><\/p>\n<p>Again:<br \/>\n<code>comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt<\/code><\/p>\n<p>That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers. One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from &hellip; <a href=\"https:\/\/osric.com\/chris\/accidental-developer\/2013\/03\/set-difference-of-two-lists-using-bash-shell\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Set difference of two lists using BASH shell<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[232],"tags":[197,295,293,297,266,296,294],"class_list":["post-926","post","type-post","status-publish","format-standard","hentry","category-tips-tricks","tag-bash","tag-comm","tag-sed","tag-shell","tag-sort","tag-unix2dos","tag-wc"],"_links":{"self":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/926","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/comments?post=926"}],"version-history":[{"count":5,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/926\/revisions"}],"predecessor-version":[{"id":932,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/926\/revisions\/932"}],"wp:attachment":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/media?parent=926"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/categories?post=926"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/tags?post=926"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}