Removing exceptions from a list using Bash (with sed and awk)

  • I have a CSV file, a list of 1000+ users and user properties.
  • I have a list of exceptions (users to be excluded from processing), one user per line, about 50 total.

How can I remove the exceptions from the list?

# make a copy of the original list
cp list-of-1000.csv list-of-1000-less-exceptions.csv
# loop through each line in exceptions.txt and remove matching lines from the copy
while read line; do sed -i "/${line}/d" list-of-1000-less-exceptions.csv; done < exceptions.txt

This is a little simplistic and could be a problem if any usernames are subsets of other usernames. (For example, if user ‘bob’ is on the list of exceptions, but the list of users also contains ‘bobb’, both would be deleted.)

In the particular instance I am dealing with, the username is conveniently the first field in the CSV file. This allows me to match the start of the line and the comma following the username:

while read line; do sed -i "/^${line},/d" list-of-1000-less-exceptions.csv; done < exceptions.txt

What if the username was the third field in the CSV instead of the first?

Use awk:
awk -F, -vOFS=, '{print $3,$0}' list-of-exceptions.csv > copy-of-list-of-exceptions.csv

  • -F, sets the field separator to a comma (defaults to whitespace)
  • -vOFS=, sets the Output Field Separator (OFS) to a comma (defaults to a space)
  • $3 prints the third field
  • $0 prints all the fields, with the specified field separator between them

while read line; do sed -i "/^${line},/d" copy-of-1000-less-exceptions.csv; done < exceptions.txt

Now there’s still an extra username in this file. Maybe that doesn’t matter, but maybe it does. There are several ways to remove it–here’s one:

awk -F, -vOFS=, '$1=""; print $0' copy-of-1000-less-exceptions.csv | sed 's/^,//' > list-of-1000-less-exceptions.csv

  • -F, sets the field separator to a comma (defaults to whitespace)
  • -vOFS=, sets the Output Field Separator (OFS) to a comma (defaults to a space)
  • $1="" sets the first field to an empty string
  • print $0 prints all the fields

The result of the awk command has an initial comma on each line. The first field is still there, it’s just set to an empty string. I used sed to remove it.

You could also use sed alone to remove the extra username field:
sed -i 's/^[^,]*,//' copy-of-1000-less-exceptions.csv

Set difference of two lists using BASH shell

Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.

One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).

I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).

I had two files:

  • possibly-delivered.txt
  • definitely-delivered.txt

First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):

sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt

I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):

wc -l possibly-delivered-edited.txt

204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?

With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):

comm -23 possibly-delivered-edited.txt definitely-delivered.txt

However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:

sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)