3 ways to iterate over lines of a file in Linux

Frequently I need to run a process for each item in a list, stored in a text file one item per line: usernames, filenames, e-mail addresses, etc. Obviously there are more than 3 ways to do this, but here are 3 I have found useful:

Bash
sh prog1.sh list.txt

Source: prog1.sh

while read line
do
    echo $line
done < $1

4 lines. Not bad.

Perl
perl prog2.pl list.txt

Source: prog2.pl

while(<>) {
    print `echo $_`;
}

3 lines. Pretty good.

Perl -n
perl -n prog3.pl list.txt

Source: prog3.pl

print `echo $_`;

1 line! The -n switch basically wraps your Perl code in a loop that processes each line of the input file. I just discovered this while flipping through my 17-year-old copy of Programming Perl (link is to a newer edition).

I really like this method because you can write a script that processes a single input that could easily be reused by another script, but can also easily be used to process an entire list by adding just the -n switch. (There’s also a similar -p switch that does the same thing, but additionally prints out each line.)

I should note that in the examples above, I am using echo as a substitute for any command external to the script itself. In the Perl examples, there would be no need to call echo to merely print the contents of the line, but it’s a convenient stand-in for a generic command.

As suggested by a comment on a previous post, I have made these examples available in a git repository: iterate over lines.

Removing exceptions from a list using Bash (with sed and awk)

  • I have a CSV file, a list of 1000+ users and user properties.
  • I have a list of exceptions (users to be excluded from processing), one user per line, about 50 total.

How can I remove the exceptions from the list?

# make a copy of the original list
cp list-of-1000.csv list-of-1000-less-exceptions.csv
# loop through each line in exceptions.txt and remove matching lines from the copy
while read line; do sed -i "/${line}/d" list-of-1000-less-exceptions.csv; done < exceptions.txt

This is a little simplistic and could be a problem if any usernames are subsets of other usernames. (For example, if user ‘bob’ is on the list of exceptions, but the list of users also contains ‘bobb’, both would be deleted.)

In the particular instance I am dealing with, the username is conveniently the first field in the CSV file. This allows me to match the start of the line and the comma following the username:

while read line; do sed -i "/^${line},/d" list-of-1000-less-exceptions.csv; done < exceptions.txt

What if the username was the third field in the CSV instead of the first?

Use awk:
awk -F, -vOFS=, '{print $3,$0}' list-of-exceptions.csv > copy-of-list-of-exceptions.csv

  • -F, sets the field separator to a comma (defaults to whitespace)
  • -vOFS=, sets the Output Field Separator (OFS) to a comma (defaults to a space)
  • $3 prints the third field
  • $0 prints all the fields, with the specified field separator between them

while read line; do sed -i "/^${line},/d" copy-of-1000-less-exceptions.csv; done < exceptions.txt

Now there’s still an extra username in this file. Maybe that doesn’t matter, but maybe it does. There are several ways to remove it–here’s one:

awk -F, -vOFS=, '$1=""; print $0' copy-of-1000-less-exceptions.csv | sed 's/^,//' > list-of-1000-less-exceptions.csv

  • -F, sets the field separator to a comma (defaults to whitespace)
  • -vOFS=, sets the Output Field Separator (OFS) to a comma (defaults to a space)
  • $1="" sets the first field to an empty string
  • print $0 prints all the fields

The result of the awk command has an initial comma on each line. The first field is still there, it’s just set to an empty string. I used sed to remove it.

You could also use sed alone to remove the extra username field:
sed -i 's/^[^,]*,//' copy-of-1000-less-exceptions.csv

PowerShell Ellipsis (dot dot dot)

Sometimes when you retrieve an object via PowerShell, some of the properties are truncated, denoted by an ellipsis (“…”).

For example:
Get-Mailbox chris | Select AddressListMembership

AddressListMembership
---------------------
{\Staff Global Address List, \Staff, \IT Staff, \Exchange Admins...}

How do you see the full list? There are a couple ways:

Select -ExpandProperty
Get-Mailbox chris | Select -ExpandProperty AddressListMembership

$FormatEnumerationLimit =-1
This is a per-session variable in PowerShell. By default the value is 4, but if you change it to -1 it will enumerate all items. This will affect every property of every object, so it may be more than you need.

MySQL date_add and date_sub functions running against millions of rows

One of my servers runs a query once a week to remove all rows from a Syslog table (>20,000,000 rows) in a MySQL database that are older than 60 days. This was running terribly slowly and interfering with other tasks on the server.

Although the original query used a DELETE statement I’ve used SELECT statements in the examples below.

SELECT COUNT(*)
FROM SystemEvents
WHERE ReceivedAt < DATE_SUB(NOW(), INTERVAL 60 DAY);

That selects about 900,000 rows and takes about 45 seconds.

SELECT COUNT(*)
FROM SystemEvents
WHERE ReceivedAt < DATE_ADD(CURRENT_DATE, INTERVAL -60 DAY);

Likewise takes about 48 seconds.

Is MySQL running a function every time it makes a comparison? I decided to try using a hard-coded date to find out:

SELECT COUNT(*)
FROM SystemEvents
WHERE ReceivedAt < '2015-11-12 12:00:00';

6 seconds! Much faster.

I created a user-defined variable:
SET @sixty_days_ago = DATE_SUB(NOW(), INTERVAL 60 DAY);

Then ran the query:
SELECT COUNT(*)
FROM SystemEvents
WHERE ReceivedAt < @sixty_days_ago;

12 seconds. No 6 seconds, but still a fraction of the original time!

Holding messages in the Postfix mail queue

Earlier today, someone sent a large number of email messages each containing a 30 megabyte attachment to users on our servers. This put our Postfix servers under a heavy load and caused some messages to be delivered after a substantial delay. (This was in part due to additional processing done by our servers, I’m sure a plain-jane Postfix instance could have handled it without an issue.)

This was no good. The sender–let’s call it bigbulk.test.com–should be able to send such messages, but not at the expense of normal mail delivery. I needed to change the priority of those messages to let other messages take priority.

The first thing I did was to hold all the mail from bigbulk.test.com:

  • Retrieve the mail queue
  • Select only the lines containing bigbulk.test.com
  • Select only the queue ID, the first item listed in each result
  • Pass the queue IDs to the postsuper -h command

mailq | grep bigbulk.test.com | cut -d ' ' -f 1 | xargs -n1 postsuper -h

But what about delivering them? I sent them in small batches so as not to overload the server again.

  • Retrieve the mail queue
  • Select only the lines containing bigbulk.test.com
  • Select only the queue ID (stripping out the hold-indicator)
  • Select only the first 5 results
  • Pass the queue IDs to the postsuper -H command

mailq | grep bigbulk.test.com | cut -d '!' -f 1 | head -n5 | xargs -n1 postsuper -H

Event processing, interval processing in Excel

(And by Excel, I mean MS Excel, Open Office, and Google Docs.)

I was recently working with a large amount of computer-generated event data. I wanted to analyze the data, but was only concerned with events (rows) that occurred within intervals demarcated by certain start and end events.

At the time, I had no answer for this in Excel. I wrote a small computer program that read the file one line at a time and ignored lines that occurred outside the intervals of interest. Recently I came up with a solution for this problem in Excel, so I thought I would share it here.

In this example, I am going to use a highly simplified traffic study as my example. A computer at a traffic light records 2 kinds of events:

sensor events
on or off, indicating whether or not there is a car in the intersection
light events
red, amber, or green, indicating the new light color

Here are some sample data collected by this computer:

seconds event state
0 light green
7 sensor on
8 sensor off
15 sensor on
16 sensor off
25 light amber
30 light red
60 light green
85 light amber
90 light red
92 sensor on
93 sensor off
120 light green
145 light amber
150 light red
180 light green
199 sensor on
200 sensor off
204 sensor on
205 light amber
206 sensor off
210 light red
240 light green
265 light amber
269 sensor on
270 light red
271 sensor off
300 light green

Let’s say we want to find out how many cars drove through a red light–that is, the light was red when the car started driving through the intersection.

First, add a new column. This column will indicate the current state of the light for each event. That’s trivial for each light event, but associating the state of the light with each sensor event is what we’re after. In this column, add the following formula:

Excel and Google Sheets:
=IF(B2="light",C2,D1)

Open Office Spreadsheets:
=IF(B2="light"; C2; D1)

That formula means:

  • IF the current event is a light event
  • THEN set this cell to the current state
  • ELSE set this cell to the most recent light state.

Next, add another column. This column will indicate whether the row represents a driving through a red light. In this column, add the following formula:

Excel and Google Sheets
=IF(B2="sensor", IF(C2="on", IF(D2="red", 1, 0), 0), 0)

Open Office Spreadsheets
=IF(B2="sensor"; IF(C2="on"; IF(D2="red"; 1; 0); 0); 0)

The above is a nested series of if statements:

  • IF the row contains a sensor event AND
  • IF the sensor event is an on event AND
  • IF the current state of the light is red
  • THEN it is a traffic violation
  • ELSE it is not a traffic violation

Copy these formulae to the other rows, via Edit–Fill–Down (Excel and Open Office) or ctrl-d (or cmd-d on Mac). The spreadsheet should now indicate that there was one incident of running a red light, which occurred at second 92.

Using group expressions in regular expression pattern matching

I’ve used group expressions in regexes many times, but only for replacement. Yesterday I learned that they can also be used for matching.

For example, let’s say you have the text:

Banananananas don’t grow in Mississississippi because banananas are afraid of getting turned into Missississippi’s famous bananana pudding.

The following regular expression will find instances of iss or an that are repeated more than twice.

(iss|an)\1\1+

You can use \1\1 as the replacement (or $1$1 in Dreamweaver, which uses backslashes to identify groups in match expressions, but dollar signs to represent groups in replace expressions) to turn the misspelled words into Mississippi and banana(s).

Another example might be applying consistent formatting to phone numbers or dates.

Phone numbers
Let’s say you usually use 555-555-1212 as the format for phone numbers and sometimes you use 555.555.1212, but the new trend is to use spaces instead of dashes or dots as separators:

Find: ([\d]{3})([-\.])([\d]{3})\2([\d]{4})
Replace: \1 \3 \4

Dates
Let’s say you usually use 12/5/2013 as the format for dates, dabbled with 12.5.2013, but now you’ve decided that dashes are clearer:

Find: ([\d]{1,2})([\./])([\d]{1,2})\2([\d]{4})
Replace: \1-\3-\4

In both cases you could just repeat the bracketed character class, but then you could end up matching strings you didn’t intend to:

  • 555-555.1212
  • 12.5/2013

ANT deployment script and SFTP

My development team is moving away from developing on mapped drives/file shares to using cloud-hosted servers on Amazon Web Services (AWS). This is introducing a change to our usual workflow, as our access to the remote servers is limited to SSH and SFTP.

Although I previously used Apache Ant scripts through Eclipse to facilitate deploying application updates, the scripts were generally unpopular with the rest of the development team. (Many of them do not use Eclipse and preferred just to drop-and-drag files from their development sandboxes to the development or production servers.) Additionally, my original Ant scripts relied on the sync command to synchronize folders on the file shares.

Here is a revised Ant script that uses SCP (Secure Copy)–not SFTP but achieves the same goal–to deploy application files from a developer sandbox to the development or production server:

<project name="Deploy myapp" default="Sandbox to Dev">
  <input message="Username:" addproperty="username" />
  <input message="Password:" addproperty="passwd" />
  <property name="applicationFolder" value="myapp"/>
  <property name="site" value="osric.com"/>
  <property name="sandboxRoot" value="${basedir}"/>
  <property 
    name="development" 
    value="${username}:${passwd}@dev.osric.com:/home/web/${site}/${applicationFolder}"/>
  <property 
    name="production" 
    value="${username}:${passwd}@osric.com:/home/web/${site}/${applicationFolder}"/>
  <target name="Sandbox to Dev">
    <scp todir="${development}" trust="true">
      <fileset dir="${sandboxRoot}">
        <exclude name="**/build.xml"/>
        <exclude name="**/.*"/>
      </fileset>
    </scp>
  </target>
  <target name="Sandbox to Production">
    <scp todir="${production}" trust="true">
      <fileset dir="${sandboxRoot}">
        <exclude name="**/build.xml"/>
        <exclude name="**/.*"/>
      </fileset>
    </scp>
  </target>
</project>

There are a couple issues with this script to be aware of:

  • SCP is not included with Ant. The script produced the error “Problem: failed to create task or type scp”. I needed to:
    1. Download JSCH
    2. Place the file in Eclipse’s plugins/[ant folder]/lib folder
    3. Add the JAR file to the Ant build path (via Window–Preferences–Ant Home Entries (default)–Add External JARs…–select the jsch .jar file)
  • The password input is in plain text. Hiding password input in Ant provides a solution for Ant, but one that does not work from Eclipse. I have seen other possible solutions, so I’ll update this once I implement once and confirm that it works.

Set difference of two lists using BASH shell

Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.

One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).

I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).

I had two files:

  • possibly-delivered.txt
  • definitely-delivered.txt

First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):

sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt

I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):

wc -l possibly-delivered-edited.txt

204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?

With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):

comm -23 possibly-delivered-edited.txt definitely-delivered.txt

However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:

sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)