Minor improvements to legacy Perl code

We’re always working with code we didn’t write. You’ll spend far more time looking at code you didn’t write (or don’t remember writing) than you will spend writing new code.

Today I looked at an example Perl script that used 45 lines of code to pull the company associated with an OUI (Organizationally Unique Identifier) from a text file, given a MAC address.

I thought I could do slightly better.

find_mac_co.sh:

#!/bin/sh
OUI=$(echo "$1" | sed 's/[^A-Fa-f0-9]//g' | cut -c1-6)
awk -F "\t" -v IGNORECASE=1 -v OUI="$OUI" '$0 ~ OUI { print $3 }' ouidb.tsv
exit 0

Example run:

$ sh find_mac_co.sh 7c:ab:60:ff:ff:ff
Apple, Inc.

There’s probably a way to make the Perl version shorter too. I’m more familiar with bash and shell commands.

The biggest problem with this script is that it relies on an up-to-date list of OUIs. An even better way is to query an API:

find_mac_co_api.sh

#!/bin/sh
MACADDRESS="$1"
curl "https://api.maclookup.app/v2/macs/$MACADDRESS/company/name"
exit 0

Example run:

$ sh find_mac_co_api.sh 7c:ab:60:ff:ff:ff
Apple, Inc.

3 ways to iterate over lines of a file in Linux

Frequently I need to run a process for each item in a list, stored in a text file one item per line: usernames, filenames, e-mail addresses, etc. Obviously there are more than 3 ways to do this, but here are 3 I have found useful:

Bash
sh prog1.sh list.txt

Source: prog1.sh

while read line
do
    echo $line
done < $1

4 lines. Not bad.

Perl
perl prog2.pl list.txt

Source: prog2.pl

while(<>) {
    print `echo $_`;
}

3 lines. Pretty good.

Perl -n
perl -n prog3.pl list.txt

Source: prog3.pl

print `echo $_`;

1 line! The -n switch basically wraps your Perl code in a loop that processes each line of the input file. I just discovered this while flipping through my 17-year-old copy of Programming Perl (link is to a newer edition).

I really like this method because you can write a script that processes a single input that could easily be reused by another script, but can also easily be used to process an entire list by adding just the -n switch. (There’s also a similar -p switch that does the same thing, but additionally prints out each line.)

I should note that in the examples above, I am using echo as a substitute for any command external to the script itself. In the Perl examples, there would be no need to call echo to merely print the contents of the line, but it’s a convenient stand-in for a generic command.

As suggested by a comment on a previous post, I have made these examples available in a git repository: iterate over lines.

Removing exceptions from a list using Bash (with sed and awk)

  • I have a CSV file, a list of 1000+ users and user properties.
  • I have a list of exceptions (users to be excluded from processing), one user per line, about 50 total.

How can I remove the exceptions from the list?

# make a copy of the original list
cp list-of-1000.csv list-of-1000-less-exceptions.csv
# loop through each line in exceptions.txt and remove matching lines from the copy
while read line; do sed -i "/${line}/d" list-of-1000-less-exceptions.csv; done < exceptions.txt

This is a little simplistic and could be a problem if any usernames are subsets of other usernames. (For example, if user ‘bob’ is on the list of exceptions, but the list of users also contains ‘bobb’, both would be deleted.)

In the particular instance I am dealing with, the username is conveniently the first field in the CSV file. This allows me to match the start of the line and the comma following the username:

while read line; do sed -i "/^${line},/d" list-of-1000-less-exceptions.csv; done < exceptions.txt

What if the username was the third field in the CSV instead of the first?

Use awk:
awk -F, -vOFS=, '{print $3,$0}' list-of-exceptions.csv > copy-of-list-of-exceptions.csv

  • -F, sets the field separator to a comma (defaults to whitespace)
  • -vOFS=, sets the Output Field Separator (OFS) to a comma (defaults to a space)
  • $3 prints the third field
  • $0 prints all the fields, with the specified field separator between them

while read line; do sed -i "/^${line},/d" copy-of-1000-less-exceptions.csv; done < exceptions.txt

Now there’s still an extra username in this file. Maybe that doesn’t matter, but maybe it does. There are several ways to remove it–here’s one:

awk -F, -vOFS=, '$1=""; print $0' copy-of-1000-less-exceptions.csv | sed 's/^,//' > list-of-1000-less-exceptions.csv

  • -F, sets the field separator to a comma (defaults to whitespace)
  • -vOFS=, sets the Output Field Separator (OFS) to a comma (defaults to a space)
  • $1="" sets the first field to an empty string
  • print $0 prints all the fields

The result of the awk command has an initial comma on each line. The first field is still there, it’s just set to an empty string. I used sed to remove it.

You could also use sed alone to remove the extra username field:
sed -i 's/^[^,]*,//' copy-of-1000-less-exceptions.csv

Canvas enrollments.csv and add_sis_stickiness

I have an enrollments.csv file for Instructure’s Canvas LMS, and I want all of the enrollments in it to “stick”–that is, to survive a batch mode SIS import. These are primarily course designers, and so they have no official standing in the class–and therefore are not in our database, and therefore are not included with regular updates to enrollments.

According to the Canvas documentation for SIS imports:

add_sis_stickiness – Boolean

This option, if present, will process all changes as if they were UI changes. This means that “stickiness” will be added to changed fields. This option is only processed if ‘override_sis_stickiness’ is also provided.

Source: https://canvas.instructure.com/doc/api/sis_imports.html#method.sis_imports_api.create

However, experience tells me otherwise. An inquiry to Instructure’s support confirms that add_sis_stickiness does not apply to enrollments. Enrollments added this way will be deleted following the next enrollments batch import.

The choices to preserve these course designer enrollments are basically to add each one manually using the web UI, or add them via the API. Either option will make the enrollments “stick.”

I opted to use the API. Since I already had a formatted input file, I wrote a short BASH script (with the help of several man pages and a couple StackOverflow pages) that reads the CSV and processes each row, adding the enrollment via the API:

headerrow=1
while read row; do
    if [ $headerrow -eq 0 ]
    then
        # get the SIS course ID
        cid="$(echo $row | cut -d',' -f1)"
        # get the SIS user ID
        uid="$(echo $row | cut -d',' -f2)"
        # get the role / enrollment type
        type="$(echo $row | cut -d',' -f3)"
        # reformat the enrollment type
        tid="$(echo $type | cut -c1 | tr [[:lower:]] [[:upper:]])""$(echo $type | cut -c2-)"Enrollment
        echo course is $cid
        echo user is $uid
        echo type is $tid
        result="$(curl https://[yourcanvassite].instructure.com/api/v1/courses/sis_course_id:$cid/enrollments -H 'Authorization: Bearer [REDACTED]' -X POST -F enrollment[type]=$tid -F enrollment[user_id]=sis_user_id:$uid -F 'enrollment[enrollment_state]=active' -F 'enrollment[notify]=false')"
        echo $result
    fi
    headerrow=0
done <enrollments.csv

Holding messages in the Postfix mail queue

Earlier today, someone sent a large number of email messages each containing a 30 megabyte attachment to users on our servers. This put our Postfix servers under a heavy load and caused some messages to be delivered after a substantial delay. (This was in part due to additional processing done by our servers, I’m sure a plain-jane Postfix instance could have handled it without an issue.)

This was no good. The sender–let’s call it bigbulk.test.com–should be able to send such messages, but not at the expense of normal mail delivery. I needed to change the priority of those messages to let other messages take priority.

The first thing I did was to hold all the mail from bigbulk.test.com:

  • Retrieve the mail queue
  • Select only the lines containing bigbulk.test.com
  • Select only the queue ID, the first item listed in each result
  • Pass the queue IDs to the postsuper -h command

mailq | grep bigbulk.test.com | cut -d ' ' -f 1 | xargs -n1 postsuper -h

But what about delivering them? I sent them in small batches so as not to overload the server again.

  • Retrieve the mail queue
  • Select only the lines containing bigbulk.test.com
  • Select only the queue ID (stripping out the hold-indicator)
  • Select only the first 5 results
  • Pass the queue IDs to the postsuper -H command

mailq | grep bigbulk.test.com | cut -d '!' -f 1 | head -n5 | xargs -n1 postsuper -H

Set difference of two lists using BASH shell

Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.

One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).

I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).

I had two files:

  • possibly-delivered.txt
  • definitely-delivered.txt

First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):

sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt

I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):

wc -l possibly-delivered-edited.txt

204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?

With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):

comm -23 possibly-delivered-edited.txt definitely-delivered.txt

However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:

sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)

Using FFmpeg to programmatically slice and splice video

My wife has a research project in which she needs to analyze brief (8-second) segments of hundreds of much longer videos. My goal was to take the videos (~30 minutes each) and cut out only the relevant sections and splice them together, including a static marker between each segment. This should allow her and her colleagues to analyze the videos quickly and using precise time-points (instead of using a slider in a video player to locate and estimate time-points). I’ve posted my notes from this process below for my own reference, and in case it should prove useful to anyone else.

To my knowledge, the best tool for the job is FFmpeg, an open source video tool. Continue reading Using FFmpeg to programmatically slice and splice video

Monitoring web server status with a shell script

Recently, my VPS (Virtual Private Server) ran into some issues where it exceeded the maximum amount of RAM allotted under my subscription. When this happens, the web server software shuts down and does not restart until I manually restart it.

This is bad. I’m not always visiting my own web site, so it could be down for days without me knowing. Although I really need to identify what is using all the RAM, in the meantime I’ll settle for a monitoring system that will notify me when the server is down.

#!/bin/bash
if curl -s --head http://osric.com/ | grep "200 OK" > /dev/null
  then 
    echo "The HTTP server on osric.com is up!" > /dev/null
  else
    echo "The HTTP server on osric.com is down!"
fi

cURL will let you retrieve a URL via the command line, and provides more options than Wget for a single URL. In this case, I used the silent switch to eliminate the status/progress output, and the head switch to retrieve only the document headers. The document header is then piped to Grep, which searches for the string “200 OK” (the HTTP status message for a successful request).

I send the result of that to /dev/null so that the output doesn’t appear on the screen.

If grep does find 200 OK, then I send a success message to /dev/null. This is largely unnecessary, but it is nice to leave in to test the script in a successful case–just remove the > /dev/null. If it doesn’t find 200 OK, then there is a problem. It might not mean, necessarily, that the web server is down, but it definitely indicates there is a problem that needs to be identified.

I added a call to this script to a crontab to run every 5 minutes. If there is no output, nothing happens. If there is output, the output is sent to me via e-mail, which, assuming I am checking my e-mail religiously, should reduce server downtime.

WordPress Manual Update Instructions

One of the great things about WordPress is the one-click upgrade procedure. It’s particularly convenient, because WordPress has frequent upgrades and security updates. Without an easy way of upgrading, many users would complain of upgrade fatigue, or would continue running older versions with security flaws.

Of course, not everyone can use the one click upgrade: My host is not configured properly, so I need to upgrade manually. Fortunately, a manual upgrade is relatively painless. Although it is described in some detail at Upgrading WordPress: Manual Update, I’m listing my specific procedures (using the Bash shell) here.

  1. Download the latest version to your home directory:
    wget http://wordpress.org/latest.tar.gz
  2. Remove any old WordPress directory before unpacking the latest:
    rm -r ./wordpress
  3. Unpack the latest version:
    tar -xf wordpress-3.2.tar.gz
  4. Backup the database:
    mysqldump --add-drop-table -h localhost -u [username] -p [database name] | bzip2 -c > [site name].[dd-MMM-yyyy].bak.sql.bz2
  5. Disable plugins. You can do this via the admin interface, or the database:
    UPDATE wp_options SET option_value = 'a:0:{}' WHERE option_name = 'active_plugins';
  6. Remove the wp-admin and wp-includes directories:
    rm -r ./path/to/your/wordpress/wp-admin ./path/to/your/wordpress/wp-includes
  7. Copy only the updated files over to the WordPress install:
    cp -ru ./wordpress/* ./path/to/your/wordpress/
  8. Visit the admin page (which may prompt a database upgrade).
  9. Re-enable any plugins.

Step 6 may seem unnecessary in light of step 7, but in my experience merely updating the wp-admin and wp-includes directories is not enough: there may be old files that are not present in the latest version, but that will cause problems if they still exist.

I’ve found that my reCAPTCHA plugin doesn’t retain its API keys, but you can look them up again on the reCAPTCHA site. Other plugins may have similar issues.

Thanks to Perishable Press for the tip on disabling WordPress plugins via MySQL.