Using group expressions in regular expression pattern matching

I’ve used group expressions in regexes many times, but only for replacement. Yesterday I learned that they can also be used for matching.

For example, let’s say you have the text:

Banananananas don’t grow in Mississississippi because banananas are afraid of getting turned into Missississippi’s famous bananana pudding.

The following regular expression will find instances of iss or an that are repeated more than twice.

(iss|an)\1\1+

You can use \1\1 as the replacement (or $1$1 in Dreamweaver, which uses backslashes to identify groups in match expressions, but dollar signs to represent groups in replace expressions) to turn the misspelled words into Mississippi and banana(s).

Another example might be applying consistent formatting to phone numbers or dates.

Phone numbers
Let’s say you usually use 555-555-1212 as the format for phone numbers and sometimes you use 555.555.1212, but the new trend is to use spaces instead of dashes or dots as separators:

Find: ([\d]{3})([-\.])([\d]{3})\2([\d]{4})
Replace: \1 \3 \4

Dates
Let’s say you usually use 12/5/2013 as the format for dates, dabbled with 12.5.2013, but now you’ve decided that dashes are clearer:

Find: ([\d]{1,2})([\./])([\d]{1,2})\2([\d]{4})
Replace: \1-\3-\4

In both cases you could just repeat the bracketed character class, but then you could end up matching strings you didn’t intend to:

  • 555-555.1212
  • 12.5/2013

ANT deployment script and SFTP

My development team is moving away from developing on mapped drives/file shares to using cloud-hosted servers on Amazon Web Services (AWS). This is introducing a change to our usual workflow, as our access to the remote servers is limited to SSH and SFTP.

Although I previously used Apache Ant scripts through Eclipse to facilitate deploying application updates, the scripts were generally unpopular with the rest of the development team. (Many of them do not use Eclipse and preferred just to drop-and-drag files from their development sandboxes to the development or production servers.) Additionally, my original Ant scripts relied on the sync command to synchronize folders on the file shares.

Here is a revised Ant script that uses SCP (Secure Copy)–not SFTP but achieves the same goal–to deploy application files from a developer sandbox to the development or production server:

<project name="Deploy myapp" default="Sandbox to Dev">
  <input message="Username:" addproperty="username" />
  <input message="Password:" addproperty="passwd" />
  <property name="applicationFolder" value="myapp"/>
  <property name="site" value="osric.com"/>
  <property name="sandboxRoot" value="${basedir}"/>
  <property 
    name="development" 
    value="${username}:${passwd}@dev.osric.com:/home/web/${site}/${applicationFolder}"/>
  <property 
    name="production" 
    value="${username}:${passwd}@osric.com:/home/web/${site}/${applicationFolder}"/>
  <target name="Sandbox to Dev">
    <scp todir="${development}" trust="true">
      <fileset dir="${sandboxRoot}">
        <exclude name="**/build.xml"/>
        <exclude name="**/.*"/>
      </fileset>
    </scp>
  </target>
  <target name="Sandbox to Production">
    <scp todir="${production}" trust="true">
      <fileset dir="${sandboxRoot}">
        <exclude name="**/build.xml"/>
        <exclude name="**/.*"/>
      </fileset>
    </scp>
  </target>
</project>

There are a couple issues with this script to be aware of:

  • SCP is not included with Ant. The script produced the error “Problem: failed to create task or type scp”. I needed to:
    1. Download JSCH
    2. Place the file in Eclipse’s plugins/[ant folder]/lib folder
    3. Add the JAR file to the Ant build path (via Window–Preferences–Ant Home Entries (default)–Add External JARs…–select the jsch .jar file)
  • The password input is in plain text. Hiding password input in Ant provides a solution for Ant, but one that does not work from Eclipse. I have seen other possible solutions, so I’ll update this once I implement once and confirm that it works.

Set difference of two lists using BASH shell

Recently a handful of e-mail messages went undelivered due to some mis-communication between 2 servers.

One server had a record of all the addresses it thought it sent to over the period of time in question, and the other server a record of all the addresses to which it had actually delivered (including messages from several other servers).

I had both lists, but what I really wanted was just the set difference: only the elements of the first list that did not appear in the second. (In other words, a list of the recipients whose messages were never delivered).

I had two files:

  • possibly-delivered.txt
  • definitely-delivered.txt

First, the possibly-delivered.txt file had a bunch of extraneous lines, all of which contained the same term: “undelivered”. Since that term did not exist in any of the lines I was looking for, I removed all the lines using sed (stream editor):

sed '/undelivered/d' possibly-delivered.txt > possibly-delivered-edited.txt

I already knew (from prior investigations) that there should be 204 addresses in that list, so I performed a check to make sure there were 204 lines in the file using wc (word count):

wc -l possibly-delivered-edited.txt

204 lines returned. Great! Now, how to compare the 2 files to get only the results I wanted?

With a little help from Set Operations in the Unix Shell I found what I needed–comm (compare):

comm -23 possibly-delivered-edited.txt definitely-delivered.txt

However, comm warned me that the 2 files were not in sorted order, so first I had to sort them:

sort possibly-delivered-edited.txt > possibly-delivered-edited-sorted.txt
sort definitely-delivered.txt > definitely-delivered-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

This returned zero results. That was not possible (or at least, highly improbable!), so I checked the files. It looks like the sed command had converted my Windows linebreaks to Unix linebreaks, so I ran a command to put them back:
unix2dos possibly-delivered-edited-sorted.txt

Again:
comm -23 possibly-delivered-edited-sorted.txt definitely-delivered-sorted.txt

That returned my list of addresses from the first list that did not appear in the second list. (Quickly, accurately, and without tedium.)

Updating a Windows timestamp

Let’s say you need to modify the timestamp, or last modified date, on a Windows file. In my case, I wanted to change the last modified date on a lot of PDF files, so that the last modified attribute would reflect the date the content was last modified (rather than the file).

The best and easiest way I’ve found to do this is via Cygwin:
touch -t 201109201145 *.pdf

When I searched for how to accomplish this, I found all sorts of sites advising me to download a $20 shareware application, or write a C# application to do it. Fortunately, in a *nix type environment, it’s a one-line command.

There are other ports of Unix/Linux/*nix commands for Windows, some of which may be more lightweight than Cygwin. (I rely on Cygwin quite a bit, so I have it installed on all my Windows machines.) I have not used any of the following, but here are links to some alternatives, in case you are interested: