April 2012 – The Accidental Developer

Grammars and the Random Goth Lyric Generator

To celebrate one of the last days of National Poetry Month as well as The Accidental Developer’s 100th blog post, I will attempt to combine a bit of computer science and poetry.

I’ve been studying grammars and formal languages, among other things, this past semester in my Theory of Computation class. One thing that it reminded me of was the second Javascript application I ever developed (with the help of my friend and college classmate Miranda Tarrow): The Random Goth Lyric Generator.

I took a simple sentence structure (subject, verb, adjective, object) and made random substitutions for each line, 4 lines per stanza, 4 stanzas per poem.

The (slightly less-than-formal) grammar for each line would look something like this:
Line -> SSL | PSL SSL -> SNP SV A O PSL -> PNP PV A O SNP -> Singular Noun | Singular Noun Phrase PNP -> Plural Noun | Plural Noun Phrase SV -> Singular Verb PV -> Plural Verb A -> Adjective O -> Object

We could extend this to the entire poem:
Poem -> Stanza Stanza Stanza Stanza Stanza - > Line Line Line Line

The word list was meant to be dark and foreboding but was often hilarious–the examples included:

Nouns & Noun Phrases:

My solitude
Your touch
A ravenous she-wolf
Spiders

Verbs & Verb Phrases:

entangles
summons
grovels before
spews forth

Adjectives:

labyrinthine
diseased
spectral
infernal

Line -> SSL -> SNP SV A O could become:

Your touch entangles infernal spiders.

I don’t know why the list of objects was a separate list of nouns, as it seems to me now that it could have pulled from the same list. Since the grammar used just one sentence structure, the results were very repetitive but frequently humorous. I often considered expanding the possible sentence structure (something as simple as making the adjective optional), but decided that the repetition was part of the charm. In fact, many poems and song lyrics feature repetition, and the results seemed eerily intentional at times.

The page was very popular for a time. I received quite a bit of e-mail regarding the page, including suggestions for additional words. Someone sent a song they’d recorded for which they used the random lyrics (with the addition of a shouted, “There’s that m———— word again!” in the middle). At least one randomly-generated poem was published in a small poetry journal.

I’ve considered creating a sequel to parody William Carlos Williams and loading it up with words from his own poems:
Poem -> S1 S2 S3 S4 S1 -> Noun Verb newline Preposition S2 -> Article Adjective Noun newline Noun S3 -> Adjective Preposition Adjective newline Noun S4 -> Preposition Article Adjective newline Noun

Which might produce something like:

no one spilled
with

the whole honey
suckle

pressed after sweet
odor

while the urgent
petals

It just doesn’t seem quite as funny or compelling. I can venture a guess that William’s sparse form and carefully selected language doesn’t lend itself to random imitation as well as verbose and self-indulgent free verse. Although perhaps the sample is merely too small!

Using FFmpeg to programmatically slice and splice video

My wife has a research project in which she needs to analyze brief (8-second) segments of hundreds of much longer videos. My goal was to take the videos (~30 minutes each) and cut out only the relevant sections and splice them together, including a static marker between each segment. This should allow her and her colleagues to analyze the videos quickly and using precise time-points (instead of using a slider in a video player to locate and estimate time-points). I’ve posted my notes from this process below for my own reference, and in case it should prove useful to anyone else.

To my knowledge, the best tool for the job is FFmpeg, an open source video tool. Continue reading Using FFmpeg to programmatically slice and splice video

Robert Sedgewick: “Algorithms for the Masses”

On 9 April 2012, I saw Robert Sedgewick give the talk, “Algorithms for the Masses,” on the campus of Drexel University. I have several of Sedgewick’s books on my shelves at home, including Algorithms in Java, Third Edition, Parts 1-5 and Introduction to Programming in Java: An Interdisciplinary Approach. One of my previous computer science professors, Kenneth Sloan, counted Sedgewick among his classmates.

The basic thesis of the lecture was that good algorithms matter and that we need to champion good algorithms where they are most needed (particularly in the sciences).

One of his points was that computer science is currently very abstract and lacks a basis in the scientific method. Algorithms need to be tested against models to see how they actually perform. In some cases, the theoretical performance of an algorithm can be off by several orders of magnitude compared to actual performance. For example, the quicksort algorithm is quadratic (N²) in the worst case, but N log N most of the time. There’s a reason why quicksort (by the way, the subject of Sedgewick’s 1975 PhD dissertation) is widely used, in spite of the fact that it is O(N²) versus binary sort’s O(N log N).

Sedgewick said, though, that he has run into many computer scientists who fail to observe the difference between theoretical worst-case and actual performance. Some will choose an algorithm based on Big-O analysis alone. Sedgewick’s response: Big-O is an upper bound, but is your input an example of the worst-case? Probably not. Algorithms should be chosen based on their actual performance.

[As an alternative to Big-O notation, Sedgewick suggested Tilde notation, although from my perspective I don’t see that there is a great difference between them.]

He also gave an example of taking theory too far in the other direction. A computer scientist gave a talk demonstrating that his algorithm, Algorithm B, though exceedingly complex, was superior to the simpler Algorithm A. When Sedgewick asked him why, he explained that Algorithm B removed a log log N factor. Sedgewick’s analysis was that log log N, in this universe, amounts to 6 — hardly worth trading algorithms for what, realistically, amounts to a constant factor.

[Why 6? Wikipedia and other sources estimate the number of atoms in the observable universe at 10⁸⁰. The natural logarithm of 10⁸⁰ is 184. The natural logarithm of 184 is 5.2. 6 sounds like a fine estimate.]

Another point was that scientists often need algorithms in their daily work, but do things the hard way for a lack of knowledge. One example was a biologist who was trying to use Excel to calculate a standard deviation for over a million data points, an idea that caused several audience members to cringe.

How do we bring a better understanding of algorithms to the masses? (By masses, I think he really means the masses of college-educated scientists–not quite everyone, but still a much larger group than just computer scientists.) He had several suggestions:

Analytic Combinatorics
From what I gather, analytic combinatorics is a way of using formal languages to describe recurrence relations, and thus a simpler (and easier-to-teach) method of creating generating functions. I don’t exactly know what that means, but you can read the book on it (by Flajolet and, of course, Sedgewick): Analytic Combinatorics (PDF).

Testing Algorithms Empirically
In computer science classes, he suggests students run a program on an increasing series of inputs (e.g. n = { 1000, 2000, 4000, 8000, 16000, … }) and examining the ratios of input size to run times to understand the real impact of running in linear time, N log N time, quadratic time, etc. (This is something that some of my past computer science courses have included, so apparently they have already adopted this piece of Sedgewick’s advice.)

Changing Intro to CS Courses
Sedgewick recommends identifying core elements of computer science, such as classic algorithms, and teaching them to everyone as early as possible. Some changes made to the curriculum at Princeton (where Sedgewick teaches) have led to a dramatic increase in enrollment in intro computer science courses and from a wider range of majors.

Change Publishing
Sedgewick touted his Introduction to Programming in Java and its accompanying web site as a major change for textbooks. The programming examples are short and simple, but demonstrate a wide range of real-world applications across several branches of science. (Although he did not mention it in the lecture, I also appreciated that the examples in this book often include graphics, sound, and animation — which are far more thrilling results than the usual ASCII that intro CS students see as the fruits of their labors.)

He also criticized academic publishing for making journal articles look as much like boring print articles as possible, in spite of the fact that they are now primarily accessed online. Where are the full-color figures? The hyperlinks? Animated simulations? These things are all possible online, but instead publishers restrict content to the form that is least likely to be accessed.

Several times in the lecture, he also mentioned that freshly-minted computer scientists often have little or no background in science: no physics, no chemistry, no biology. Although he recognized this as a problem, he had no solutions (other than to, perhaps, require foundational science courses as part of a CS degree).

The last two items–changing the curriculum and publishing–really sounded like a Sedgewick paid-programming infomercial. “Everyone should take courses in the subject of which I am an expert, and they should use the book I wrote to teach it.” It was hard not to be a little skeptical of his motives. At the same time, I can’t help but think that he’s right.