Hosting a static site on AWS using S3 and CloudFront

A few years ago, Michael Berkowski gently scolded me for hosting a site on HTTP — not HTTPS. I decided that the easiest way to fix this (ignoring Let’s Encrypt for now) was to instead host the site, a static site that hasn’t been updated in years, on AWS. Specifically, to host the site using S3 and CloudFront.

The domain was redbuswashere.com, related to a road trip adventure that didn’t go exactly as planned.

Since that time, I’ve migrated several other sites to AWS, using S3 to store the files and CloudFront as the front-end CDN. I’ve learned a few things in the process, including several of the things that can go wrong. I’ve also created a YouTube video on the process, for people who want to see this step-by-step: Hosting a Static HTML Site on AWS S3.

Continue reading Hosting a static site on AWS using S3 and CloudFront

DirectoryIndex on a static HTML site hosted by AWS

Apache’s mod_dir has a DirectoryIndex option so that if you request a directory, it can return the index document for that directory. For example:

https://www.example.com/dir/ would return https://www.example.com/dir/index.html

The directive typically looks something like this:

DirectoryIndex index.html index.cgi index.pl index.php index.xhtml index.htm

(It’s been many years since I’ve seen index.cgi and index.pl!)

When I recently converted a WordPress site to a static site and hosted it via AWS CloudFront backed by AWS S3 buckets, I found that directory indexes didn’t work. A request for https://www.example.com/dir/ would return a 403 Forbidden error.

StackOverflow to the rescue (and a question from 2015, no less): How do you set a default root object for subdirectories for a statically hosted website on Cloudfront? included several possible solutions.

The solution I liked best was to deploy a pre-built Lambda function that implements similar functionality: standard-redirects-for-cloudfront.

Note that the instructions guide you to get the ARN from the CloudFormation output panel. This is important, as it is not just the ARN but also an appended version number. (In my case it was the ARN followed by :1.) Otherwise you’ll get the following error when adding it to the Origin request section of the CloudFormation behavior:

The function ARN must reference a specific function version. (The ARN must end with the version number.)

Minor improvements to legacy Perl code

We’re always working with code we didn’t write. You’ll spend far more time looking at code you didn’t write (or don’t remember writing) than you will spend writing new code.

Today I looked at an example Perl script that used 45 lines of code to pull the company associated with an OUI (Organizationally Unique Identifier) from a text file, given a MAC address.

I thought I could do slightly better.

find_mac_co.sh:

#!/bin/sh
OUI=$(echo "$1" | sed 's/[^A-Fa-f0-9]//g' | cut -c1-6)
awk -F "\t" -v IGNORECASE=1 -v OUI="$OUI" '$0 ~ OUI { print $3 }' ouidb.tsv
exit 0

Example run:

$ sh find_mac_co.sh 7c:ab:60:ff:ff:ff
Apple, Inc.

There’s probably a way to make the Perl version shorter too. I’m more familiar with bash and shell commands.

The biggest problem with this script is that it relies on an up-to-date list of OUIs. An even better way is to query an API:

find_mac_co_api.sh

#!/bin/sh
MACADDRESS="$1"
curl "https://api.maclookup.app/v2/macs/$MACADDRESS/company/name"
exit 0

Example run:

$ sh find_mac_co_api.sh 7c:ab:60:ff:ff:ff
Apple, Inc.

Renaming multiple files: replacing or truncating varied file extensions

In the previous post, I ran into an issue where Wget saved files to disk verbatim, including query strings/parameters. The files on disk ended up looking like this:

  • wp-includes/js/comment-reply.min.js?ver=6.4.2
  • wp-includes/js/jquery/jquery-migrate.min.js?ver=3.4.1
  • wp-includes/js/jquery/jquery.min.js?ver=3.7.1
  • wp-includes/css/dist/block-library/style.min.css?ver=6.4.2

I wanted to find a way to rename all these files, and truncate the filename after and including the question mark. As an example, to convert jquery.min.js?ver=3.7.1 to jquery.min.js.

Continue reading Renaming multiple files: replacing or truncating varied file extensions

Converting a WordPress site to a static site using Wget

I recently made a YouTube tutorial on converting a WordPress site to a static HTML site. This blog post is a companion to the video.

First of all, why convert a WordPress site to a static HTML site? There are a number of reasons, but my primary concern is to reduce update fatigue. WordPress software, along with WordPress themes and plugins, have frequent security updates. Many sites have stable content after an initial editing phase, the need to apply never-ending security updates for a site that doesn’t change doesn’t make sense.

The example site I used in the tutorial is www.stress2012.com, a site for an academic conference/workshop that was held in 2012. It’s 2024: the site content is not going to change.

To mirror the site, I used Wget with the following command:

Continue reading Converting a WordPress site to a static site using Wget

3 ways to remove blank lines from a file

There are certainly more than 3 ways to do this. Typically I’ve always used sed to do this, but here’s my method using sed and two other methods using tr and awk:

sed:

sed '/^$/d' file_with_blank_lines

tr:

tr -s '\n' <file_with_blank_lines

awk:

awk '{ if ($0) print $0 }' file_with_blank_lines

If you have other favorite ways, leave a note in the comments!

Migrating database servers

As I’m migrating websites and applications from one server to another, I’m also migrating databases from one server to another.

Even though I’ve done this dozens, if not hundreds, of times, I always find myself looking up how to do this. I’m migrating from one MySQL (MariaDB) servers to another MySQL (MariaDB), so relatively straightforward but still some command syntax I don’t remember off the top of my head.

First, export the old database to a file:

DBHOST=old-db-host.osric.com
DBUSER=dbusername
DBNAME=dbname
mysqldump --add-drop-table -h $DBHOST -u $DBUSER -p $DBNAME >$DBNAME.07-NOV-2023.bak.sql

The this mysqldump command produces output that will re-create the necessary tables and insert the data.

In this case I’m not compressing the output, but it would be trivial to pipe the output of mysqldump to a compression utility such as xz, bzip2, or gzip. For my data, which is entirely text-based, any of these utilities performs well, although xz achieves the best compression:

mysqldump --add-drop-table -h $DBHOST -u $DBUSER -p $DBNAME | xz -c >$DBNAME.07-NOV-2023.bak.sql.xz
mysqldump --add-drop-table -h $DBHOST -u $DBUSER -p $DBNAME | bzip2 -c >$DBNAME.07-NOV-2023.bak.sql.bz2
mysqldump --add-drop-table -h $DBHOST -u $DBUSER -p $DBNAME | gzip -c >$DBNAME.07-NOV-2023.bak.sql.gz

Next, create the new database and a database user account. This assumes there is a database server running:

sudo mysql -u root
CREATE DATABASE dbname;
CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'your-t0p-s3cr3t-pa55w0rd';
GRANT ALL PRIVILEGES ON dbname.* TO 'dbuser'@'localhost';

Note that the CREATE USER and GRANT PRIVILEGES commands will result in a “0 rows affected” message, which is normal:

Query OK, 0 rows affected (0.002 sec)

There are other ways to create the database, see 7.4.2 Reloading SQL-Format Backups in the MySQL documentation.

Next, import the database from the file. This example uses the root user because I did not grant the dbuser PROCESS privileges (which are not table-level privileges):

sudo mysql --user=root --host=localhost dbname <dbname.07-NOV-2023.bak.sql

WordPress 6.3 is incompatible with older versions of PHP

After installing WordPress 6.3, this site was broken because the new version of WordPress isn’t compatible with PHP 5.x.

I know WordPress has been complaining about this for a while, but PHP 5.x is the default version on CentOS 7, which is still supported until June 30, 2024.

I would expect that WordPress would, instead of encouraging the users on systems with old versions of PHP to apply the update, warn that applying the update will absolutely break the target website.

I’m exceedingly annoyed at WordPress. An absolutely terrible experience.

I currently have the site running on a temporary server that is a little fragile, it remains to be seen how stable it will be over the coming days.

Running Splunk in AWS

I don’t like using Google Analytics. The data is useful and well-presented, but I really just want basic web stats without sending all my web stats (along with data from my users) to Google. I’ve considered a number of other options, including Matomo. But I already use Splunk at work, why not run Splunk at home too?

Splunk Enterprise offers a 60-day trial license. After that, there’s a free license. It’s not really clear that the free license covers what I’m trying to do here. The free license info includes:

If you want to run Splunk Enterprise to practice searches, data ingestion, and other tasks without worrying about a license, Splunk Free is the tool for you.

I think this scenario qualifies. This is a hobby server. I use Splunk at my day job, so this is in some sense Splunk practice. I’ll give it more thought over the next 60 days. Your situation may vary!

I launched an EC2 instance in AWS (Amazon Web Services). I picked a t2.micro instance. That instance size might be too small, but I’m not planning to send much data there. I picked Amazon Linux, which uses yum and RPMs for package management, familiar from the RHEL, CentOS, and now Rocky Linux servers I use frequently. (One thing to note, the default user for Amazon Linux is ec2-user. I always have to look that up.)

For purposes of this post, I’ll use 203.0.113.18 as the EC2 instance’s public IP address. (203.0.113.0/24 is an address block reserved for documentation, see RFC 5737.)

I transferred the RPM to the new server. I’m using Splunk 9.0.3, the current version as of this writing. I installed it:

sudo yum install splunk-9.0.3-dd0128b1f8cd-linux-2.6-x86_64.rpm

Yum reported the installed size as 1.4 GB. Important to note, since I used an 8 GB HD, the default volume size when I launched the EC2 instance.

I added an inbound rule the security group associated with the EC2 instance to allow 8000/tcp traffic from my home IPv4 address.

The installation works! I was able to connect to 203.0.113.18:8000 in a web browser. My connection to 203.0.113.18:8000 was not encrypted, but one thing at a time, right?

Disk space, as I suspected, might be an issue. This warning appeared in Splunk’s health status:

MinFreeSpace=5000. The diskspace remaining=3962 is less than 1 x minFreeSpace on /opt/splunk/var/lib/splunk/audit/db

Next question: how do I get data into Splunk? The Splunk Enterprise download page helpfully includes a link to a “Getting Data In — Linux” video, although the video focused on ingesting local logs. I’m more interested in setting up the Splunk Universal Forwarder on a different server and ingesting logs from the osric.com web server. I installed the Splunk forwarder on the target web server.

I enabled a receiver via Splunk web (see Enable a receiver for Splunk Enterprise for information). I used the suggested port, 9997/tcp.

I also allowed this traffic from the web server’s IPv4 address via the AWS security group associated with the EC2 instance.

I configured the forwarder on the target web server (see Configure the universal forwarder using configuration files for more details):

$ ./bin/splunk add forward-server 203.0.113.18:9997
Warning: Attempting to revert the SPLUNK_HOME ownership
Warning: Executing "chown -R splunk /opt/splunkforwarder"
WARNING: Server Certificate Hostname Validation is disabled. Please see server.conf/[sslConfig]/cliVerifyServerName for details.
Splunk username: admin
Password:
Added forwarding to: 203.0.113.18:9997.

I tried running a search, but the disk space limitations finally became apparent:

Search not executed: The minimum free disk space (5000MB) reached for /opt/splunk/var/run/splunk/dispatch. user=admin., concurrency_category="historical", concurrency_context="user_instance-wide", current_concurrency=0, concurrency_limit=5000

I increased disk to 16 GB. (I’d never done that before for an EC2 instance, but it was surprisingly easy.)

I needed to add something to monitor. On the target web server host I ran the following:

$ sudo -u splunk /opt/splunkforwarder/bin/splunk add monitor /var/www/chris/data/logs

The resulting output included the following message:

Checking: /opt/splunkforwarder/etc/system/default/alert_actions.conf
                Invalid key in stanza [webhook] in /opt/splunkforwarder/etc/system/default/alert_actions.conf, line 229: enable_allowlist (value: false).

It’s not clear if that’s actually a problem, and a few search results suggested it wasn’t worth worrying about.

Everything was configured to forward data from the web server to Splunk. How could I find the data? I tried running a simple Splunk search:

index=main

0 events returned. I also checked the indices at http://203.0.113.18:8000/en-US/manager/search/data/indexes, which showed there were 0 events in the main index.

I ran tcpdump on the target web server and confirmed there were successful connections to 203.0.113.18 on port 9997/tcp:

sudo tcpdump -i eth0 -nn port 9997

I tried another search on the Splunk web interface, this time querying some of Splunk’s internal indexes:

index=_* osric

Several results were present. Clearly communication was happening. But where were the web logs?

The splunk user on the target web server doesn’t have permissions to read the web logs! I ran the following:

chown apache:splunk /var/www/chris/data/logs/osric*

After that change, the Indexes page in the Splunk web interface still showed 0 events in the main index.

I followed the advice on What are the basic troubleshooting steps in case of universal forwarder and heavy forwarder not forwarding data to Splunk?, but still wasn’t seeing any issues. I took a close look again at the advice to check permissions. Tailing a specific log file worked fine, but getting a directory listing as the splunk user failed:

$ sudo -u splunk ls logs
ls: cannot open directory logs: Permission denied

Of course! The splunk user had access to the logs themselves, but not to the directory containing them. It couldn’t enumerate the log files. I ran the following:

$ sudo chgrp splunk logs

That did it! Logs were flowing! Search queries like the following produced results on the Splunk web interface:

index=main

The search was slow, and there were warnings present when searching:

Configuration initialization for /opt/splunk/etc took longer than expected (1964ms) when dispatching a search with search ID 1676220274.309. This usually indicates problems with underlying storage performance.

I looks like t2.micro is much too small and under-powered for Splunk, even an instance with very little data (only 3 MB of data and 20,000 log events in the main index).

Despite these drawbacks, the data was searchable. How did Splunk compare as a solution?

Dashboards
I’ll need to create dashboards from scratch. I’ll want to know top pages, top URIs resulting in 404 errors, top user agents, etc. All of those will need to be built. It’s possible there’s a good Splunk app available that includes a lot of common dashboards for the Apache web server, but I haven’t really explored that.

Google Analytics can’t report on 404 errors, but otherwise it provides a lot of comprehensive dashboards and data visualizations. Even if all you want are basic web stats, an application tailored to web analytics will include a lot of ready-made functionality.

Robots, Spiders, and Crawlers (and More)
It turns out, a large percentage of requests to my web server are not from human beings. Many of the requests are coming from robots. At least 31% of requests in the past day were coming from these 9 bots:

  • 8LEGS
  • Sogou
  • PetalBot
  • AhrefsBot
  • SEOkicks
  • zoominfobot
  • SemrushBot
  • BingBot
  • DotBot

Google Analytics (and presumably other web analytics tools) do a great job of filtering these out. It’s good to know which bots are visiting, but it’s not really telling me anything about which content is most popular with users.

Security Insights
Related to the above, the stats from the web logs do a much better job of showing suspicious activity than Google Analytics does. It’s much easier to see which IP addresses are requesting files that don’t exist, or are repeatedly trying and failing to log in to WordPress (19% of all requests are for wp-login.php). This is useful information that I can use to help protect the server: I’ve previously written about how to block WordPress scanners using fail2ban. A tool dedicated to web analytics likely won’t provide this kind of detail, and may in fact hide it from site administrators if they aren’t also reviewing their logs.

Costs
The t2.micro instance will cost me approximately 8 USD per month. The t2.micro instance clearly isn’t powerful enough to run Splunk at any reasonable level of performance, even for a single-user system with a fairly small number of log events.

What is the right size instance? I don’t have enough experience running Splunk as an administrator to make a guess, or even to determine if the bottleneck is CPU (likely) or RAM. But I decided to at least try upgrading the instance to t2.medium to see if that made a difference, since that includes 2 virtual CPUs (twice that of the t2.micro) and 4 GB RAM (four times that of t2.micro).

It did make a difference! The Splunk web interface is much faster now, but will cost roughly 33 USD per month. That’s getting close to the amount I pay to run the web server itself. I think setting up Splunk to collect web stats was a useful exercise, but I’m going to look at some other alternatives as Google Analytics replacements.

DIY Gist Chatbots

[This was originally posted at the now-defunct impractical.bot on 23 Feb 2019]

I created a tool that will allow anyone to experiment with NLTK (Natural Language Toolkit) chatbots without writing any Python code. The repository for the backend code is available on GitHub: Docker NLTK chatbot.

I plan to expand on this idea, but it is usable now. In order to create your own bot:

  • Create a GitHub account
  • Create a “gist” or fork my demo gist: Greetings Bot Source
  • Customize the name, match, and replies elements
  • Note your username and the unique ID of your gist (a hash value, a 32-character string of letters and numbers)
  • Visit http://osric.com/chat/user/hash, replacing user with your GitHub username and hash with the unique ID of your gist. For an example, see Greetings Bot.

You can now interact with your custom bot, or share the link with your friends!

One more thing: if you update your gist, you’ll need to let the site know to update the code. Just click the “Reload Source” link on the chat page.