Running Splunk in AWS

I don’t like using Google Analytics. The data is useful and well-presented, but I really just want basic web stats without sending all my web stats (along with data from my users) to Google. I’ve considered a number of other options, including Matomo. But I already use Splunk at work, why not run Splunk at home too?

Splunk Enterprise offers a 60-day trial license. After that, there’s a free license. It’s not really clear that the free license covers what I’m trying to do here. The free license info includes:

If you want to run Splunk Enterprise to practice searches, data ingestion, and other tasks without worrying about a license, Splunk Free is the tool for you.

I think this scenario qualifies. This is a hobby server. I use Splunk at my day job, so this is in some sense Splunk practice. I’ll give it more thought over the next 60 days. Your situation may vary!

I launched an EC2 instance in AWS (Amazon Web Services). I picked a t2.micro instance. That instance size might be too small, but I’m not planning to send much data there. I picked Amazon Linux, which uses yum and RPMs for package management, familiar from the RHEL, CentOS, and now Rocky Linux servers I use frequently. (One thing to note, the default user for Amazon Linux is ec2-user. I always have to look that up.)

For purposes of this post, I’ll use 203.0.113.18 as the EC2 instance’s public IP address. (203.0.113.0/24 is an address block reserved for documentation, see RFC 5737.)

I transferred the RPM to the new server. I’m using Splunk 9.0.3, the current version as of this writing. I installed it:

sudo yum install splunk-9.0.3-dd0128b1f8cd-linux-2.6-x86_64.rpm

Yum reported the installed size as 1.4 GB. Important to note, since I used an 8 GB HD, the default volume size when I launched the EC2 instance.

I added an inbound rule the security group associated with the EC2 instance to allow 8000/tcp traffic from my home IPv4 address.

The installation works! I was able to connect to 203.0.113.18:8000 in a web browser. My connection to 203.0.113.18:8000 was not encrypted, but one thing at a time, right?

Disk space, as I suspected, might be an issue. This warning appeared in Splunk’s health status:

MinFreeSpace=5000. The diskspace remaining=3962 is less than 1 x minFreeSpace on /opt/splunk/var/lib/splunk/audit/db

Next question: how do I get data into Splunk? The Splunk Enterprise download page helpfully includes a link to a “Getting Data In — Linux” video, although the video focused on ingesting local logs. I’m more interested in setting up the Splunk Universal Forwarder on a different server and ingesting logs from the osric.com web server. I installed the Splunk forwarder on the target web server.

I enabled a receiver via Splunk web (see Enable a receiver for Splunk Enterprise for information). I used the suggested port, 9997/tcp.

I also allowed this traffic from the web server’s IPv4 address via the AWS security group associated with the EC2 instance.

I configured the forwarder on the target web server (see Configure the universal forwarder using configuration files for more details):

$ ./bin/splunk add forward-server 203.0.113.18:9997
Warning: Attempting to revert the SPLUNK_HOME ownership
Warning: Executing "chown -R splunk /opt/splunkforwarder"
WARNING: Server Certificate Hostname Validation is disabled. Please see server.conf/[sslConfig]/cliVerifyServerName for details.
Splunk username: admin
Password:
Added forwarding to: 203.0.113.18:9997.

I tried running a search, but the disk space limitations finally became apparent:

Search not executed: The minimum free disk space (5000MB) reached for /opt/splunk/var/run/splunk/dispatch. user=admin., concurrency_category="historical", concurrency_context="user_instance-wide", current_concurrency=0, concurrency_limit=5000

I increased disk to 16 GB. (I’d never done that before for an EC2 instance, but it was surprisingly easy.)

I needed to add something to monitor. On the target web server host I ran the following:

$ sudo -u splunk /opt/splunkforwarder/bin/splunk add monitor /var/www/chris/data/logs

The resulting output included the following message:

Checking: /opt/splunkforwarder/etc/system/default/alert_actions.conf
                Invalid key in stanza [webhook] in /opt/splunkforwarder/etc/system/default/alert_actions.conf, line 229: enable_allowlist (value: false).

It’s not clear if that’s actually a problem, and a few search results suggested it wasn’t worth worrying about.

Everything was configured to forward data from the web server to Splunk. How could I find the data? I tried running a simple Splunk search:

index=main

0 events returned. I also checked the indices at http://203.0.113.18:8000/en-US/manager/search/data/indexes, which showed there were 0 events in the main index.

I ran tcpdump on the target web server and confirmed there were successful connections to 203.0.113.18 on port 9997/tcp:

sudo tcpdump -i eth0 -nn port 9997

I tried another search on the Splunk web interface, this time querying some of Splunk’s internal indexes:

index=_* osric

Several results were present. Clearly communication was happening. But where were the web logs?

The splunk user on the target web server doesn’t have permissions to read the web logs! I ran the following:

chown apache:splunk /var/www/chris/data/logs/osric*

After that change, the Indexes page in the Splunk web interface still showed 0 events in the main index.

I followed the advice on What are the basic troubleshooting steps in case of universal forwarder and heavy forwarder not forwarding data to Splunk?, but still wasn’t seeing any issues. I took a close look again at the advice to check permissions. Tailing a specific log file worked fine, but getting a directory listing as the splunk user failed:

$ sudo -u splunk ls logs
ls: cannot open directory logs: Permission denied

Of course! The splunk user had access to the logs themselves, but not to the directory containing them. It couldn’t enumerate the log files. I ran the following:

$ sudo chgrp splunk logs

That did it! Logs were flowing! Search queries like the following produced results on the Splunk web interface:

index=main

The search was slow, and there were warnings present when searching:

Configuration initialization for /opt/splunk/etc took longer than expected (1964ms) when dispatching a search with search ID 1676220274.309. This usually indicates problems with underlying storage performance.

I looks like t2.micro is much too small and under-powered for Splunk, even an instance with very little data (only 3 MB of data and 20,000 log events in the main index).

Despite these drawbacks, the data was searchable. How did Splunk compare as a solution?

Dashboards
I’ll need to create dashboards from scratch. I’ll want to know top pages, top URIs resulting in 404 errors, top user agents, etc. All of those will need to be built. It’s possible there’s a good Splunk app available that includes a lot of common dashboards for the Apache web server, but I haven’t really explored that.

Google Analytics can’t report on 404 errors, but otherwise it provides a lot of comprehensive dashboards and data visualizations. Even if all you want are basic web stats, an application tailored to web analytics will include a lot of ready-made functionality.

Robots, Spiders, and Crawlers (and More)
It turns out, a large percentage of requests to my web server are not from human beings. Many of the requests are coming from robots. At least 31% of requests in the past day were coming from these 9 bots:

  • 8LEGS
  • Sogou
  • PetalBot
  • AhrefsBot
  • SEOkicks
  • zoominfobot
  • SemrushBot
  • BingBot
  • DotBot

Google Analytics (and presumably other web analytics tools) do a great job of filtering these out. It’s good to know which bots are visiting, but it’s not really telling me anything about which content is most popular with users.

Security Insights
Related to the above, the stats from the web logs do a much better job of showing suspicious activity than Google Analytics does. It’s much easier to see which IP addresses are requesting files that don’t exist, or are repeatedly trying and failing to log in to WordPress (19% of all requests are for wp-login.php). This is useful information that I can use to help protect the server: I’ve previously written about how to block WordPress scanners using fail2ban. A tool dedicated to web analytics likely won’t provide this kind of detail, and may in fact hide it from site administrators if they aren’t also reviewing their logs.

Costs
The t2.micro instance will cost me approximately 8 USD per month. The t2.micro instance clearly isn’t powerful enough to run Splunk at any reasonable level of performance, even for a single-user system with a fairly small number of log events.

What is the right size instance? I don’t have enough experience running Splunk as an administrator to make a guess, or even to determine if the bottleneck is CPU (likely) or RAM. But I decided to at least try upgrading the instance to t2.medium to see if that made a difference, since that includes 2 virtual CPUs (twice that of the t2.micro) and 4 GB RAM (four times that of t2.micro).

It did make a difference! The Splunk web interface is much faster now, but will cost roughly 33 USD per month. That’s getting close to the amount I pay to run the web server itself. I think setting up Splunk to collect web stats was a useful exercise, but I’m going to look at some other alternatives as Google Analytics replacements.