My RSS Feeds, version 2.0

As a result of rendering holly inaccessible recently, I was cut off from my old quick-and-dirty RSS aggregator: a PHP script that, on demand, downloaded the various feeds and rendered them all on a single page.

It worked, but downloading feeds on-demand caused noticeable slowdowns if the servers hosting the feeds were slow or unresponsive (*cough* LiveJournal *cough*). A stand-alone aggregator program would’ve (presumably) have this problem, but I rather like having a web interface to my favorite RSS feeds (which is the same reason I don’t use Firefox‘s new Live Bookmarks feature).

So, now was a good a time as any to write a new, improved aggregator.

This time I wrote it in Perl instead of PHP, and it runs off-line instead of on-demand. It does the same basic thing: fetches RSS feeds one by one, caches them locally (to avoid hammering the servers), and generates an HTML page with all the links organized. Simple but effective.

The new version generates a static HTML page when it’s run, instead of building it dynamically using PHP. This means it’s best run as a cron job to regenerate the page every few hours or so. That way you don’t have to wait for feeds to be fetched, and it’s still reasonably up-to-date.

The configuration file the script uses to specify feeds is pretty straightfoward. It’s set up a series of stanzas separated by blank lines; each stanza describes either a feed or a section header (used to group feeds together in categories). It looks something like this:

Section: Technology

URL: http://www.arstechnica.com/
Title: Latest Headlines from Ars
CacheAs: ars
Period: 6

Only the Section: and URL: lines are required. Title overrides the feed title that appears in the RSS feed. CacheAs tells it which file to cache the feed into (the default is based on the URL). Period specifies how many hours to wait between fetches (the default is 4).

Anyway, here’s the script, for those interested. It hasn’t been heavily tested yet, but it seems to work OK. License is GPL2. Be sure to change the three config variables near the top to point to what files you want to work with; the ones I have probably won’t work for you.

(Try to ignore the giant gaps. WordPress doesn’t seem to understand you shouldn’t add <p> tags inside a big <pre> block. Grumble.)

#! /usr/bin/perl -w

use strict;

use Getopt::Long;
use LWP::Simple;
use XML::RSS;

my $CACHE_DIR = '/home/paul/.www/rss/cache';
my $FEED_FILE = '/home/paul/.www/rss/feeds.txt';
my $OUTPUT_FILE = '/home/paul/.www/rss/index.html';

my @entries = ();

my $verbose = '';
GetOptions('verbose' => $verbose);

load_feed_file();
update_cache();
generate_page();

exit 0;

sub load_feed_file
{
	local *FEEDS;
	open FEEDS, '<', $FEED_FILE
		or die "Couldn't open $FEED_FILE: $!";

	my $scratch = {};

	while (<FEEDS>)
	{
		chomp;
		if ($_ eq '' && scalar keys %$scratch > 0)
		{
			push @entries, $scratch;
			$scratch = {};
		}
		elsif (/^([^:]+):s*(.*)s*$/)
		{
			$scratch->{$1} = $2;
		}
	}

	if (scalar keys %$scratch > 0)
	{
		push @entries, $scratch;
	}

	close FEEDS
		or die "Couldn't close $FEED_FILE: $!";
}

sub is_feed
{
	my ($entry) = @_;
	return defined $entry->{URL}
}

sub is_header
{
	my ($entry) = @_;
	return (defined $entry->{Section});
}

sub update_cache
{
	foreach my $entry (@entries)
	{
		next unless is_feed($entry);

		unless (defined $entry->{CacheAs})
		{
			$entry->{CacheAs} = $entry->{URL};
			$entry->{CacheAs} =~ s#/#_#g;
		}

		$entry->{Period} = 4 unless defined $entry->{Period};
		my $age = -M "$CACHE_DIR/$entry->{CacheAs}";
		$age *= 24 if defined $age;
		if (defined $age && $age < $entry->{Period})
		{
			print "Skipping $entry->{URL} ($age hours old)\n"
				if $verbose;
			next;
		}

		print "Trying $entry->{URL}\n"
			if $verbose && !defined $age;
		print "Trying $entry->{URL} (was $age hours old)\n"
			if $verbose && defined $age;

		my $result = mirror($entry->{URL},
		                    "$CACHE_DIR/$entry->{CacheAs}");
		if (is_error($result))
		{
			warn "Couldn't mirror $entry->{URL} (code $result)";
		}
		print "  ... and the result was $result\n"
			if $verbose;
	}
}

sub generate_page
{
	print "Generating pagen"
		if $verbose;

	local *OUTPUT;
	open OUTPUT, '>', $OUTPUT_FILE
		or die "Couldn't open $OUTPUT_FILE: $!";

	my $now = localtime;

	print OUTPUT <<EOF;
<html>
<head>
<title>My RSS Feeds</title>
</head>
<body>
<h1>My RSS Feeds</h1>
<p><small>Page generated $now.</small></p>
EOF

	foreach my $entry (@entries)
	{
		if (is_header($entry))
		{
			print_header($entry);
		}
		elsif (is_feed($entry))
		{
			print_feed($entry);
		}
	}

	print OUTPUT <<EOF;
</body>
</html>
EOF

	close OUTPUT
		or die "Couldn't close $OUTPUT_FILE: $!";
}

sub print_header
{
	my ($entry) = @_;

	print OUTPUT <<EOF;
<h2>$entry->{Section}</h2>
EOF
}

sub print_feed
{
	my ($entry) = @_;

	my $rss = new XML::RSS;
	eval { $rss->parsefile("$CACHE_DIR/$entry->{CacheAs}"); };
	if ($@)
	{
		$entry->{Title} = $entry->{URL}
			unless defined $entry->{Title};
		print OUTPUT <<EOF;
<h3><a href="$entry->{URL}">$entry->{Title}</a></h3>
<p>RSS file invalid.</p>
EOF
		return;
	}

	$entry->{Title} = $rss->{channel}->{title}
		unless defined $entry->{Title};
	print OUTPUT <<EOF;
<h3><a href="$rss->{channel}->{link}">$entry->{Title}</a></h3>
<p><small>$rss->{channel}->{description}</small></p>
<ul>
EOF

	foreach my $item (@{$rss->{items}})
	{
		$item->{title} = '[no title]'
			unless defined $item->{title};
		print OUTPUT <<EOF;
<li><a href="$item->{link}">$item->{title}</a></li>
EOF
	}

	print OUTPUT <<EOF;
</ul>
EOF
}

3 Responses

  1. hey, lj isn’t bad. it’s better than blurty normally.

  2. Perhaps, but I follow a *lot* more LiveJournal-hosted blogs than I do Blurty-hosted blogs. If LiveJournal is characteristically slow, it affects the script a lot more.

    Speaking of which, there’s a couple of minor issues the script I posted above has. I suppose I’ll post a revised version in the near future.

  3. Hey, any chance you could post it as a plaintext file somewhere? Safari (and MacOS in general) is pretty aggressive about preserving the whole semantic Unicodey nature of clipboard text, and somewhere along the line when you posted the code it also got all SmartQuotey anyway.

Comments are closed.