Importing Slashdot Journal Articles by Yak

I’ve imported all my old slashdot journal articles because:

  • posterity
  • I like the fact I’ve been writing on the internet for so long and I want my domain to show it
  • there’s something to be said for keeping your own writing on your own domain and not someone else’s
  • because I can.

It turns out that although slashdot has an export feature, it doesn’t include the journal entries. Let the yak-shaving begin.

What worked

Use wget to download all the paginated lists of posts into html files. (I forget whether I looped this or got wget to spider it, either would work).

Parse paginated list of posts to get individual post urls into file urls.txt:

#!/bin/sh -v
echo "" > urls.txt
for page in page*.html; do
    xidel --data "$page" --xquery 'for $var in //article return $var//span[@class="story-title"]//a[@rel]/@href' >> urls.txt

Loop through those urls downloading the individual post pages

#!/bin/bash -v
mkdir -p posts
cd posts
while read url; do
    echo $url
    echo $file2
    curl $url >> $file2
done < ../urls.txt

Parse the downloaded files, transforming them into individual markdown files:

#!/usr/bin/env ruby

require 'nokogiri'
require 'date'

Dir.glob("posts/*").each do |input|
    puts input
    doc = { |f| Nokogiri::HTML(f) }

    doc.css("article[data-fhtype='journal']").each do |a|
        url=a.at_css(".story-title a[rel]").attribute("href").text
        date = Date.parse(a.at_css("time").text[3..])
        outfile = date.to_s + "-slashdot-journal-" + url[23..].gsub("/","-") + ".md"
        puts outfile"out/" + outfile, "w") {|out|
            out.write "---"
            out.write "\n"
            out.write "title: \"" + a.at_css(".story-title a[rel]").text + "\""
            out.write "\n"
            out.write "date: "+ date.to_s
            out.write "\n"
            out.write "slashdot_url: https:"+ url
            out.write "\n"
            out.write "---"
            out.write "\n"
            out.write "\n"
            a.css("div[class=body] div[class=p]").each do |p|
                out.write p.inner_html.strip
                out.write "\n"
                out.write "\n"

Exploration with nokogiri

Once you have an html file on disk you can explore the in-memory model interactively with irb, which helps iterate on scripts like the above more rapidly.


$ irb
irb(main):001:0> doc ="page-0.html") { |f| Nokogiri::XML(f) }
irb(main):002:0> doc.css("article[data-fhtype='journal']").each {|a| puts "---", "title: " + a.at_css(".story-title").text, "time: "+ a.at_css("time").text, "---";};nil

Dead-ends explored

  • xq - doesn’t seem to provide a rich enough expression to pick bits out of html and stitch them back together in interesting ways, more of a tool for capturing better structured data.
  • xidel - can do xquery not just xpath, got further with this but not far enough
  • wget‘ing the paginated list of posts - for some reason this resulted in repeated content when parsed with nokogiri
  • wget‘ing individual post pages - suspected manipulation of html, so dropped down to curl


