The Problem:
Late last year, not too long after the announcements about Tumblr changing a number of their content policies to facilitate getting their app back on the iOS App store, I removed all of the posts from a blog I had been maintaining since May of 2013. I ended up deleting 5,321 posts. Since this was the largest online collection of my public writing, I obviously backed it up first by using Tumblr’s GPDR complaint data tool.
But all the posts came back with filenames like: 50074020934.html
I’d like to look through all of these posts and maybe restore some of them here to my website, but I don’t have the free time to look through all 5,000+ posts to determine what is worth restoring and what is better left in the memory of the internet (shitposts are fun aren’t they)
The Plan:
The posts all follow a pretty basic format:
<!DOCTYPE HTML>
<html>
<head></head>
<body>
<h1>Derivative (Arkansas) Portrait #3</h1>
<p>I once spent a full hour convincing a tipsy friend of mine that metric time was the reason British television programmes tend to have longer run-times than American ones.</p>
<div id="footer"><span id="timestamp">February 22nd, 2015 9:10pm</span><span class="tag">prose</span></div>
</body>
</html>
My Vogon framework doesn’t really have a real reason to exist yet, but I’ve had a lot of luck using it for data manipulation projects.
My plan is to use the built in DOMDocument and XPath classes from PHP to open all of the files, see if there is a title in the <h1> tag, and then save the post with the title as the filename, making it a little easier to scan for things that might be worth posting back here
The Method:
<?php
$active_dir = ROOT . DIRECTORY_SEPARATOR . 'main' . DIRECTORY_SEPARATOR . 'ext' . DIRECTORY_SEPARATOR . 'post_clean' . DIRECTORY_SEPARATOR;
$data_dir = $active_dir . 'data' . DIRECTORY_SEPARATOR;
$input_dir = $data_dir . 'input' . DIRECTORY_SEPARATOR;
$output_dir = $data_dir . 'output' . DIRECTORY_SEPARATOR;
$input = dir_contents($input_dir); //Get a list of our posts.
foreach($input as $filename){
$title = '';
$file = $input_dir.$filename;
$domdoc = new DOMDocument();
if(!@$domdoc -> loadHTMLFile($file)){
$html = file_get_contents($file);
if(!@$domdoc -> loadHTML($html)){
return false;
}
}
$xpath = new DOMXpath($domdoc);
$title = $xpath->query('//h1');
if($title->length > 0){
foreach($title as $t){
$title = $t->nodeValue;
break; //Since $title is an object, not an array, this is just a hack to get the first match.
}
if(!empty($title)){
$safe_filename = preg_replace('/[^a-zA-Z0-9]/', '_', $title).'.html';
file_put_contents($output_dir . $safe_filename, file_get_contents($file));
}
}
}
?>
The actual code ended up being pretty easy, the only custom function I needed to use was was the “dir_contents” function, which is just scandir() with an optional filter by extension, and automatic removal of the “.” and “..” values.
The Results:
Since we dropped any posts that didn’t have a title, and potentially overwrote any posts with the same name, we ended up with only 842 items in our output directory, but as a proof of concept it was very successful. Before I’m finished with it, I’ll probably append the date and maybe the tags to the filename, while also adding some handling for the posts that don’t have a title.
I’ll also probably put a little more effort into making the titles a little less… “computery” so that we don’t get things like “___All_the_Money_or_the_Simple_Life_Honey________The_Dandy_Warhols.html”, which is very clear, but doesn’t need to be nearly that long.