Retrieve your blog posts from a WordPress eXtended Rss file with WXR to HTML

by Jon

If you’ve ever migrated or retired a WordPress blog, you’re probably familiar with WordPress eXtended Rss files. There’s a thorough summary here, but basically a WXR file is a copy of all of the textual content on your site: pages, blog posts, and comments. You can use them to migrate that content from blog to blog, or just to archive it for your own backups.

The problem is, there’s not much you can do with the file other than import it back into WordPress or another blogging system. But what if you just want to read the content? What if it’s been years since that blog was live, and you just want to rescue a favorite post?

The file is XML, so it is technically human-readable, but there’s a lot of ugly markup to sift through as well. Not fun. You could import the file back into a WordPress install, but that seems a tad overkill.

I’ve been hit with this exact problem myself. I have the WXR files of two old WordPress sites, with years of content dating back to the early days of blogging. I’d like to do something with that content.

Though WXR files may be a pain for us to read, as XML it’s a breeze for software to parse. With that in mind, I wrote WXR to HMTL. It’s a short and sweet python script for converting a WXR file into a plain, easy to read HTML file.

Note, the goal is not to recreate the original sites in all their former glory. I don’t even care about comments all that much; I just want the blog posts in a form that I can read, search, and copy/paste from. Also it’s not really possible to rebuild the site; WXR files only contain the raw text: no images, no style, and no layout information. Those limitations in mind, here’s the script:

#!/usr/bin/env python

"""
WXR to HMTL <https://jonthysell.com/>

Copyright 2012 Jon Thysell <thysell@gmail.com>

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
"""

import sys
import codecs
import string
from lxml import etree

_header_html = """
<html>
<head>
<title>%s</title>
</head>
<body>
<h1>%s</h1>
<p>%s</p>
<p><em>Exported from <a href="%s">%s</a> %s</em></p>
"""

_footer_html = """
<p><em>HTML generated by WXR to HTML &lt;<a href="http://jonthysell.com">http://jonthysell.com</a>&gt;</em></p>
</body>
</html>
"""

_item_html = u"""
<h2><a href="%s">%s</a></h2>
<p>Published: %s</p>
<p>%s</p>
"""

_title = ""
_link = ""
_desc = ""
_pubdate = ""
_items = []

# This controls whether to add in paragraph tags. Most likely you want this on. Only change this to False if for some reason your posts are already valid HMTL.
_autop = True

def autop(s):
    s = string.replace(s, "\r\n", "\n")
    s = string.replace(s, "\n\n", "</p><p>")
    s = string.replace(s, "\n", "<br \>")
    return s

def main(input_file):
    """Take an WXR XML file and export an HMTL file."""
    global _title, _link, _desc, _pubdate, _items, _autop
    print "Reading from %s" % input_file
    with codecs.open(input_file, 'r') as wxr_file:
        tree = etree.parse(wxr_file)
        _title = tree.xpath('/rss/channel/title')[0].text
        _link = tree.xpath('/rss/channel/link')[0].text
        _desc = tree.xpath('/rss/channel/description')[0].text
        _pubdate = tree.xpath('/rss/channel/pubDate')[0].text
        xml_items = tree.xpath('/rss/channel/item')
        for xml_item in xml_items:
            t = xml_item.xpath('title')[0].text
            l = xml_item.xpath('link')[0].text
            p = xml_item.xpath('pubDate')[0].text
            c = xml_item.xpath('content:encoded', namespaces={'content': 'http://purl.org/rss/1.0/modules/content/'})[0].text
            if _autop:
                c = autop(c)
            _items.append((l, t, p, c))

    output_file = input_file[:-3] + "html"
    print "Writing to %s" % output_file
    with codecs.open(output_file, encoding='utf-8', mode='w') as html_file:
        p = (_title, _title, _desc, _link, _link, _pubdate)
        html_file.write(_header_html % p)
        for _item in _items:
            html_file.write(_item_html % _item)
        html_file.write(_footer_html)

if __name__ == "__main__":
    main(sys.argv[1])

To run this, you’ll need Python and the lxml module installed. The script takes one parameter, the WXR file, and exports a single HTML file with all of your posts and pages, including titles, original links, and timestamps. It will not export your comments, tags, categories, etc. If you need that, feel free to tweak the script.

Now finally I can delve into my own personal back-catalog. It’s rather exciting to look at my posts from so long ago.

Do you find this script useful? Say so in the comments!

/jon