Shortcuts, Part 2

First things first: create a step-by-step to-do list of what I’ll be doing to merge these Disqus threads:

  1. I need to look up the JSON objects for 1200+ threads – by disqus_identifier – to determine what URL is associated with each one.  Ideally, I’ll end up with two files: one that has all the JSON objects for the 600+ old disqus_identifiers in it, and one that has all the counterpart objects for the 600+ new disqus_identifiers in it.
  2. Clean the data: it’d be great not to merge any threads that don’t actually have any comments (“posts,” in Disqus terminology) in them.
  3. Reduce the data: the end result of the process should be one CSV file, created by concatenating the corresponding URLs from the old and new files, on each line, with a comma in between.

Looking up information about a thread

Disqus provides an API that can return information about a thread:

http://disqus.com/api/docs/threads/details/

Using this API for a single thread is relatively easy – just plug in the necessary values:

http://disqus.com/api/3.0/threads/details.json?api_key=YOUR_API_KEY&forum=yourforumname&thread=ident:URL_ESCAPED_IDENTIFIER

Looking up information about MANY threads

Retrieving this data for hundreds of threads, however, is going to require some automation.  Command-line tools such as curl and wget are tried-and-true in situations like this; they can be used alone or in concert with shell scripts, but there are also alternatives.  Since the “best approach” always depends on context and need, and here I’m just putting something basic together rather than writing a thesis on the different possible methods, I’m just going to stick with wget.

To use wget to do what we want, we need some notion of what it can use as input and what it can output.  We want to feed it a list of URLs and output a list of JSON objects in one file, and we can parse the output later.

Prep work: creating our two input files

Just as there are various ways to automate the retrieval of URLs, there are also many ways to parse and manipulate files.  For novice users with little time to learn command-line tools for parsing purposes, a few minutes with a spreadsheet tool like MS Excel will satisfy most needs.

I’m starting with a file that has two identifiers on every line – separated by a comma – so first I’ll try using MS Excel to split the two columns into worksheets we can save as separate files.

It’s clear upon opening the file that I can also pull out the entries where both old and new disqus_identifiers on a single line are URLs. The Disqus migration tool can process those directly.

With those deleted, it’s time to use Excel formulae to populate the third and fourth columns of our spreadsheet with the Disqus API calls we’re going to make using wget.  Should be relatively straightforward, with the only variables in our formulae being the first two columns, in order to populate the API calls with the disqus_identifier values we want to look up.  Something like this:

=CONCATENATE(“http://disqus.com/api/3.0/threads/details.json?api_key=MY_API_KEY&forum=myforum&thread=ident:”,A1)

Before copying this out to all the other cells below it, though, I realize it has to be scrapped: our disqus_identifiers all contain characters that must be URL-encoded; our version of Excel does not have the ENCODEURL function (if yours does, just use ENCODEURL(A1) in the formula above instead of just A1.

Where does that leave me?

Other command-line tools based on regular expressions, whether sed or awk, won’t help me escape/encode URL characters in and of themselves without explicit substitution instructions; and while I could go back to Excel and create a VBA macro or something that does the same character-replacement-by-enumeration, as others have done, that’s no less tedious.  So, I save what I’ve done so far (each column into a separate file) and turn to Python.

With the old identifiers in a file named left.txt and the new ones in right.txt, I create a Python script to run in the same directory that takes a file and uses each line of that file to populate the template for a single line of a new file that will be used by wget. Why didn’t I use python to also make the calls to the server and process the output as it was being returned?  Minimizing the risk in complexity.  I’d like to avoid having to repeat all the calls to the server that have already executed if, somewhere down the line, there’s an error parsing one of the responses and I have to start the script all over again – so I’ll be prepping my URLs, calling wget exactly once on each of my two input files (hopefully), and then attempting to parse the results into a Disqus-ready migration file… which I suspect will take me a couple runs at least in terms of trial-and-error.

My python script:

#!/usr/bin/python
import optparse
import urllib

def parselogfile(inputlogfilename):
  src = open(inputlogfilename, 'r')
  dest = open('api-ready.' + inputlogfilename, 'w')

  for line in src:
    template = 'https://disqus.com/api/3.0/threads/details.json?api_key=MY_API_KEY&forum=myforum&thread=ident:'
    dest.write(template)
    dest.write(urllib.quote(line.rstrip('\n'), ''))
    dest.write('\n')

  src.close()
  dest.close()

def main():
  p = optparse.OptionParser(description=' Prepares a list of disqus identifiers for calls to API',
                                   prog='pyprepare',
                                version='pyprepare 1.0',
                                  usage='%prog filename')
  options, arguments = p.parse_args()
  if len(arguments) == 1:
    parselogfile(inputlogfilename=arguments[0])
  else:
    p.print_help()

if __name__ == '__main__':
  main()

Noting two things:

  1. urllib.quote() is what’s doing our URL encoding, and its second parameter (known as safe) is set to ” instead of the default ‘/’, otherwise ‘/’ won’t get encoded;
  2. using line.rstrip removes the ‘\n’ from the end of each line if there is one (not the same as nabbing line length minus one, because the line might not end with ‘\n’ if it’s the last line of the file).

Comments are closed.