Shortcuts, Part 3

With two files full of pre-prepared calls to the Disqus API, it’s time to fire up wget.

We test a couple URLs from our files first, to make sure we’re getting responses we can work with.  But by default, wget saves its output to a file – so we output to the terminal (stdout) instead, using the -O -switch (should appear all on one continuous line):

wget -O -\&forum=myforum\&thread=ident:myencodedthreadid

The space and hyphen after ‘-O’ are used to designate stdout; the hyphen is replaced with a filename if the default one based on the URL is not desired.  The ampersands are escaped for running on the command-line; they do not need to be escaped in our file.

From the wget manual:

Use of ‘-O’ is not intended to mean simply “use the name file instead of the one in the URL;” rather, it is analogous to shell redirection: ‘wget -O file http://foo’ is intended to work like ‘wget -O – http://foo > file’; file will be truncated immediately, and all downloaded content will be written there.

This is useful information for us: it means I don’t need to specify any additional options if I use ‘-O’ with a filename to make sure the output from all requests generated by the file fed to wget are appended to the same file.

To run wget with a sampling of our URLs, we add the corresponding option:

wget -O tester-output.txt -i tester.txt

But with the output being saved to a file, we no longer need wget to be so chatty and dump BOTH to stdout and to our file (wget is verbose by default)!  Try this instead, adding the ‘-q’ switch:

wget -qO tester-output.txt -i tester.txt

With all this testing, we’re also concerned about one more thing: rate-limiting! Let’s space our requests out so that we at least get sense of the rate we’re allowed if we begin to get errors. 0.1 seconds should allow us to make 1200 calls in two minutes:

wget -qO tester-output.txt -i tester.txt -w 0.1

Now let’s check the output. In my case, it looks like the JSON objects returned by the details.json API call are all that’s being stored in the file – no HTTP headers, or anything else, which is good.  And the link element is in each one, which is what we’ll be using. But I’m seeing things like \u00a0 and some other wonky stuff that I think has to do with utf-8 encoding.  As long as it’s not in the link values, we should be ok.  Let’s let it run with the full file and see.

Working with the data

We have a good set of data in both files. The file with the “old” disqus_identifiers returned 621 out of 628 hits; the file with the “new” disqus_identifiers returned only 479.  The missing hits returned “BAD REQUEST” – and manual inspection of a sampling of the failures, in a browser, confirms that that these are likely identifiers for which a thread does not actually exist.

Using Python and perhaps a basic text editor (my favorite is TextWrangler), we’ll process our files a bit further to glean some more useful information.  For example, I noticed in a quick scan of the files that many threads returned by the API already had both identifiers I’m working with:

    "title":"Post Title",

Given that the link in both files is a match, and more importantly that the thread id is the same, it’s probably fair to presume that there’s no need to merge any threads here.

What do we want to do with the information we have so far?  If our goal is to have an exhaustive list of URLs for threads that need to be merged, we might want to go back to our original data and compare it with what we have.  Here’s what I’m thinking for next steps:

  1. Find a way to compare our original list of old disqus_identifiers with the JSON file that resulted from looking up threads with those IDs, to identifiy the 7 disqus_identifiers that did not return a thread; that’s few enough that we can probably look each one up manually to verify the error and to determine whether the associated posts have a different thread that we’re ok with.
  2. Having vetted the missing identifiers, I’d like to then find a way to separate out the threads that have two identifiers already, and make sure that the second identifier is the one that is in the file with the “new” identifiers.  If any threads have two identifiers but neither is in the file with the “new” ones, we have a different problem on our hands that I’ll need to devise a solution for.
  3. With a list of threads that only have one identifier (the “old” one) I can then go back to the original CSV supplied by our vendor, (1) look up what the new identifier is, and (2) look that identifier up in the JSON file containing the objects returned for the new identifiers to see what the associated link is.
  4. I suspect what I’ll find is that the link associated with the new/correct identifiers is actually the post preview link for each entry, and that I’ll want to merge in reverse to the post URL associated with the old disqus_identifier.

In reality

Given what’s still going on, with the Disqus JS loading on our blog preview pages, it’s safe to assume that all this work to winnow down the set of threads to merge could be subsumed in a more straightforward fix: we could generate a list of post ids, convert them to post URLs, and use wget to fetch each one – recording in the process the redirect that occurs and processing that into a merge file for disqus. That would probably close the issue more effectively.

Comments are closed.