QAD Normalized Google Distance

I was introduced to the idea of Normalized Google Distance as a measure of semantic relatedness between keywords today and it struck me as a really neat idea but I couldn’t find an implementation anywhere with which to play around. After playing around with a convoluted and far too manual Google Spreadsheet system I mashed up this code:

#!/usr/bin/env python
import math,sys
import json
import urllib

def gsearch(searchfor):
query = urllib.urlencode({‘q’: searchfor})
url = ‘http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s’ % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results[‘responseData’]
return data

args = sys.argv[1:]
m = 45000000000
if len(args) != 2:
print “need two words as arguments”
exit
n0 = int(gsearch(args[0])[‘cursor’][‘estimatedResultCount’])
n1 = int(gsearch(args[1])[‘cursor’][‘estimatedResultCount’])
n2 = int(gsearch(args[0]+” “+args[1])[‘cursor’][‘estimatedResultCount’])
l1 = max(math.log10(n0),math.log10(n1))-math.log10(n2)
l2 = math.log10(m)-min(math.log10(n0),math.log10(n1))
distance = l1/l2
print distance

So if you want to play, there it is. Just beware that Google blocks you for a while if you start doing too many requests too quickly and each run is three requests.

Responses

  • hmm, I’m not certain about NGD as a similarity measure as it relies on the google algorithm, which is opaque and relies more on page content co-appearance than user behaviour.

    I would suggest that you read ‘Implicit association via crowd-sourced coselection’ (Ashman, H et. al, 2011) for an overview of some of the behavioural linking aspects of this semantic linking.

    Also, be aware that Schmakeit, J.F. found significant difference between the google api result and the web results which may create ambiguity in the accuracy of the given method. That thesis currently hasn’t been published (actually it’s been suppressed a bit because it’s potentially quite dammaging) but I’d expect to see it appear mid next year, possibly in JASIT.


Leave a Reply

Your email address will not be published. Required fields are marked *