The English-Chinese dictionary, revisited

February 23, 2009, [MD]

My previous post on extracting an English-Chinese dictionary garnered a fair amount of attention, I got reddit'ed (on their frontpage for a short time), solidot'ed, mentioned on the Wall Street Journal blog, and more. Very fun. About 10,000 page views in three days, and a bunch of comments, both here and at Reddit. It was fun to follow the referrers in the log and see how the article spread, first through some big portals, then being picked up by others, as well as personal blogs, twitter, etc.

For the simple dictionary server I wrote - which was only a small part of my article, but I guess the most immediate and visible way of testing it out - I generate a simple log file. I don't want to infringe anyone's privacy, so I am not logging IP or anything, simply the word that is searched for. I just had a look at this file, and here are the stats. First, I did a quick

cat log | wc

to see how many words (ie. searches) had been performed. The result is 2,819. Then I did a

cat log | sort | uniq | wc

this shows me how many different unique words have been searched for (ie if five people searched for China, that would be one entry - doesn't account for misspellings of course). The result was 702 different words. Then I wanted to make an overview over the most frequent search words, but I couldn't immediately think of a way to do this using shell commands. So I wrote a quick Ruby script. (Note, I could of course have imported this into a spreadsheet, but what if it had been a million rows?)

text = File.read(ARGV[0]) counter = {} text.each do |line| counter[line] = counter[line] ? counter[line] + 1 : 1 end puts "

" counter.sort {|a,b| b[1]<=>a[1]}.each do |key| puts "" end puts "
" + key[0].chomp + " " + key[1].to_s + "
"

This gave the following top searches:


searchword 414 sex 376 Toronto 253 open access 174 test.cgi/ 148 天安门 146 北京 135 开放获取 118 telephone 75 托福 41 很好很强大 23 上海 16 favicon.gif 11 Dictionary 11 favicon.ico 11 toronto 8 word 7 中国 7


It's an interesting combination. First, I notice that things that I link to in my article appears very frequently. I began talking about my server in my previous article with the text: I didn’t bother writing a form page for it, but the API is extremely simple: http://reganmian.net/en-zh/*searchword.*Since the searchword is italicized, my intent was for the user to replace that with whatever they wanted to search for, and unlike my subsequent real examples, this URL wasn't even linked, but still people ended up clicking on it (finding no hits, and hopefully not abandoning the dictionary just because of this).

The second hit is sex, unsurprisingly. And apart from being titillating, this is a searchword that really illustrates the strength of this dictionary compared to other more traditional ones. The result is 141 headwords, as diverse as East Sussex, Oral sex in Islamic law, Sex Pistols and Psyochopathia Sexualis. I certainly didn't know how to translate metrosexual or genetic sexual attraction to Chinese, but now I do. (And, incidentally, this blog will probably be blocked in all British Schools, and Saudi Arabia).

Toronto and Open Access where both examples I provided, but test.cgi/ is an interesting one. It has a huge amount of hits, and as far as I know, hasn't been listed anywhere. I am curious if this is an automated attempt at exploiting a vulnerability. Then we get some Chinese ones, like 天安门 (Tiananmen), 北京 (Beijing) and 开放获取 (Open Access).

Further down, we find "很好很强大". It literally means "very good, very strong", but is a Chinese internet meme which can be used to express strong surprise (and often unhappiness) with something. According to this blogger (in Chinese), the expression can be traced back to Jin Ping Mei, a Chinese erotic classic. Although there is a nice Wikipedia entry in Chinese, there is no English corresponding article, and thus no entry in the dictionary.

By the way, I made a corresponding dictinary for traditional characters at http://reganmian.net/en-tw/search-something. Now we'll see if in a while, when I check it's log, "search-something" is the top search word...

Stian

Stian Håklev February 23, 2009 Toronto, Canada
comments powered by Disqus