The English-Chinese dictionary, revisited

My previous post on extracting an English-Chinese dictionary garnered a fair amount of attention, I got reddit‘ed (on their frontpage for a short time), solidot‘ed, mentioned on the Wall Street Journal blog, and more. Very fun. About 10,000 page views in three days, and a bunch of comments, both here and at Reddit. It was fun to follow the referrers in the log and see how the article spread, first through some big portals, then being picked up by others, as well as personal blogs, twitter, etc.

For the simple dictionary server I wrote – which was only a small part of my article, but I guess the most immediate and visible way of testing it out – I generate a simple log file. I don’t want to infringe anyone’s privacy, so I am not logging IP or anything, simply the word that is searched for. I just had a look at this file, and here are the stats. First, I did a quick

cat log | wc

to see how many words (ie. searches) had been performed. The result is 2,819. Then I did a

cat log | sort | uniq | wc

this shows me how many different unique words have been searched for (ie if five people searched for China, that would be one entry – doesn’t account for misspellings of course). The result was 702 different words. Then I wanted to make an overview over the most frequent search words, but I couldn’t immediately think of a way to do this using shell commands. So I wrote a quick Ruby script. (Note, I could of course have imported this into a spreadsheet, but what if it had been a million rows?)


text = File.read(ARGV[0])
counter = {}
text.each do |line|
  counter[line] = counter[line] ? counter[line] + 1 : 1
end
puts "<table>"
counter.sort {|a,b| b[1]<=>a[1]}.each do |key|
  puts "<tr><td>" + key[0].chomp + "</td><td>" +
key[1].to_s + "</td></tr>"
end
puts "</table>"

This gave the following top searches:

searchword 414
sex 376
Toronto 253
open access 174
test.cgi/ 148
天安门 146
北京 135
开放获取 118
telephone 75
托福 41
很好很强大 23
上海 16
favicon.gif 11
Dictionary 11
favicon.ico 11
toronto 8
word 7
中国 7

It’s an interesting combination. First, I notice that things that I link to in my article appears very frequently. I began talking about my server in my previous article with the text: I didn’t bother writing a form page for it, but the API is extremely simple: http://reganmian.net/en-zh/searchword. Since the searchword is italicized, my intent was for the user to replace that with whatever they wanted to search for, and unlike my subsequent real examples, this URL wasn’t even linked, but still people ended up clicking on it (finding no hits, and hopefully not abandoning the dictionary just because of this).

The second hit is sex, unsurprisingly. And apart from being titillating, this is a searchword that really illustrates the strength of this dictionary compared to other more traditional ones. The result is 141 headwords, as diverse as East Sussex, Oral sex in Islamic law, Sex Pistols and Psyochopathia Sexualis. I certainly didn’t know how to translate metrosexual or genetic sexual attraction to Chinese, but now I do. (And, incidentally, this blog will probably be blocked in all British Schools, and Saudi Arabia).

Toronto and Open Access where both examples I provided, but test.cgi/ is an interesting one. It has a huge amount of hits, and as far as I know, hasn’t been listed anywhere. I am curious if this is an automated attempt at exploiting a vulnerability. Then we get some Chinese ones, like 天安门 (Tiananmen), 北京 (Beijing) and 开放获取 (Open Access).

Further down, we find “很好很强大”. It literally means “very good, very strong”, but is a Chinese internet meme which can be used to express strong surprise (and often unhappiness) with something. According to this blogger (in Chinese), the expression can be traced back to Jin Ping Mei, a Chinese erotic classic. Although there is a nice Wikipedia entry in Chinese, there is no English corresponding article, and thus no entry in the dictionary.

By the way, I made a corresponding dictinary for traditional characters at http://reganmian.net/en-tw/search-something. Now we’ll see if in a while, when I check it’s log, “search-something” is the top search word…

Stian

Similar posts that might interest you:

4 Responses to “The English-Chinese dictionary, revisited”

  1. Nick
    February 23rd, 2009 @ 2:41 pm

    Looks like test.cgi/ is what it searches for when you enter an empty search string.

    Great dictionary! I just looked up “Python” and was not disappointed.

  2. Stian Håklev
    February 23rd, 2009 @ 3:12 pm

    Nick: Good catch. You are right, I just tried to search for an empty word. Haha, I actually hadn’t tried that before. Guess it’s my Apache URL rewrite thing.

    Interesting that so many people search for an empty string. I guess I should include a message about the syntax etc for people who don’t get any hits. Then again, my purpose was never to develop a killer web English-Chinese dictionary – I know the design is minimalistic at best, and I didn’t even bother to add a search field :) – rather to share some hacks with people, put all the code up and hope that others continue experimenting. The server was just a quick demonstration.

  3. Language representation among DOAJ Open Access journals | Random Stuff that Matters
    April 5th, 2009 @ 10:17 pm

    [...] in OpenOffice.org is not good enough. I whipped up a quick Ruby script, reusing a few lines from my previous script to count the most frequent search-words used with my online Chinese-English dictionary, and got the [...]

  4. Sepra
    October 16th, 2009 @ 6:36 am

    Great post. Excellent article

Leave a Reply