Release early, release often: English-Chinese dictionary based on Wikipedia

Background
Although there are some great Chinese dictionaries out there, I often encounter cases when they are not enough. I might either be looking for a specific concept, like “open access” (in scholarly publishing) and want to know how that is written in Chinese, so that I can google for articles about it in Chinese. Or I might want to know how Heidegger, Copenhagen or Grey’s Anatomy is written in Chinese – a dictionary is unlikely to have any of these (it might have the two first, if it is very complete, but certainly not the last).

Wikipedia is a great, perhaps unintentional, source of these words, because of interwiki links. Each article (usually) has a number of links to articles on the same topic in another language. In the wiki markup, these look like [[en:Oslo]] (to make a link to the English article about Oslo), and they are listed on the bottom left side of the screen. When, a few days ago, I needed to know about open access in Chinese, I went to the English page on open access, and clicked on the Chinese link, to the page 开放获取.

Data extraction
One of the neat things about Wikipedia is that you can download the whole database, and do fun things with the contents. In fact, I tried doing this all the way back in 2007, which gave good results, but I never did anything further (and didn’t write it up, since I was in Indonesia, and not blogging, at the time). Recently, a friend of mine has been playing around with dictd servers and files, and this inspired me to take up what I had left.

Luckily, I had pasted the little code snippet I made back then into an email to my friend, and I could find it easily. It worked without any modifications. All I had to do was download the latest Chinese database, an XML file containing all the articles (the file zhwiki-20090116-pages-articles.xml.bz2 in this directory; you can find all the different databases at Wikipedia Downloads). I then ran the Ruby script below, to extract all the titles of articles, and their interwiki links, to a separate, tab-separated text file:


zh = File.open("zhwiki-20090116-pages-articles.xml")
zhtitle = File.open("english-chinese.txt","w")
title, entitle, hitcounter, counter = '','',0,0

while true
  counter += 1
  if zh.readline.match(/\(.*?)\<\/title>/)
    title = Regexp::last_match[1]
  end
  if line.match(/\[\[en:(.*?)\]\]/)
    entitle =  Regexp::last_match[1]
    unless title.match(/^Wikipedia|^User|^Help|^[A-Za-z]/i)
      zhtitle << title << "\t" << entitle << "\n"
      hitcounter += 1
      if hitcounter == (hitcounter / 100) * 100
        puts "In #{counter} articles, found #{hitcounter}
hits: #{hitcounter.to_f/counter.to_f*100}%."
      end
    end
  end
end

This generates an output file that looks like this

设计模式 Design pattern
中华人民共和国 People’s Republic of China
克利斯登·奈加特 Kristen Nygaard
黑客 Hacker (computing)
林纳斯·托瓦兹 Linus Torvalds
理查德·斯托曼 Richard Stallman
自由软件基金会 Free Software Foundation
2003年7月 July 2003
操作系统 Operating system

It’s interesting to compare the statistics for this run, and for the run I did in March, 2007. In 2007, the XML dump was 520MB, with 59,600 hits (articles with interwiki links to English). In 2009, the XML dump is 1.1GB, with around 123,300 hits, ie. roughly double both the filesize and hits in two years. As you can see from the small selection above, not all of these are useful (dates, for example), but many are, and would not be found in ordinary dictionaries.

Transforming to simplified and traditional characters
The Chinese Wikipedia contains articles written in both simplified and traditional characters, and has a built-in facility to convert this on the fly, so that a user can read everything in simplified or traditional according to the settings. Converting from simplified to traditional and back is not trivial, because there are a number of traditional characters that all convert to the same simplified character, etc. Chinese Wikipedia has come up with a great conversion database which deals with this, and once again, we are able to download it and use it for our own purposes.

Back in 2007, a friend of mine downloaded this database, and converted it to a sed script. Sed is an extremely fast command line regexp search-and-replace program for *NIX (also built into OSX). The file looks like this:

s/幾畫/几画/g
s/賣畫/卖画/g
s/滷鹼/卤碱/g
s/原畫/原画/g

each line is an instruction to do global search and replace on for example 幾畫 (traditional) with 几画 (simplified). This is from the traditional->simplified file, there is also a simplified->traditional file. Note that sometimes the conversion isn’t simply between characters, but also between words, when different words are used in mainland China and Taiwan. (Download cn->tw and tw->cn sed files).

So instead of having a file that mixed simplified and traditional characters, we can easily generate one with simplified and one with traditional, using sed:

sed -f cntw english-chinese.txt > english-chinese.tw.txt
sed -f twcn english-chinese.txt > english-chinese.cn.txt

Instead of doing all this yourself, you can directly download the entire textfile in simplified, or traditional characters (or both, zipped).

Using the file
With this simple file, you can already do a lot. The simplest is to use grep, a fast command line tool that searches lines in a text file. To quickly search for open access, I would use

grep -i "open access" english-chinese.cn.txt

and get the following result:

开放获取    Open access

the -i means that grep ignores case differences. Note that on my Mac, I cannot see Chinese characters in the terminal window (it might be possible to fix with some settings). An alternative would be to do

grep -i "open access" english-chinese.cn.txt > out.tmp

and then open out.tmp in a text editor that can read UTF8 (unicode). Note that in many text editors you have to specifically ask to open the file as UTF8.

This is cumbersome, but you can of course make different kinds of interfaces to it.

Web interface
One simple interface I made was a web interface. Initially I simply ran the grep command through a Ruby wrapper, but I realized that if I executed arbitrary text on the command line, people could use it to infiltrate my server, so I changed to a very simple search. Note that this is not indexed, and is extremely “inefficient” – by putting this into a database, or using something like Ferret, it would be extremely much faster. But it works. Source:

#!/usr/bin/env ruby
require 'rubygems'
require "fcgi"
a = File.read("zhcn-en.txt")
FCGI.each_cgi do |cgi|
  text = cgi['bigger']

  search = text.gsub(/\.html/,'')
  puts "Content-Type: text/html; charset=utf-8"
  puts "<html><head><title>#{search} |
English-Chinese dictionary</title></head>"
  puts '<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" >'
  puts "<body>"
  puts "<h1>Search result for #{search}</h1><i>This is a
simple search of a database extracted from the interwiki
links of Chinese Wikipedia. shaklev@gmail.com</I><p>"
puts "<table>"
  a.each do |line|
    if line.downcase.match(search.downcase)
      a,b=line.split("\t")
      puts "<tr><td>" + a + "</td><td>" + b + "</td></tr>"
    end
  end
  puts "</table>"
end

I didn’t bother writing a form page for it, but the API is extremely simple: http://reganmian.net/en-zh/searchword. Here are some examples:

http://reganmian.net/en-zh/Toronto
http://reganmian.net/en-zh/open access
http://reganmian.net/en-zh/sex

One advantage of the simple search is that it accepts both English and Chinese input, see for example:

http://reganmian.net/en-zh/开放获取

Redirects and disambiguation
When I initially entered “open access” in English Wikipedia, I arrived at a disambiguation page giving me links to different meanings of the term, one of which, Open access (publishing), was the one I wanted. It is also often the case that abbreviations, people’s last names, etc. are redirected to the full article name. I figured it would be useful to have an index of all these disambiguations and redirects, so that I could incorporate that in the database. If, for example, NATO was a redirect to North Atlantic Treaty Alliance, I could have both of those two words function as headwords for the same Chinese term in the dictionary.

And if you looked up open access, I could have (publishing): Chinese term, (infrastructure) different Chinese term, etc. The problem is that I would have to sort through the English database to do this, and the English dump is 7,8GB packed (probably something like 150GB unpacked – pure text). There is also a dump of redirects, however that is just an SQL dump, containing the ID of each article, and the title of the redirect, thus I would have to first import the SQL dump of page titles into a local SQL database. I tried, but it took for ever, and I gave up. This is not impossible, but it will take more time and more programming.

Other dictionary formats
Having a simple text file is great, you can grep it, and even build simple interfaces, like the web interface I mentioned above. But it would be great if we could also put this database into different dictionaries and lookup programs that already exist.

First I thought about Wenlin. Although it is proprietary, and has not been significantly updated for many years, it is still a very powerful program, which I use frequently when reading texts. I even made a screencast to showcase why I found it so useful. I wondered if it would be possible to import this dictionary into Wenlin. Turns out there is a way to import entries – you need to open a specially formatted textfile in Wenlin, and then choose “import”. I was lucky enough to find a very interesting German project to create a German database for Wenlin, and they had a text file that I could use as a model.

The format looks like this:

cidian.db
New or changed entries:

*** 1 ***
pinyin                  	zàijūliú
characters               	再拘留
serial-number            	1016904350
reference                	vwu3184a1
part-of-speech           	v.
environment              	law
definition               	rearrest

With some experimentation, I found that the serial-number and reference had to be there, but could be empty. part-of-spech and environment were not necessary at all. However, I needed two things. First of all, since simplified and traditional characters are not easy to automatically convert, the program requires that you specify both simplified and traditional characters. This was solved by using the scripts above to generate one file for simplified and one for traditional (the same content was on the same lines in each file, so it would be easy to combine them).

In addition, Wenlin requires you to provide the pinyin for each word. This is because some characters have multiple readings, so that it is not easy to automatically generate (correct) pinyin for characters. I didn’t need this, but Wenlin required it, and it even checked to see that each pinyin was a possible reading for the given Chinese character. So I needed somehow to get all the words rendered in pinyin.

There are many services, and programs, that convert from Chinese text to pinyin. However, I couldn’t find any good command line tools. Command line tools are very good when you are dealing with text files of many megabytes! I tried pasting the text into textfields in Firefox, and in stand-alone applications, and they all choked. Surprisingly, Wenlin itself was able to open the large (around 4MB) file, and it actually has a built-in conversion to pinyin. However, given that this cannot happen automatically, it tags each character with multiple readings, and asks you to select the correct one. I wasn’t too preoccupied by having correct readings, just having possible readings that would be accepted by Wenlin would be enough, so I saved the result of the conversion (which took a while). Some lines looked like this:

lín nà sī·tuō 【◎Fix:◎wǎ;◎wà;◎wā】 【◎Fix:◎zī;◎cí】
Linus Torvalds
lǐ 【◎Fix:◎chá;◎zhā】 dé·sī tuō màn    Richard Stallman

And I had to use the search-and-replace with regexp function in TextMate to remove these options, leaving only the first one (as I mentioned, my goal was not to choose the correct reading, but a possible one).

Combining all three files, I generated the file in Wenlin’s required format, however because of all the space required per word, the file became quite large, and Wenlin was unable to cope with it (in fact, even trying to import 100 words automatically failed). I wish there was a command line tool that enabled me to import large amounts of words into Wenlin, but until then, I might have to give this venue up.

Apple’s Dictionary.app
Initally, I thought that Dictionary.app, which is preinstalled on all Mac’s, used dictd files, but it turns out they use some Apple-specific format. Luckily, this is well documented, and there are tools for generating these files included on in the developer package. All you have to do is generate an XML file, which looks something like this

<?xml version="1.0" encoding="UTF-8"?>
<d:dictionary xmlns="http://www.w3.org/1999/xhtml"
xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng">
  <d:entry id="mathematics1">
    <d:index d:value="mathematics" d:title="mathematics"/>
    <h1>mathematics</h1>
    <p>数学 (數學)</p>
  </d:entry>
  <d:entry id="philosophy2">
    <d:index d:value="philosophy" d:title="philosophy"/>
    <h1>philosophy</h1>
    <p>哲学 (哲學)</p>
  </d:entry>
</d:dictionary>

Here is the script I wrote to generate this file:

pinyin = File.open('english-pinyin.txt')
cn = File.open('english-chinese.cn.txt')
tw = File.open('english-chinese.tw.txt')
result = File.open('MyDictionary.xml','w')
result << '<?xml version="1.0" encoding="UTF-8"?>
<d:dictionary xmlns="http://www.w3.org/1999/xhtml"
xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng">'
counter = 0
begin
  pinyin.each do |line|
    counter += 1
    b, english = cn.readline.split("\t")
    c, dummy = tw.readline.split("\t")
    english.gsub!(/\((.*)\)/,'')
    english.downcase! english.strip! ; b.strip! ; c.strip!
    result << "<h1>#{english}</h1>
    <p>#{b} (#{c})</p>
    </d:entry>"
  end
rescue
end
result << "</d:dictionary>"
result.close

After generating this file (you can download an example here), and editing the MyInfo.plist to reflect the name of the new dictionary, you can run make, and it will churn through, compile the dictionary and generate the index. The finished product (example here) can be installed into your ~/Library/Dictionaries with the command make install or manually, and ideally when you restart Dictionary.app, it will show up.

However, when I compiled the dictionary, I got a number of error messages

"""/Developer/Extras/Dictionary Development Kit"/bin"/
build_dict.sh" "My Dictionary" MyDictionary.xml
MyDictionary.css MyInfo.plist
- Building My Dictionary.dictionary.
*** Invalid index. Skipped -- entry[12504] index[<d:index
d:value="" d:title=""/>](7 similar lines deleted)
2009-02-16 18:47:30.507 add_supplementary_key[58407:10b] ***
Terminating app due to uncaught exception 'NSRangeException',
reason: '*** -[NSCFString characterAtIndex:]: Range or index
out of bounds'
2009-02-16 18:47:30.508 add_supplementary_key[58407:
10b] Stack: (2520711435, (lots of numbers))
/Developer/Extras/Dictionary Development Kit/bin/build_
dict.sh: line 131: 58407 Trace/BPT trap          "$DICT_
BUILD_TOOL_BIN"/add_supplementary_key <$OBJECTS_DIR/
normalized_key_body_list_1.txt > $OBJECTS_DIR/normalized_
key_body_list_2.txt
*** Unknown format. Skipped [raphael  1222584 0  rapha]
- Building key_text index.
(things are good...)
- Finished building ./objects/My Dictionary.dictionary.
echo "Done."
Done.

I am quite aware that I didn’t read the specs for the file format very carefully, but rather just threw together something that seemed to work – but I must say that the message above is less than meaningful. The actual result, is that a file is produced, and it does work well in Dictionary.app, but it clearly does not contain all the words.

I am not quite enamoured with the Dictionary.app interface, since it only shows a list of headwords, and you have to select a headword to see the translation – different from how the website I mentioned does things. However, it is extremely speedy, and it would be nice if I could solve the problem above.

I might also try to convert the dictionary into an actual dictd file, which should be easier. And I’ve even thought of merging it somehow with CEDICT to get one large database (I found some perl scripts that might help with this).

Conclusion
This is how far I got in my amateur hacking this time, before it was time to turn back to studies and “more important things”. It’s interesting that I did a part of this work two years ago, when I wasn’t blogging, and therefore didn’t document it. These days, I figure I derive so much utility from other people’s write ups about their problems and solutions, and neat hacks, that I ought to share my stuff with the world. Perhaps only a few people ever come across it, but to them it might be very useful. Also it’s a great personal archive of things too – hadn’t I found the script I wrote two years ago in GMail, it might have been lost.

It also shows how useful semantically marked up data can be, especially when a website allows you to download it’s entire database and do fun stuff with it, that they never even planned for. (Something similar was the case when I made the Indonesian mouse-over dictionary). There’s a large amount of useful tools out there, but they become much more useful if you can run them in command line mode. And there needs to be an easy to use, easy to create, and extract from, format for dictionaries, that all applications can read. (Maybe dictd is it, I need to learn more about it first).

Stian

Similar posts that might interest you:

33 Responses to “Release early, release often: English-Chinese dictionary based on Wikipedia”

  1. Bhrgunatha
    February 16th, 2009 @ 9:25 pm

    Thanks, I think this post is going to be really helpful learning contemporary Chinese. Most sources are fairly traditional in the approach to teaching vocabulary and idiom so something like this is a treasure.

  2. John Britton
    February 16th, 2009 @ 11:22 pm

    Good work Stian, would be cool to see this implemented for other languages as well… any takers?

  3. est
    February 16th, 2009 @ 11:46 pm

    haha, nice post, but generally we still write RMS and Linus directly in Chinese. It’s bit odd to say it in Chinese.

  4. Shushan Wen
    February 17th, 2009 @ 12:11 am

    Good job!
    One thing is that when I tried the traditional/simplified Chinese conversion by using sed -f twcn or cnt, it works well in general, but there’re still some characters are not mapped correctly once a while.
    For example, in twcn I can see
    s/迴/回/g
    but in cntw it is
    s/回/回/g

    Another example is cntw has a line as
    s/伙/夥/g
    but twcn does not have the reverse mapping.

    Thanks for your attention,

    Shushan

  5. dd
    February 17th, 2009 @ 12:54 am

    if you looking for user generated dictionary on the web with sharing ability, look no father than kitajiro, though which is Chinese-Japaneses dictionary with ready-to-use dictionary formats for several dictionary tools including pdic and EBWin. also could mention is lingoes dictionary tools with own dictionary file format, author promised file format editor, but at the moment there not available online.
    http://www.ctrans.org/cjdic/index.php
    http://www.lingoes.net/

  6. OwenK
    February 17th, 2009 @ 2:54 am

    You should merge it into cc-cedict! :-)

  7. J.D.
    February 17th, 2009 @ 2:58 am

    Re: chinese characters in Terminal on the Mac, you might have better luck if you set the character encoding to something like UTF-8.

    Preferences > Settings> Advanced > Character Encoding

    J.D.

  8. honato
    February 17th, 2009 @ 3:23 am

    Great Article. Thanks for sharing your experience.

  9. Leon
    February 17th, 2009 @ 5:09 am

    Excellent stuff. I’ve often wondered about amending Wenlin’s dictionary, and always use wikipedia as a reference for Chinese/English proper noun conversion. Have you used the Chinese Pera-Kun plugin for Firefox? I find it very useful for reading firefox web pages, and it may be interesting to see how accessible its internal dictionary format is. Thank you for publishing all the code used in your work, too!

  10. turbo24prg
    February 17th, 2009 @ 6:18 am

    Very nice idea. I’m using this technique manually quite often. When automating this, you don’t have to use any scraping yourself. Use semantic web technologies!

    The Wikipedia is already scraped and turned into RDF by the DBPedia Project. They have a SPARQL Endpoint you can use.:

    http://dbpedia.org/snorql/?query=SELECT+%3Fname+WHERE+{%0D%0A+++[+rdfs%3Alabel+%22Design+pattern%22%40en+%3B+rdfs%3Alabel+%3Fname+]+.%0D%0A+++FILTER+(LANG(%3Fname)+%3D+%27zh%27)+.%0D%0A}

    It may seem a bit confusing, but read about RDF and SPARQL and look at the DBPedia. It’s really awesome.

  11. Stian Håklev
    February 17th, 2009 @ 6:57 am

    Hi Leon. I have used that, and I find it very useful as well. Rather than amending the database, in the past I tried to completely replace the database (with a Hindi one), and my friend tried to replace it with an Esperanto one, but it failed miserably, not quite sure why. I find Mozilla plugins, with all their XML config and not, very convoluted, although I wish I knew how it worked. This is another thing that is needed – a generic mouseover dictionary plugin for Firefox, where you can easily plug in any dictionary (and maybe some simple grammar rules as well).

  12. Stian Håklev
    February 17th, 2009 @ 7:01 am

    That would be fun, but I wonder if it wouldn’t have to be done partly manually. First of all, I might not want to “pollute” the database with lot’s of dates, “List of Harry Potter books”, etc. In addition, they also want pinyin, parts of speech etc.

  13. Stian Håklev
    February 17th, 2009 @ 7:06 am

    Thanks for this. As I stated in the blog, this is taken directly from the Wikipedia conversion tables circa 2007. I imagine that they have been somewhat improved by now (although probably not doubled, like the articles have), and I will try to extract a new version and post it to my blog later.
    In parenthesis, it would also be nice to have a conversion table that simply converted all the characters, without changing, for example, 电脑 into 计算机 etc for different jurisdictions. Personally I use TongWenTang plugin to convert webpages from traditional to simplified (http://tongwen.mozdev.org/). I guess I should just learn how to read traditional characters, since all the dailies in Toronto are in traditional, but…

  14. Stian Håklev
    February 17th, 2009 @ 7:08 am

    This would be the easiest thing in the world. Both the extraction code and the server code would work exactly the same for any other language. I even considered making the extraction code more generic, by having it accept command line parameters, but I decided to keep it simple. At the same time as the Chinese-English, I also generated Norwegian-Chinese and Esperanto-Chinese for a friend. I’d love to see others playing with this code and sharing the results, but I would also be willing to make specific versions available if people request it (of course it also works for other Wikipedia language versions, ie German-Albanian etc, but of course for esoteric pairs, you might get fewer entries).

  15. When Your Dictionaries aren’t Good Enough… « Justrecently’s Weblog
    February 17th, 2009 @ 7:21 am

    [...] Random Matters discusses a rather technical approach (too technical for me). [...]

  16. David Gerard
    February 17th, 2009 @ 8:24 am

    Take care – interwiki links are not necessarily 1:1. However, I know researchers into this have said that interwiki links that *are* 1:1 are usually safe.

  17. Pontus Stenetorp
    February 17th, 2009 @ 10:46 am

    Really nicely done. I had the same idea back in early 2008. Since Swedish lacks dictionaries towards many languages I thought of a way to generate data for a free dictionary. This could indeed provide data for at least a complement for a Any Wikipedia language-Any other Wikipedia dictionary.

    I am happy to see that your work turned out so well. Looking forward to reading more from you. =)

    You have my up-vote on reddit, more people should see that playing with languages is fun.

  18. Max
    February 17th, 2009 @ 11:26 am

    I ran the extraction on Japanese – English and Japanese – German entries. I might upload them somewhere. Note: Apart from the pbvious changes I had to fix the code as it gave me syntax errors and stuff:

    zh = File.open(“jawiki-latest-pages-articles.xml”)
    zhtitle = File.open(“german-japanese.txt”,”w”)
    title, entitle, hitcounter, counter = ”,”,0,0

    while true
    counter += 1
    line = zh.readline
    if line.match(/\(.*?)\/)
    title = Regexp::last_match[1]
    end
    if line.match(/\[\[de:(.*?)\]\]/)
    entitle = Regexp::last_match[1]
    unless title.match(/^Wikipedia|^User|^Help|^[A-Za-z]/i)
    zhtitle << title << “\t” << entitle << “\n”
    hitcounter += 1
    if hitcounter == (hitcounter / 100) * 100
    puts “In #{counter} articles, found #{hitcounter}
    hits: #{hitcounter.to_f/counter.to_f*100}%.”
    end
    end
    end
    end

  19. Stian Håklev
    February 17th, 2009 @ 11:56 am

    Max, sorry about the errors. I think there might be some problems involved with posting code on the blog. I think I will have to invest in a plugin for displaying code that better preserves formatting etc. I will also try to collect all the scripts together and put them in a zipfile for download. I find it very useful to include the actual scripts in the blog though, because generally they are very short, and serve more to demonstrate both specific approaches, and also how easy these things are (and how powerful Ruby is :))…

    There is lot’s more that can be done with this – both in terms of the interface, in integrating this dictionary with other existing open dictionaries, converting the files into specific dictionary formats, etc. I’d love to see work on this, and hope you trackback or leave comments, so that I can keep up. Share your work, and ideas – even if it’s (as this is) in progress. :) Thanks for commenting.

  20. links for 2009-02-17 at DeStructUred Blog
    February 17th, 2009 @ 8:03 pm

    [...] Release early, release often: English-Chinese dictionary based on Wikipedia | Random Stuff that Matt… (tags: interesting English language Wiki Dictionary chinese Conversion Translation) [...]

  21. Tae
    February 17th, 2009 @ 9:56 pm

    I’d beg to differ with your assumption that
    a dictionary wouldn’t contain “Grey’s anatomy” as an entry.
    http://www.nciku.com/search/all/grey%27s%20anatomy

    :)

  22. Stian Håklev
    February 17th, 2009 @ 10:25 pm

    Tae: Inded you are right, I stand corrected. Awesome. Yes, both iciba, and nciku are really great tools – I just came across nciku the other day, and it seems very comprehensive. It doesnt’t have http://reganmian.net/en-zh/NATO%20phonetic%20alphabet though :) Haha… But seriously, the more dictionaries and neat services the better, but the big difference to me is whether you are allowed to download the database and play around with it. Shabdkosh.com is an awesome Hindi dictionary online, but when I was in India studying Hindi, I didn’t have access to the internet, and really needed an offline version, etc… :)

  23. China Journal : Best of the China Blogs: February 18
    February 18th, 2009 @ 12:16 am

    [...] students of the Chinese language, a technical and highly interesting article on using Wikipedia as a Chinese dictionary (h/t Chinese Student Blog). [Random Stuff That [...]

  24. hsknotes
    February 18th, 2009 @ 9:11 am

    You are a hero to all peoples. With some hacks and updates this (already incredible) can be truly awesome.

    Two things needed:

    1. pinyin, (correct pinyin), perhaps impossible, perhaps only with users gradually updating it can this problem be fixed.

    2. in sentence usage. If the paired word could include a link to this wikipedia page or even just the sentence in which it appears, that would of great use for so many reasons.

    Mind telling me what those great chinese dictionaries are? I’ve been here 5 years and have yet to find them, learners, chin-chin, chin-eng, etc. English-chinese certainly have quite a few good ones.

    More later.

  25. hsknotes
    February 18th, 2009 @ 9:52 am

    Partial retraction.

    Ok, this site is still excellent in many ways, but before using it I overestimated the amount of bindings. It only works off of titles (useful, no doubt), not quite sure I thought there would be anything else.

    So, someone mentioned this (maybe you as well), but we’ve just been manually doing this. And for words that aren’t in titles you pray for quality translations so you can match up sentence by sentence.

    Nonetheless, as they say in Xi’an when the food is great (‘unrepresentable in english’).

  26. Richard88
    February 18th, 2009 @ 11:52 pm

    Stian: “Wenlin was unable to cope with it (in fact, even trying to import 100 words automatically failed)”

    I have imported CEDICT into Wenlin witout problems.

    Stian: “Wenlin required [the pinyin for each entry], and it even checked to see that each pinyin was a possible reading for the given Chinese character.”

    If you put .-nopy on a line by itself before the first entry, the pinyin validity check will be skipped.

    e.g. the following entries will be accepted:

    cidian
    .-nopy
    ***
    py blah
    char 一
    1df one; 1; single; a (article); as soon as; entire; whole; all; throughout
    ser 42
    ***
    py blah
    char 一一
    1df one by one; one after another
    ser 43
    ***

  27. Tom Bishop
    February 23rd, 2009 @ 1:53 pm

    This article is very interesting. What was the problem importing the list into Wenlin? More than 100 entries should be no problem; I routinely import all of the ABC Dictionary, about 200,000 entries. Please don’t hesitate to contact me for Wenlin technical support. Also I’d like to mention that we’re working hard on the next editions of both ABC and Wenlin.

  28. Stian Håklev
    February 23rd, 2009 @ 5:53 pm

    Tom,
    I used the method where I first open the file in Wenlin, and then choose Import on the menu. Is there another method? The file I generated for Wenlin was 28 MB, and Wenlin was not too happy even opening it in the editor (although I must say it performed much better than expected on files of several MB, that some text editors choke on).

    I am very excited to hear that you are working hard on the next version. It would be fun to hear more about what kind of features, etc, are planned, and perhaps you could receive some user feedback on what is most needed etc.

    Thanks
    Stian

  29. Tom Bishop
    February 23rd, 2009 @ 6:36 pm

    Stian Håklev wrote:

    >Tom,
    >I used the method where I first open the file in Wenlin, and then choose Import on the menu. Is there another method?

    No, that’s the right way.

    >The file I generated for Wenlin was 28 MB, and Wenlin was not too happy even opening it in the editor (although I must say it performed much better than expected on files of several MB, that some text editors choke on).

    You mean it took a long time? That’s not surprising, but it shouldn’t make the program unhappy. The ABC cidian, as a text file, is 29 megabytes. On my 2-year-old MacBook (2 GHz Intel Core 2 Duo, 1 GB RAM), Wenlin opens it in six seconds, and imports it in six minutes. Did you actually encounter an error or crash during the import, or did you just run out of patience? There should be a progress bar showing the percentage of completion. I encourage you to try again and let it run overnight, or use a faster computer. If there’s a real problem, I really want to know about it.

    > I am very excited to hear that you are working hard on the next version. It would be fun to hear more about what kind of features, etc, are planned, and perhaps you could receive some user feedback on what is most needed etc.

    Frankly, what’s most needed is money! But our plan is to make ABC web-based, publish the Wenlin source code, obtain not-for-profit status, and encourage a lot of experts to volunteer.

    We don’t really need a lot of feedback currently, since our to-do list is already ten miles long. Nevertheless, we always welcome suggestions. The main things we’re already working on are a completely new English-Chinese dictionary; user-interface improvements (e.g. multiple independent windows); and flashcards that aren’t limited to monosyllables.

    Thanks for your encouragement! All the best,

    Tom (wenlin.com)

  30. chinese dictionary
    March 31st, 2009 @ 1:09 am

    open access may translate to :对公开放

  31. Creating a “dictionary” from KDE translation files | Random Stuff that Matters
    November 1st, 2009 @ 5:27 pm

    [...] previously written about how I used interwiki links in Wikipedia to extract dictionary information (here and here). After talking with a friend, I got another idea for how I could extract even more [...]

  32. johnfisherman
    March 18th, 2010 @ 2:36 pm

    I have recently developed an application that taps into the power of Wikipedia to translate terms across languages.

    http://www.fredrocha.net/MemeMiner

    I've used some some AJAX / JSON / jQuery magic for the interface.

    Go take a peek and let me know what you think!

    Thx!

  33. Henry
    June 8th, 2010 @ 4:50 am

    Pretty sweet… Great post :)

Leave a Reply

Login