Release early, release often: English-Chinese dictionary based on Wikipedia

February 16th, 2009

Background
Although there are some great Chinese dictionaries out there, I often encounter cases when they are not enough. I might either be looking for a specific concept, like “open access” (in scholarly publishing) and want to know how that is written in Chinese, so that I can google for articles about it in Chinese. Or I might want to know how Heidegger, Copenhagen or Grey’s Anatomy is written in Chinese – a dictionary is unlikely to have any of these (it might have the two first, if it is very complete, but certainly not the last).

Wikipedia is a great, perhaps unintentional, source of these words, because of interwiki links. Each article (usually) has a number of links to articles on the same topic in another language. In the wiki markup, these look like [[en:Oslo]] (to make a link to the English article about Oslo), and they are listed on the bottom left side of the screen. When, a few days ago, I needed to know about open access in Chinese, I went to the English page on open access, and clicked on the Chinese link, to the page 开放获取.

Data extraction
One of the neat things about Wikipedia is that you can download the whole database, and do fun things with the contents. In fact, I tried doing this all the way back in 2007, which gave good results, but I never did anything further (and didn’t write it up, since I was in Indonesia, and not blogging, at the time). Recently, a friend of mine has been playing around with dictd servers and files, and this inspired me to take up what I had left.

Luckily, I had pasted the little code snippet I made back then into an email to my friend, and I could find it easily. It worked without any modifications. All I had to do was download the latest Chinese database, an XML file containing all the articles (the file zhwiki-20090116-pages-articles.xml.bz2 in this directory; you can find all the different databases at Wikipedia Downloads). I then ran the Ruby script below, to extract all the titles of articles, and their interwiki links, to a separate, tab-separated text file:


zh = File.open("zhwiki-20090116-pages-articles.xml")
zhtitle = File.open("english-chinese.txt","w")
title, entitle, hitcounter, counter = '','',0,0

while true
  counter += 1
  if zh.readline.match(/\(.*?)\<\/title>/)
    title = Regexp::last_match[1]
  end
  if line.match(/\[\[en:(.*?)\]\]/)
    entitle =  Regexp::last_match[1]
    unless title.match(/^Wikipedia|^User|^Help|^[A-Za-z]/i)
      zhtitle << title << "\t" << entitle << "\n"
      hitcounter += 1
      if hitcounter == (hitcounter / 100) * 100
        puts "In #{counter} articles, found #{hitcounter}
hits: #{hitcounter.to_f/counter.to_f*100}%."
      end
    end
  end
end

This generates an output file that looks like this

设计模式 Design pattern
中华人民共和国 People’s Republic of China
克利斯登·奈加特 Kristen Nygaard
黑客 Hacker (computing)
林纳斯·托瓦兹 Linus Torvalds
理查德·斯托曼 Richard Stallman
自由软件基金会 Free Software Foundation
2003年7月 July 2003
操作系统 Operating system

It’s interesting to compare the statistics for this run, and for the run I did in March, 2007. In 2007, the XML dump was 520MB, with 59,600 hits (articles with interwiki links to English). In 2009, the XML dump is 1.1GB, with around 123,300 hits, ie. roughly double both the filesize and hits in two years. As you can see from the small selection above, not all of these are useful (dates, for example), but many are, and would not be found in ordinary dictionaries.

Transforming to simplified and traditional characters
The Chinese Wikipedia contains articles written in both simplified and traditional characters, and has a built-in facility to convert this on the fly, so that a user can read everything in simplified or traditional according to the settings. Converting from simplified to traditional and back is not trivial, because there are a number of traditional characters that all convert to the same simplified character, etc. Chinese Wikipedia has come up with a great conversion database which deals with this, and once again, we are able to download it and use it for our own purposes.

Back in 2007, a friend of mine downloaded this database, and converted it to a sed script. Sed is an extremely fast command line regexp search-and-replace program for *NIX (also built into OSX). The file looks like this:

s/幾畫/几画/g
s/賣畫/卖画/g
s/滷鹼/卤碱/g
s/原畫/原画/g

each line is an instruction to do global search and replace on for example 幾畫 (traditional) with 几画 (simplified). This is from the traditional->simplified file, there is also a simplified->traditional file. Note that sometimes the conversion isn’t simply between characters, but also between words, when different words are used in mainland China and Taiwan. (Download cn->tw and tw->cn sed files).

So instead of having a file that mixed simplified and traditional characters, we can easily generate one with simplified and one with traditional, using sed:

sed -f cntw english-chinese.txt > english-chinese.tw.txt
sed -f twcn english-chinese.txt > english-chinese.cn.txt

Instead of doing all this yourself, you can directly download the entire textfile in simplified, or traditional characters (or both, zipped).

Using the file
With this simple file, you can already do a lot. The simplest is to use grep, a fast command line tool that searches lines in a text file. To quickly search for open access, I would use

grep -i "open access" english-chinese.cn.txt

and get the following result:

开放获取    Open access

the -i means that grep ignores case differences. Note that on my Mac, I cannot see Chinese characters in the terminal window (it might be possible to fix with some settings). An alternative would be to do

grep -i "open access" english-chinese.cn.txt > out.tmp

and then open out.tmp in a text editor that can read UTF8 (unicode). Note that in many text editors you have to specifically ask to open the file as UTF8.

This is cumbersome, but you can of course make different kinds of interfaces to it.

Web interface
One simple interface I made was a web interface. Initially I simply ran the grep command through a Ruby wrapper, but I realized that if I executed arbitrary text on the command line, people could use it to infiltrate my server, so I changed to a very simple search. Note that this is not indexed, and is extremely “inefficient” – by putting this into a database, or using something like Ferret, it would be extremely much faster. But it works. Source:

#!/usr/bin/env ruby
require 'rubygems'
require "fcgi"
a = File.read("zhcn-en.txt")
FCGI.each_cgi do |cgi|
  text = cgi['bigger']

  search = text.gsub(/\.html/,'')
  puts "Content-Type: text/html; charset=utf-8"
  puts "<html><head><title>#{search} |
English-Chinese dictionary</title></head>"
  puts '<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" >'
  puts "<body>"
  puts "<h1>Search result for #{search}</h1><i>This is a
simple search of a database extracted from the interwiki
links of Chinese Wikipedia. shaklev@gmail.com</I><p>"
puts "<table>"
  a.each do |line|
    if line.downcase.match(search.downcase)
      a,b=line.split("\t")
      puts "<tr><td>" + a + "</td><td>" + b + "</td></tr>"
    end
  end
  puts "</table>"
end

I didn’t bother writing a form page for it, but the API is extremely simple: http://reganmian.net/en-zh/searchword. Here are some examples:

http://reganmian.net/en-zh/Toronto
http://reganmian.net/en-zh/open access
http://reganmian.net/en-zh/sex

One advantage of the simple search is that it accepts both English and Chinese input, see for example:

http://reganmian.net/en-zh/开放获取

Redirects and disambiguation
When I initially entered “open access” in English Wikipedia, I arrived at a disambiguation page giving me links to different meanings of the term, one of which, Open access (publishing), was the one I wanted. It is also often the case that abbreviations, people’s last names, etc. are redirected to the full article name. I figured it would be useful to have an index of all these disambiguations and redirects, so that I could incorporate that in the database. If, for example, NATO was a redirect to North Atlantic Treaty Alliance, I could have both of those two words function as headwords for the same Chinese term in the dictionary.

And if you looked up open access, I could have (publishing): Chinese term, (infrastructure) different Chinese term, etc. The problem is that I would have to sort through the English database to do this, and the English dump is 7,8GB packed (probably something like 150GB unpacked – pure text). There is also a dump of redirects, however that is just an SQL dump, containing the ID of each article, and the title of the redirect, thus I would have to first import the SQL dump of page titles into a local SQL database. I tried, but it took for ever, and I gave up. This is not impossible, but it will take more time and more programming.

Other dictionary formats
Having a simple text file is great, you can grep it, and even build simple interfaces, like the web interface I mentioned above. But it would be great if we could also put this database into different dictionaries and lookup programs that already exist.

First I thought about Wenlin. Although it is proprietary, and has not been significantly updated for many years, it is still a very powerful program, which I use frequently when reading texts. I even made a screencast to showcase why I found it so useful. I wondered if it would be possible to import this dictionary into Wenlin. Turns out there is a way to import entries – you need to open a specially formatted textfile in Wenlin, and then choose “import”. I was lucky enough to find a very interesting German project to create a German database for Wenlin, and they had a text file that I could use as a model.

The format looks like this:

cidian.db
New or changed entries:

*** 1 ***
pinyin                  	zàijūliú
characters               	再拘留
serial-number            	1016904350
reference                	vwu3184a1
part-of-speech           	v.
environment              	law
definition               	rearrest

With some experimentation, I found that the serial-number and reference had to be there, but could be empty. part-of-spech and environment were not necessary at all. However, I needed two things. First of all, since simplified and traditional characters are not easy to automatically convert, the program requires that you specify both simplified and traditional characters. This was solved by using the scripts above to generate one file for simplified and one for traditional (the same content was on the same lines in each file, so it would be easy to combine them).

In addition, Wenlin requires you to provide the pinyin for each word. This is because some characters have multiple readings, so that it is not easy to automatically generate (correct) pinyin for characters. I didn’t need this, but Wenlin required it, and it even checked to see that each pinyin was a possible reading for the given Chinese character. So I needed somehow to get all the words rendered in pinyin.

There are many services, and programs, that convert from Chinese text to pinyin. However, I couldn’t find any good command line tools. Command line tools are very good when you are dealing with text files of many megabytes! I tried pasting the text into textfields in Firefox, and in stand-alone applications, and they all choked. Surprisingly, Wenlin itself was able to open the large (around 4MB) file, and it actually has a built-in conversion to pinyin. However, given that this cannot happen automatically, it tags each character with multiple readings, and asks you to select the correct one. I wasn’t too preoccupied by having correct readings, just having possible readings that would be accepted by Wenlin would be enough, so I saved the result of the conversion (which took a while). Some lines looked like this:

lín nà sī·tuō 【◎Fix:◎wǎ;◎wà;◎wā】 【◎Fix:◎zī;◎cí】
Linus Torvalds
lǐ 【◎Fix:◎chá;◎zhā】 dé·sī tuō màn    Richard Stallman

And I had to use the search-and-replace with regexp function in TextMate to remove these options, leaving only the first one (as I mentioned, my goal was not to choose the correct reading, but a possible one).

Combining all three files, I generated the file in Wenlin’s required format, however because of all the space required per word, the file became quite large, and Wenlin was unable to cope with it (in fact, even trying to import 100 words automatically failed). I wish there was a command line tool that enabled me to import large amounts of words into Wenlin, but until then, I might have to give this venue up.

Apple’s Dictionary.app
Initally, I thought that Dictionary.app, which is preinstalled on all Mac’s, used dictd files, but it turns out they use some Apple-specific format. Luckily, this is well documented, and there are tools for generating these files included on in the developer package. All you have to do is generate an XML file, which looks something like this

<?xml version="1.0" encoding="UTF-8"?>
<d:dictionary xmlns="http://www.w3.org/1999/xhtml"
xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng">
  <d:entry id="mathematics1">
    <d:index d:value="mathematics" d:title="mathematics"/>
    <h1>mathematics</h1>
    <p>数学 (數學)</p>
  </d:entry>
  <d:entry id="philosophy2">
    <d:index d:value="philosophy" d:title="philosophy"/>
    <h1>philosophy</h1>
    <p>哲学 (哲學)</p>
  </d:entry>
</d:dictionary>

Here is the script I wrote to generate this file:

pinyin = File.open('english-pinyin.txt')
cn = File.open('english-chinese.cn.txt')
tw = File.open('english-chinese.tw.txt')
result = File.open('MyDictionary.xml','w')
result << '<?xml version="1.0" encoding="UTF-8"?>
<d:dictionary xmlns="http://www.w3.org/1999/xhtml"
xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng">'
counter = 0
begin
  pinyin.each do |line|
    counter += 1
    b, english = cn.readline.split("\t")
    c, dummy = tw.readline.split("\t")
    english.gsub!(/\((.*)\)/,'')
    english.downcase! english.strip! ; b.strip! ; c.strip!
    result << "<h1>#{english}</h1>
    <p>#{b} (#{c})</p>
    </d:entry>"
  end
rescue
end
result << "</d:dictionary>"
result.close

After generating this file (you can download an example here), and editing the MyInfo.plist to reflect the name of the new dictionary, you can run make, and it will churn through, compile the dictionary and generate the index. The finished product (example here) can be installed into your ~/Library/Dictionaries with the command make install or manually, and ideally when you restart Dictionary.app, it will show up.

However, when I compiled the dictionary, I got a number of error messages

"""/Developer/Extras/Dictionary Development Kit"/bin"/
build_dict.sh" "My Dictionary" MyDictionary.xml
MyDictionary.css MyInfo.plist
- Building My Dictionary.dictionary.
*** Invalid index. Skipped -- entry[12504] index[<d:index
d:value="" d:title=""/>](7 similar lines deleted)
2009-02-16 18:47:30.507 add_supplementary_key[58407:10b] ***
Terminating app due to uncaught exception 'NSRangeException',
reason: '*** -[NSCFString characterAtIndex:]: Range or index
out of bounds'
2009-02-16 18:47:30.508 add_supplementary_key[58407:
10b] Stack: (2520711435, (lots of numbers))
/Developer/Extras/Dictionary Development Kit/bin/build_
dict.sh: line 131: 58407 Trace/BPT trap          "$DICT_
BUILD_TOOL_BIN"/add_supplementary_key <$OBJECTS_DIR/
normalized_key_body_list_1.txt > $OBJECTS_DIR/normalized_
key_body_list_2.txt
*** Unknown format. Skipped [raphael  1222584 0  rapha]
- Building key_text index.
(things are good...)
- Finished building ./objects/My Dictionary.dictionary.
echo "Done."
Done.

I am quite aware that I didn’t read the specs for the file format very carefully, but rather just threw together something that seemed to work – but I must say that the message above is less than meaningful. The actual result, is that a file is produced, and it does work well in Dictionary.app, but it clearly does not contain all the words.

I am not quite enamoured with the Dictionary.app interface, since it only shows a list of headwords, and you have to select a headword to see the translation – different from how the website I mentioned does things. However, it is extremely speedy, and it would be nice if I could solve the problem above.

I might also try to convert the dictionary into an actual dictd file, which should be easier. And I’ve even thought of merging it somehow with CEDICT to get one large database (I found some perl scripts that might help with this).

Conclusion
This is how far I got in my amateur hacking this time, before it was time to turn back to studies and “more important things”. It’s interesting that I did a part of this work two years ago, when I wasn’t blogging, and therefore didn’t document it. These days, I figure I derive so much utility from other people’s write ups about their problems and solutions, and neat hacks, that I ought to share my stuff with the world. Perhaps only a few people ever come across it, but to them it might be very useful. Also it’s a great personal archive of things too – hadn’t I found the script I wrote two years ago in GMail, it might have been lost.

It also shows how useful semantically marked up data can be, especially when a website allows you to download it’s entire database and do fun stuff with it, that they never even planned for. (Something similar was the case when I made the Indonesian mouse-over dictionary). There’s a large amount of useful tools out there, but they become much more useful if you can run them in command line mode. And there needs to be an easy to use, easy to create, and extract from, format for dictionaries, that all applications can read. (Maybe dictd is it, I need to learn more about it first).

Stian

Similar posts that might interest you:

Slideshare has great customer support

February 12th, 2009

Why use Slideshare?
A lot of people have began posting their slides to Slideshare for sharing with others. In many ways, Slideshare makes more sense to me than for example Scribd – it’s fairly easy for me to download a PDF and view it in my local viewer, but it’s a pain to have to start OpenOffice just to view a Powerpoint. In addition, Slideshare has an incredibly neat feature called Slidecasting, which lets you add an mp3 of the actual talk (or of music, if you wanted to make a photo slideshow), and sync the transition of the slides to the relevant points in the audio. This way people can hear you speak and see the slides change at the appropriate points, without you having to compile it all into a large video file.

It’s also getting easier and easier to produce an audio file – I normally just turn on Audacity on my laptop before I start speaking. The sound becomes a bit noisy, but with the remove noise feature in Audacity, the end-result is surprisingly good. I’ve put several presentations up, but my first slidecast was my presentation on open research and open education which I gave this summer in New Delhi. That presentation vividly illustrates how with a little bit of extra work, your presentation can have a much wider reach. The presentation was given at the Indian Institute of Public Administration, which is where top bureaucrats go to get mid-career training, and I spoke to a group of 25 professors. It was a real honor to be invited, and they had some interesting questions after the talk. However, by putting the slidecast up on Slideshare, I was able to share it with the world.

During the last half year, that slidecast received almost 1.500 hits, and I know that a number of people saw it, and that it benefitted me. For example, I would probably not be giving a presentation on open education at the Education Commons at OISE, if the people working there hadn’t seen this earlier presentation and decided that although I was a first year MA student, I knew how to present. This can be especially important for students and beginning academics who don’t have a lot of formal publications. But even if you do – knowing how to write, and knowing how to present are two different skills.

Problems, and good customer support
I was especially excited about putting my recent presentation from the Connexions/OCWC Conference online (abstract, slidecast) because this was on a fairly specialized topic about which almost no information is available in English. Whereas the previous presentation was a general overview of OER and open research, which would not be very interesting to people already inside the movement, this presentation was “cutting edge” research. Not that many people attended my presentation in Houston, because there were competing tracks (I was competing with David Wiley, no less), but several already told me that they were interested.

However, when trying to set the timings of the Slidecast, I ran into problems. It seemed like save did not work, and the timings did not show up. I tried a variety of different things, like reuploading it in a different format, but I could not get it to work. I looked around to see if there was another webservice that could let me do the same, but I couldn’t easily find one. My other option would have been to turn it into a “movie”, but that seemed like a waste. Even Keynote, which supports creating Flash-movies with narration, doesn’t let you do it with an existing MP3, you have to record from scratch.

So I finally sent off an email to the support desk at Slideshare.net, not expecting much. After all, I was not a paying customer, and I wasn’t expecting much of an answer. I was positively surprised when I received an email back, quite soon after, suggesting that I delete cookies and temporary files. I did, and it had no effect. Back to Slideshare. It actually took us a while to figure out what was going on, in total my GMail conversation thread shows that 18 messages went back and forth during 24 hours. The wonderful part was that on the other end was a high-level technician who treated me as a valued collaborator in hunting down this bug. I sent screenshots, tried different techniques, and at one point even used Firebug to look at the XML files sent to my browser. Finally we nailed it, there had been some bug in importing my specific presentation, and the thing was fixed.

In this age of calling customer support and being told by someone to “reboot your computer”, it is wonderful to be dealing with someone who has a deep understanding of how their system works, and who treats you as an intelligent person. Thank you very much to Ashwan, and anyone else who were involved.

(And the slidecast is now available!)

Suggestion
One thing that I would love to see is more detailed analytical data. I know that 1.500 people have visited my presentation, but how many of them actually watched it, and how many clicked on a link, saw the first slide, and left? Youtube has quite detailed analytics about when in a movie people stop watching, etc. We wouldn’t need something that fancy, but perhaps “350 people watched at least half the slides in your presentation”, or “400 people spent more than 10 minutes on your presentation”. Since Slideshare serves up each slide as people move along (I’m assuming), they should be able to tell with a good enough accuracy.

Stian

Similar posts that might interest you:

Presentation on OpenCourseWare in China posted to Slideshare

February 12th, 2009

Last week, I attended the Connexions/OpenCourseWare Consortium conference at Rice University, and I gave a presentation on OpenCourseWare in China (see abstract). The presentation roughly touches on three different aspects: The translation and use of MIT OpenCourseWare in China, the Chinese Ministry of Education led OpenCourseWare project, and the research on OpenCourseWare in Chinese. This is my MA research topic, so expect to hear a lot more about the issue, but this was my first attempt at synthesizing some of my thoughts and findings, and present them publicly. I also showcase some of the Chinese OCW sites, and discuss a number of issues and questions.

I recorded this with Audacity on my laptop during my presentation, and was able to sync the slides with the audio on Slideshare. Note that there is a minimum delay between each slide, which leads to problems when I show a number of slides in rapid succession, there is a bit of a lag. That is corrected as soon as I start speaking longer, however, and shouldn’t be a big problem.

When people asked me questions from the audience, it was not audible on my built-in microphone, so I dubbed over that with my own voice repeating the question – which also saves some time for the listener.

Any feedback or questions are very welcome! You can also download only the mp3 file.

Stian

Similar posts that might interest you:

Large market for Spanish-language books in the US

February 4th, 2009

I love seeing material in multiple languages, even languages I don’t understand. As populations have become more diverse around the world, public libraries have risen to the challenge, and whether you visit the Public Library in Oslo, or in Toronto, they have wonderful collections of children and adult books (and often films, DVDs, magazines and newspapers as well) in a wide variety of languages. In Toronto it’s fun going “branch shopping”, since the library has something like 99 branches, and each branch tailors its material to the local population – thus you can see many Indian languages at the Gerrard branch, and Polish and Serbian at the Roncesvalles branch.

However, the big chain bookstores in Canada are anglo-only. If you visit a Chapters (pretty much the only chain bookstore in Canada), and ask for books in other languages, they’ll show you the dictionary section. They don’t have anything in French (in Toronto at least, maybe in Ottawa it’s different), nor in Spanish or Chinese. Certainly, in the various Chinatowns there are great Chinese-language bookstores, but I always wondered – if you are successful Chinese lawyer living in a nice house in the suburbs, why is it that you have to drive downtown to Spadina (or to the huge Chinese mall in North York) to pick up the latest Hong Kong novel? Why can’t you do it while taking your kids to a movie at the mall?

I’ve also noticed that US bookstores tend to carry more and more materials in Spanish. Of course, the Latinos in the US probably make up a bigger percentage as a single linguistic group (especially in the south) than any group does in for example Toronto (even the Chinese). In addition, at least for the Chinese, a lot of the second-generation kids cannot read Chinese, and there’s the added complication of traditional vs. simplified characters. But still. Canada keeps thinking of itself as the fruit bowl and the US as the melting pot, but I wonder. As I’ve written about before, I found people much more confidently speaking different languages in New York than Toronto (anecdotally).

All this to say that I came across an interesting piece on books translated into English through Ethan Zuckerman’s always excellent blog. It contains some very interesting news about the rise of Spanish publishing within the US, including the fact that El Código Da Vinci (The Da Vinci Code by Dan Brown in Spanish) sold 300,000 copies. Here’s an excerpt from the paper:

The Bilingual Publishing Trend in the US
Spanish is the second most commonly used language in the US after English. According to the 2006 American Community Survey conducted by the US Census Bureau, Spanish is the primary language spoken at home by over 34 million people aged five or older. The US is home to more than 40 million Hispanics, making it the world’s fifth-largest Spanish-speaking community after Mexico, Colombia, Spain and Argentina.

A little over a decade ago, Spanish-language books occupied the smallest slice of shelf space at bookstores around the country. But the 2000 census and its revelations about the fast-growing Hispanic population sparked renewed interest among US publishing houses in meeting the reading wishes of Spanish speakers. Then came Dan Brown’s The Da Vinci Code, which not only shot up the international charts but quickly became one of the best-selling translations into Spanish of all time. While successful Spanish-language titles in the US typically sell between 15,000 and 20,000 copies, more than 300,000 copies of El Código Da Vinci were scooped off bookstore shelves across the land, ushering in what some described as a new era for Spanish-language books in America.

Now publishers are starting to time the release of English and Spanish versions so they coincide. Best-selling translations have helped the book market overall by alerting readers to the broadening selection of Spanish titles available at their local bookstores.

That was when several major US publishers began establishing divisions to cultivate new Hispanic talent and focus on the sale of both Spanish-language books and English books geared for the Hispanic market. About that time, large chain booksellers began hiring Spanish book-buyers to study market demographics and expand their Libros en Español sections. Publishers from Spain were for many years the only players serving the Hispanic market. But now they are competing with US houses for new authors and translation rights.

The author goes on to say that perhaps the rise of e-book readers (which I have been experimenting with lately, with excellent results) will enable publishers to sell directly to customers abroad, for example Spanish and Latin-American publishers could set up websites marketed directly to the diasporas in the US. I personally would love to buy Chinese and Indonesian novels online, if they were in an open format (no DRM), and I could easily pay a reasonable price.

Stian

Similar posts that might interest you:

How to get Bokeen Cybook v3 ebook reader to display Chinese with Stanza

January 31st, 2009

I’ve been interested in e-book readers for a long time, and a good friend generously agreed to lend me his Bokeen Cybook v3 reader for a week or two to play with, since that is the only way you can really get a feel for this technology – just looking at it for a few minutes is not enough.

I will write a longer review later, but for now I wanted to post the solution to a problem that immediately bugged me: How to get the reader to display Chinese text? I had to experiment with a few different ways, but finally I found that you have to generate a Mobipocket file. For Mac, you can use the excellent Stanza to do this (I understand there are programs for Windows too, not sure about Linux, but hopefully there are options).

A small bug, it seems, in Stanza is that if you copy text from for example a webpage (many Chinese novels are available in full-text as one long webpage), and choose “Create new from clipboard” in Stanza, the linefeeds disappear. And for some reason, if I save the text in Textmate or TextEdit as UTF8 text, or even RTF, it doesn’t work, and the text becomes all garbled. However, if I copy the text into OpenOffice, or Word, and either save the file as .doc and open it in Stanza, or copy again from this program to Stanza, it works – with the linefeeds and paragraphs preserved.

So to summarize, copy from a webpage to OpenOffice, and then from OpenOffice to Stanza. Generate MobiPocket file, and drag it over to the ebook reader. Voilà, pure Chinese ebook goodness!

(This is using Stanza 1.0.0-beta16 – they might fix the loss of linefeeds in a subsequent update. Also, I hope Bokeen comes out with a new firmware update that enables you to read simple unicode txt or rtf files directly).

Stian

Reblog this post [with Zemanta]
Similar posts that might interest you:

Upcoming: A theoretical approach to accreditation of Open Education

January 27th, 2009

I’ve been interested in accreditation/evaluation of Open Education for a long time, and when we discussed a number of different theoretical approaches to the purpose of schooling, and the purpose of accreditation, in class, I realized that it would be very interesting to try to apply these theories to the problem of accreditation of open education. The Dean’s Graduate Conference at OISE is an annual event where graduate students get to present on their on-going research to their peers and people from the community who are interested. This year, we put together a panel on OER, and there will be two other speakers besides me. Hopefully this can begin to raise the status of OER at our school, both as an activity, but also as something that can be researched. Below is my abstract (the conference is in the beginning of May in Toronto).

A theoretical approach to accreditation of Open Education

It has always been possible to gain advanced learning outside of the formal academy, through libraries and book-clubs for example, but the open-education movement has radically increased the feasibility of informal learning. Through the proliferation of open-access journals, open-educational resources (such as MIT OpenCourseWare), collaborative authoring such as Connexions and WikiEducator, and peer-to-peer learning systems such as Peer2PeerUniversity and Wikiversity, determined students with internet access can achieve learning outcomes similar to university courses.

How can such knowledge be accredited and proven? Some of the possibilities currently being explored range from the traditional methods of challenge exams and competency-based accreditation institutions, to attempts at applying peer-based accreditation from the open source world, and portfolios. However, these attempts need to be informed by sociological theories of schooling and accreditation, and I will use human capital theory and credentialism to analyze accreditation of open education.

Stian

Reblog this post [with Zemanta]
Similar posts that might interest you:

Upcoming Presentation: Open Education Around the World

January 25th, 2009

Education Commons is the unit at OISE that handles all the technology needs for teaching, learning and research and they run an infrequent speaker’s series on topics that would interest graduate students and faculty. Last fall, Leslie Chan presented on open access (video), and at that time he very generously proposed me as a future speaker.

Together with professor Jim Slotta, I will give an overview of open education trends around the world, and also discuss how this impacts/or could impact Canadian students, researchers and universities. I will use some material from my previous talk in Delhi, but with significantly updated material and a different twist. Here is the abstract – if you are in Toronto on that day, you are welcome to attend (but please register, so we can gauge attendance).

Open Education Around the World

The term “Open Educational Resources” (OER) was coined at a 2002 UNESCO conference, and refers to the rapidly growing phenomenon of sharing educational resources freely online. Projects have and being developed in several American institutions, and in almost 30 countries.  These “open resources” can be accessed by the wide educational community of teachers and students in all contexts, which has the potential to radically expand access to education, but raises many questions. How can pedagogical models and online communities support this kind of learning? Are there ways of providing accreditation for new forms of informal learning?

Join us as we give an overview of the field of open education, and participate in the discussion about this new dimension that will impact Canadian higher education in coming years. We will discuss new opportunities for U of T courses, including the challenge of locating high quality, relevant materials for courses (both online and face-to-face) and of integrating these materials in order to enhance student learning. We will discuss the implications of open education for university educators and researchers, as well as other communities of learners such as those in developing nations or those who wish to organize their own program of study.

Date: Thursday, March 12, 2009
Time: 10 am - 12 pm
Place: Knowledge Innovation & Technology Lab (252 Bloor Street West, Room 3-104)
[Register for workshop]

Stian

Similar posts that might interest you:

One size does not fit all: A case study of the spread of OpenCourseWare to India, China and Japan.

January 25th, 2009

I first went to the annual Comparative and International Education Society conference last year, when it was held at Columbia University. It’s a huge event, with something like 3.000+ attendees, including a very hefty component from OISE, both professors and graduate students. It was great going there only as a participant, and getting the feel for the place. I remember feeling a bit silly though, because everyone that saw my name tag which said “University of Toronto”, naturally assumed I was at Ontario Institute for Studies in Education (UofT’s faculty of education), but all I could reply was “no, I’m an undergraduate, but I’ve applied”… Very wannabe. This year, I am a bona-fide OISE student, luckily.

Looking at the program, there are a lot of very interesting sessions that I am looking forward to. It will also be more fun to go this year, because I already begin to recognize different theoretical debates, the “big names” in the field, etc. For example, it will be fun to see Jürgen Schriewer duck it out with Ramirez and Meyer, after reading so much about their different conceptions of globalization of education.

I submitted an abstract based on a paper I did for a class on global governance and educational change, where I tried to apply different theories to explain why the idea of OpenCourseWare spread to some countries, and not to others. Here is the abstract:

One size does not fit all: A case study of the spread of OpenCourseWare to India, China and Japan.

Since its inception in 2002, OpenCourseWare (OCW), a movement to post university courses online under open licenses, has spread around the world. Initially proposed by MIT President Charles Vest, and supported by the William and Flora Hewlett Foundation, this concept has spread to universities in more than 30 countries in less than six years. In many cases they have created local and regional consortia, and in some cases it is supported by the local government. How is such a rapid dissemination possible, and what does it mean for internationalization of higher education?

This paper will consist of a case study of three Asian countries that produce OCWs. Japan’s initiative was set up through personal connections between MIT and Japanese universities, and is independent of the state. India’s program is not a formal member of the consortium, and consists of the national open university, and the Indian Institutes of Technology. China’s Ministry of Education has financially supported the creation of over 10,000 open courses. I will apply Mintrom’s theory of policy entrepreneurs and innovation diffusion to analyze the spread of this movement, and use the three cases to discuss whether OCW is a case of Ramirez and Meyer’s “global institutionalism”.

Stian

Reblog this post [with Zemanta]
Similar posts that might interest you:

Global Concept, Local Practices: State of the Research on OCW in Chinese

January 25th, 2009

I am giving a few different presentations this year. The first one is at the OpenCourseWare Consortium/Connexions conference in Houston in the first week of February. The presentation is based on the preliminary research I am doing for my MA thesis, which will be on open educational resources in China. In China, there are roughly two categories of OERs – one is the Chinese translations of MIT and other foreign universities’ OCW materials – mostly facilitated by CORE and OOPS. In addition, the Chinese Ministry of Education is funding a large-scale production of Chinese OER (many thousand courses produced and available already), called China Quality OCW (Chinese: 精品课程).

This is a huge project involving 650 different universities, and there has also been a lot written and researched around this – thousands of peer-reviewed papers. It’s very difficult as an outsider to make any sense of this research quickly, so I am getting research help from a Chinese professor in distance education. The presentation will include an introduction to the Chinese OER project (which most people in North America are not very familiar with), and a demonstration of some of the resources, as well as a presentation of some of the research on OER going on in China. Here’s the abstract that was accepted for a 45 minute session:

Global Concept, Local Practices: State of the Research on
OCW in Chinese

Since the MIT OCW program was started in 2002, the OCW movement and idea have spread to many different countries and linguistic contexts. Wonderful innovation, production and research is happening in different countries, and often published in different languages. For the OCW and OER movements to progress, it is imperative that we be able to learn from each other, and bridge these linguistic barriers.

China has been one of the most aggressive adopters of the OCW idea. Not only is China Open Resources for Education (CORE) coordinating efforts to translate MIT OCW into Chinese, but the Chinese Ministry of Education has since 2003 been operating a national OCW program called China Quality OpenCourseWare (精品课程). Chinese universities submit proposals, and can receive between $7,300 and $14,600 per course that is made freely available online. By 2007, there were already over 1,100 courses available online, many of these with extensive resources, and video recordings.

In addition to this large-scale production OCW, the Chinese scholarly community has also been prolific in researching and publishing about the program. The China Academic Journals database, which provides the full text of over 7,000 Chinese scholarly articles, lists 2,137 articles with the term 精品课程 (China Quality OCW), of which 421 were published in 2008. In numbers, this is roughly equivalent to all the scholarly publication that mention OCW in English and other Latin languages in total – however, the story becomes even more impressive when initial sampling shows that most of the Chinese articles listed mention OCW in their title, and have OCW as their main topic, whereas many of the English language publications are writing about broader issues, and only refer to OCW in passing.

I am currently conducting a research project on this wealth of literature. Initially I will try to provide a broad grouping of the Chinese articles on OCW, provide statistics on number of articles in each group (for example: articles that describe the process of producing individual OCW courses, articles that present surveys on student usage, etc), and in what kind of journals these articles appear. My ultimate objective is not only to gain a good understanding of the state of research around the Chinese Quality OCW program, but also identify specific journal articles that provide theoretical models, methodological approaches or accounts of experiences that are very relevant and useful to the North-American research on OER and OCW.

In my presentation, I will give a brief overview of the history and current state of China Quality OpenCourseWare, how it is funded, produced, and used, and also how it interacts with the Chinese translations of for example MIT OCW. I will give an overview over the “state of research”, both in terms of poignant research questions, methodologies and also relevant findings, from the Chinese context. I will also argue for a more integrated research roadmap for OCWs in North America, that actively engages with researchers and the literature from around the world.

Very ambitious, and my research is still in its early stages, but I think it will still be interesting, and invite people at the conference to attend.I might try to record it as well.

Stian

Reblog this post [with Zemanta]
Similar posts that might interest you:

Cathy Casserly to head new OER program at Carnegie Foundation

January 20th, 2009

I first met Cathy Casserly at the Open Ed conference in Dalian in 2008, where she immediately welcomed me and began thinking of projects I could get involved with and ways she and Hewlett could support me. At that time, I was still an undergraduate, and barely knew anyone in the field – and her reception was a wonderful and inspiring welcome.

From I first began to understand the development of the OER and OCW movements through David Wiley‘s Open Ed course, I was struck by the incedible deftness through which the Hewlett Foundation had basically “constructed a brand new field”. I don’t want to say that there were not many other organizations and institutions contributing, nor that the idea of open education was brand new (go back and read Illich), but there seems to be no doubt that Hewlett has played a huge role. By strategic investments in a number of projects, and by pushing these to collaborate and create synergies, first in the US, and rapidly also internationally, they achieved what I believe is every charitable funding organizations wet dream: To put a limited amount of resources in just the right place, at just the right time, and create something that is valuable, sustainable, and growing.

Now, Cathy is moving to the Carnegie Foundation for the Advancement of Teaching, to head Carnegie’s new strategic work in open education. From the press release:

As the first full-time Senior Partner appointed by Carnegie President Anthony S. Bryk, Casserly will be responsible for new program initiatives and will manage the strategic direction of Carnegie’s work in Open Educational Resources. In leading efforts to build a new field of Design, Educational Engineering and Development (D-EE-D), Carnegie provides an ideal combination of timing and place to extend the knowledge and evidence base regarding the effectiveness of innovation and Open Educational Resources for learning.

I am very excited that Carnegie has decided to focus strategically on open educational resources. I have long respected their work on teaching and learning in higher education, and read several of their publications, for example on rethinking assessment of scholarly work. From what I can gather from their webpage, they already have a program called the Knowledge Media Library, where they seem to have been playing with ideas similar to OER, and they also helped publish Opening Up Education: The Collective Advancement of Education through Open Technology, Open Content, and Open Knowledge with MIT Press.

It will be very interesting to follow as Carnegie develops it’s program in OER, and also to see who will replace Cathy at Hewlett Foundation, and whether their focus will shift.

Stian

Similar posts that might interest you:
Login