Creating Anki cards for Russian Coursera MOOC with stemming and frequency lists

April 16, 2014, [MD]

In which I automatically generate Anki review cards for vocabulary based on subtitles from a Russian Coursera MOOC

Learning Russian, a 10-year project

I have been working on my Russian on and off for many years, I'm at the level where I don't feel the need for textbooks, but my understanding is not quite good enough for authentic media yet. I've experimented with readings novels in parallel (English and Russian side by side, or Swedish and Russian, like on the picture), and I've been listening to a great podcast for Russian learners.

Language learning with MOOCs

MOOCs can be a great resource for language learning, whether intentionally or not. At the University of Toronto, we found that more than 60% of learners across all MOOCs spoke English as a second language, no doubt some of them are not just viewing the foreign language of the MOOC as a barrier, but are also hoping to improve their English. To my delight, the availability of non-English language MOOCs has been growing steadily. For example, Coursera has courses in Chinese, French, Spanish, Russian, Turkish, German, Hebrew and Arabic. There are also MOOC providers focused on specific linguistic areas, for example China's XueTangX, France's Université Numérique, and Germany's iversity.

Learning idiomatic Haskell with

March 11, 2014, [MD]

I got my introduction to functional programming through Clojure, but lately I've been really fascinated by Haskell. I gave up on it a few times, because it can seem quite impenetrable, and very different from anything that I'm used to. But there is something about the elegance, and powerful ideas that keeps me coming back. I've read a bunch of tutorials and papers, followed blog posts, and experimented a bit with ghci and IHaskell, but I never really got off the starting block with writing my own Haskell.

How to get practice?

The projects I'm currently working on in Python are too complex and urgent for me to try to implement them in Haskell at this point, and because Haskell is so different, I found myself stymied doing even simple things, even though I'd just finished reading a complex CS paper about monads. I needed some structured tasks that were not too hard, and that gave me good feedback on how to improve.

Listening to the Functional Geekery podcast, I heard about, which provides exercises for many popular languages. There are several similar websites in existence, perhaps the most well known is Project Euler, however that focuses too much on CS/math-type problems, and is perhaps better for learning algorithms than a particular language. The way works, is that you install it as a command line application. The first time you run it, it downloads the first exercise for a number of languages.

Parsing massive clicklogs, an approach to parallel Python

March 10, 2014, [MD]

I am currently working on analyzing MOOC clicklog data for a research project. The clicklogs themselves are huge text files (from 3GB to 20GB in size), where each line represents one "event" (a mouse click, pausing a video, submitting a quiz, etc). These events are represented as a (sometimes nested) JSON structure, which doesn't really have a clear schema.


Our goal was to run frequent-sequence analyses over these clicklogs, but to do that, we needed to process them in two rounds. Initially, we walk throug the log-file, and convert each line (JSON blob) into a row in a Pandas table, which we store in a HDF5 file using pytables (see also). We convert the JSON key-values to columns, extract information from the URL (for example /view/quiz?quiz_id=3 results in the column action receiving the value /view/quiz, and the column quiz_id, the value 3. We also do a bit of cleaning up of values, throw out some of the columns that are not useful, etc.

Speeding it up

We use Python for the processing, and even with a JSON parser written in C, this process is really slow. An obvious way of speeding it up would be to parallelize the code, taking advantage of the eight-cores on the server, rather than only maxing out a single core. I did not have much experience with parallel programming in general, or in Python, so this was a learning experience for me. I by no means consider myself an expert, and I might have missed some obvious things, but I still thought what we came up with might be useful to others.

GNU Parallel, quick and easy

March 9, 2014, [MD]

I haven't been blogging for months, and there is a lot of things I'd like to write about. But rather than waiting until I have the time to do that, I thought I'd just quickly capture a neat function that many might not know about.

Running shell commands on multiple files is something we do every day, usually with different wildcard patterns (like rm *, which deletes all the files in the current directory). What's sometimes not quite clear to me is when the wildcard-expansion happens by the shell (and ie. rm is given hundreds of arguments), and when it is passed to the command to do it's own expansion.

But sometimes you want to do something on files in multiple directories. Some shells like zsh will let you do **/*.md to list all Markdown files, arbitrarily nested, whereas in Bash, this only goes down one directory. An alternative to this is to first generate a list of the files, and then execute the command once for each file. xargs can be used to do this (although I always somehow found its syntax a bit difficult). A great alternative is gnu parallel, which does the same, but in parallel. Since most current computers have four, eight, or even more cores, this can speed things up significantly. (And even more so if the function needs to wait for a network connection, for example pinging hosts, downloading files with curl, etc).

Playing with word stemming and frequencies in Russian

October 18, 2013, [MD]

I've been diving back into Russian lately, after many years of neglect (and having never really learnt it in the first place). Much to say about my experiments with DIY parallel texts, my adventure at a Russian supermarket in suburban Toronto, etc, but first some geekery.


I've been amazed at the amount of people writing (thoughtfully, enthusiastically, beautifully) about Russian literature in English (and other languages), for example Lizok's Bookshelf. In her last entry, Lisa mentioned a book called Seryozha (Серёжа) and a blog post she had written about it, which had attracted a number of comments from India. Color me intrigued, I was very interested to see the number of readers who had enjoyed growing up with this book in Tamil and Bengali (and there was even a Korean reader).

This made me interested in the book itself, so I had a look at the English Wikipedia page (fittingly, the only interwiki link is to Panjabi Wikipedia, looking up the author Vera Panova (Вера Панова), etc. And finally hunting down the book itself (I also tried to find online versions of the English translation, but so far failed).

Tuesday morning hack: Rename, resize and upload image, and get Markdown link

October 16, 2013, [MD]

I've written previously about a quick way of adding pictures to my blog entries. Today I was adding a picture to a pull request on Github, and thought it would be a nice thing to automatize. In the previous example, the image is just moved to my nanoc blog folder, and then gets uploaded to my server when I sync my entire blog. In this case, I wanted the image put on my server immediately, and I also wanted a bit of control over the size, and even the filename.

Blogging with Nanoc: Easy workflow for embedding images

October 3, 2013, [MD]

Part of the reason I decided to switch from WordPress to a static blogging system was that I had been writing more and more in my wiki, which runs on localhost and is synced to the server. I built a number of tools to speed up my workflow when editing the wiki, many based on simple Ruby-scripts that are triggered by keyboard shortcuts, and use AppleScript (through appscript-rb) to get context (which page I'm currently looking at in Chrome, etc).

Beginning to write my blog posts as Markdown files in Sublime Text (I keep thinking I should learn Vim or EMacs, and occasionally I'll get inspired, surf people's dot files, do a tutorial -- but it never seems to stick), I not only have access to all the powerful editing features of ST, but I can also begin to add features with Ruby.

Likert-graphs in R, embedding metadata for easier plotting

October 2, 2013, [MD]

I've been working a lot with questionnaire data in R lately. Some are large MOOC-questionnaires with up to 20,000 respondents, others are in-class surveys with 30-250 respondents, where we have to type or scan in the response sheets. However, once it comes to data cleanup and analysis, there is not much difference between 30 and 20,000 respondents. Here's an example of a section of a recent survey we distributed to around 600 undergraduates in history and religion:

Before beginning to do any kind of statistical tests or modeling, I like to generate graphs of the different questions, both univariate, and split across other variables. Most of the questions are typically likert-type questions, where you have a number of options from "Very unlikely" to "Very likely", or "Not at all" to "To a large extent" -- these are all converted into ordered factors. The questions in the dataframe will then end up like this:

> db[1:5,9:14]
                    X7                   X8   X9               X10        X11        X12
1       To some extent           Not at all <NA>        Not at all       <NA> Not at all
2 To a moderate extent    To a small extent <NA> To a small extent Not at all Not at all
3       To some extent    To a small extent <NA> To a small extent Not at all Not at all
4 To a moderate extent To a moderate extent <NA>        Not at all Not at all Not at all
5    To a small extent To a moderate extent <NA>        Not at all Not at all Not at all

There are many ways of plotting this data, and one of the simplest one would probably be a stacked barchart (taken from

Fun with Julia, metaprogramming and Sublime Text

September 29, 2013, [MD]

Julia is a newish programming language developed at MIT, targetting data analysis and scientific computing. You can read the creators of the language describe why they created Julia, or see an example of linear regressions. I've been spending a lot of time programming R in RStudio, which is a great IDE for R (although it often crashes, unfortunately), and really come to appreciate the power of the programming ecosystem. However, as many will agree (see the R Inferno), the underlying language that R is built on leaves much to desire, it's got a lot of quirks, a weird object orientation system "bolted on", and is also slow (overcome by many packages including C extensions, but that makes hacking on it much harder).

One obvious contender is Python with NumPy and SciPy, and the amazing IPython Notebook (great video) and Pandas. There is clearly a lot of momentum in this community, with books being published, tutorials etc. I also see interesting uses of embedded Python for scripting on for example Quantopian.

Another is Julia, which I mentioned above. It's very early days, the language is developing so quickly that you are encouraged to compile from git, and update frequently, but it seems to have a very solid foundation, and an incredibly welcoming community. Although I really don't have time these days, I decided to sit down and play with it a bit -- I especially wanted to look at some of the large tables I'm working on in the context of MOOC research, and see how functional their DataFrames are.