Starting data analysis/wrangling with R: Things I wish I'd been told

October 14, 2014, [MD]

R is a very powerful open source environment for data analysis, statistics and graphing, with thousands of packages available. After my previous blog post about likert-scales and metadata in R, a few of my colleagues mentioned that they were learning R through a Coursera course on data analysis. I have been working quite intensively with R for the last half year, and thought I'd try to document and share a few tricks, and things I wish I'd have known when I started out.

I don't pretend to be a statistics whiz – I don't have a strong background in math, and much of my training in statistics was of the social science "click here, then here in SPSS" kind, using flowcharts to divine which tests to run, given the kinds of variables you wanted to compare. I'm eager to learn more, but the fact is that running complex statistical functions in R is typically quite easy. The difficult part is acquiring data, cleaning it up, combining different data sources, and preparing it for analysis (they say 90% of a data scientist's job is data wrangling). Of course, knowing which tests to run, and how to analyze the results is also a challenge, but that is general statistical knowledge that applies to all statistics packages.

So here are some of my suggestions and "lessons learnt", in no particular order. Some will find the code samples scary, others will find the suggestion to use for-loops far too basic, but hopefully you will find something useful here.


A pedagogical script for idea convergence through tagging Etherpad content

October 3, 2014, [MD]

I describe a script aimed at supporting student idea convergence through tagging Etherpad content, and discuss how it went when I implemented it in a class

Background

In an earlier blog post I introduced the idea of pedagogical scripting, as well as implementing scripts in computer code. I discussed my desire to make ideas more "moveable", and support deeper work on ideas, and talked about the idea of using tags to support this. Finally, I introduced the tool tag-extract, which I developed to work on a literature review.

Context

I am currently teaching a course on Knowledge and Communication for Development (earlier open source syllabi) at the University of Toronto at Scarborough. The course applies theoretical constructs from development studies to understanding the role of technology, the internet, and knowledge in international development processes, and as implemented in specific development projects.

I began the course with two fairly tech-centric classes, because I believe that having an intuition about how the Internet works, is important for subsequent discussions. I've also realized in previous years that even the "digital generation" often has very little understanding of what happens when you for example send a Facebook message from one computer to another.


Supporting idea convergence through pedagogical scripts and Etherpad APIs, an introduction

October 3, 2014, [MD]

We can script Etherpad to push discussion prompts out to many small groups, and then pull the information back. Using tags, we can extract information, and as a community organize the emerging folksonomy

This blog post brings together two long-standing interests of mine. The first is how to script web tools to support small-group collaborative learning, and the other is how to support reorganization of ideas across groups.

Two meanings of the word "scripting"

There is an interesting intersection between two quite different meanings of the word scripting in the context of my work. In the CSCL literature, scripts refer to sequences of activities that support groups of students in carrying out a collaborative learning activity. They can be very simple and generic, like the well-known jigsaw script, or very content-specific. There is active on-going research on scripting, including how external scripts interfer with internal scripts, and the dangers of over-scripting.


Easy interoperability between Ruby and Python scripts with JSON

September 22, 2014, [MD]

I recently needed to call a Ruby script from Python to do some data processing. I was generating some Etherpad-scripts in Python, and needed to restructure tags (using tag-extract) with a Ruby script. The complication was that this script does not just return a simple string or number, but a somewhat complex data structure, that I needed to process further in the Python script.

Luckily, searching online I came across the idea to use JSON as the interchange format, which worked swimmingly. Given that all the information was in text format, and the data structure was not that complex (just some lists and dictionaries), JSON could cope well with the complexity, and was easier to debug, since it's a text format. If I had had other requirements, like binary data, I would have had to investigate other data formats.


Creating Anki cards for Russian Coursera MOOC with stemming and frequency lists

April 16, 2014, [MD]

In which I automatically generate Anki review cards for vocabulary based on subtitles from a Russian Coursera MOOC

Learning Russian, a 10-year project

I have been working on my Russian on and off for many years, I'm at the level where I don't feel the need for textbooks, but my understanding is not quite good enough for authentic media yet. I've experimented with readings novels in parallel (English and Russian side by side, or Swedish and Russian, like on the picture), and I've been listening to a great podcast for Russian learners.

Language learning with MOOCs

MOOCs can be a great resource for language learning, whether intentionally or not. At the University of Toronto, we found that more than 60% of learners across all MOOCs spoke English as a second language, no doubt some of them are not just viewing the foreign language of the MOOC as a barrier, but are also hoping to improve their English. To my delight, the availability of non-English language MOOCs has been growing steadily. For example, Coursera has courses in Chinese, French, Spanish, Russian, Turkish, German, Hebrew and Arabic. There are also MOOC providers focused on specific linguistic areas, for example China's XueTangX, France's Université Numérique, and Germany's iversity.


Learning idiomatic Haskell with Exercism.io

March 11, 2014, [MD]

I got my introduction to functional programming through Clojure, but lately I've been really fascinated by Haskell. I gave up on it a few times, because it can seem quite impenetrable, and very different from anything that I'm used to. But there is something about the elegance, and powerful ideas that keeps me coming back. I've read a bunch of tutorials and papers, followed blog posts, and experimented a bit with ghci and IHaskell, but I never really got off the starting block with writing my own Haskell.

How to get practice?

The projects I'm currently working on in Python are too complex and urgent for me to try to implement them in Haskell at this point, and because Haskell is so different, I found myself stymied doing even simple things, even though I'd just finished reading a complex CS paper about monads. I needed some structured tasks that were not too hard, and that gave me good feedback on how to improve.

Listening to the Functional Geekery podcast, I heard about exercism.io, which provides exercises for many popular languages. There are several similar websites in existence, perhaps the most well known is Project Euler, however that focuses too much on CS/math-type problems, and is perhaps better for learning algorithms than a particular language. The way exercism.io works, is that you install it as a command line application. The first time you run it, it downloads the first exercise for a number of languages.


Parsing massive clicklogs, an approach to parallel Python

March 10, 2014, [MD]

I am currently working on analyzing MOOC clicklog data for a research project. The clicklogs themselves are huge text files (from 3GB to 20GB in size), where each line represents one "event" (a mouse click, pausing a video, submitting a quiz, etc). These events are represented as a (sometimes nested) JSON structure, which doesn't really have a clear schema.

Introduction

Our goal was to run frequent-sequence analyses over these clicklogs, but to do that, we needed to process them in two rounds. Initially, we walk throug the log-file, and convert each line (JSON blob) into a row in a Pandas table, which we store in a HDF5 file using pytables (see also). We convert the JSON key-values to columns, extract information from the URL (for example /view/quiz?quiz_id=3 results in the column action receiving the value /view/quiz, and the column quiz_id, the value 3. We also do a bit of cleaning up of values, throw out some of the columns that are not useful, etc.

Speeding it up

We use Python for the processing, and even with a JSON parser written in C, this process is really slow. An obvious way of speeding it up would be to parallelize the code, taking advantage of the eight-cores on the server, rather than only maxing out a single core. I did not have much experience with parallel programming in general, or in Python, so this was a learning experience for me. I by no means consider myself an expert, and I might have missed some obvious things, but I still thought what we came up with might be useful to others.


GNU Parallel, quick and easy

March 9, 2014, [MD]

I haven't been blogging for months, and there is a lot of things I'd like to write about. But rather than waiting until I have the time to do that, I thought I'd just quickly capture a neat function that many might not know about.

Running shell commands on multiple files is something we do every day, usually with different wildcard patterns (like rm *, which deletes all the files in the current directory). What's sometimes not quite clear to me is when the wildcard-expansion happens by the shell (and ie. rm is given hundreds of arguments), and when it is passed to the command to do it's own expansion.

But sometimes you want to do something on files in multiple directories. Some shells like zsh will let you do **/*.md to list all Markdown files, arbitrarily nested, whereas in Bash, this only goes down one directory. An alternative to this is to first generate a list of the files, and then execute the command once for each file. xargs can be used to do this (although I always somehow found its syntax a bit difficult). A great alternative is gnu parallel, which does the same, but in parallel. Since most current computers have four, eight, or even more cores, this can speed things up significantly. (And even more so if the function needs to wait for a network connection, for example pinging hosts, downloading files with curl, etc).


Playing with word stemming and frequencies in Russian

October 17, 2013, [MD]

I've been diving back into Russian lately, after many years of neglect (and having never really learnt it in the first place). Much to say about my experiments with DIY parallel texts, my adventure at a Russian supermarket in suburban Toronto, etc, but first some geekery.

Серёжа

I've been amazed at the amount of people writing (thoughtfully, enthusiastically, beautifully) about Russian literature in English (and other languages), for example Lizok's Bookshelf. In her last entry, Lisa mentioned a book called Seryozha (Серёжа) and a blog post she had written about it, which had attracted a number of comments from India. Color me intrigued, I was very interested to see the number of readers who had enjoyed growing up with this book in Tamil and Bengali (and there was even a Korean reader).

This made me interested in the book itself, so I had a look at the English Wikipedia page (fittingly, the only interwiki link is to Panjabi Wikipedia, looking up the author Vera Panova (Вера Панова), etc. And finally hunting down the book itself (I also tried to find online versions of the English translation, but so far failed).


Tuesday morning hack: Rename, resize and upload image, and get Markdown link

October 16, 2013, [MD]

I've written previously about a quick way of adding pictures to my blog entries. Today I was adding a picture to a pull request on Github, and thought it would be a nice thing to automatize. In the previous example, the image is just moved to my nanoc blog folder, and then gets uploaded to my server when I sync my entire blog. In this case, I wanted the image put on my server immediately, and I also wanted a bit of control over the size, and even the filename.