March 11, 2014, [MD]
I got my introduction to functional programming through Clojure, but lately I've been really fascinated by Haskell. I gave up on it a few times, because it can seem quite impenetrable, and very different from anything that I'm used to. But there is something about the elegance, and powerful ideas that keeps me coming back. I've read a bunch of tutorials and papers, followed blog posts, and experimented a bit with ghci and IHaskell, but I never really got off the starting block with writing my own Haskell.
How to get practice?
The projects I'm currently working on in Python are too complex and urgent for me to try to implement them in Haskell at this point, and because Haskell is so different, I found myself stymied doing even simple things, even though I'd just finished reading a complex CS paper about monads. I needed some structured tasks that were not too hard, and that gave me good feedback on how to improve.
Listening to the Functional Geekery podcast, I heard about exercism.io, which provides exercises for many popular languages. There are several similar websites in existence, perhaps the most well known is Project Euler, however that focuses too much on CS/math-type problems, and is perhaps better for learning algorithms than a particular language. The way exercism.io works, is that you install it as a command line application. The first time you run it, it downloads the first exercise for a number of languages.
March 10, 2014, [MD]
I am currently working on analyzing MOOC clicklog data for a research project. The clicklogs themselves are huge text files (from 3GB to 20GB in size), where each line represents one "event" (a mouse click, pausing a video, submitting a quiz, etc). These events are represented as a (sometimes nested) JSON structure, which doesn't really have a clear schema.
Our goal was to run frequent-sequence analyses over these clicklogs, but to do that, we needed to process them in two rounds. Initially, we walk throug the log-file, and convert each line (JSON blob) into a row in a Pandas table, which we store in a HDF5 file using pytables (see also). We convert the JSON key-values to columns, extract information from the URL (for example
/view/quiz?quiz_id=3 results in the column action receiving the value
/view/quiz, and the column
quiz_id, the value 3. We also do a bit of cleaning up of values, throw out some of the columns that are not useful, etc.
Speeding it up
We use Python for the processing, and even with a JSON parser written in C, this process is really slow. An obvious way of speeding it up would be to parallelize the code, taking advantage of the eight-cores on the server, rather than only maxing out a single core. I did not have much experience with parallel programming in general, or in Python, so this was a learning experience for me. I by no means consider myself an expert, and I might have missed some obvious things, but I still thought what we came up with might be useful to others.
March 9, 2014, [MD]
I haven't been blogging for months, and there is a lot of things I'd like to write about. But rather than waiting until I have the time to do that, I thought I'd just quickly capture a neat function that many might not know about.
Running shell commands on multiple files is something we do every day, usually with different wildcard patterns (like
rm *, which deletes all the files in the current directory). What's sometimes not quite clear to me is when the wildcard-expansion happens by the shell (and ie.
rm is given hundreds of arguments), and when it is passed to the command to do it's own expansion.
But sometimes you want to do something on files in multiple directories. Some shells like zsh will let you do
**/*.md to list all Markdown files, arbitrarily nested, whereas in Bash, this only goes down one directory. An alternative to this is to first generate a list of the files, and then execute the command once for each file.
xargs can be used to do this (although I always somehow found its syntax a bit difficult). A great alternative is
gnu parallel, which does the same, but in parallel. Since most current computers have four, eight, or even more cores, this can speed things up significantly. (And even more so if the function needs to wait for a network connection, for example pinging hosts, downloading files with curl, etc).
October 17, 2013, [MD]
I've been diving back into Russian lately, after many years of neglect (and having never really learnt it in the first place). Much to say about my experiments with DIY parallel texts, my adventure at a Russian supermarket in suburban Toronto, etc, but first some geekery.
I've been amazed at the amount of people writing (thoughtfully, enthusiastically, beautifully) about Russian literature in English (and other languages), for example Lizok's Bookshelf. In her last entry, Lisa mentioned a book called Seryozha (Серёжа) and a blog post she had written about it, which had attracted a number of comments from India. Color me intrigued, I was very interested to see the number of readers who had enjoyed growing up with this book in Tamil and Bengali (and there was even a Korean reader).
This made me interested in the book itself, so I had a look at the English Wikipedia page (fittingly, the only interwiki link is to Panjabi Wikipedia, looking up the author Vera Panova (Вера Панова), etc. And finally hunting down the book itself (I also tried to find online versions of the English translation, but so far failed).
October 16, 2013, [MD]
I've written previously about a quick way of adding pictures to my blog entries. Today I was adding a picture to a pull request on Github, and thought it would be a nice thing to automatize. In the previous example, the image is just moved to my nanoc blog folder, and then gets uploaded to my server when I sync my entire blog. In this case, I wanted the image put on my server immediately, and I also wanted a bit of control over the size, and even the filename.
October 2, 2013, [MD]
Part of the reason I decided to switch from WordPress to a static blogging system was that I had been writing more and more in my wiki, which runs on
localhost and is synced to the server. I built a number of tools to speed up my workflow when editing the wiki, many based on simple Ruby-scripts that are triggered by keyboard shortcuts, and use AppleScript (through appscript-rb) to get context (which page I'm currently looking at in Chrome, etc).
Beginning to write my blog posts as Markdown files in Sublime Text (I keep thinking I should learn Vim or EMacs, and occasionally I'll get inspired, surf people's dot files, do a tutorial -- but it never seems to stick), I not only have access to all the powerful editing features of ST, but I can also begin to add features with Ruby.
October 2, 2013, [MD]
I've been working a lot with questionnaire data in R lately. Some are large MOOC-questionnaires with up to 20,000 respondents, others are in-class surveys with 30-250 respondents, where we have to type or scan in the response sheets. However, once it comes to data cleanup and analysis, there is not much difference between 30 and 20,000 respondents. Here's an example of a section of a recent survey we distributed to around 600 undergraduates in history and religion:
Before beginning to do any kind of statistical tests or modeling, I like to generate graphs of the different questions, both univariate, and split across other variables. Most of the questions are typically likert-type questions, where you have a number of options from "Very unlikely" to "Very likely", or "Not at all" to "To a large extent" -- these are all converted into ordered factors. The questions in the dataframe will then end up like this:
X7 X8 X9 X10 X11 X12
1 To some extent Not at all <NA> Not at all <NA> Not at all
2 To a moderate extent To a small extent <NA> To a small extent Not at all Not at all
3 To some extent To a small extent <NA> To a small extent Not at all Not at all
4 To a moderate extent To a moderate extent <NA> Not at all Not at all Not at all
5 To a small extent To a moderate extent <NA> Not at all Not at all Not at all
There are many ways of plotting this data, and one of the simplest one would probably be a stacked barchart (taken from statistical-research.com):
September 29, 2013, [MD]
Julia is a newish programming language developed at MIT, targetting data analysis and scientific computing. You can read the creators of the language describe why they created Julia, or see an example of linear regressions. I've been spending a lot of time programming R in RStudio, which is a great IDE for R (although it often crashes, unfortunately), and really come to appreciate the power of the programming ecosystem. However, as many will agree (see the R Inferno), the underlying language that R is built on leaves much to desire, it's got a lot of quirks, a weird object orientation system "bolted on", and is also slow (overcome by many packages including C extensions, but that makes hacking on it much harder).
One obvious contender is Python with NumPy and SciPy, and the amazing IPython Notebook (great video) and Pandas. There is clearly a lot of momentum in this community, with books being published, tutorials etc. I also see interesting uses of embedded Python for scripting on for example Quantopian.
Another is Julia, which I mentioned above. It's very early days, the language is developing so quickly that you are encouraged to compile from git, and update frequently, but it seems to have a very solid foundation, and an incredibly welcoming community. Although I really don't have time these days, I decided to sit down and play with it a bit -- I especially wanted to look at some of the large tables I'm working on in the context of MOOC research, and see how functional their DataFrames are.
September 29, 2013, [MD]
Some readers might have noticed that my blog, wiki, and other services were down for almost a month.
I have been hosting with Site 5 for a number of years, and always been quite happy with them - even recommending them to several users of Researchr. They provided lot's of space, databases, subdomains etc. However, I suddenly received an email from them telling me I was using more than my fair share of resources, and needed to move up to a Virtual Private Host, raising my cost from around 6$/month to around 70$!
I obviously understand that the reason hosting was so cheap, was that I was sharing resources with a number of other users. However, the frustrating part was that there didn't seem to be anything I could do to mitigate the problem - I didn't have access to good statistics showing where all the extra CPU was burnt. After all, most of the hits I received went to a static blog, and a wiki, and I didn't have significantly more traffic than I'd had before.
I removed a number of older scripts and installations, but it didn't seem to work, and I finally had to quickly shut down to avoid having to pay 70$/month. I hadn't looked for a new webhost for many years, and wanted to spend some time looking carefully through the options. That seemed like a big task, and because those weeks were very busy, I ended up putting it off for a while.