March 9, 2014, [MD]
I haven't been blogging for months, and there is a lot of things I'd like to write about. But rather than waiting until I have the time to do that, I thought I'd just quickly capture a neat function that many might not know about.
Running shell commands on multiple files is something we do every day, usually with different wildcard patterns (like
rm *, which deletes all the files in the current directory). What's sometimes not quite clear to me is when the wildcard-expansion happens by the shell (and ie.
rm is given hundreds of arguments), and when it is passed to the command to do it's own expansion.
But sometimes you want to do something on files in multiple directories. Some shells like zsh will let you do
**/*.md to list all Markdown files, arbitrarily nested, whereas in Bash, this only goes down one directory. An alternative to this is to first generate a list of the files, and then execute the command once for each file.
xargs can be used to do this (although I always somehow found its syntax a bit difficult). A great alternative is
gnu parallel, which does the same, but in parallel. Since most current computers have four, eight, or even more cores, this can speed things up significantly. (And even more so if the function needs to wait for a network connection, for example pinging hosts, downloading files with curl, etc).
October 17, 2013, [MD]
I've been diving back into Russian lately, after many years of neglect (and having never really learnt it in the first place). Much to say about my experiments with DIY parallel texts, my adventure at a Russian supermarket in suburban Toronto, etc, but first some geekery.
I've been amazed at the amount of people writing (thoughtfully, enthusiastically, beautifully) about Russian literature in English (and other languages), for example Lizok's Bookshelf. In her last entry, Lisa mentioned a book called Seryozha (Серёжа) and a blog post she had written about it, which had attracted a number of comments from India. Color me intrigued, I was very interested to see the number of readers who had enjoyed growing up with this book in Tamil and Bengali (and there was even a Korean reader).
This made me interested in the book itself, so I had a look at the English Wikipedia page (fittingly, the only interwiki link is to Panjabi Wikipedia, looking up the author Vera Panova (Вера Панова), etc. And finally hunting down the book itself (I also tried to find online versions of the English translation, but so far failed).
October 16, 2013, [MD]
I've written previously about a quick way of adding pictures to my blog entries. Today I was adding a picture to a pull request on Github, and thought it would be a nice thing to automatize. In the previous example, the image is just moved to my nanoc blog folder, and then gets uploaded to my server when I sync my entire blog. In this case, I wanted the image put on my server immediately, and I also wanted a bit of control over the size, and even the filename.
October 2, 2013, [MD]
Part of the reason I decided to switch from WordPress to a static blogging system was that I had been writing more and more in my wiki, which runs on
localhost and is synced to the server. I built a number of tools to speed up my workflow when editing the wiki, many based on simple Ruby-scripts that are triggered by keyboard shortcuts, and use AppleScript (through appscript-rb) to get context (which page I'm currently looking at in Chrome, etc).
Beginning to write my blog posts as Markdown files in Sublime Text (I keep thinking I should learn Vim or EMacs, and occasionally I'll get inspired, surf people's dot files, do a tutorial -- but it never seems to stick), I not only have access to all the powerful editing features of ST, but I can also begin to add features with Ruby.
October 2, 2013, [MD]
I've been working a lot with questionnaire data in R lately. Some are large MOOC-questionnaires with up to 20,000 respondents, others are in-class surveys with 30-250 respondents, where we have to type or scan in the response sheets. However, once it comes to data cleanup and analysis, there is not much difference between 30 and 20,000 respondents. Here's an example of a section of a recent survey we distributed to around 600 undergraduates in history and religion:
Before beginning to do any kind of statistical tests or modeling, I like to generate graphs of the different questions, both univariate, and split across other variables. Most of the questions are typically likert-type questions, where you have a number of options from "Very unlikely" to "Very likely", or "Not at all" to "To a large extent" -- these are all converted into ordered factors. The questions in the dataframe will then end up like this:
X7 X8 X9 X10 X11 X12
1 To some extent Not at all <NA> Not at all <NA> Not at all
2 To a moderate extent To a small extent <NA> To a small extent Not at all Not at all
3 To some extent To a small extent <NA> To a small extent Not at all Not at all
4 To a moderate extent To a moderate extent <NA> Not at all Not at all Not at all
5 To a small extent To a moderate extent <NA> Not at all Not at all Not at all
There are many ways of plotting this data, and one of the simplest one would probably be a stacked barchart (taken from statistical-research.com):
September 29, 2013, [MD]
Julia is a newish programming language developed at MIT, targetting data analysis and scientific computing. You can read the creators of the language describe why they created Julia, or see an example of linear regressions. I've been spending a lot of time programming R in RStudio, which is a great IDE for R (although it often crashes, unfortunately), and really come to appreciate the power of the programming ecosystem. However, as many will agree (see the R Inferno), the underlying language that R is built on leaves much to desire, it's got a lot of quirks, a weird object orientation system "bolted on", and is also slow (overcome by many packages including C extensions, but that makes hacking on it much harder).
One obvious contender is Python with NumPy and SciPy, and the amazing IPython Notebook (great video) and Pandas. There is clearly a lot of momentum in this community, with books being published, tutorials etc. I also see interesting uses of embedded Python for scripting on for example Quantopian.
Another is Julia, which I mentioned above. It's very early days, the language is developing so quickly that you are encouraged to compile from git, and update frequently, but it seems to have a very solid foundation, and an incredibly welcoming community. Although I really don't have time these days, I decided to sit down and play with it a bit -- I especially wanted to look at some of the large tables I'm working on in the context of MOOC research, and see how functional their DataFrames are.
September 29, 2013, [MD]
Some readers might have noticed that my blog, wiki, and other services were down for almost a month.
I have been hosting with Site 5 for a number of years, and always been quite happy with them - even recommending them to several users of Researchr. They provided lot's of space, databases, subdomains etc. However, I suddenly received an email from them telling me I was using more than my fair share of resources, and needed to move up to a Virtual Private Host, raising my cost from around 6$/month to around 70$!
I obviously understand that the reason hosting was so cheap, was that I was sharing resources with a number of other users. However, the frustrating part was that there didn't seem to be anything I could do to mitigate the problem - I didn't have access to good statistics showing where all the extra CPU was burnt. After all, most of the hits I received went to a static blog, and a wiki, and I didn't have significantly more traffic than I'd had before.
I removed a number of older scripts and installations, but it didn't seem to work, and I finally had to quickly shut down to avoid having to pay 70$/month. I hadn't looked for a new webhost for many years, and wanted to spend some time looking carefully through the options. That seemed like a big task, and because those weeks were very busy, I ended up putting it off for a while.
May 21, 2013, [MD]
At Beyond the PDF 2 in Amsterdam, the organizers announced a round of microgrants and asked us to use the hashtag #1k to apply for them. Inspired by Martin Fenner's blog post, and my own experiments with scholarly authoring, I posted the following:
After a vote, this "project" was chosen as one of the winners. Martin Fenner graciously agreed to co-organize it with me, and 8th of June we're organizing a workshop/meeting/unconference/hackathon on Scholarly Markdown in San Francisco. Sign up on Eventbrite, and add your information to the wiki. We're also using the wiki to collect ideas (we'd love to hear from you even if you can't make it in person!).
Here's the description from the Eventbrite page:
Please join us for a full day of presentations, discussions and coding around Markdown for scholarly content. Some of the potential outcomes of the workshop include:
notes from discussion about the suitability of Markdown for scientific authoring (different stages, disciplines, etc)
- comprehensive list of tools/initiatives
- notes from discussion about collaboration/synergy between tools
- list of interesting examples/showcase (GH repositories etc)
- list of "barriers", "obstacles" etc
- notes from discussion about way forward, applying for grants, etc
The event is supported by a 1K Force11 Challenge prize.
April 6, 2013, [MD]
As many times before (most recently, Beyond the PDF 2 in Amsterdam), I've archived my tweets from the recent Coursera conference, cleaning them up just a little bit (I took out most retweets, but included a few). See also my impressions from the conference.
Before the conference
- Will be in Philadelphia for #CourseraConfAtPenn Fri+Sat. Anybody wants to meet to talk about #oa, #btpdf2, learning, #OER?
The Coursera Partners' Conference gets underway tomorrow, April 5th, 2013. We're stoked! #CourseraConfAtPenn
- If you’re hashtag is 18 characters, you’re doing it wrong. Looking at you, #CourseraConfAtPenn.
- @Akibaedx Great to see that edX and Coursera are hanging out :) Very interested in research on flipped (PhD stud from UofT)