Unique publication IDs in open scholar search

March 24, 2013, [MD]

At Beyond the PDF 2, I gave a Vision talk about "An open alternative to Google Scholar", and since then a group of us have begun discussing how we can make this happen. There isn't yet a fixed place for this discussion to take place, but we can use the broader hashtag #scholrev (see Peter Murray-Rust's blog post) to coordinate.

Researchr and "Scrobblr"

Many of my thoughts on this topic came out of my work with Researchr, and my wish to have a system with an open API, which would let me integrate search, metadata lookup, etc, with Researchr. I also wanted unique IDs for publications to be able to link my notes about an article, with notes somebody else took about the same article. Together with Ryan Muller, we began work on "Scrobblr",  a social hub for reading. The idea was that it could work like Scrobbler for music, where what you listen to is automatically submitted, and shared with your friends. In the same way, the papers you read are automatically submitted, and other people in your group can see what you are reading, and automatically import your citations (screencast demo).

Although we never got there, we thought a lot about how this could be expanded to a much larger social hub - sharing bibliography lists from different Researchr users, auto-suggesting "you've been reading many of these papers lately, you should get in touch with this other student, who is reading a lot of similar stuff", automatic PDF hash-based lookups (screencast demo), etc. I began writing up design ideas in a document that was never finished, but many of these are relevant to the current ideas about an Open Scholar Search, so I'll post some of them here.

Unique IDs for publications

There are a lot of reasons why we'd want unique IDs for publications, making citation lists unambiguous and easier to parse, enabling rich citations in non-traditional media (wikis, blogs), etc. Right now CrossRef DOIs is the closest we come, and they already show how difficult it is to push for the usage of such identifiers (ORCID will have a similar challenge). It would be great if it were possible to build on the work CrossRef has done, and I've recently become aware of how much interesting innovations the team is coming out with (slides, blog).

However, there are a few barriers. The first is economic - as far as I can see, it costs a minimum of \$330 for a publisher to participate. This might not seem like much, but I know of very few independent OA journals that have DOIs. (There might also be large technical implementation costs, I don't know). However, worse than this is that only the publisher can submit metadata. This means that we have to rely on them to submit correct metadata (and although that's often the case, it's not always). It also means that we will never get metadata/identifiers for publishers who don't participate, who don't even exist anymore, or for scholarly material that wasn't published as journal articles (we might want to cite video films, archive items etc, and have unique identifiers for them as well).

Below I discuss how the identifier might be formatted (from Scrobblr notes). This is also related to who can assign an identifier, in the case of CrossRef DOI, identifiers are assigned by publishers, who get their own "name spaces" (similar to ISBN, DNS or IP numbers). In the case of ORCID, who share the their deliberation about identifiers, numbers are assigned centrally, and are simply arbitrary numbers with a specific formatting. This will probably end up being the case with articles in open scholar search as well, but below I play with the idea of using something more semantic - after all, it's a lot easier to give a hat tip to @houshuang than to http://orcid.org/0000-0002-2632-8448, even though both are equally unique. And it is a fascinating idea to be able to write [@scardamalia2006knowledge] in any blog or wiki, and have it work...

Unique IDs

(From "Ideas for Scrobblr":)

Each publication should be assigned a unique ID (UID). This is inspired by the integration of many different applications that is enabled by the concept of a citekey in BibTeX. APIs should enable users to submit UID and receive metadata for any publication (whether in JSON or BibTeX, whether strictly citation info or also social info about tags, other users, links etc). There should also be a number of ways to determine a publication’s UID through various lookups.

Format

There are (roughly) two choices for the format of a UID. The first would be a randomly generated (or sequential) ID with no semantic meaning, whether with numbers or letters etc. The second would be the citekey format which researchr currently uses. The advantage with this is that it is familiar to users (of LaTeX / researchr etc), and immediately conveys some minimal information about a citation. Through use, certain frequent citations might even be recalled actively or passively. Certainly, it is much easier to reorder three publications cited in a blog post using citekeys ("I’ll put the scardamalia2006knowledge first, and then mention johsnson2000corruption") than using random IDs ("See for example 3093049304955 and 88585").

However, there are a few challenges with using the citekey format. The first is generation and the second is collisions. Although the general principle is well understood (last name of first author + year + first word of title) there are a number of permutations, for example

This results in citekeys generated by researchr or other tools (Google Scholar) and Scrobblr to be different. Some of these we can just define arbitrarily, but we might want some decent algorithm to solve the first point above - perhaps joining the words of the last name without spaces.

Given that we can thus generate nice citekeys from submitted metadata (much of which won’t even have a citekey, or have a citekey in a totally different format), we encounter the problem that the citekey in the database might differ from the citekey in the user’s local system. One approach would be to use Researchr or other plugins to “harmonize” these (i.e.. automatically modify citekeys on the user’s end) - this would have to be done early in the import process, because everything locally is tied to the citekey (PDF name, wiki pages). (Of course, in the future Scrobblr will be the first place we go to download papers in our fields anyway so theoretically we won’t even have this problem :)) Or we could just accept that there will be a discrepancy here.

The second problem however will be collision. It is likely that there will be cases of several papers generating the same citekey. Again we’ll need a way of resolving this. A simple way would be to add “b” to the year or something like that - not very elegant, since it will look kind of “random” when viewing it outside of a context. Another approach could have been to go back and give both articles a longer citekey to avoid collision (perhaps the first two words of the title), however, given that a citekey once assigned should be absolute, this is impossible.

Given that we can solve all of these things, the final concern is user confusion about local citekeys and Scrobblr citekeys, given that they look so similar. One way to mitigate this in practice would be to come up with some notation for linking to citekeys which specified that they were Scrobblr citekeys. Currently we are using [@citekey] for citations, but this is purely random, it could easily be something else. It would however be great if it was something both easy to type, easy on the eyes, and still fairly unambiguous. Since citekey is rarely used on the web today, it would for example be easy to write a plugin that scanned a blog post for this notation and recognized citations.

Stian Håklev March 24, 2013 Toronto, Canada
comments powered by Disqus