Each publication should be assigned a unique ID (UID). This is inspired by the integration of many different applications that is enabled by the concept of a citekey in BibTeX. APIs should enable users to submit UID and receive metadata for any publication (whether in JSON or BibTeX, whether strictly citation info or also social info about tags, other users, links etc). There should also be a number of ways to determine a publication’s UID through various lookups.
There are (roughly) two choices for the format of a UID. The first would be a randomly generated (or sequential) ID with no semantic meaning, whether with numbers or letters etc. The second would be the citekey format which researchr currently uses. The advantage with this is that it is familiar to users (of LaTeX / researchr etc), and immediately conveys some minimal information about a citation. Through use, certain frequent citations might even be recalled actively or passively. Certainly, it is much easier to reorder three publications cited in a blog post using citekeys (I’ll put the scardamalia2006knowledge first, and then mention johsnson2000corruption) than using random IDs (See for example 3093049, 304955 and 88585).
However, there are a few challenges with using the citekey format. The first is generation and the second is collisions. Although the general principle is well understood (last name of first author + year + first word of title) there is a number of permutations, for example
- I prefer manually changing van2006knowledge to vanderwende2006knowledge
- what to do with punctuation, is it peter2006knowledge or peter2006knowledge-integration
- it often makes sense to include the first word with more than n (=3?) letters, etc.
This results in citekeys generated by researchr or other tools (Google Scholar) and Scrobblr to be different. Some of these we can just define arbitrarily, but we might want some decent algorithm to solve the first point above - perhaps joining the words of the last name without spaces.
Given that we can thus generate nice citekeys from submitted metadata (much of which won’t even have a citekey, or have a citekey in a totally different format), we encounter the problem that the citekey in the database might differ from the citekey in the user’s local system. One approach would be to use Researchr or other plugins to “harmonize” these (i.e.. automatically modify citekeys on the user’s end) - this would have to be done early in the import process, because everything locally is tied to the citekey (PDF name, wiki pages). (Of course, in the future Scrobblr will be the first place we go to download papers in our fields anyway so theoretically we won’t even have this problem :)) Or we could just accept that there will be a discrepancy here.
The second problem however will be collision. It is likely that there will be cases of several papers generating the same citekey. Again we’ll need a way of resolving this. A simple way would be to add “b” to the year or something like that - not very elegant, since it will look kind of “random” when viewing it outside of a context. Another approach could have been to go back and give both articles a longer citekey to avoid collision (perhaps the first two words of the title), however, given that a citekey once assigned should be absolute, this is impossible.
Given that we can solve all of these things, the final concern is user confusion about local citekeys and Scrobblr citekeys, given that they look so similar. One way to mitigate this in practice would be to come up with some notation for linking to citekeys which specified that they were Scrobblr citekeys. Currently we are using [ @citekey] for citations, but this is purely random, it could easily be something else. It would however be great if it was something both easy to type, easy on the eyes, and still fairly unambiguous. Since citekey is rarely used on the web today, it would for example be easy to write a plugin that scanned a blog post for this notation and recognized citations. (Not a great example, since the citation would probably be transformed to a full citation by the blog post software anywhere, but…)
There are two possible ways to populate the Scrobblr database, the first is user contributions, and the second is “canonical contributions”. I will start with the second one, because while it might be much rarer in the beginning, it avoids a number of problems with the first category.
Despite all the effort users go through in typing in and reformatting citations, “perfect” citation entries are out there, at least in most cases. The metadata put out by a publisher of a given article or chapter should in most cases be considered “canonical”, and since almost all publishers maintain these in a machine-readable format internally, theoretically it should be easy to integrate these. In practice, the situation is of course different.
(The one common exception to publisher’s data being considered canonical is the unfortunate practice of listing authors’ names using only initials. In most cases, it would be better to have full names. It is hoped that the deployment of ORCID - unique author IDs - will help mitigate this, but even if it a success, which is not certain, it will probably take a long time to propagate).
However, there is a number of possible approaches - how attractive these are depends partly on how specialized we intend to keep Scrobblr (or to put it in other words, how much work we want to put into acquiring and hosting metadata for publications which we ourselves are unlikely to be interested in). There are a few categories of canonical or mostly-canonical sources that we could with some work integrate:
- large pre-print archives that enable bulk download, for example (only?) arxiv.org which enables full download both of their entire full-text article collection, and metadata in BibTeX (740,000 publications)
- repositories which provide standardized APIs, such as most of the institutional repositories through Directory of Open Access Repositories (DOAR)
- Directory of Open Access Journals provides the TOCs of 4000 OA journals (has API)
etc. In addition to having ideally mostly correct metadata, the sources above all provide links to the full-text of the journal articles, which means that we can also extract PDF hashes.
The obvious way for users to contribute is of course to use researchr, and we already have a fully functional demo of submitting both metadata and PDF hashes. However, we would also like users of other platforms be able to contribute. Submission of metadata is quite easy - it is no problem to tie into Mendeley API for example, to import your entire library or selected publications, the same for many other websites such as CiteULike. The GScholar bookmarklet is also an attractive solution for ad-hoc imports. Finally, one could imagine a web interface for directly entering citations into Scrobblr.
A common problem with many of these methods (apart from the researchr one), is that it will be difficult or impossible to access the PDF and generate a PDF hash. In the case of Google Scholar, we can do this if there is a PDF link present, in the case of Mendeley and Papers, we might be able to do this with an desktop plugin, but not with a purely online API.
Incomplete or erroneous information might be the result of spam, or of legitimate mistakes and lack of full information (for example importing BibTeX from Google Scholar, which has itself extracted it from the PDF file using a heuristic algorithm). Apart from errors, there might also be different, equally legitimate ways of encoding metadata for an article.
Separate from the concept of people submitting incomplete publications is the problem of people submitting completely erroneous information, whether intentional (as an effort to spam) or unintentional (a script gone mad that fills up the database with 1000s of spurious entries). Ideally the use of an API key (and we might need a captcha for sign-up etc), as well as a non-standard API, would mitigate most automatic attempts. We could eventually have a way of flagging erroneous entries, and if enough from the same user were flagged, all his/her contributions could be removed automatically. (Of course, if someone were really determined to undermine the site specifically, creating multiple accounts etc, that would be hard to defend against - not something to worry about in the short-term though).
The problem of “unintentional spamming”, by users creating thousands of entries through erroneous scripts or experimentation is more difficult to secure against. There might be an option in a user’s control panel to view last import, and cancel the entire last import if something went wrong.
The much more common problem will be people submitting incomplete citations or slightly different citations. This is tightly connected to duplication-detection (below). If users are entering citations directly through the Scrobblr interface, there are various ways in which we can do error-checking, enable auto-suggestion of author and journal names, etc. These measures will not ensure consistency and correctness, but they can contribute. However, most of the data will likely be uploaded as individual entries or in bulk from other sources.
Ideally, the Scrobblr platform would offer powerful and user-friendly tools for editing already contributed citations. Possible desired features:
- automatic lookup and suggestion based on DOI, fuzzy Google search, Mendeley (using Mendeley API) etc, both for new publication and for publications that have already been entered, but are not complete
- batch-processing/editing with preview (ie. change all instances of Buckingham-Shum, S. to Buckingham-Shum, Simon)
- easy ways of normalizing things like author-names and journal names (perhaps a fuzzy search that lists S. Buckingham Shum, Simon Buckingham Shum and Simon Buckingham-Shum together, and let’s the user rename all of these to one of the variants)
- linked to the above, the possibility of having a database of journals and authors (similar to author pages and journal pages in researchr currently, where people could collaboratively edit meta information, such as link to homepage, picture, research interests etc. These would then help in designating “canonical” author/journal names, and their pages would automatically list their publications. There might be a need for unique author IDs, but when using full author names, that might be a rare enough problem to not warrant extra intrusive functionality - again ORCID to the rescue, hopefully)
- some kind of history/revisions of edits
- the ability to subscribe to certain authors, keywords to see all edits
- distinguishing who can conduct certain kinds of edits depending on some kind of a reputation system, like StackOverflow (can only post comment if you have a certain reputation etc)
- some kind of a voting system for the most “correct” metadata?
This editing depends on the fact that each publication needs to have one and only one entry in the database, thus we need mechanisms to avoid duplication.
How do we ensure that there is only one entry in the database for each different publication? (If we were librarians we would spend a lot of time discussing different editions etc, but in most cases with journal publications, this should not be relevant. With books that have been reissued, that might be an issue).
If metadata is submitted together with a PDF hash, it would be easy to detect if a publication entry for that publication with an identical PDF hash already exists. If this occurs, there could be different algorithms for dealing with the different metadata (unless it’s identical) - keeping the first, keeping the last, keeping the longest (“most complete”), etc.
However, we might also be dealing with the same publication, but slightly different PDF files (pre-print vs final version, or even just one that has annotations “hard-coded” and one that doesn’t). A text fingerprint engine would solve this problem, and allow us to tie the two PDF hashes together (a publication can have several PDF hashes). However, there might also be cases where metadata is submitted without PDF hashes (similar to the case where PDFs are different, and we have not yet developed a text fingerprinting engine).
In this case, we might be able to match based on for example a DOI field, or an identical URL field. (The problem is that with batch submissions, we are not able to ask the user to confirm.) BibSonomy has developed a system for perceptual hashing of bibtex entries, which allows them to identify duplicates with some degree of certainty - this is documented on their website. Otherwise, the only way might be to import both, and then enable manual “merging” in the future, based on an interactive editing system as listed in the section above.
If we do indeed generate two entries and two unique citekeys for one publication, and these are subsequently merged, we might have to allow for “alias” citekeys to continue to live - because there might be references to both citekeys on blogs, etc. This would function like Wikipedia’s #redirect.
The data in the database should be retrievable in a number of ways:
- by canonical ID (whether that is citekey or a random identifier, see above)
- by binary PDF hash
- by perceptual text hash
it could also be searched by any of the fields, as well as by user/keyword etc. The API can return BibTeX as its default, but in the interest of interoperability, it should not be difficult to issue a few other formats as well, for example BibJSON. In addition to the bibliographic metadata, there might be other metadata returned, such as other users who have an item in their library, keywords, popularity, linkbacks, etc.
In addition to the general bibliographic information, the BibTeX entry should also contain the unique identifier, whether as the main citekey (to be handled by researchr) or (perhaps more secure) as a separate field. This will show that the entry in the user’s system conforms directly to a specific publication in Scrobblr, and can be used for scrobbling, etc.
- a user might grab a publication directly from a Scrobblr detail page, this would involve parsing information already present on the page, whether through researchr, or in the future for example Zotero (we should investigate which of the metadata that is readable by Zotero which we can easily embed in pages)
- a user might have one or a number of PDFs that were not imported with their metadata. Either researchr or another system with a plugin could automatically query Scrobblr using first the binary hash, and if that fails, by extracting the text and querying with a perceptual text hash generated from the text.
One of the main features of Scrobblr is being able to indicate that the user did something with a given publication. Ideally, this will be registered automatically in the background, without requiring an explicit action by the user.
Data might be captured at various points of the user’s workflow, for example in researchr:
- when a publication has been imported/added to a user’s library
- whenever a user opens a publication’s PDF in a reader
- when a user exports clippings from the PDF and generates a researchr page
- when a user edits and posts higher level notes
- when a user makes these pages available online (syncing their researchr installation)
These different data points have different uses. For example, for gathering statistics, and connecting users with each other, feeding recommendation systems, etc, simply recording which publications exist/get added to another user’s library, and when he/she opens them, might be very useful. However, if providing links to generated content, waiting until the content has been published online (not just generated locally) is ideal.
What are the ways in which users of other citation managers could also scrobble?
- create an app, which wraps the default PDF reader - upon launch, it sends the binary (and perhaps perceptual text hash) to the server, and opens the PDF in the regular PDF reader (this could be done in the background, while the PDF is launched, so the user will not notice a significant difference in launch time)
- there is no obvious way of getting access to the user’s metadata, so we would just have a binary/perceptual hash. if other users had already entered the publication, this would work. alternatively, we could still notice that two users were reading the same paper, even though we didn’t know the name of the paper - but this is obviously not ideal.
- it is not possible to set a PDF reader only for use with BibDesk (other applications might be different). Users will probably not want to scrobble all of their PDF documents, both for privacy, and because it is not very helpful to connect users, because they were both looking at similar GAship applications. However, any attempt to ask the user would quickly become intrusive, and the user would probably turn off the app.
- some citation managers have built-in PDF functionality
- using lsof to show list of open files - shares most of the problems listed above
- hacking Skim (which is open source) to have built-in scrobbling functionality - requires someone with Obj-C knowledge, would require us to distribute and keep current a fork of Skim, and still runs into many of the problems above, although it would be a more elegant solution, and in theory we could add a very unobtrusive way of marking publications for scrobbling or not (it could also send a binary hash, and only scrobble if the hash already existed in the database, showing an indication of this in the PDF reader, with a link to the relevant Scrobblr page - added functionality which might make the PDF reader valuable even to people who do not use any citation manager
Note that adding links to user-generated content relevant to different publications, which could be considered a form of scrobbling when using the researchr workflow, is very similar to backlinks from other social sites, to be discussed below.
Currently, most citation managers and many citation sharing websites provide functionality for citing publications and generating publication lists in MS Word, OpenOffice etc. However, there are usually no provisions for using citations in other contexts, such as on blogs, wikis, etc.
The desired functionality can be separated into three parts. The first is ease of insertion, the second is rendering/display (ie. value added, why would you want to use this tool rather than typing in or copying and pasting the citation), and the third is the information sent back to the Scrobblr server about the citation (trackback).
To make social citations attractive to users who do not care too much about the “trackback” functionality (see below), Scrobblr must offer an easier way of making nicely formatted citations than the way the user traditionally adds citations. Even though the researchr suite might offer additional functionality, the basic functionality should be available to anyone, without any additionally downloaded and installed software.
The simplest way of inserting a citation would be to look up a citation directly on Scrobblr, and make a note of the UID (which should be displayed prominently - whatever format the UID is in, see above). Let us presume the UID is in a citekey format. The user would then be able to easily format and enter the citation as for example Scardamalia & Bereiter, 2006Scardamalia, M., & Bereiter, C. (2006). Knowledge building: Theory, pedagogy, and technology. In K. Sawyer (Ed.), The Cambridge handbook of the learning sciences (97–118). New York: Cambridge University Press. in a WordPress blog entry, or a wiki entry.
The ease of use could be further enhanced by preferring publications added by the user him/herself, by the user’s “friends”, or in groups the user is a member of. There could also be some formatting indicating whether a publication is in a user’s library. (This would work well for a bookmarklet generated from a user’s account page, or a plugin for a personal WordPress wiki, but how would this be done for a plugin on a shared wiki? Perhaps the wiki owner chooses certain groups to privilege during install).
In most cases, the functionality will be launched through a plugin in a websites that is controlled by us, or the user. The first two plugins could be for WordPress and Dokuwiki. Very similar to the existing WordPress and Dokuwiki plugins, the script would output a citation in citekey format (like Scardamalia & Bereiter, 2006Scardamalia, M., & Bereiter, C. (2006). Knowledge building: Theory, pedagogy, and technology. In K. Sawyer (Ed.), The Cambridge handbook of the learning sciences (97–118). New York: Cambridge University Press. or some other markup deemed appropriate), which would on display of the page render as a properly marked up citation, with the full citation on mouseover, and linking to the publication’s page on Scrobblr.
The publication pages on Scrobblr currently show links to people’s notes about that publication. We want to enhance this with links to other public usages/citations of the same publication. These may include:
- use on the user’s researchr wiki, outside of the ref: page itself (for example on a topical page, referencing a number of citations)
- on a non-researchr group wiki
- in a blog entry on a personal blog
- in a Wikipedia article
- in the case of a website with a citation plugin, users are also able to type in citations manually, in addition to using the selector to insert them automatically
- on public websites where the user is using a bookmarklet/browser plugin, rather than site plugin, the only time we have information is when the user is using the selector to insert a citation. There is no reasonable way of detecting if that citation is later removed
- users might in the same editing session or in subsequent editing sessions (perhaps by other users in the case of a collaborative website) remove a citation that has already been reported to Scrobblr
We do not want to submit information to the server about private citations - for example in a private e-mail message, or even about usage in a semi-private setting, such as a course management system, if the page with the citation is not available to the world.
One way to check if the URL is publicly available is to attempt to fetch the URL that is submitted to Scrobblr, and do a text search to see if the title of the publication is found in the HTML of the page. This could even be repeated on a semi-regular basis to remove stale links (in the case that the page has subsequently been edited).
An excellent example of social citations filling a real gap is a wiki for collaborative literature review. As an example, Peer2Peer University wants to be a hub not only for people seeking informal learning opportunities, but also for people wanting to research collaborative and peer-learning. Given this, we want to set up a collaborative wiki to create a review of the literature on collaborative and peer-learning in open contexts.
The existing social citation platforms allow you to collect citations in groups, however they provide no or very little facility for adding comments, sequencing these, etc. A wiki (for example a publicly hosted Dokuwiki), in conjunction with Scrobblr, would be ideal for this purpose. The Google Scholar bookmarklet makes it easy for people to add publications to the group, without downloading or installing any specific software.
Based on the metadata in the database, citations will be displayed both on the individual publication pages on Scrobblr, and on any pages using social citations plugins and bookmarklets from Scrobblr, as listed above. Given that one of our critiques of traditional ways of citing has been the lack of metadata, and the turning of semantic data into “dumb” citations, we need to think about how to best represent the metadata to the user.
The default way of showing a citation, which academic users are used to, is formatted according to a citation standard, such as the APA. This could apply to in-text citations, such as Brawley (1999), and source list citations, such as
Brawley (1999). How to format citations. Journal of Citation Formatting, 1 (4).