Monday, 3 June 2013

Badly-coded affiliations: a too long-standing curse


  A webinar on the Repository Junction Broker (RJB) Project being presently carried out at EDINA National Data Centre in Edinburgh was delivered last week by Muriel Mewissen, RJ Broker Project manager. The RJ Broker is a SWORD-based tool for automated content delivery into institutional repositories which will identify target IRs by associating the co-authors' affiliations to their institution's platform (where available).

In the course of this RSP-organised event, Muriel shared some slides with an analysis of the preliminary content transfers the RJ Broker has performed so far. The first RJB-mediated transfer test involved processing in excess of 60,000 Europe PubMed Central articles and delivering them into the (mock) worldwide repository network.

EuropePMC is a solid disciplinary platform for the biosciences, whose content is often delivered straight from publishers. The platform's contents do usually feature good-quality metadata as a result, and EuropePMC provides thus a good example for testing research article transfer. Moreover, the specific EuropePMC article set selected for this test was remarkably modern. However, the statistical figures Muriel presented for the RJ Broker's ability to resolve author's affiliations in EuropePMC articles were simply astonishing (see figure below): author's affiliations were badly coded for over half the transferred articles' metadata.



This is a well-known issue the PEER project also had to deal with at the time. Institutions have been telling their authors since ages to try to harmonise their affiliation when signing their papers, but it's still very frequent to find affiliations such as Department of Psychology, Compton Rd or Radiology Unit, Hearts Lane which are literally impossible to process by the RJ Broker since they lack their main affiliation node.

A large collective effort needs to be done in order to provide the means for somehow tackling this long-standing issue once and for all, and ORCID looks a very promising initiative in this regard. If it were somehow possible to have author's affiliations coded into their ORCID iDs – something ORCID is actually aiming to do – the rate of miscoded affiliations could be expected to rapidly drop as a result.

Very much like the author identification, this is of course a huge challenge no-one has so far been able to tackle, and ORCID faces a lot of hard work in order to find a way to attack the miscoded affiliation issue. But there is currently much talk in the community about organisational IDs and having some system put in place that will hopefully provide the means to start solving this seemingly unsolvable difficulty. The research information management community badly needs ORCID to succeed in this challenge if it is to be able to ever start building the eagerly awaited service layer on top of the infrastructure one.




3 comments:

Muriel Mewissen said...

I’m glad you enjoyed the seminar on the RJ Broker. However, I would like to add some background information to the figures you highlight in my slide. The context of this slide was RJ Broker centric and on what the broker can do with the data. The RJ broker identified organisation for ~22,500 records, leaving us with ~36,000 for which we couldn’t identified an organisation. You jumped to the wrong conclusion that these were all due to “badly coded” metadata. This isn’t the case. There are several reasons why the RJ Broker can’t identify an organisation. Yes, some records had issues with the metadata, whether missing, incorrect or incomplete. A typical example is to have an organisation field stating ‘Medical Institute’ which although correct is not specific enough. The RJ Broker relies on the Organisation and Repository Identification (ORI) tool (ori.edina.ac.uk) to identify organisations which holds information on over 24,000 organisations worldwide. However, we know ORI does not hold all organisations, for example we recently discovered that the British Antarctic Survey wasn’t in the ORI because it is not listed in any of the sources ORI harvests data from. Some of these records will have valid organisation field but the RJ Broker cannot identify them.

It would be interesting to analyse these ~36,000 records and sort them according to why an organisation cannot be identify. However, this is a time consuming process which we have not yet been able to perform.

Pablo de Castro said...

Thanks for your comment, Muriel.
I recently came across the following paragraph by David Palmer, Hong Kong University, in his paper "The benefits of authority management in an IR; more than name disambiguation" about correcting mistakes regarding ID issues for HKU institutional authors, http://hub.hku.hk/handle/10722/184124:

"(...) Our work with Scopus corrected 10,000s of egregious Scopus errors on HKU data, again exacerbated by the use of Romanization for Hanzi names. These errors include one author with two or more Scopus AU-ID profiles, one author’s papers distributed amongst two or more profiles of orthographically dissimilar people, two or more homonymous individuals erroneously shown as one Scopus profile, erroneous affiliations, and more".
With this I mean that the post is addressing a well-known, very frequent issue in authors' affiliation identification and that the reference to EuropePMC is rather incidental here. When using the expression "badly coded" I mean of course impossible to process by the RJ Broker, as the next paragraph makes clear. No offence meant then to EuropePMC, whose data are anyway often delivered by publishers themselves, who collected them in turn from the very authors at manuscript submission time. Hopefully the ORCID affiliation feature to be shortly provided by Ringgold/ISNI will gradually alleviate this burden.

Ema Susanti said...
This comment has been removed by a blog administrator.