12.07.2016 17:29

Open access to data on new and known taxa


The discovery of new species is regularly newsworthy, as well as impressive phenomena such as  the huge light trap Stade de France attracting moths reported yesterday. Whilst in the former case - because of a missing vernacular name - a scientific name is provided, in the latter a generic term (moth) is used to describe what happened to be one single species. In both cases GoogleWikipedia or Twitter are about as close to the scientific background of those species as one gets. And there are an estimated 17,000 new, discovered species and millions of known species with an estimated library of over 500 Million pages of scientific literature.

For several reasons, this corpus is not accessable, a special nuisance in the age of the Internet. Each name is a pinnacle of a network of cited published data. However, the entry point is either deliberately kept closed, discoverable only for the specialists, or does exist in the print library only. The worst and most furstrating is the first. Thereby, access is denied through artificial barriers such as a paywall or passwords for a corpus of literature that in fact is a huge collection of data based on, and derived from, observation data in a highly structured way. This is not only frustrating in the age of a global, increasingly neglected biodiversity crisis; from a legal point of view, data are not copyrightable and thus not only want to be free, but ought to be free.

This new button is Plazi’s most prominent spyhole into this hidden world of biodiversity knowledge and a step towards the entry point to the “Biodiversity Knowledge Graph” of Linked Open Data. Every day new taxonomic treatments – the block of text of scientific article that is explicitly linked to a specific usage of scientific name – are automatically added through harvested taxonomic articles. They are either imported as semantically enhanced documents and directly converted, or as Portable Document Format (PDF) files and processed using GoldenGate and integrated into TreatmentBank. Each of the treatment is provided with a persistent identifier and thus can be cited.

Currently the data of an estimated 25% of the annually new discovered species and higher taxa are made available this way, complemented by a sevenfold number of taxa based on re-descriptions or catalogue entries of previously published taxa.

Wherever possible this includes links to cited treatments, and listing of few to hundreds of facts adding up to rapidily corpus of over 100 Million of facts.

For GBIF this is the fastest update of their taxonomic backbone. Published, new material citations are accessible and provided too, if data conversion issues don't make it too challenging.

The automatic process is complemented by semiautomatic processing of legacy literature, open to interested parties to build their own corpus, as recently demontstrated at a EMODNet workshop.