Nicolas BAILLY :“In the era of Big Data : the good, the bad, and the ugly. The future from past experiences in Biodiversity Information Systems and e-infrastructures”

lundi 12 octobre 2015
par  Romain DAVID

Autor : Nicolas Bailly, Hellenic Centre for Marine Research, LifeWatchGreece

Since the inception of the Web, the biodiversity domain progressively entered in the Big Data era. Other scientific domains did it in a pre-Web Internet period. But biodiversity took a while to really engage, even though the annual conferences of the Taxonomic Database Working Group were held since the mid-1980s. Only in 2001 started the GBIF for occurrence data, and OBIS for the marine domain, while 1966 was the year of the first specimen collection ever digitized : 35 years to realize the interest of gathering large datasets to perform, e.g., ecological niche modelling at large scale, beyond the management of collection artifacts. Genomics was the first sector to assemble big sequence datasets in the early 1980s, actually pushed by the necessity to deposit sequences there before publishing. For taxonomy, the electronic compilations of large lists of names really took off from the early 1990s, but the Catalogue of Life is still incomplete by 30%. Portals for description of species, life and ecological traits, are still being developed, and if the genetic resources are well recorded for crops and cattle (but still not for aquacultured commodities !), the wild populations of non-domesticated species that could be the source of new types of food are largely unknown from that respect.
This slowness, not to say sluggishness, constitutes the ugly part of the story, still ongoing nowadays, together with too much duplication of efforts especially on the informatics side, and the loss of many small datasets because researchers are not trained to or do not want to make their data available, although those were acquired with public money most of the time. Some solutions today exist to minimize these duplications and losses.
In a lifetime of a dataset, its status may move from a restricted usage for elaborating or testing hypotheses by one (team of) researcher(s) towards a collaborative initiative with many data providers. The bad part of the story comes when there is an attempt by the original team to try imposing standards and data structure, which usually prevents a full collaboration (e.g., FishBase at some points), or if this task is given to a committee, which after many meetings and discussions ends up into a highly abstracted impracticable jargon … when there is an end (e.g., the attempt of GBIF to define a life traits structure).
Rather, as demonstrated by GBIF and OBIS for the occurrence data with the DarwinCore data schema, it is preferable to focus on data exchange standards that everyone can match with one’s own data structure and where databases can be managed independently. That is the good part of the story even if many challenges remain, more on the networking, social relationships and long-term maintenance issues than actually on the technological aspects (Catalogue of Life). Indexation and metadata management become key points in that perspective as demonstrated by a number of e-infrastructures (D4Science, BioFresh, LifeWatch, EU BON, EMODnet). In the end, we have to admit that the full collaboration producing one information system is impossible, but rather, that the coordination with the production of several systems but with the same full dataset of inter-exchangeable data is the best solution.
The future research questions could be divided in 2 :
-  What are the scientific questions that can justify that the society pays for maintaining the existing Biodiversity Information System (BIS) and e-infrastructure as repositories of data, information and knowledge like encyclopedias or specimen collections : data are stored but nobody knows if they will be used ever or again ?
-  What are the scientific questions that can trigger the development of new BIS and e-infrastructures just like the question of fisheries management started FishBase/ SeaLifeBase and ecological niche modelling triggered the GBIF ?
For the former, developing predictive models of biodiversity changes (from genetic variability to ecosystem modifications) under different drivers (human threats, climate change, “normal” fluctuations, etc.) at different scales is the core of these data usage. Also, they allow establishing timely baselines and keep trace of the past, trying to mitigate the “shifting baseline syndrome”. Some challenges are to develop new statistical methods to treat missing data, uncertain data (e.g., from citizen science initiatives) and sampling biases (Bayesian statistics, neural networks, etc.), and automatic tools to update the taxonomies for long-term maintenance.
For the latter, modelling the trophic interactions in large scale ecosystems, linking phenotypic with genetic variability, incl. documenting life traits at population level, and develop more accurate EBVs (Essential Biodiversity Variables) at all scales might be the keys to predict and monitor looser and winner species, and thus adapt and balance our management of biodiversity in a sustainable way, from both a conservation and exploitation point of views.
But in the end, it becomes clear that we will need to link all our biodiversity data with socio-economic data if we want to have a real impact on policy making, which is a step further. For this, semantic web technologies are being developed to constitute the big network of the Linked Open Data initiative. Developments are still needed to locate and retrieve not only entire datasets but also in order to give users the possibility to create their own dataset by filtering part of the data from several datasets in a seamless way.

Further readings

Appeltans W., et al. (>50 authors), 2012. The magnitude of global marine species diversity. Current Biology, 22 (23) : 2189–2202.

Bailly, N. ; Kesner-Reyes, K. ; Villacorta-Casal, C.M., 2010. Hotspots of marine biodiversity in the Southeast Asian Sea : Mapping current location and climate change impacts. Terminal Report. Aquatic Biodiversity Informatics Office- The WorldFish Center. 27 p. + annexes.

Berghe, E.V., Coro, G., N. Bailly, F. Fiorellato, C. Aldemita, A. Ellenbroek and P. Pagano. 2015. Retrieving taxa names from large biodiversity data collections using flexible matching workflow. Ecological Informatics 28:29-41.

Coro, G., Pagano, P., & Ellenbroek, A. 2013. Combining simulated expert knowledge with Neural Networks to produce Ecological Niche Models for Latimeria chalumnae. Ecological Modelling, 268, 55-63.

Coro, G., Webb, T. J., Appeltans, W., Bailly, N., Cattrijsse, A., & Pagano, P. 2015. Classifying degrees of species commonness : North Sea fish as a case study. Ecological Modelling, 312, 272-280.

Costello, M.J., W. Appeltans, N. Bailly, W.G. Berendsohn, Y. De jong, M. Edwards, R. Froese, F. Huetmann, W. Los, J. Mees, H. Segers and F.A. Bisby, 2014. Strategies for the sustainability of online open-access biodiversity databases. Biological Conservation, 173:156-165.

De Jong, Y., Verbeek, M., Michelsen, V., Bjørn, P. de P., Los, W., Steeman, F., … Penev, L. (2014). Fauna Europaea – all European animal species on the web. Biodiversity Data Journal, (2), e4034. doi:10.3897/BDJ.2.e4034
Hardisty, A., Roberts, D. et al. (>70 authors) 2013. A decadal view of biodiversity informatics : challenges and priorities. BMC ecology, 13(1), 16.

Roskov Y., Abucay L., Orrell T., Nicolson D., Kunze T., Culham A., Bailly N., Kirk P., Bourgoin T., DeWalt R.E., Decock W., De Wever A., eds. 2015. Species 2000 & ITIS Catalogue of Life, 2015 Annual Checklist. DVD. Species 2000 : Naturalis, Leiden, the Netherlands.

Ruggiero, M.A., D.P. Gordon, T.M. Orrell, N. Bailly, T. Bourgoin, R.C. Brusca, T. Cavelier-Smith, M.D. Guiry and P.M. Kirk. 2015. A higher level of classification of all living organisms. PLOS One 10(4) : e0119248. doi:10.1371/journal.pone.0119248.









Aucun évènement à venir les 6 prochains mois