• Print

Bringing data to the fore


The SourceData project improves access to the data in published scientific articles and enables further use of research results. This EMBO initiative has entered its proof-of-principle phase with the support of the Robert Bosch Foundation. The tools that are being developed as part of the project make the source data in scientific publications easier to find and allow scientists to use the data to their full potential.


Efficient access to primary research could significantly speed up science. Yet scientific journals publish most data as figures that do not easily allow further analysis of the data and are inaccessible to systematic data mining or search. “When I had to make an extensive review of all papers reporting Cdk1 substrates, I realized how cumbersome and time-consuming such an elementary task can be using standard literature search engines like PubMed, let alone finding and comparing the respective figures showing the experimental evidence,” Tim Hunt explains to illustrate this problem. “It would be so wonderful to be able to find and compare all published cell cycle-dependent phosphorylation profiles of such substrates to analyse the differences, depending on the context of the experiments.” To address this central issue in scientific publishing, EMBO has initiated the SourceData project. SourceData will develop tools adapted to the scientific publishing process that will make published data re-usable and searchable.


“The main objective of the SourceData project is to describe the biological content of the data underlying published figures—we call these ‘source data’—by using standardized terminologies and developing data-oriented strategies to search the literature,” explains Thomas Lemberger, Deputy Head of Scientific Publications at EMBO, who leads the project.


“All EMBO Press journals already encourage the presentation of source data for figures that convey the essential findings in a paper,” says Bernd Pulverer, Head of Scientific Publications at EMBO. “Last year, more than half of the research papers in our journals contained source data and we are confident that this will become the standard for publication in the near future.”


The SourceData project has recently received support from the Robert Bosch Foundation to develop the necessary software tools in collaboration with the Vital-IT Center for high-performance computing and the Swiss-Prot groups, both headed by Ioannis Xenarios at the Swiss Institute of Bioinformatics. One of the tools is a computer-assisted biocuration platform that helps biocurators to identify in an unambiguous way the biological components involved in a published experiment. In addition, whenever possible, a simplified computer-readable representation of the hypothesis tested by the experiment will be generated.


With the biocuration tool in place, editors of scientific manuscripts can describe the source data underlying the figures that represent the essential findings in the paper to the reader. At EMBO Press, dedicated scientific data editors already work with the authors of manuscripts to optimise and structure the presentation of their figures and data.


SourceData recently hired its first biocurators who joined EMBO in Heidelberg, Germany. “Part of our job is to annotate the figures of papers published in open access and other partner journals. We collaborate with publishers, including EMBO Press, Wiley, HighWire Press and Nature Publishing Group, to make a proofof-principle demonstration of the capabilities of the SourceData curation process and semantic models that we are developing,” says Sara El-Gebali, one of the new staff. “We are working closely with the software developers at Vital-IT to optimize the usability of the curation tool.”


“Our goal is to transform the scientific paper into an enriched resource,” says Thomas Lemberger. “We want to make scientific papers more useful to the community and the data in them more easily discoverable. One of the ways to do this is to publish the data in a more structured manner and in a form that is accessible to computers. The first paper published by EMBO that included source data was published in Molecular Systems Biology in 2009. More journals are adopting similar policies. We can now build on this experience to go full circle: Soon we will be able to publish figures in a way that not only includes the human-readable illustration and figure legends but also the associated source data and machine-readable metadata. This is a very exciting time.” SourceData collaborates with publishers including Wiley Blackwell and HighWire to ensure the tools are applicable across journals.


“It is an essential part of our job to work in collaboration with our publishing partners Wiley and HighWire to optimize the integration of biocuration tasks within the production workflows. We also need to develop downstream applications such as ‘Smart Figures’ that will allow readers to view figures in one paper in the context of related data published in other papers,” says Nancy George, biocurator at SourceData. “We also need to keep up to date with innovations in the fields related to biocuration as well as the development of community standards and data exchange formats.”


“For a project like SourceData, we need to combine expertise in areas like scientific publishing, data mining, semantic web technologies and biocuration,” says Lemberger. “Our recently appointed members of the Scientific Advisory Board will provide us with essential guidance in this respect.”


The Scientific Advisory Board of SourceData includes Jason Swedlow, of the University of Dundee, Scotland, Alfonso Valencia, Director of the Spanish National Bioinformatics Institute, Susanna-Assunta Sansone, of the University of Oxford e-Research Centre, Phil Archer, Data Activity Lead of the World Wide Web Consortium, and Mark Patterson, Executive Director of eLife.


The SourceData website will be launched in April at sourcedata.embo.org