A researcher’s view of data handling for life science

Given the current mess of data handling in life science (or bioscience, as it is also called) which I described in a previous article, what should be done? Let us begin with a few words from one of the gurus:

You have to start with the customer experience and work backwards to the technology.

Steve Jobs, quoted here.

We should start by defining what the needs are. What does the scientist, the research group, want in terms of data storage and handling? What do they need in order to pursue successful life science? What other goals for data storage in Swedish science are there? How can we promote approaches to data handling that facilitate Open Science?

This text is not a final text or treatise. It’s a snapshot of my thinking on the subject. Serious policy and design specifications must of course be crafted through debate and input from various experts. There is now an initiative at the Science for Life Laboratory, where I work, to discuss these issues. I have written this as a starting point for the discussion, in the hope that it may be useful for SciLifeLab and for others.
What is important, and what is not, for bioscience data handling

There is an on-going discussion between the main bureaucratic players of Swedish science regarding the issue of data storage and data handling for the biosciences in Sweden. The question they discuss is ”who should pay for what, and how should the money be channelled?”

It is exactly the wrong question.

Since all of these actors (except for non-governmental organisations, e.g. KAW) are financed by tax-payers money, it is a technical budget issue how they decide to finance things. There seems to be a strange idea floating around that if one fiddles with the financing paths, the whole problem of data storage will become more maintainable. This is, in my mind, to miss the point entirely.
The mess in bioscience data handling

Science is a social activity relying on knowledge sharing, reproducibility, reanalysis and extension of previous work. The movement towards Open Access publication and Open Science sharing of data and analysis protocols can be seen as a natural development of these ideals. Large data sets are essential to many scientific investigations and are sometimes the product of an investigation. The biosciences have fairly recently started producing large data sets. There are several well-funded international efforts maintaining focused bioscience data sets, such as genomes at Ensembl, protein sequence data at UniProt, and many others.

Bioscience researchers are performing more Big Data experiments, but the various infrastructures available at the group, department, university and national levels are unable to cope. The situation for individual research groups is basically a mess. Various ad hoc solutions are being implemented, ultimately leading to a patchwork of systems that is becoming increasingly difficult for anyone to navigate. This also makes proper implementation of Open Science extremely hard, if not impossible.
Why queues are inevitable

Detta inlägg finns även på svenska.

We love to complain about queues. Why do we have to wait? Do not the queues in e.g. the health care system show that too few resources are allocated to it? I have looked a little closer at this problem.

My conclusion: No, we are probably not willing to pay what it costs to eliminate queues. My results rely on some basic assumptions, and are applicable to many different types of scenarios. I have used computer simulations to investigate the problem. The numbers speak for themselves: The queueless society is an unreasonable utopia.
Därför är köer oundvikliga

This blog post is also available in English.

Vi klagar gärna på köer. Varför ska vi behöva vänta? Visar inte t.ex. vårdköerna att för snåla resurser läggs på sjukvården? Jag har tittat litet närmare på detta problem.

Min slutsats: Nej, vi är nog inte beredda att betala vad det kostar att avskaffa köerna. Mitt resultat bygger på några enkla förutsättningar, och är tillämpligt på många olika typer av verksamheter. Jag har använt datorsimuleringar för att räkna på problemet. Siffrorna talar sitt tydliga språk: Det köfria samhället är en orimlig utopi.
MolScript: A story of success and failure

A scientific paper I published in 1991 is on the list of ”The 100 most highly cited papers of all time”. The paper in question is

Per J. Kraulis
MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures.
J. Appl. Cryst. (1991) 24, 946-950

The Top 100 list is published in the 30 Oct 2014 issue of Nature magazine. The list contains the 100 most-cited papers in the entire scientific literature since 1900. The MolScript paper is number 82 on the list, with 13,496 citations.

Update! 26 Nov 2014: The MolScript paper is now Open Access! See the J. Appl. Cryst. web site or DOI:10.1107/S0021889891004399.

Here are a couple of images prepared using the MolScript program:

Ras p21, standard view
Ras p21, standard view

This image shows a schematic overview of the ras p21 protein, based on a 3D structure determined by Ernest Laue’s group, which I was a member of (Kraulis, PJ, et al, Biochemistry (1994) 12, 3515-3531). The ras p21 protein is a key component of growth signaling in the cell. In a large fraction of cancer cases, this molecule has been mutated, so that its normal regulatory function has broken down.

