A researcher’s view of data handling for life science

Given the current mess of data handling in life science (or bioscience, as it is also called) which I described in a previous article, what should be done? Let us begin with a few words from one of the gurus:

You have to start with the customer experience and work backwards to the technology.

Steve Jobs, quoted here.

We should start by defining what the needs are. What does the scientist, the research group, want in terms of data storage and handling? What do they need in order to pursue successful life science? What other goals for data storage in Swedish science are there? How can we promote approaches to data handling that facilitate Open Science?

This text is not a final text or treatise. It’s a snapshot of my thinking on the subject. Serious policy and design specifications must of course be crafted through debate and input from various experts. There is now an initiative at the Science for Life Laboratory, where I work, to discuss these issues. I have written this as a starting point for the discussion, in the hope that it may be useful for SciLifeLab and for others.

Let me begin by describing some of the needs that should be met, and some goals we should have. This is based on my experience at SciLifeLab and on discussions with other researchers. Only by having first discussed the issues and the goals can we then proceed to questions about resources and technology. The engineer in me says: Specifications first, then design and implementation. Too often, the discussion revolves around concrete technical solutions before the needs have been properly analyzed. The danger is that we create ad hoc solutions that become fragmented, are badly designed, difficult to maintain, and very hard for the user to understand.

Of course, this does not mean that we have to specify the ideal system in complete detail. That is just impossible. Instead, we have to figure out which issues are simpler and more fundamental, and which are more complex. We can then organize the work into providing the basics first, add on increasingly complex layers as we progress. We may discover at some level that the some of the basic building blocks are not good enough. Then we need to revisit those before solving the higher-level problem. Hierarchy is essential, but cannot be just top-down.

Rosalind’s day in the lab

Let’s follow an imaginary life science researcher called Rosalind during a day in the lab. What challenges does she face?

Rosalind picks up a package delivered by courier containing a 1 TB hard disk drive with the RNA-Seq data for the samples she sent to a sequencing service. She connects the disk to her laptop, and starts the script that checks the integrity of the data. It takes a couple of hours. She doesn’t have enough space in her laptop to copy the data over, so lets it stay on the delivery disk.

Rosalind needs to compare the new data with some old pilot experiments run by her former Ph.D. student James who is now a postdoc in another lab. She searches her laptop for the notebook files copied from James’s machine before he left, and after some digging among the folders, finds the file name in an entry. Browsing the entry, she finds the name of the hard drive where it is supposed to be. After some fruitless digging around in the cabinet in her office, she goes to the cupboard James had for his training gear and other stuff, and finally finds the disk there, with another four disks. She had told James to clean out that cupboard, but that had never happened. That’s so like James! He never did bother with the boring stuff. Thankfully, the disks do seem to have helpful labels written on them.

Now, how does she know that the file she has found on the disk is the one used for the final analysis in the paper that’s in press? She knows that James had to rerun the pipeline several times to iron out various minor issues. But did James rename the old files? She cannot find any MD5 hash values in James’ notebook, and she knows there are none in the supplementary material in the paper. The journal didn’t accept the files themselves as supplementary material, as she had hoped they would. She has to take it on trust that the files she found on James’ disk are the final ones.

Now she has to find space somewhere on the server to run the analysis. She manages to free up enough space to transfer the data onto one of the fast disks on the server, but only by deleting some files she thinks are intermediate steps from James’ previous work. She’s uncomfortable doing it, but she doesn’t have enough grant money to buy more fast disk this year.

Coming back the next day, she has a set of result files. Now, how to record the fact that these files were derived from the original RNA-Seq data? She tacks on one more suffix on the file names, to show which step of the analysis she is at. This file name is the handle she will use in her notes.

While her other scripts are chugging along, she starts to think about how to backup and safely store the original RNA-Seq data. She know that the sequencing service does not guarantee anything. The university has no storage facility for this kind of data. Using box.com, which the university has said is the official solution, is out of the question. Storing 600 GB there? Try uploading that, good luck! It seems she will have to do it the usual, ugly way. Go to a computer store and buy a couple of 1TB disk drives, one to keep at home, and one for the lab. That ought to do it.

So, what is needed?

Here is an attempt at distilling the unmet needs experienced by Rosalind. The ideal storage system should satisfy the following requirements:

  1. It must provide enough storage with good I/O performance for analysis to be performed.
  2. Allow computation and analysis, i.e. be located so that CPU resources can access the data efficiently.
  3. Allow stowing away data that is not going to be analysed just right now.
  4. Allow archiving data that is to be stored indefinitely.
  5. Data that has been stowed or archived must be easily (ideally transparently) moved back to high-performance storage, for renewed analysis.
  6. Be safe, i.e. not become corrupted or changed inadvertently.
  7. Be secure, i.e. not be accessed or modified by others before time, or due to privacy concerns.
  8. Allow collaboration with involved partners, which may be located globally.
  9. Allow the data to be tagged or otherwise enriched by metadata, during analysis, and for later publication.
  10. Allow files to be frozen, i.e. protected from being overwritten.
  11. All files should have an MD5 hash value associated with them. This can be used as a file reference, as well as an integrity mark.

This list is neither complete, nor well structured. Please feel free to comment below.

The basics of a solution

Here are some ideas about the basic design of a storage system that I believe would help to meet the needs discussed above. This is obviously rather short on detail. But I hope it can help get the discussion going.

Cloud, and yet localized

The storage services such as Amazon S3, box.com and DropBox should be inspirational models. Their block-based storage design is appropriate for files that are processed as a whole. These services are not appropriate for databases where items or records in a large data set are accessed in a non-deterministic manner.

Many applications in the life sciences do process files as entire blocks of data. The main problem for life science researchers is that analysis is very I/O intensive, meaning that the CPUs must have efficient access to the disks where the data is located. If, for instance. Amazon S3 is used, then computation must use Amazon’s own compute service EC2, otherwise the operation becomes both too expensive and too inefficient. This is not even possible for box.com and DropBox, there are no fast CPU resources connected to those.

So for academic life science, we would like to have a cloud system that allows files to be transparently relocated to the compute center where the analysis is to be done. Ideally, this would be done transparently, when the scripts accessing the data are started, but a simple web interface to control where data files are to be moved would be a reasonable solution.

Different levels of storage

Storage is expensive. Some technical solutions are cheaper than others, but are also slower. This is likely to be the case for the foreseeable future. So we need a system where only the data that we really need to process (right now!) is located on the storage having the best performance, since this is the most expensive.

We also need a system for archiving. Swedish universities are by law required to store data for at least 10 years, but of course science needs archiving for its own purposes, and for longer than 10 years. This kind of storage basically just has to be durable. It would be OK if it is very slow to access. I think most researchers can live with a solution where archived data takes on the order of a few days to bring back to some faster medium.

Considering the current setup of computing centers in Sweden, it might be reasonable to have an additional layer of half-fast storage in between the other two extremes. For instance, some data sets are required multiple times as reference for some analysis, and it might be useful to have that data located on a cheaper system than the fastest storage. But this is a question that the technical experts can decide. Is the additional complexity worth the savings?

Allocation of storage: Quotas? Grants? Payment?

The trickiest question of all is really how to regulate the amount of storage space a researcher is allocated. Experience shows that a researcher will fill up her allocated storage space, no matter how large. To provide researchers with as much storage as they want is of course not sustainable, or even possible. So some quota system is required.

Quotas can work without involving payments. Quotas can be allocated based on many different criteria, such as the nature of the research, excellence judged in some way, or how efficient the researcher has shown herself to be in using computing resources.

Quotas are, however, also a major pain in the neck. If the researcher needs to do some unanticipated analysis, and that is often what science is all about, then she may have to apply for a larger quota than usual. How is that going to be handled in a fair and efficient way?

The idea described above with a three-layer (or two) storage system could help. By setting fairly strict quotas on the most expensive storage, but allowing very large quotas on the slow archival storage, the responsibility for shuffling data appropriately will be pushed to the researcher. That’s where it must be. No-one else can decide which data should be moved from one system to the other. Time limits for having a data set on fast storage, for instance, are not going to work. It is just not possible to say how long it will take to develop a novel analysis method, for instance.

An additional benefit of a layered system controlled by the researcher is that a payment system can be added on top. If a researcher wants more fast storage than she gets in the quota system, she can pay for it herself, and get that amount added to her quota.

There have been suggestions that universities should pay for the storage used by the researchers. This may be reasonable (I do not know), but it does not solve the issue of making researchers behave responsibly using storage space. University financing is an orthogonal issue to the problem of how to design a sustainable system for researchers.

Web front end

If the layered system above is realized, then a web interface would be the natural way for a researcher to control which data sets should be shuffled between which modes of storage, and which compute centers. By making transfer requests in the web interface, and following their progress, the scientist can be spared the usual horror of using scp or rsync between machines.

A web front end would also allow researchers to publish specific data sets for papers or manuscripts, providing stable URLs such as DOIs. Tags should be used to mark up each data set, to allow for searches. A web interface is a requirement for Open Science: A data set that does not appear on the Web, in one way or another, is not Open.

So, there it is. Now, discuss!

9 reaktioner på ”A researcher’s view of data handling for life science

  1. A pseudocode one-liner shame-based quotas system:

    ls -alR /home/researchers/* | ls2html > quotas_for_each_researcher.html && python -m SimpleHTTPServer serve quotas_for_each_researcher.html

    Then publish it and show regular updates that so that everyone can see their own usage/costs.

    Then calculate cost function based on the time the data sits in there and how much of it (a.k.a Amazon’s S3 cost systems for academics, guerrilla style).

    My personal bet is that this will prompt researchers to be more efficient in their own disk space and more responsible with their colleague’s space and resources usage (save some heated arguments), a.k.a:

    https://en.wikipedia.org/wiki/Tragedy_of_the_commons

    Why is this not in place already?

    1. I hate to say it, but my feeling is that researchers have no shame when it comes to using resources for their own research. ”My research is the most important, it’s obvious!”

    2. tallphil

      We had basically this at my old job – not a web page, but a monthly e-mail saying how much storage you were using, what it cost per year and then a list of what files you had and how big they were. Made a dramatic difference – personally I saw loads of big files I had sitting around that I’d forgotten about. I’m sure this is the case now too, I think simply telling researchers what they are currently using would have a dramatic impact.

  2. tallphil

    Great article Per! One general comment – I don’t think that Rosalind’s group should host their data for others to use herself, I think that she should use a standard repository such as the SRA / ENA (or equivalent for other data types). No matter how good open access repositories are, specialised and centralised repositories will always be more accessible. They benefit from a) being known about by the community and b) being searchable.

    1. Good point! I did at some point think that maybe we should put together a kind of decision tree, where we show which community repositories exist and should be used for what kind of data. I am sure that would be helpful for many not-so-routine users.

  3. Great! I second Phil’s point on standard repositories. In ELIXIR this will probably be address in the context of Data Management Plans. We have plans (underpinned by already available solutions) to establish an online decision tree tool to, among other things, direct researchers to appropriate repositories. We also have this, that Micke wrote a couple of years ago: https://wiki.bils.se/wiki/Publishing_Data#Repositories. 🙂

    Not sure I agree that time limits for having a data set on fast storage will not work. If you haven’t touched the data i six months, I think it a reasonable trade-off to have to wait another day for your data to be put back on the fast storage.

  4. Scientists might have no shame when it comes to raw disk space, but put a theoretical price tag at bytes (as they do in Amazon) over time, and you’ll see some bytes moving and/or disappearing 😉

  5. Johannes Alneberg

    Great article Per!

    For the sake of the argument I have to take the viewpoint of the individual researcher here, and as such, of course ”My research is most important, it’s obvious!”. After all, that’s what’s paying the bills. If you don’t think your own research is important, you wouldn’t do that research.

    For me, computational infrastructure is just a resource as any other. Having access to a large disk and cpu quota means you have (some of) the resources needed to make great science. Just as a large money grant would give you resources to make great science. The distribution of this resources should thus (as with grants) be done by independent referees. You never hear about researchers not applying for or decline to accept money because they think other researchers deserve it better, even though that might be the right thing to do.

    However, the wasteful usage of especially disk usage is slightly different to the case of money. Maybe the difference is the attention that money spending attracts where wasting disk space doesn’t. In that case the shame-emails might be a good idea.

    There is another side to it as well though. From the individual researchers selfish view, the space used for files is not wasted unless one really knows that the files are completely useless, which never happens. In the case where files are easy to regenerate in say 2 hours, there is still a 2 hour benefit of having those files available directly. And if there is no cost for the researcher of keeping them, then that’s the most efficient way for the researcher.

    There is another point as well, in my experience a researcher that does not use the quota given will only be rewarded (read punished) with a lower quota in the future (where the needs might be greater) and thus less resources to do great science.

    I don’t have a clear solution to this either, but it would have to involve some cost for a user to use disk space and potentially also a reward for users using less space. As with grants, I think a scientific committee needs to make assessments how large quota each group should be granted.

    1. Good points. I think compute infrastructure in some ways is similar to a piece of shared equipment in a lab. A freezer, let’s say. Limited space, and if it’s full, something has to be thrown out before anything else goes in.

      Yes, the pattern that someone who doesn’t use a quota will get it reduced is problematic. So is over-subscription, where the sum of all quotas is larger than the space available. Very hard to find a sensible solution. But a simple tool to view current usage for a user, maybe combined with emails if biggish files are not touched in a while, might be one step towards more efficient usage.

Kommentera

Fyll i dina uppgifter nedan eller klicka på en ikon för att logga in:

WordPress.com Logo

Du kommenterar med ditt WordPress.com-konto. Logga ut / Ändra )

Twitter-bild

Du kommenterar med ditt Twitter-konto. Logga ut / Ändra )

Facebook-foto

Du kommenterar med ditt Facebook-konto. Logga ut / Ändra )

Google+ photo

Du kommenterar med ditt Google+-konto. Logga ut / Ändra )

Ansluter till %s