/ Science

Dat in the Lab at UC Merced: Reproducible bioinformatics pipelines in marine ecology

This week the Dat in the Lab team was hosted by the UC Merced Library for a visit to Mike Dawson’s lab. We learned about jellyfish in Palau’s marine lakes (more here), sea star wasting disease, and an exciting environmental DNA (eDNA) project (part of CALeDNA).

49tXn Mike Dawson describes differences in fish found in Palau's marine lakes.

Researchers in Dawson’s lab manage a collection of physical specimens (such as preserved jellyfish collected on research trips), extracted DNA and other raw material from samples, as well as genomic data in various stages of processing and annotation.

Palau-Dawson The Dawson Lab collects samples in the field with the Coral Reef Research Foundation and our collaborators Herwig Stibor, Philippe Pondaven, and Stephan Behl from Munich (Germany) & Brest (France). For more pictures from the Dawson Lab, check out their photo gallery.

The samples come back to the lab, where DNA can be extracted and processed to generate raw DNA sequences that are used in barcoding, population genomics, and the basis of an annotated genome (samples are stored in some impressively organized -20 C freezers).

0003810-DSC01663-1 Impressive freezers!

The raw sequence data are then input into analysis pipelines to investigate phylogenomics, community ecology, and population genomics in marine species. In the Dawson lab, a researcher may participate in all of these processes through their time in the lab - seeing a dataset from specimen collection to annotated genome. In other words, a single researcher may manage both physical collections and diverse data types representing all points of the research lifecycle.

tVrJy Rare plastic Orca sighting in the Dawson Lab.

We met with lab members to discuss their research processes and uncover obstacles and concerns relating to data publishing and data management. Through these conversations, reproducible research computing environments emerged as a challenge for many of the researchers. A research computing environment refers to the operating system, software versions, packages, and other dependencies that are specific to the analysis environment on a specific computer or cluster. A pipeline is a series of data transformation or analysis tools linked together. In genomic research, pipelines are used to take raw data and produce an annotated sequence through a series of (theoretically) reproducible steps.

While details of a computing environment may be published with an analysis pipeline or publication, running another group's pipeline or incorporating parts of a published pipeline is not always straightforward. Researchers in the Dawson lab want to run pipelines from other groups easily, and they want to package up their pipeline for data transformation, analysis, and annotation so that it can be easily run by other groups. The analysis pipelines themselves are research outputs. Any improvement in documentation and reusability of these resources will enhance the impact of the research by improving the ability of others to use the pipeline.

BFoH4 Dannise V. Ruiz-Ramos described the pipeline she is working with to annotate the sea star genome.

In our meeting with Mike Dawson, he raised the issue of continuity in a research group. How does a group leader handle it when several students or postdocs all get amazing jobs within the same year? Running a group that supports PhD and postdoctoral training positions means that people will move on. When a person leaves one research group, the institutional knowledge they have accrued goes with them.

0003810-DSC01697

From the perspective of the group leader, it is important to capture as much of that knowledge as possible and use it to smoothly onboard new group members. Data management is critical in this process, but capturing the research computing environment (and other aspects of workflow not typically thought of as data) are just as critical as the raw datasets and physical specimens. Well documented, reproducible pipelines will save the group time effort and provide training for new group members.

0003810-DSC01670 Unlovable samples will be dealt with.

What can Dat do to make bioinformatics pipelines more reproducible? Dat enables data version control. All the information that describes a research computing environment is data, but giving a researcher a text file detailing the software and operating system does not guarantee that the researcher will be able to reproduce the environment. Can Dat take some of the work out of this? Can we leverage this Dat’s data version control to create containerized virtual environments, building on our dat-container prototype? While a full solution of the reproducible research environment issue may not be possible in the term of this project, we are actively pursuing this problem and will see where it takes us.

0003810-DSC01702 After meeting with research group members, we took in the sights of Merced.

0003811-DSC01719 Library staff joined the Dawson lab for the Dat tutorial.

The next morning, the UC Merced Library hosted a workshop on Dat (try it at home). The goal of the workshop is to take participants through basic Dat functionality and spark discussion on potential features. Thanks to our UC Davis beta-testers, the UC Merced workshop was a bug-free experience. Like the group at UC Davis, researchers expressed an interest in multi-writer functionality to allow multiple people to author a datset. We also discussed archiving the lab’s data, the point at which a lab may not reasonably be able to store all versions of all datasets, and how to make data versioning decisions when working analysis pipelines.

0003811-DSC01725 Emily Lin, Head of Digital Curation and Scholarship, toured the Dat team around the UC Merced Library, including a stop in the spectacular McFadden-Willis Reading Room.

This visit underscored for us that focusing on researcher workflows and feedback is critical for us. We have learned so much from the visits to UC Davis and Merced. The feedback from researchers has already influenced our priorities. What will we learn over the next year as we continue to work with these researchers? Stay tuned for more detailed posts on the initial projects that we are pursuing at each institution over the next few weeks. Thanks to everyone at UC Merced for a productive visit!

0003811-DSC01713 UC Merced campus visible in the distance across the fields.