There’s a broadly held view in biology that one dataset ought to equal one paper. Utilizing the identical information in two papers is commonly considered with suspicion. To readers, it might seem that the authors are attempting to get twice the educational credit score for a given quantity of labor. This even has a reputation, ‘double-dipping’. (Observe, that is distinct from publishers’ ‘double-dipping’ to get each writer open entry charges, and in addition institutional subscriptions).
I not too long ago double-dipped. Triple-dipped, Quadruple-dipped, and extra actually. I used to be each very hesitant to take action, and but pleased with the consequence. I’m penning this publish to elucidate what I did, why I did it, and why I feel double-dipping will be a wonderful alternative for authors, readers, and funding businesses. A lot so, that I feel funding businesses ought to take into account fellowships to assist salaries of scholars, postdocs, and even perhaps college to revisit printed information to squeeze extra insights.
What I did.
In 2009, I obtained funding from the Howard Hughes Medical Institute to pursue a examine of how stickleback immune genes advanced throughout advanced landscapes of populations related by various levels of gene stream (for hosts and parasites). The main focus initially was meant to be on MHC. Our very first order of enterprise was to set the stage for the evolutionary genetics work by studying the fundamental pure historical past of the host-parasite system I meant to check. What parasites are current within the space I labored in on Vancouver Island? To what extent do they differ from one stickleback inhabitants to the subsequent? To what extent is that this parasite group variation attributable to abiotic variables, biotic communities, fish inhabitants traits, or fish genotypes? By answering such fundamental pure historical past questions, we get a basis to decide on essentially the most attention-grabbing populations for evolutionary genetic contrasts.
To gather such information, we carried out a subject survey in Might 2009. “We” included my graduate scholar Will Stutz (who helped conceive of the undertaking to start with), a grad scholar collaborator Travis Ingram, a brand new PhD scholar Yuexin Jiang, two undergraduates (Chris Thompson and Todasporn Rodbumrung), UBC graduate scholar Travis Ingram, and a high-school biology trainer Kim Hendrix (most of us pictured, above)
. Over the course of 4 weeks on Vancouver Island we sampled ~100 stickleback from every of 45 populations, a mixture of lake and stream and estuary websites. Then from fall 2009 by way of 2013 I employed Julie Day then Kim Ballare as analysis technicians to depend parasites, characterize abdomen contents, measure ecomorphology for ~3500 fish specimens. From 2013-2014, Hollis Woodard labored as a technician in my lab to genotype a pair thousand stickleback for MHC, and Yoel Stuart helped her do ddRADseq on a subset of people to get impartial genetic markers. The consequence was an unlimited pure historical past dataset on weight loss plan, morphology, an infection, MHC, >100,000 SNPs, all set inside a geographic context of various lake sizes, elevations, and many others scattered throughout watersheds on north-eastern Vancouver Island.
The ensuing information set was huge, and intimidating. I had all the information in had by someday in 2014, and it took me practically a 12 months of on-and-off-again work simply to arrange and curate the information to examine for errors, odd outliers, misspelled inhabitants IDs, and all these little issues that may creep right into a dataset that has been dealt with by many separate folks. For a number of years in a row, the information would lie fallow for months on finish, then I might discover a little bit of time to work on analyses, solely to set it apart and begin over half a 12 months later. The issue was this was no one’s major dataset. It was meant to be exploratory (with some a priori predictions to make sure), and there was a lot to discover, and I used to be the only real individual delving into these explorations for a very long time. Every summer time I’d spend a pair weeks on Cape Cod with my household and I’d sneak in a while at an incredible French Patisserie spending a couple of hours analyzing these information whereas my children had been in summer time camp (photograph under). Over time I constructed up a set of analyses that answered my a priori questions and went a step additional to describing the spatial construction of the parasite metacommunity in nice element.
1000’s of traces of R code later, it was time to put in writing. However I fairly rapidly discovered that the writing constructed as much as over 100 pages of textual content. I used to be writing a guide, and never even I need to learn a book-length doc on the pure historical past of stickleback parasites on Vancouver Island. The trick was, there are such a lot of distinct questions that may come from a dataset of this measurement. Will we analyze parasite species richness? Or multivariate composition? Will we analyze every species of parasite individually, or through ordination in a single group? Will we embrace genetics, or host weight loss plan, or lake abiotic circumstances? These all are attention-grabbing, all inform us one thing totally different, however to do all of them merely took manner an excessive amount of textual content, it might pressure the curiosity of any however essentially the most dedicate readers.
Sooner or later, in stepped Emlyn Resetarits (a PhD scholar with Mathew Leibold and myself), who helped persuade me to separate this up into chew sized components. The consequence:
1) A paper centered on parasite metacommunity composition – which species are discovered the place, and which species are discovered collectively or aside, and what predicts this variation? We throw in a giant GWAS examine of many parasites within the appendix, which could have been a paper unto itself. Bolnick et al 2020 Ecology
2) A paper centered on parasite metacommunity range – not a lot who’s discovered the place, however how various they’re, which revealed a richly totally different story than species composition alone. Bolnick et al 2020 Ecography
3) A 3rd paper put aside the parasite info to concentrate on the evolutionary ecology of stickleback weight loss plan and particular person specialization. This was revisiting a subject that was core to my tutorial beginnings, which I hadn’t touched in a couple of years. However the dataset on stickleback diets (collected to grasp an infection patterns) was additionally precisely one thing I’d hoped to attain for years. This turned out to present a fantastically clear and intuitive consequence that generalist populations (consuming roughly equal mixes of benthic and limnetic prey, in mid-sized lakes) had the best dietary and phenotypic range. However, these purposeful variances had been unrelated to genomic heterozygosity, which elevated steadily with lake measurement. Briefly, impartial genomic range and purposeful ecological range had been unrelated, and responded to completely totally different options of the populations’ environments (Bolnick and Ballare, 2020, Ecology Letters).
4) That Ecology Letters paper occurred to incorporate, in passing, a GWAS evaluation of SNPs associated to lake measurement. Are there loci whose allele frequency varies predictably between smaller versus bigger lakes, and whose heterozygosity was largest in mid-sized lakes? Properly, at an American Naturalist convention proper round when this paper got here out, Diana Rennison and I in contrast notes. I had this GWAS between benthic versus limnetic allopatric lake populations, and he or she had inhabitants genomic information for benthic versus limnetic species pairs in symatry. Why not evaluate these? Harer et al 2020 Molecular Ecology was the consequence. Remarkably, the benthic-limnetic species pairs present each extra repeatable evolution, and larger divergence, than allopatric populations.
5) Most not too long ago, we lastly acquired to the unique motive for this information assortment: Main Histocompatibility Advanced genetic range. MHC (right here MHC IIb) is among the many most various genes within the vertebrate genome, often stated to be below balancing or frequency-dependent choice to keep up this range. The entire level of this survey was to find out the variety of MHC, and its affiliation with parasite load and variety. Properly, with assist from Stijn de Haan we lastly ran the bioinformatics pipeline to establish alleles and genotype people (utilizing a bioinformatics protocol developed by Will Stutz, the PhD scholar who first deliberate this examine with me). Then Foen Peng adopted the dataset to run statistical analyses and write. The ensuing paper simply posted to Molecular Ecology a couple of days in the past: Peng et al 2021 Molecular Ecology. Disappointingly, (e.g., opposite to our beginning motives) little or no in regards to the parasite group tells us something about MHC. As an alternative, MHC range appears to be greatest predicted by impartial genomic range, not parasite range. And MHC divergence between populations is greatest predicted by genomic Fst, and never parasite or ecological variations. This solely deepens the puzzle of MHC range for us, as a result of it definitely is insanely various, but we twice now have did not discover a clear adaptive clarification for this variation (see additionally Stutz et al 2017 Molecular Ecology, which used a distinct set of populations and totally different analytical method).
6) In the end, I’m glad to say we largely moved away from the MHC focus, which appears to not matter for the parasite that engages us most, the cestode Schistocephalus solidus (pictured under), and as a substitute began doing QTL mapping, expression, and GWAS analyses. The info set collected in 2009 for some fundamental pure historical past proved to be extraordinarily helpful in motivating and guiding our genetic mapping research (manuscripts in prep, and in addition Weber et al 2017 PNAS).
Why I did it: the advantages of N-dipping
Now, apologies for what should seem to be a prolonged commercial for a aspect of my lab’s latest work (okay, it form of is an advert). However I’ve a broader aim with this publish. Right here we had one survey, one dataset, that has yielded 5 papers (and extra in queue). That’s some critical information recycling. So is it moral? Completely sure, certainly I’d say it’s morally preferable.
Right here’s why. First, every of the papers cited above asks a completely totally different query of the information. The largest overlap is between Peng et al 2021, and Stutz et al 2017, however these used two totally different datasets, and totally different analytical approaches, to ask the identical query. Stutz et al used parapatric populations to make the most of gene stream, Peng et al used allopatric populations however way more of them, with the added bonus of ddRADseq genomic information. And the MHC information analyses don’t actually make sense till you might have grappled with covariation between parasite species, and described their range, so we needed to sort out that Ecology and Ecography paper first. So, the papers assist one another, however they definitely aren’t redundant conceptually.
Second, we might have put all this into one paper however it might have been a 150 web page behemoth. You don’t need to learn that, and I don’t need to write it. In actual fact, I did write it. At the least, the Ecography and Ecology and Ecology Letters papers had been all mashed into one >100 web page manuscript at one level, and the MHC information would have added one other >50. And it was so arduous to maintain clear threads of which analyses when with which ends up. Breaking it up made it simpler to learn and perceive.
Third, funding businesses put important funding into this dataset, which required wage for 4 technicians and two postdocs to tug collectively. Aren’t we morally obligated to wring each little bit of perception out of that hard-won information?
A name to funds
This final query leads me to a suggestion. Many people accumulate datasets as our careers progress. Usually these information units have a wealthy multi-layered nature, however we publish essentially the most thrilling bit of knowledge then transfer on. That is partly as a result of we prioritize publishing the highest-impact work we will, and for profession development are higher off setting apart much less thrilling findings, to spend time on the splashiest stuff. However there may be additionally a monetary angle. Knowledge analyses, and re-analyses, and writing, take time. Which takes cash. Most Affiliate Professors and Professors have sedimentary layers of unpublished or incompletely printed analyses. These took funds to generate. It’s a disgrace to have their outcomes solely partly printed, with necessary parts gathering mud. The reason being, after I apply for my subsequent grant the overview panel will need to see proof of a brand new plan of motion. What information will I acquire subsequent, what experiment will I conduct, what survey or mannequin will I design? Revisiting older information to wring out extra valuable drops of perception? Not fundable. Properly, it ought to be. These information include extra insights. They’re hard-won, expensive to acquire, and carry multiple paper’s price of information and classes. So I feel it might be nice if NSF or different funding businesses would assist fellowships, whether or not for grad college students, postdocs, or college, to revisit and repurpose present information to attain new ends. The info are there, we simply want the time to delve deeper.
So to conclude, I feel we have to encourage folks to make use of their hard-won information extra effectively and completely. That requires funds, and the social assist for the observe. It additionally has the attention-grabbing side-effect that we find yourself with interconnected papers strewn throughout many various journals, that construct a a lot bigger holistic story when considered collectively. I am intrigued by the notion of bundling printed papers to create a narrative arc that transcends a single paper in a single journal. Contemplate the above description of a set of papers to be such a bundled set.