Best Practices
Before you start your experiments
The relevance of results we generate mainly depends on the quality of data provided. It is of utmost importance that the experimental layout is optimized for the question to be addressed. This relates for example to the number of biological replicates required. It is usually a disadvantage to collect the data in slices. Appropriate blocking of experimental conditions will in addition help the identification of robust effects with highest sensitivity. Our experience with many colloborations in which the researchers contacted us after collection of the data leads us to strongly suggest to contact us before you start the experiments.
"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of." R. Fisher
Naming stuff, be consistent and avoid all special characters
Whenever you provide names to e.g. files, directories, samples, technical parameters do not use any characters except letters (A-Z, a-z) , digits (0-9) and the underscore ("_"). Yes, white space (" ") is bad.
When providing details about a sample in its name, e.g. "Wildtype_IP_treatment_2017201", stick to a fixed scheme and separate the items using an underscore.
This plus details on how to work with data in EXCEL: Data organization in Spreadsheets
Data safety
The most important part of your experiment is the raw data you obtain. It has to be protected from corruption and loss. You might be asked to provide the data to the community upon submission of a related manuscript. Given such importance, the core facility will not take responsibility for the integrity of the data. The data will be on our computers or servers only during the process of analysis. After project completion the data will be deleted.
Please make sure that you always keep a copy of the raw data on one of your computers and have it backed-up to some place that is not close to the primary storage.
Analytical reproducibility
Reproducible analysis is a central requirement for any data-driven study. It ensures that results can be traced, verified and re-run, both by you and by others. Achieving reproducibility requires scripted analysis, controlled software environments and, ideally, version control.
Scripting.
Analyses implemented as scripts (R, Python, command line workflows, etc.) are vastly more reliable than point-and-click operations. Scripts document every analytical decision, allow exact re-execution and avoid accidental deviations between runs.
Software environment control.
Results depend not only on code but also on the exact versions of software libraries, tools and dependencies. Environment managers (Conda, Mamba, renv, containers, etc.) make these dependencies explicit and allow the same environment to be recreated later. This prevents silent changes in package behaviour from altering results.
Version control.
Version control systems such as Git track changes in code and documentation and make analytical development transparent. They help avoid fragmented “final_v2_really_final” scripts and allow consistent collaboration. For complex analyses, version control is the only reliable way to manage evolving pipelines.
Taken together, these practices are essential for scientific reliability. The absence of scripted, versioned and environment-controlled workflows is a major source of irreproducible findings. If in doubt, ask for support before analysis starts; it is much easier to build reproducibility into a project than to repair it afterwards.
Use of large language models (LLMs) in data analysis
LLMs are increasingly used to assist with coding, exploratory analysis and tool development. They can accelerate routine tasks, generate prototype code and help clarify options when planning analytical approaches. At the same time, their output must be treated with caution.
Rapid development, but uncertain validity.
LLMs can produce code that appears plausible yet contains subtle errors, inefficient logic or incorrect assumptions. As analyses become more complex, these issues become harder to detect, and undetected mistakes can compromise entire projects.
Tight control required.
LLM-generated code should never be used without critical review, testing and validation. This includes checking that the code actually performs what it claims to perform, that it matches the study design and that it integrates correctly with the rest of the workflow. Transparency is essential: generated code should be version-controlled, commented and tested just like manually written code.
Role in research workflows.
LLMs can support experienced analysts, but they do not replace expertise. They should not be used to bypass conceptual understanding of the analysis or to patch gaps in methodological planning. Their safe use requires a clear analytical strategy, strong scepticism and continuous quality control.
We are collecting practical experience with LLM integration into data pre- and postprocessing workflows. We are happy to share insights and offer consultation on responsible, controlled use of LLMs in research settings.