Files
biobox/docs/DOCKER_GUIDE.md
CI 04a5851ff8 Build branch biobox/main with version main to biobox on branch main (7158daa)
Build pipeline: viash-hub.biobox.main-tb4cv

Source commit: 7158daa5f6

Source message: Fix bases2fastq component, update to latest practices (#190)

* wip updates

* refactor component

* assume bases2fastq follows semver

* fix version command

* add entry to changelog

* move to minor changes
2025-09-01 11:04:56 +00:00

311 lines
7.2 KiB
Markdown

# Docker and Engine Best Practices
This guide covers best practices for setting up Docker engines and managing dependencies in biobox components.
## Table of Contents
- [Preferred Approach: Biocontainers](#preferred-approach-biocontainers)
- [Finding Biocontainers](#finding-biocontainers)
- [Version Detection](#version-detection)
- [Docker Run Syntax](#docker-run-syntax)
- [Custom Containers](#custom-containers)
- [Recommended Base Containers](#recommended-base-containers)
- [Multi-tool Containers](#multi-tool-containers)
- [Container Optimization](#container-optimization)
- [Testing Docker Setup](#testing-docker-setup)
## Preferred Approach: Biocontainers
### Basic Setup
```yaml
engines:
- type: docker
image: quay.io/biocontainers/bowtie2:2.5.4--he96a11b_6
setup:
- type: docker
run:
- bowtie2 --version 2>&1 | head -1 | sed 's/.*version /bowtie2: /' > /var/software_versions.txt
```
### Key Requirements
1. **Use specific versions**: Always pin to specific versions with build strings
2. **Include version detection**: Add setup commands to create `/var/software_versions.txt`
3. **Verify command availability**: Ensure the container has the required commands from `requirements.commands`
## Finding Biocontainers
### Search Strategy
1. **Google search**: `biocontainer <tool_name>`
2. **Direct URL**: `https://quay.io/repository/biocontainers/<tool_name>?tab=tags`
3. **Check version compatibility**: Choose the most recent stable version
4. **Verify build string**: Include the complete version tag with build string
### Version Selection
```yaml
# Good: Specific version with build string
image: quay.io/biocontainers/samtools:1.17--hd87286a_2
# Bad: Latest or incomplete version
image: quay.io/biocontainers/samtools:latest
image: quay.io/biocontainers/samtools:1.17
```
## Version Detection
### Common Patterns
```bash
# Pattern 1: Tool outputs "Tool version X.Y.Z"
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /' > /var/software_versions.txt
# Pattern 2: Tool outputs just version number
echo "tool: $(tool --version 2>&1 | head -1)" > /var/software_versions.txt
# Pattern 3: Complex version output, extract numeric part
tool --version 2>&1 | grep -E "^[0-9]" | head -1 | sed 's/^/tool: /' > /var/software_versions.txt
# Pattern 4: Version in specific format
tool --version 2>&1 | awk '{print "tool: " $NF}' > /var/software_versions.txt
```
### Real Examples
```bash
# bowtie2
bowtie2 --version 2>&1 | head -1 | sed 's/.*version /bowtie2: /' > /var/software_versions.txt
# samtools
samtools --version 2>&1 | head -1 | sed 's/samtools /samtools: /' > /var/software_versions.txt
# fastqc
fastqc --version 2>&1 | sed 's/FastQC v/fastqc: /' > /var/software_versions.txt
```
### Testing Version Detection
Always test your version detection command:
```bash
# Test in the container
docker run quay.io/biocontainers/tool:version bash -c "
tool --version 2>&1 | head -1 | sed 's/.*version /tool: /'
"
```
## Docker Run Syntax
### List vs Multiline Strings
**Preferred: List format**
```yaml
run:
# Single commands
- command1 arg1 arg2
- command2 arg1 arg2
# Chained commands
- command1 && command2 && command3
```
**Alternative: Multiline strings (for complex commands)**
```yaml
run: |
command1 arg1 arg2 && \
command2 arg1 arg2 && \
command3 arg1 arg2
```
**Important:** Comments inside multiline strings (`run: |`) become Dockerfile `RUN` commands and will break the build. Use comments before the `run:` key or use the list format.
## Custom Containers
### When to Use Custom Containers
Use custom containers when:
- No suitable biocontainer exists
- You need to install additional dependencies
- You need a specific base environment (R, Python, etc.)
### Python-based Tools
```yaml
engines:
- type: docker
image: python:3.10-slim
setup:
- type: python
packages:
- numpy~=x.x.x
- pandas~=x.x.x
- scipy~=x.x.x
```
### R-based Tools
```yaml
engines:
- type: docker
image: rocker/r2u:24.04
setup:
- type: r
cran: [devtools, BiocManager]
bioc: [Biostrings, GenomicRanges]
```
### Compilation from Source
```yaml
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [build-essential, cmake, git, wget]
- type: docker
run:
- wget https://github.com/user/tool/archive/v1.0.tar.gz && tar -xzf v1.0.tar.gz
- cd tool-1.0 && make && make install
- echo "tool: 1.0" > /var/software_versions.txt
```
## Recommended Base Containers
### General Purpose
- **Ubuntu**: `ubuntu:22.04` - Good for compilation and apt packages
- **Alpine**: `alpine:latest` - Minimal size, apk packages
- **Debian**: `debian:bookworm-slim` - Stable, well-supported
### Language-Specific
#### Python
```yaml
# Basic Python
image: python:3.10-slim
# With scientific packages
image: python:3.10
# GPU-enabled
image: nvcr.io/nvidia/pytorch:23.08-py3
```
#### R
```yaml
# Fast package installation
image: rocker/r2u:24.04
# Tidyverse included
image: rocker/tidyverse:4.3.0
# Bioconductor base
image: bioconductor/bioconductor_docker:RELEASE_3_17
```
#### Node.js
```yaml
# LTS version
image: node:18-slim
# Alpine variant
image: node:18-alpine
```
#### Other Languages
```yaml
# Java
image: openjdk:11-jre-slim
# Go
image: golang:1.20-alpine
# Rust
image: rust:1.70-slim
# Ruby
image: ruby:3.1-slim
```
## Multi-tool Containers
### Installing Multiple Tools
```yaml
engines:
- type: docker
image: ubuntu:22.04
setup:
- type: apt
packages: [wget, curl, build-essential]
- type: docker
run:
# Install tool 1
- wget https://tool1.com/download && install_tool1
# Install tool 2
- wget https://tool2.com/download && install_tool2
# Create version file
- echo "tool1: $(tool1 --version)" > /var/software_versions.txt
- echo "tool2: $(tool2 --version)" >> /var/software_versions.txt
```
## Container Optimization
### Layer Efficiency
```yaml
# Good: Combine related commands
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y wget curl && \
wget https://tool.com/download && \
install_tool && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Bad: Separate layers for each command
setup:
- type: apt
packages: [wget, curl]
- type: docker
run: wget https://tool.com/download
- type: docker
run: install_tool
- type: docker
run: apt-get clean
```
## Testing Docker Setup
### Viash Docker Debugging
```bash
# Inspect the generated Dockerfile
viash run config.vsh.yaml -- ---dockerfile
# Build with cached layers (faster)
viash run config.vsh.yaml -- ---setup cachedbuild ---verbose
# Build from scratch (clean build)
viash run config.vsh.yaml -- ---setup build ---verbose
# Enter interactive debugging session
viash run config.vsh.yaml -- ---debug
# Check installed tools (inside container)
which tool
tool --version
# Verify version file
cat /var/software_versions.txt
```
### Common Issues
1. **Command not found**: Tool not in PATH or not installed
2. **Version detection fails**: Command syntax varies between tools
3. **Permission issues**: Tools installed in wrong location
4. **Missing dependencies**: Tool requires additional libraries