TrESFlow is a Nextflow DSL2 pipeline for the preprocessing of TrES-seq data from FASTQs to cell by feature matrices.
Install your conda/mamba/micromamba env as follows (conda-forge & bioconda channels):
micromamba env create -n tres
micromamba activate tres
micromamba install pandas polars ipython pysam pybedtools numpy matplotlib seaborn scipy pyarrow upsetplot anndata scanpy matplotlib-venn leidenalg scikit-learn snapatac2
micromamba install screen samtools bwa-mem2 star fastqc multiqc trim-galore deeptools parallel ucsc-bedGraphToBigWig nextflow git gatk4Download the repo and cd in it:
git clone git@github.com:CSOgroup/TrESFlow.git
cd TrESFlowInstall codon in your env:
./scripts/install_codon_0.16.3.sh --prefix /path/to/env/prefix (for ex:/home/ahrmad/micromamba/envs/tres)The only supported public input contract is one hierarchical YAML samplesheet.
library_name: Isa
runtime:
env_prefix: /home/annan/micromamba/envs/tres
tmpdir: /mnt/dataFast/ahrmad/tmp/TrESFlow_Isa
references:
species: human
root: /mnt/dataFast/ahrmad/TrESFlow_References
ligation_barcode_whitelist: /mnt/dataFast/ahrmad/TrESFlow_References/ligation_barcode_whitelist.txt
rna_ref_dir: /mnt/dataFast/ahrmad/TrESFlow_References/rna/human/star
dna_ref_dir: /mnt/dataFast/ahrmad/TrESFlow_References/dna/human/bwa
dna_blacklist_bed: /mnt/dataFast/ahrmad/TrESFlow_References/dna/human/hg38-blacklist.v2.bed
dna_chrom_sizes: /mnt/dataFast/ahrmad/TrESFlow_References/dna/human/hg38.chrom.sizes
dna_effective_genome_size: 2913022398
samples:
day15:
groups:
Normal:
sb_barcodes: [CAGT, ACGT, GCTA, CGTA]
Co2:
sb_barcodes: [GTCA, TGCA, CTGA, TCGA]
rna:
reads:
i1: test_realdata/day15_I1.fq.gz
r1: test_realdata/day15_R1.fq.gz
r2: test_realdata/day15_R2.fq.gz
dna:
reads:
i1: test_realdata/day15_DNA_I1.fq.gz
i2: test_realdata/day15_DNA_I2.fq.gz
r1: test_realdata/day15_DNA_R1.fq.gz
r2: test_realdata/day15_DNA_R2.fq.gz
mark_barcodes:
H3K27me3: AGGCTATA
H3K27ac: GCCTCTATNotes:
- Omit the
rna:ordna:block when a sample has only one modality. At least one modality block must be present for each sample. runtime.env_prefix,runtime.tmpdir,references.species,references.root, andreferences.ligation_barcode_whitelistare required. Runtime and reference paths are no longer accepted as normal CLI parameters.groups.<group>.sb_barcodesremains supported for single-tagmentation samples. Userna_sb_barcodesanddna_sb_barcodeswhen RNA and DNA sample barcodes differ;dna.tagmentation: dualrequires explicit 3 ntdna_sb_barcodes.references.rna_ref_diris required when RNA samples are present and must point directly to the STAR index directory.references.dna_ref_dir,references.dna_blacklist_bed, andreferences.dna_effective_genome_sizeare required when DNA samples are present.dna.mark_barcodesis the source of truth for DNA modality barcodes.
Committed examples:
- smoke test:
assets/samplesheet.example.yaml - canonical real example:
assets/samplesheet.real.example.yaml - generic template:
assets/samplesheet.template.yaml
RNA core:
TAG_RNA_SAMPLE_BARCODETAG_RNA_UMITAG_RNA_CELL_BARCODETRIM_RNA_FASTQSSPLIT_RNA_READSFQ_TO_SAMRNA_STARSOLO_ALIGNRNA_FILTERED_BAMRNA_COVERAGE
DNA core:
TAG_DNA_SAMPLE_BARCODETAG_DNA_MODALITY_BARCODETAG_DNA_CELL_BARCODETRIM_DNA_FASTQSSPLIT_DNA_READSALIGN_DNAMARK_DUPLICATES_DNASPLIT_DUPLICATES_DNABAM_COVERAGE_DNA
Architecture/DAG:
RNA publishes:
rna_split_fastqs/rna_align/TrES_Stats/pipeline_info/
DNA publishes:
dna_split_fastqs/dna_align/TrES_Stats/pipeline_info/
The runtime contract comes from the samplesheet runtime: block. runtime.tmpdir
is exported as TMPDIR for pipeline tasks and can become very large on real runs.
The pipeline creates the directory if it is missing and fails if it is not writable.
Reference paths are explicit in the samplesheet:
references:
species: human
root: /mnt/dataFast/ahrmad/TrESFlow_References
ligation_barcode_whitelist: /mnt/dataFast/ahrmad/TrESFlow_References/ligation_barcode_whitelist.txt
rna_ref_dir: /mnt/dataFast/ahrmad/TrESFlow_References/rna/human/star
dna_ref_dir: /mnt/dataFast/ahrmad/TrESFlow_References/dna/human/bwa
dna_blacklist_bed: /mnt/dataFast/ahrmad/TrESFlow_References/dna/human/hg38-blacklist.v2.bed
dna_chrom_sizes: /mnt/dataFast/ahrmad/TrESFlow_References/dna/human/hg38.chrom.sizes
dna_effective_genome_size: 2913022398references.rna_ref_dir is passed directly to STAR as --genomeDir. The directory must contain Genome, SA, SAindex, chrName.txt, chrLength.txt, chrStart.txt, chrNameLength.txt, and genomeParameters.txt.
references.dna_ref_dir must contain exactly one complete bwa-mem2 sidecar set. The pipeline infers the prefix from files such as hg38.fa.0123, hg38.fa.amb, hg38.fa.ann, hg38.fa.bwt.2bit.64, and hg38.fa.pac.
The main remaining runtime CLI parameter is --max_cpus.
Default local CPU budget:
--max_cpus 64RNA_STARSOLO_ALIGNreserves up to48cores.RNA_COVERAGEandBAM_COVERAGE_DNAreserve up to32cores.ALIGN_DNAreserves up to48cores and passes that value to bwa-mem2 and samtools.RNA_FILTERED_BAM, trim, split, and duplicate-filter helper steps reserve up to16cores.- tagging processes default to
6cores and64 GBmemory. FQ_TO_SAMandMARK_DUPLICATES_DNAstay at1core.- These are scheduler reservations derived from
--max_cpus; Nextflow still prevents local tasks from exceeding the configured CPU budget.
Every run writes:
${outdir}/pipeline_info/execution_report.html${outdir}/pipeline_info/execution_timeline.html${outdir}/pipeline_info/execution_trace.tsv${outdir}/pipeline_info/flowchart.html${outdir}/pipeline_info/runtime_contract.tsv
The active runtime scripts live under scripts/core_runtime/. upstream/source_scripts/ is kept only as provenance for the vendored core code.
NXF_OFFLINE=true nextflow run . \
--samplesheet /mnt/dataFast/ahrmad/TEST_NF/isa_multiome.yaml \
--outdir /mnt/dataFast/ahrmad/TEST_NF/TrESFlow_Isa \
--max_cpus 32