Skip to content
Merged

v1.2 #82

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/pkgdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ on:
branches: [default]
pull_request:
branches: [default]
release:
types: [published]
branches: [default]
workflow_dispatch:
branches: [default]

Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: markeR
Title: An R Toolkit for Evaluating Gene Signatures as Phenotypic Markers
Version: 1.1.2
Version: 1.2
Authors@R:
c(
person("Rita", "Martins-Silva",
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ knitr::opts_chunk$set(

> **To cite `markeR` please use:**
>
> Martins-Silva R, Kaizeler A, Barbosa-Morais NL (2025). _markeR: An R Toolkit for Evaluating Gene Signatures as Phenotypic Markers_. doi:10.18129/B9.bioc.markeR, R package version 1.1.2, https://bioconductor.org/packages/markeR.
> Martins-Silva R, Kaizeler A, Barbosa-Morais NL (2025). _markeR: An R Toolkit for Evaluating Gene Signatures as Phenotypic Markers_. doi:10.18129/B9.bioc.markeR, R package version 1.2, https://bioconductor.org/packages/markeR.

The folder `inst/Paper/` is in the **paper** branch and contains all scripts and materials used in the original `markeR` paper to reproduce analyses and figures. You can browse it [here](https://github.com/DiseaseTranscriptomicsLab/markeR/tree/paper/inst/Paper).

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ across experimental and clinical phenotypes.
>
> Martins-Silva R, Kaizeler A, Barbosa-Morais NL (2025). *markeR: An R
> Toolkit for Evaluating Gene Signatures as Phenotypic Markers*.
> <doi:10.18129/B9.bioc.markeR>, R package version 1.1.2,
> <doi:10.18129/B9.bioc.markeR>, R package version 1.2,
> <https://bioconductor.org/packages/markeR>.

The folder `inst/Paper/` is in the **paper** branch and contains all
Expand Down
118 changes: 75 additions & 43 deletions vignettes/articles/Article_BenchmarkingMode.Rmd

Large diffs are not rendered by default.

66 changes: 36 additions & 30 deletions vignettes/articles/Article_DiscoveryMode.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,20 @@ knitr::opts_chunk$set(
)
```

This vignette provides a comprehensive introduction to the **markeR** package, focusing on its **Discovery Mode**. The discovery mode was designed for users who are interested in quantifying a gene set in a given dataset to explore associations with phenotypic or clinical variables. This approach is particularly suited to hypothesis generation where the phenotype marked by the gene set is of known biological or clinical relevance.
This vignette provides a comprehensive introduction to the **`markeR`** package, focusing on its **Discovery Mode**. The discovery mode was designed for users who are interested in examining the relationship between a gene set and one or more metadata variables of interest, being suitable for exploratory or hypothesis-generating analyses.

# Case-study: Senescence

We will be using an already pre-processed gene expression dataset, derived from the Marthandan et al. (2016) study (GSE63577), that includes human fibroblast samples cultured under two different conditions: replicative senescence and proliferative control. The dataset has already been filtered and normalized using the `edgeR` package. For more information about the dataset structure, see the help pages for `?counts_example` and `?metadata_example`.
In this vignette, we use a pre‑processed RNA-seq dataset from Marthandan et al. (2016, GSE63577), with normalised read counts for human fibroblasts under **replicative senescence** and **proliferative control**. See `?counts_example` for preprocessing details and structure. `markeR` requires as input a filtered and normalised, non log-transformed, gene expression matrix (genes × samples). Row names must be gene identifiers; column names must match sample IDs in the metadata.

We use the accompanying metadata from the Marthandan et al. (2016) study (see `?metadata_example`).

This dataset serves as a working example to demonstrate the main functionalities of the markeR package. In particular, it will be used to showcase the two primary modules designed for benchmarking gene signatures:
This dataset serves as a working example to demonstrate the main functionalities of the `markeR` package. In particular, it will be used to showcase the two primary modules of `markeR` for quantifying phenotypes using gene sets:

- **Score**: calculates expression-based signature scores for each sample, and
- **Enrichment**: evaluates the over-representation of gene signatures within ranked gene lists.
- **Score-based methods**: log2-median expression, ranking approaches, and single-sample gene set enrichment analysis (ssGSEA) to quantify coordinated expression within a gene set.
- **Enrichment-based methods**: GSEA using moderated t- or B-statistics.

To illustrate the usage of `markeR`, we use the **HernandezSegura** gene set: A transcriptomic gene set identified by Hernandez-Segura et al. (2017) as consistently altered across multiple senescence models. This set includes information on the directionality of gene regulation. It has shown strong performance in classification and enrichment analyses, including in the original paper of markeR, and also in the Tutorial on the *Benchmarking Mode* of `markeR`.
To illustrate the usage of `markeR`, we use the **HernandezSegura** gene set: A transcriptomic gene set identified by Hernandez-Segura et al. (2017) as consistently altered across multiple senescence models. This set includes information on the direction of change in expression of genes upon phenotype.

```{r example}
library(markeR)
Expand All @@ -42,8 +44,12 @@ data(counts_example)
counts_example[1:5,1:5]
```

For illustration purposes of different variable types, let's imagine we also had two additional variables: one indicating the number of days between sample preparation and sequencing (`DaysToSequencing`), and another identifying the person who processed each sample (`researcher`). These variables are hypothetical and not part of the original study design.

For illustration purposes, two synthetic variables were added to the data:

* `DaysToSequencing`, the number of days between sample preparation and sequencing;
* `Researcher`, identifying the person who processed each sample.

This enables exploration of associations between gene set activity and both categorical and continuous variables.

```{r load_metadata}
data(metadata_example)
Expand All @@ -58,24 +64,16 @@ head(metadata_example)

# Score-Based approaches

The **Score** module in `markeR` quantifies the association between a gene signature and phenotypic variables by calculating a score for each sample based on the expression of genes in the signature. This score can then be correlated with other variables, such as clinical or experimental conditions:

- **Quantifies associations** between phenotype variables and a gene signature score using *Cohen's effect sizes* and *p-values*.
- **Visualizes** results through lollipop plots, contrast plots, and distribution plots.

This is useful for identifying:

- **Biological relationships** (e.g., phenotype-score associations)
- **Technical confounders** (e.g., batch effects)

A score summarising the collective expression of a gene set is assigned **to each sample**. Scores can be visualised using built-in functions, or used directly in downstream analyses (*e.g.*, comparisons between phenotypic groups of samples, correlations with numerical phenotypes).

## Outputs

The main function returns a structured list with:
If `method = "logmedian"` (or `ssGSEA`, `ranking`), the main function `VariableAssociation()` returns a structured list with:

- **`overall`**: Effect sizes (*Cohen’s f*) and p-values for each variable.
- **`contrasts`**: For categorical variables, pairwise or grouped comparisons using *Cohen’s d* with BH-adjusted p-values.
- **`plot`**: A combined visualization showing:
1. Lollipop plot of effect sizes (*Cohen’s f*)
1. Lollipop plot of effect sizes (i.e., *Cohen’s f* per variable)
2. Distribution plots of the score by variable (density or scatter)
3. Lollipop plots of contrasts (*Cohen’s d*) for categorical variables, if applicable
- **`plot_overall`**, **`plot_contrasts`**, **`plot_distributions`**: Individual components of the combined plot.
Expand All @@ -96,13 +94,10 @@ For this example, we use:
- The **`logmedian`** scoring method
- **`mode = "extensive"`** for thorough contrast analysis

We also include two synthetic phenotypic variables:
Though artificial, `DaysToSequencing` and `Researcher` mimic potential **technical covariates**. Strong associations between these and the score could indicate **batch effects**, where technical variation may confound biological interpretation.

- **`Researcher`** — a categorical variable representing who processed each sample
- **`DaysToSequencing`** — a numeric variable indicating time between preparation and sequencing

Though artificial, these mimic potential **technical covariates**. Strong associations between these and the score could indicate **batch effects**, where technical variation may confound biological interpretation.

In this example, the `Condition` variable shows a large effect size (Cohen’s f and Cohen's d), confirming strong discrimination between Senescent and Proliferative samples. The remaining variables don't show significant associations, suggesting no major batch effects that might be reflected in the computed scores.

```{r variableassoc_score_sen, fig.width=7, fig.height=7, out.width="100%", warning=FALSE, message=FALSE}
results_scoreassoc_bidirect <- VariableAssociation(data = counts_example,
Expand All @@ -118,20 +113,24 @@ results_scoreassoc_bidirect <- VariableAssociation(data = counts_example,
results_scoreassoc_bidirect$Overall
results_scoreassoc_bidirect$Contrasts
```




# Enrichment-based approaches

The `GSEA_VariableAssociation()` function evaluates how phenotypic variables are associated with **gene set activity**, using enrichment scores derived from gene expression statistics (B- or t-statistics). This allows users to understand whether a gene set is enriched or depleted in relation to different sample attributes.
Enrichment-based methods implement **Gene Set Enrichment Analysis (GSEA)**. Genes are ranked according to differential expression statistics, and a Normalised Enrichment Score (NES) per variable of interest is computed, accompanied by a p-value adjusted for multiple hypothesis testing.


## Outputs

The `GSEA_VariableAssociation()` function returns a list with two elements:
If `method = "GSEA"`, the main function `VariableAssociation()` returns a structured list with:

- **`data`**: A tidy data frame of GSEA results. For each variable contrast, this includes:
- **Contrast**: The comparison performed (e.g., A - B, A - (B+C))
- **Statistic**: The metric used for gene ranking (either *t* or *B*)
- **NES**: Normalized Enrichment Score
- **Adjusted p-value**: Multiple-testing corrected p-value (e.g., Benjamini–Hochberg)
- **Adjusted p-value**: Multiple-testing corrected p-value (Benjamini–Hochberg)
- **Gene Set Name**: The gene set being tested

- **`plot`**: A `ggplot2` object showing the NES and significance of each contrast as a **lollipop plot**:
Expand All @@ -156,8 +155,13 @@ Depending on the statistic used (`t` or `B`), the interpretation of results vari

## Example: Exploring Gene Set Enrichment by Variable

The following code evaluates how three phenotypic variables `Condition`, `Researcher`, and `DaysToSequencing`are associated with the **HernandezSegura** gene set:
The following code evaluates how three phenotypic variables (`Condition`, `Researcher`, and `DaysToSequencing`) are associated with the **HernandezSegura** gene set.


In this example, the HernandezSegura gene set shows significant enrichment in samples sequenced by Francisca relative to those processed by Ana or John, which would suggest a differentiated researcher-associated technical impact on the samples' biological phenotype. This gene set shows also a strong depletion in proliferative samples, which is expected given its annotation as senescence-associated and results from the score-based approach.



```{r GSEA_varassoc, fig.width=6, fig.height=6, out.width="60%", warning=FALSE, message=FALSE}
varassoc_gsea <- VariableAssociation(
data = counts_example,
Expand All @@ -179,6 +183,8 @@ varassoc_gsea <- VariableAssociation(
varassoc_gsea$data
```



# Session Information

```{r}
Expand Down
16 changes: 9 additions & 7 deletions vignettes/articles/Article_GeneSetSimilarity.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,22 @@ knitr::opts_chunk$set(
)
```

Even if a user-defined gene signature demonstrates strong discriminatory power between conditions, it may reflect known biological pathways rather than novel mechanisms. To address this, the `geneset_similarity()` function computes pairwise **Jaccard indices** or **log odds ratios (logOR)** between user-provided gene signatures and a reference set, quantifying their overlap as a percentage or a statistical enrichment.
Even if a user-defined gene signature demonstrates strong discriminatory power between conditions, it may reflect known biological pathways rather than novel mechanisms. To address this, the `geneset_similarity()` function implements two complementary similarity metrics:

* **Jaccard Index**:
the ratio of the number of genes in common over the total number of genes in the two sets.

* **Log Odds Ratio (logOR)** from Fisher’s exact test of association between gene sets, given a specified gene universe.

Users can compare their signatures to:

* **Custom gene sets**, defined manually, or
* **Custom gene sets**, defined manually;
* **MSigDB collections**, via the [`msigdbr`](https://cran.r-project.org/package=msigdbr) package.

The function provides options to:

* Filter by **Jaccard index threshold**, using `jaccard_threshold`
* Filter by **odds ratio and p-value**, using `or_threshold` and `pval_threshold`
* Limit the number of top-matching reference signatures shown, using `num_sigs_toplot`
* Filter by **Jaccard index threshold**, using `jaccard_threshold`;
* Filter by **odds ratio and p-value**, using `or_threshold` and `pval_threshold`, respectively.

# Similarity via Jaccard Index

Expand Down Expand Up @@ -92,12 +96,10 @@ The log odds ratio (logOR) provides a statistically grounded alternative for ass
- Genes in both sets
- Genes in one but not the other
- Gene universe as background
Log-transformed odds ratios are visualized.

> **Note**: When using `metric = "odds_ratio"`, the `universe` parameter **must** be supplied.



## Example 3: Compare against user-defined and MSigDB gene sets

```{r, fig.width=6, fig.height=8, out.width="60%"}
Expand Down
Loading