Title: | Signature Overrepresentation Analysis |
---|---|
Description: | Pathway Analysis is statistically linking observations on the molecular level to biological processes or pathways on the systems(i.e., organism, organ, tissue, cell) level. Traditionally, pathway analysis methods regard pathways as collections of single genes and treat all genes in a pathway as equally informative. However, this can lead to identifying spurious pathways as statistically significant since components are often shared amongst pathways. SIGORA seeks to avoid this pitfall by focusing on genes or gene pairs that are (as a combination) specific to a single pathway. In relying on such pathway gene-pair signatures (Pathway-GPS), SIGORA inherently uses the status of other genes in the experimental context to identify the most relevant pathways. The current version allows for pathway analysis of human and mouse datasets. In addition, it contains pre-computed Pathway-GPS data for pathways in the KEGG and Reactome pathway repositories and mechanisms for extracting GPS for user-supplied repositories. |
Authors: | Amir Foroushani [aut] , Fiona Brinkman [aut], David Lynn [aut], Witold Wolski [cre] |
Maintainer: | Witold Wolski <[email protected]> |
License: | GPL-3 |
Version: | 3.1.1 |
Built: | 2024-11-26 03:20:42 UTC |
Source: | https://github.com/wolski/sigora |
This function first randomly selects a number (np) of pathways, then randomly selects a number (ng) of genes that are associated with at least one of the selected pathways. The function can be used to compare Sigora's performance to traditional overrepresentation tests.
genesFromRandomPathways(GPSrepo, np, ng)
genesFromRandomPathways(GPSrepo, np, ng)
GPSrepo |
A signature repository (created by ..) or one of the precompiled options. |
np |
How many pathways to select. |
ng |
Number of genes to be selected. |
selectedPathways |
A vector containing the "np" originally selected pathways. |
genes |
A vector containing the "ng" selected genes from selectedPathways. |
Foroushani AB, Brinkman FS and Lynn DJ (2013).“Pathway-GPS and SIGORA: identifying relevant pathways based on the over-representation of their gene-pair signatures.”PeerJ, 1
data('kegH') ## select 50 genes from 3 human KEGG pathways seed=1234 set.seed(seed) a1 <- genesFromRandomPathways(kegH,3,50) ## originally selected pathways: a1[["selectedPathways"]] ## what are the genes a1[["genes"]] ## sigora's results sigoraRes <- sigora(GPSrepo =kegH, queryList = a1[["genes"]], level = 4) ## compare to traditional methods results: oraRes <- ora(a1[["genes"]],kegH) dim(oraRes) oraRes
data('kegH') ## select 50 genes from 3 human KEGG pathways seed=1234 set.seed(seed) a1 <- genesFromRandomPathways(kegH,3,50) ## originally selected pathways: a1[["selectedPathways"]] ## what are the genes a1[["genes"]] ## sigora's results sigoraRes <- sigora(GPSrepo =kegH, queryList = a1[["genes"]], level = 4) ## compare to traditional methods results: oraRes <- ora(a1[["genes"]],kegH) dim(oraRes) oraRes
This function lists the genes involved in the present GPS for a pathway of interest, odered by their contribution to the significance of the pathway.
getGenes(yy, i, idmap = load_data("idmap"))
getGenes(yy, i, idmap = load_data("idmap"))
yy |
A sigora analysis result object (created by |
i |
The rank position of the pathway of interest in summary_results. |
idmap |
A dataframe for converting between different gene-identifier types (e.g. ENSEMBL, ENTREZ and HGNC-Symbols of genes). Most users do not need to set this argument, as there is a built-in conversion table. |
A table (dataframe) with ids of the relevant genes, ranked by their contribution to the statistical significance of the pathway.
data('kegH') set.seed(seed=12345) a1 <- genesFromRandomPathways(kegH,3,50) ## originally selected pathways:\cr a1[["selectedPathways"]] ## what are the genes a1[["genes"]] ## sigora's results with this input:\cr sigoraRes <- sigora(GPSrepo = kegH, queryList = a1[["genes"]],level = 2) ## Genes related to the second most significant result: head(getGenes(sigoraRes,2))
data('kegH') set.seed(seed=12345) a1 <- genesFromRandomPathways(kegH,3,50) ## originally selected pathways:\cr a1[["selectedPathways"]] ## what are the genes a1[["genes"]] ## sigora's results with this input:\cr sigoraRes <- sigora(GPSrepo = kegH, queryList = a1[["genes"]],level = 2) ## Genes related to the second most significant result: head(getGenes(sigoraRes,2))
This function highlights the genes involved in the present GPS for a pathway of interest in its diagram. Please note that this functionality is only implemented for results of Reactome or KEGG based analyses.
getURL(yy, i)
getURL(yy, i)
yy |
A sigora analysis result object (created by |
i |
The rank position of the pathway of interest in summary_results. |
The URL of the pathway diagram, where the relevant genes from your original query list are highlighted.
data('kegH') set.seed(seed=12345) a1<-genesFromRandomPathways(kegH,3,50) ## originally selected pathways:\cr a1[["selectedPathways"]] ## what are the genes a1[["genes"]] ## sigora's results with this input:\cr sigoraRes <- sigora(GPSrepo =kegH, queryList = a1[["genes"]],level = 2) ## Diagram for the most significant result, where the genes from our list are highlighted in red: getURL(sigoraRes,1)
data('kegH') set.seed(seed=12345) a1<-genesFromRandomPathways(kegH,3,50) ## originally selected pathways:\cr a1[["selectedPathways"]] ## what are the genes a1[["genes"]] ## sigora's results with this input:\cr sigoraRes <- sigora(GPSrepo =kegH, queryList = a1[["genes"]],level = 2) ## Diagram for the most significant result, where the genes from our list are highlighted in red: getURL(sigoraRes,1)
A mapping table for ENSEMBL, ENTREZ and gene names(HGNC/MGI symbols) of Human and mouse protein coding gene.
www.ensembl.org/biomart/martview
data(idmap) head(idmap)
data(idmap) head(idmap)
KEGG human pathway GPS data, extracted by makeGPS
, default settings.
This data can be used by sigora
to preform signature
overrepresenation.
<http://www.genome.jp/kegg/pathway.html>
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., & Tanabe, M. 2012. “KEGG for integration and interpretation of large-scale molecular data sets.” Nucleic Acids Research 40(D1).
data(kegH) str(kegH)
data(kegH) str(kegH)
KEGG mouse pathway GPS data, extracted by makeGPS
, default settings.
This data can be used by sigora
to preform signature
overrepresenation.
<http://www.genome.jp/kegg/pathway.html>
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., & Tanabe, M. 2012. “KEGG for integration and interpretation of large-scale molecular data sets.” Nucleic Acids Research 40(D1).
data(kegM) ## maybe str(kegM) ; plot(kegM) ...
data(kegM) ## maybe str(kegM) ; plot(kegM) ...
load and return data when lazyLoad false insted of using data(datastr)
load_data(datastr, package = "sigora")
load_data(datastr, package = "sigora")
datastr |
name of datasets |
package |
default sigora |
returns the data
idmap <- load_data("idmap")
idmap <- load_data("idmap")
Given a repository of gene-pathway associations either in a tab delimited
file with three columns (pathwayID,pathway Description,Gene) or a
corresponding dataframe, this function identifies all Gene Pair Signatures
(pairs of genes that are as a combination unique to a single pathway) and
Pathway Unique Genes (genes that are uniquely associated with a single
pathway) and stores them in a format that is usable by sigora
.
Please also see the "details" and "note" sections below.
makeGPS( pathwayTable = NULL, fn = NULL, maxLevels = 5, saveFile = NULL, repoName = "userrepo", maxFunperGene = 100, maxGenesperPathway = 500, minGenesperPathway = 10 )
makeGPS( pathwayTable = NULL, fn = NULL, maxLevels = 5, saveFile = NULL, repoName = "userrepo", maxFunperGene = 100, maxGenesperPathway = 500, minGenesperPathway = 10 )
pathwayTable |
A data frame describing gene-pathway associations in following format: pathwayID,pathwayName,Gene. Either pathwayTable or fn should be provided. |
fn |
Where to find the repository.Either pathwayTable or fn should be provided. |
maxLevels |
For hierarchical repositories, the number of levels to consider. |
saveFile |
Where to save the object as an rda file. |
repoName |
Repository name. |
maxFunperGene |
A cutoff threshold, genes with more than this number of associated pathways are excluded to speed up the GPS identification process. |
maxGenesperPathway |
A cutoff threshold, pathways with more than this number of associated genes are excluded to speed up the GPS identification process. |
minGenesperPathway |
A cutoff threshold, pathways with less than this number of associated genes are excluded to speed up the GPS identification process. |
The primary purpose of makeGPS
is to convert a user-supplied
gene-pathway association table to a repository of weighted Gene Pair
Signatures (GPS) that are unique features of pathways. Such GPS can than be
used for signature (gene-pair) based analyses using sigora
.
Additionally, the resulting object also retains the original "single
gene"-"pathway" associations for the purpose of followup analyses, such as
comparison of sigora-results to traditional methods. ora
is an
implementation of the traditional (individual gene) Overrepresentation
Analysis.
A GPS repository, to be used by sigora
and ora.
This function relies on package slam
, which should be installed
from CRAN. It is fairly memory intensive, and it is recommended to be run
on a machine with at least 6GB of RAM. Also, make sure to save and reuse the
resulting GPS repository in future analyses!
Foroushani AB, Brinkman FS and Lynn DJ (2013).“Pathway-GPS and SIGORA: identifying relevant pathways based on the over-representation of their gene-pair signatures.”PeerJ, 1
data(nciTable); data(idmap) ## what the input looks like: head(nciTable) ## create a SigObject. use the saveFile parameter for reuse. nciH<-makeGPS(pathwayTable=load_data('nciTable')) ils<-grep("^IL",idmap[,"Symbol"],value=TRUE) ilnci<-sigora(queryList=ils,GPSrepo=nciH,level=3)
data(nciTable); data(idmap) ## what the input looks like: head(nciTable) ## create a SigObject. use the saveFile parameter for reuse. nciH<-makeGPS(pathwayTable=load_data('nciTable')) ils<-grep("^IL",idmap[,"Symbol"],value=TRUE) ilnci<-sigora(queryList=ils,GPSrepo=nciH,level=3)
PID-NCI human pathway repository, as a data frame with three columns
corresponding to : pathwayId , pathwayName, gene. This is an example of the
expected format for the input data to makeGPS
.
This dataset is provided to illustrate how to create your own GPS
repositories.nciTable
is a dataframe with threecolumns corresponding
to pathwayId, pathwayName and gene. Each row describes the association
between an individual gene and a PID-NCI pathway. As you see in the
examples
, section, one can convert this dataframe to a GPS
repository
using makeGPS
, and save the results for future reuse.
Using the thus created GPS repository
one can preform Signature
Overrepresentation Analysis on lists of genes of interest.
<https://github.com/NCIP/pathway-interaction-database/tree/master/download>
data(nciTable) nciH<-makeGPS(pathwayTable=load_data('nciTable')) data(idmap) ils<-grep("^IL",idmap[,"Symbol"],value=TRUE) ilnci<-sigora(queryList=ils,GPSrepo=nciH,level=3)
data(nciTable) nciH<-makeGPS(pathwayTable=load_data('nciTable')) data(idmap) ils<-grep("^IL",idmap[,"Symbol"],value=TRUE) ilnci<-sigora(queryList=ils,GPSrepo=nciH,level=3)
Traditional Overrepresentation Analysis by hypergeometric test: pathways are treated as collections of individual genes and all genes are treated as equally informative. This function is provided for comparison of the results of traditional methods to Sigora.
ora(geneList, GPSrepo, idmap = load_data("idmap"))
ora(geneList, GPSrepo, idmap = load_data("idmap"))
geneList |
A vector containing the list of genes of interest (e.g. differentially expressed genes). Following Identifier types are supported: Gene Symbols, ENTREZ-IDs or ENSEMBL-IDs. |
GPSrepo |
A GPS-repository (either one of the provided precomputed GPS
repositories) or one created by |
idmap |
A dataframe for converting between different gene-identifier types (e.g. ENSEMBL, ENTREZ and HGNC-Symbols of genes). Most users do not need to set this argument, as there is a built-in conversion table. |
The primary purpose of makeGPS
is to create a GPS repository. It
does, however, retain the original "single gene"-"pathway" associations for
the purpose of followup analyses, such as comparison of sigora-results to
traditional methods. ora
is an implementation of the traditional
(individual gene) Overrepresentation Analysis.
A dataframe with individual gene ORA results.
data(kegM) ## select 50 genes from 3 mouse pathways set.seed(seed=345) a1<-genesFromRandomPathways(kegM,3,50) ## originally selected pathways: a1[["selectedPathways"]] ## compare to traditional methods results: oraRes <- ora(a1[["genes"]],kegM) dim(oraRes) oraRes
data(kegM) ## select 50 genes from 3 mouse pathways set.seed(seed=345) a1<-genesFromRandomPathways(kegM,3,50) ## originally selected pathways: a1[["selectedPathways"]] ## compare to traditional methods results: oraRes <- ora(a1[["genes"]],kegM) dim(oraRes) oraRes
Reactome human pathway GPS data, extracted by makeGPS
, default
settings. This data can be used by sigora
to preform signature
overrepresenation.
<http://www.reactome.org/>
Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., et al. 2009. “Reactome knowledgebase of human biological pathways and processes.” Nucleic acids research 37(Database issue).
data(reaH) ## maybe str(reaH) ; ...
data(reaH) ## maybe str(reaH) ; ...
Reactome mouse pathway GPS data, extracted by makeGPS
, default
settings. This data can be used by sigora
to preform signature
overrepresenation.
<http://www.reactome.org/>
Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., et al. 2009. “Reactome knowledgebase of human biological pathways and processes.” Nucleic acids research 37(Database issue).
data(reaM) str(reaM)
data(reaM) str(reaM)
This function determines which Signatures (GPS) from a collection of GPS
data (GPSrepo
argument) for the specified pathway repository are
present in the specified list of genes of interest (queryList
argument)). It then uses the distribution function of hypergeometric
probabilities to identify the pathways whose GPS are over-represented among
the present GPS and saves the results to the file specified in the
saveFile
argument.
sigora( GPSrepo, level, markers = FALSE, queryList = NULL, saveFile = NULL, weighting.method = "invhm", idmap = load_data("idmap") )
sigora( GPSrepo, level, markers = FALSE, queryList = NULL, saveFile = NULL, weighting.method = "invhm", idmap = load_data("idmap") )
GPSrepo |
An object created by |
level |
In hierarchical repositories (e.g. Reactome) number of levels to consider. Recommended value for KEGG: 2, for Reactome: 4. |
markers |
Whether to take single genes that are uniquely associated with only one pathway into account (i.e. should pathway unique genes/PUGs be considered GPS?). Recommended value: TRUE (1). |
queryList |
A user specified list of genes of interest ('query list'), as a vector of ENSEMBL/ ENTREZ IDs or gene symbols (HGNC/MGI). |
saveFile |
If provided, the results are saved here as a tab delimited File (including , for each pathway, a list of genes ordered by their contribution to the statistical significance of the pathway). |
weighting.method |
The weighting method or GPS. The default weighting
scheme for the GPS is the reciproc of the harmonic mean of the degrees of
the two component genes of a GPS. A wide range of alternative weighting
schemes are pre-implemented (see below). Additional user defined weighting
schemes are also supported. Currently, the following alternatives are
pre-implemented: |
idmap |
A dataframe for converting between different gene-identifier types (e.g. ENSEMBL, ENTREZ and HGNC-Symbols of genes). Most users do not need to set this argument, as there is a built-in conversion table. |
summary_results |
A dataframe listing the analysis results. |
detailed_results |
A dataframe describing the detailed evidence (present Gene-Pair Signatures) for each pathway. |
Foroushani AB, Brinkman FS and Lynn DJ (2013).“Pathway-GPS and SIGORA: identifying relevant pathways based on the over-representation of their gene-pair signatures.”PeerJ, 1
##query list ils <- grep("^IL",load_data('idmap')[["Symbol"]],value=TRUE) ## using precompiled GPS repositories: sigRes.ilreact <- sigora(queryList=ils,GPSrepo=load_data('reaH'),level=4) sigRes.ilkeg <- sigora(queryList=ils,GPSrepo=load_data('kegH'),level=2) ## user created GPS repository: nciH<-makeGPS(pathwayTable=load_data('nciTable')) sigRes.ilnci<-sigora(queryList=ils,GPSrepo=nciH,level=2) ## user defined weighting schemes : myfunc<-function(a,b){1/log(a+b)} sigora(queryList=ils,GPSrepo=nciH,level=2, weighting.method = myfunc)
##query list ils <- grep("^IL",load_data('idmap')[["Symbol"]],value=TRUE) ## using precompiled GPS repositories: sigRes.ilreact <- sigora(queryList=ils,GPSrepo=load_data('reaH'),level=4) sigRes.ilkeg <- sigora(queryList=ils,GPSrepo=load_data('kegH'),level=2) ## user created GPS repository: nciH<-makeGPS(pathwayTable=load_data('nciTable')) sigRes.ilnci<-sigora(queryList=ils,GPSrepo=nciH,level=2) ## user defined weighting schemes : myfunc<-function(a,b){1/log(a+b)} sigora(queryList=ils,GPSrepo=nciH,level=2, weighting.method = myfunc)