Testing for over-representation of anything

Author
Published

February 14, 2016

Recently I wrote a post on how to test for chromosome over-representation on a list of genes. The solution, which I thought it was clever at the time, can be simpled to be applied to test if overlap between two lists of genes is significant. Let’s use the pasilla data again:

Code
# library("Biobase")
library("pasilla")
Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
Code
library("Biobase")
library("DESeq2")

data("pasillaDEXSeqDataSet")

geneset1 <- sample(rownames(counts(dxd)), 150)
geneset2 <- sample(rownames(counts(dxd)), 250)

universe <- length(
   unique(rownames(counts(dxd)))
   )

common <- length(
   intersect(
      unique(geneset1),
      unique(geneset2)
      )
   )


mat <- matrix(
   c(
      universe - length(union(geneset1, geneset2)),
      length(setdiff(geneset1, geneset2)),
      length(setdiff(geneset2, geneset1)),
      length(intersect(geneset1, geneset2))
      ),
   nrow=2
   )

fr <- fisher.test(mat, alternative="greater")
fr

Since this a random set of genes, it is not surprising that there is no overlap. This works fairly well and it could even be wrapped in a nice function. Yep, someone else has done it.

GeneOverlap

amongst other things, including visualization of overlaps, this package has a great function, testGeneOverlap, which uses an object created with newGeneOverlap, that does exactly the above:

Code
library(GeneOverlap)
overl <- newGeneOverlap(
   unique(geneset1),
   unique(geneset2),
   genome.size=universe)

overl <- testGeneOverlap(overl)
print(overl)
Detailed information about this GeneOverlap object:
listA size=150, e.g. FBgn0034420:E017 FBgn0010280:E008 FBgn0034420:E012
listB size=250, e.g. FBgn0010909:E008 FBgn0034180:E008 FBgn0034180:E003
Intersection size=81, e.g. FBgn0010280:E008 FBgn0034420:E012 FBgn0050460:E008
Union size=319, e.g. FBgn0034420:E017 FBgn0010280:E008 FBgn0034420:E012
Genome size=498
# Contingency Table:
     notA inA
notB  179  69
inB   169  81
Overlapping p-value=0.15
Odds ratio=1.2
Overlap tested using Fisher's exact test (alternative=greater)
Jaccard Index=0.3

And the results is the same. However, GeneOverlap also outputs the results of a few more tests that can be quite useful:

The Fisher’s exact test also gives an odds ratio which represents the strength of association. If an odds ratio is equal to or less than 1, there is no association between the two lists. If the odds ratio is much larger than 1, then the association is strong. The class also calculates the Jaccard index which measures the similarity between two lists. The Jaccard index varies between 0 and 1, with 0 meaning there is no similarity between the two and 1 meaning the two are identical.

The explanations are also quite nice for beginners. Great Bioconductor package.

Reuse

Citation

BibTeX citation:
@online{domingues2016,
  author = {Domingues, António},
  title = {Testing for over-Representation of Anything},
  date = {2016-02-14},
  url = {https://amjdomingues.com/posts/2016-02-14-test-gene-overlap/},
  langid = {en}
}
For attribution, please cite this work as:
Domingues, António. 2016. “Testing for over-Representation of Anything.” February 14, 2016. https://amjdomingues.com/posts/2016-02-14-test-gene-overlap/.