goscripts.enrichment_stats module¶
@author: Pieter Moris
-
goscripts.enrichment_stats.
annotateOutput
(enrichmentTestResults, GOdict, gafDict, gafSubset)¶ Adds the GO id names to the array with enrichment results.
Parameters: - enrichmentTestResults (dict of dicts) – A dictionary containing dictionaries mapping the GO IDs to their frequency counts for both the interest and background set.
- GOdict (dict) – A dictionary of GO objects generated by importOBO(). Keys are of the format GO-0000001 and map to OBO objects.
- gafDict (dict) – A dictionary that maps the background’s Uniprot ACs to GO IDs.
- gafSubset (dict) – A dictionary that maps the subset’s Uniprot ACs to GO IDs.
Returns: A pandas DataFrame containing GO IDs, descriptions, frequency counts and p-values and corrected p-values.
Return type: DataFrame
-
goscripts.enrichment_stats.
countGOassociations
(validTerms, gafDict)¶ Counts the number of genes associated with at least one of the provided GO terms.
Parameters: - validTerms (set) – A set of GO terms. Should include the GO id of interest and all of its children.
- gafDict (dict) – A dictionary that maps Uniprot ACs to GO IDs.
Returns: The number of associated genes.
Return type: int
-
goscripts.enrichment_stats.
enrichmentAnalysis
(GOdict, gafDict, gafSubset, minGenes=3, threshold=0.05, propagation=True)¶ Performs a GO enrichment analysis.
First, all GO term IDs associated with the genes in the subset of interest, i.e. those defined in the gafSubset dictionary, will be tested using a one-sided hypergeometric test.
If the test is not significant at the chosen threshold (default = 0.05), the test will recursively be performed for all of the GO term’s parents. NOTE: these p-values are counted for multiple testing correction. If the test is significant, the recursive call will stop the propagation.
NOTE: At the moment, this means the test will be propagated until the top level, but after a certain point it might not be worth testing anymore (e.g. “biological process”). NOTE: Isn’t this cherry picking / p-value manipulation?
If the number of genes associated with a GO term is lower than minGenes, the test will be skipped for this term and its parents will recursively be tested. NOTE: these do not count towards the multiple testing correction limit.
For each test, the number of genes associated with the GO term is found by counting the number of genes associated with the GO term itself or with any of its child terms.
In the end, a dictionary containing the tested GO IDs mapped to p-values is returned. Any term that was not tested will be absent.
Parameters: - GOdict (dict) – A dictionary of GO objects generated by importOBO(). Keys are of the format GO-0000001 and map to OBO objects.
- gafDict (dict) – A dictionary that maps the background’s Uniprot ACs to GO IDs.
- gafSubset (dict) – A dictionary that maps the subset’s Uniprot ACs to GO IDs.
- minGenes (int) – The minimum number of genes that has to be associated with a term, before the test will be performed.
- threshold (float) – The threshold of the hypergeometric test for which the GO term’s parents will not be further recursively tested for enrichment.
- propagation (boolean) – Specifies whether or not tests should propagate upwards through the tree.
Returns: A dictionary of dictionaries mapping GO IDs to p-values and frequencies. Only GO IDs that were tested are returned.
Return type: dict of dicts
-
goscripts.enrichment_stats.
enrichmentOneSided
(subsetGO, backgroundTotal, backgroundGO, subsetTotal)¶ Performs a one-sided (enrichment) hypergeometric test for a given GO term.
k or more successes (= GO associations = subsetGO) in N draws (= subsetTotal) from a population of size M (backgroundTotal) containing n successes (backgroundGO) k or more is the sum of the probability mass functions of k up to N successes since cdf gives the cumulative probability up and including input (less or equal to k successes), and we want P(k or more), we need to calculate 1 - P(less than k) = 1 - P(k-1 or less) sf is the survival function (1-cdf).
Parameters: - subsetGO (int) – The number of genes in the interest subset associated with a GO term.
- backgroundTotal (int) – The total number of genes in the background set.
- backgroundGO (int) – The number of genes in the background set associated with the GO term.
- subsetTotal (int) – The total number of genes in the interest subset.
Returns: The p-value of the one-sided hypergeometric test.
Return type: float
-
goscripts.enrichment_stats.
multipleTestingCorrection
(enrichmentTestResults, testType='fdr_bh', threshold=0.05)¶ Updates the original enrichmentTestResults dictionary of dictionaries by appending an additional dictionary mapping GO ids to corrected p-values.
Parameters: - enrichmentTestResults (dict of dicts) – An dictionary of dictionaries mapping GO ids to p-values and counts.
- testType (str) – Specifies the type of multiple correction. Options include: bonferroni and fdr_bh (Benjamini Hochberg) and any others defined by statsmodels.stats.multitest.multipletests().
- threshold (float) – The significance level to use.
Returns: Modifies the provided enrichmenTestResults dictionary in-place.
Return type: None
-
goscripts.enrichment_stats.
recursiveTester
(GOid, backgroundTotal, subsetTotal, GOdict, gafDict, gafSubset, minGenes, threshold, enrichmentTestResults, propagation)¶ Implements the recursive enrichment tests for the enrichmentAnalysis() function by propagating through parent terms in case of an insignificant result or low gene count.
Parameters: - GOid (str) – The GO term id that is being tested for enrichment.
- backgroundTotal (int) – The total number of genes in the background set.
- subsetTotal (int) – The total number of genes in the subset of interest.
- GOdict (dict) – A dictionary of GO objects generated by importOBO(). Keys are of the format GO-0000001 and map to OBO objects.
- gafDict (dict) – A dictionary mapping the background’s gene Uniprot AC’s to GO IDs.
- gafSubset (dict) – A dictionary that maps the subset’s Uniprot ACs to GO IDs.
- minGenes (int) – The minimum number of genes that has to be associated with
- threshold (float) – The threshold of the hypergeometric test for which the GO term’s parents will not be further recursively tested for enrichment.
- enrichmentTestResults (dict of dicts) – An dictionary of dictionaries that gets passed through the recursion and filled with mappings of GO ids to p-values and frequencies for every enrichment test.
- propagation (boolean) – Specifies whether or not tests should propagate upwards through the tree.
Returns: - Does not return anything, but fills in the passed pValues dictionary (which is nested
- in the enrichmentTestResults dictionary).