Title: | Single Cell Poisson Probability Paradigm |
---|---|
Description: | Useful to visualize the Poissoneity (an independent Poisson statistical framework, where each RNA measurement for each cell comes from its own independent Poisson distribution) of Unique Molecular Identifier (UMI) based single cell RNA sequencing (scRNA-seq) data, and explore cell clustering based on model departure as a novel data representation. |
Authors: | Yue Pan [aut, cre], Justin Landis [aut] , Dirk Dittmer [aut], James S. Marron [aut], Di Wu [aut] |
Maintainer: | Yue Pan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.1 |
Built: | 2024-10-29 05:21:27 UTC |
Source: | https://github.com/cran/scpoisson |
This function returns a matrix of a novel data representation with the same dimension as input data matrix.
adj_CDF_logit(data, change = 1e-10, ...)
adj_CDF_logit(data, change = 1e-10, ...)
data |
A UMI count data matrix with genes as rows and cells as columns or an S3 object for class 'scppp'. |
change |
A numeric value used to correct for exactly 0 and 1 before logit transformation.
Any values below |
... |
not used. |
This is a function used to calculate model departure as a novel data representation.
A matrix of departure as a novel data representation (matrix as input) or an S3 object for class 'scppp' (scppp object as input; departure result will be stored in object scppp under "representation").
# Matrix as input test_set <- matrix(rpois(500, 0.5), nrow = 10) adj_CDF_logit(test_set) # scppp object as input adj_CDF_logit(scppp(test_set))
# Matrix as input test_set <- matrix(rpois(500, 0.5), nrow = 10) adj_CDF_logit(test_set) # scppp object as input adj_CDF_logit(scppp(test_set))
This function removes unwanted characters from cluster label string
clust_clean(clust)
clust_clean(clust)
clust |
a string indicates cluster label at each split step |
The clust_clean function removes any "-" or "NA" at the end of a string for a given cluster label
a string with unwanted characters removed
This function calculates the number of elements in current cluster
cluster_size(test_dat)
cluster_size(test_dat)
test_dat |
a matrix or data frame with cells to cluster as rows |
a numeric value with number of cells to cluster
This function returns a data frame with differential expression analysis results.
diff_gene_list( data, final_clust_res = NULL, clust1 = "1", clust2 = "2", t_test = FALSE, ... )
diff_gene_list( data, final_clust_res = NULL, clust1 = "1", clust2 = "2", t_test = FALSE, ... )
data |
A departure matrix generated from adj_CDF_logit() or an S3 object for class 'scppp'. |
final_clust_res |
A data frame with clustering results generated from HclustDepart(). It contains two columns: names (cell names) and clusters (cluster label). |
clust1 |
One of the cluster label used to make comparison, default "1". |
clust2 |
The other cluster label used to make comparison, default "2". |
t_test |
A logical value indicating whether the t-test should be used to make comparison. In general, for large cluster ( |
... |
not used. |
This is a function used to find deferentially expressed genes between two clusters.
A data frame contains genes (ranked by decreasing order of mean difference), and associated statistics (p-values, FDR adjusted p-values, etc.). If the input is an S3 object for class 'scppp', differential expression analysis results will be stored in object scppp under "de_results".
get FWER cutoffs for shc object
## S3 method for class 'shc' fwer_cutoff(obj, alpha, ...)
## S3 method for class 'shc' fwer_cutoff(obj, alpha, ...)
obj |
|
alpha |
numeric value specifying level |
... |
other parameters to be used by the function |
Patrick Kimes
get example data
get_example_data(x = c("p5", "p56"))
get_example_data(x = c("p5", "p56"))
x |
data set to choose |
A data set from example data
This function returns a list with clustering results.
HclustDepart(data, maxSplit = 10, minSize = 10, sim = 100, ...)
HclustDepart(data, maxSplit = 10, minSize = 10, sim = 100, ...)
data |
A UMI count matrix with genes as rows and cells as columns or an S3 object for class 'scppp'. |
maxSplit |
A numeric value specifying the maximum allowable number of splitting steps (default 10). |
minSize |
A numeric value specifying the minimal allowable cluster size (the number of cells for the smallest cluster, default 10). |
sim |
A numeric value specifying the number of simulations during the Monte Carlo simulation procedure for statistical significance test, i.e. n_sim argument when apply sigclust2 (default = 100). |
... |
not used. |
This is a function used to get cell clustering results in a recursive way.
At each step, the two-way approximation is re-calculated again within each subcluster,
and the potential for further splitting is calculated using sigclust2.
A non significant result suggests cells are reasonably homogeneous
and may come from the same cell type. In addition, to avoid over splitting,
the maximum allowable number of splitting steps maxSplit
(default is 10, which leads to at most total number of clusters) and
minimal allowable cluster size
minSize
(the number of cells in a cluster allowed for further splitting, default is 10)
may be set beforehand.
Thus the process is stopped when any of the conditions
is satisfied: (1) the split is no longer statistically significant;
(2) the maximum allowable number of splitting steps is reached;
(3) any current cluster has less than 10 cells.
A list with the following elements:
res2
: a data frame contains two columns: names (cell names) and clusters (cluster label)
sigclust_p
: a matrix with cells to cluster as rows, split index as columns,
the entry in row i
and column j
denoting the p-value
for the cell i
at split step j
sigclust_z
: a matrix with cells to cluster as rows, split index as columns,
the entry in row i
and column j
denoting the z-score
for the cell i
at split step j
If the input is an S3 object for class 'scppp', clustering result will be stored in object scppp under "clust_results".
test_set <- matrix(rpois(500, 0.5), nrow = 10) HclustDepart(test_set)
test_set <- matrix(rpois(500, 0.5), nrow = 10) HclustDepart(test_set)
This function returns a data frame with interpolated data points.
interpolate(df, reference, sample_id)
interpolate(df, reference, sample_id)
df |
The object data frame requires interpolation. |
reference |
The reference data frame to make comparison. |
sample_id |
A character to denote the object data frame. |
This is a function developed to do linear interpolation for corresponding probability from empirical cumulative distribution function (CDF) and corresponding quantiles. Given a reference data frame and a data frame needed to do interpolation, if there are any CDF values in reference but not in object data frame, do the linear interpolation and insert both CDF values and respective quantiles to the original object data frame.
A data frame contains CDF, the sample name, and the corresponding quantiles.
This function applies logit transformation for a given probability
logit(p)
logit(p)
p |
a numeric value of probability, ranges between 0 and 1, exactly 0 and 1 not allowed |
The logit function transforms a probability within the range of 0 and 1 to the real line
a numeric value transformed to the real line
This function returns a list with elements useful to check and compare cell clustering.
LouvainDepart( data, pdat = NULL, PCA = TRUE, N = 15, pres = 0.8, tsne = FALSE, umap = FALSE, ... )
LouvainDepart( data, pdat = NULL, PCA = TRUE, N = 15, pres = 0.8, tsne = FALSE, umap = FALSE, ... )
data |
A UMI count matrix with genes as rows and cells as columns or an S3 object for class 'scppp'. |
pdat |
A matrix used as input for cell clustering. If not specify, the departure matrix will be calculated within the function. |
PCA |
A logic value specifying whether apply PCA before Louvain clustering, default is |
N |
A numeric value specifying the number of principal components included for further clustering (default 15). |
pres |
A numeric value specifying the resolution parameter in Louvain clustering (default 0.8) |
tsne |
A logic value specifying whether t-SNE dimension reduction should be applied for visualization. |
umap |
A logic value specifying whether UMAP dimension reduction should be applied for visualization. |
... |
not used. |
This is a function used to get cell clustering using Louvain clustering algorithm implemented in the Seurat package.
A list with the following elements:
sdata
: a Seurat object
tsne_data
: a matrix containing t-SNE dimension reduction results,
with cells as rows, and first two t-SNE dimensions as columns; NULL if tsne = FALSE
.
umap_data
: a matrix containing UMAP dimension reduction results,
with cells as rows, and first two UMAP dimensions as columns; NULL if tsne = FALSE
.
res_clust
: a data frame contains two columns: names (cell names) and clusters (cluster label)
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck III WM, Hao Y, Stoeckius M, Smibert P, Satija R (2019). “Comprehensive Integration of Single-Cell Data.” Cell, 177, 1888-1902. doi:10.1016/j.cell.2019.05.031.
set.seed(1234) test_set <- matrix(rpois(500, 2), nrow = 20) rownames(test_set) <- paste0("gene", 1:nrow(test_set)) colnames(test_set) <- paste0("cell", 1:ncol(test_set)) LouvainDepart(test_set)
set.seed(1234) test_set <- matrix(rpois(500, 2), nrow = 20) rownames(test_set) <- paste0("gene", 1:nrow(test_set)) colnames(test_set) <- paste0("cell", 1:ncol(test_set)) LouvainDepart(test_set)
This function returns a data frame with generated sets of samples and simulation index.
nboot_small(x, lambda, R)
nboot_small(x, lambda, R)
x |
a numeric vector of sampled data points to compare with theoretical Poisson. |
lambda |
a numeric value for mean of theoretical Poisson. |
R |
a numeric value for mean of theoretical Poisson. |
This is a function used to simulate a given number sets of samples from a theoretical Poisson distribution that match input samples on sample size and sample mean (or theoretical Poisson parameter). Plotting these as envelopes in Q-Q plot shows the variability in shapes we can expect when sampling from the theoretical Poisson distribution.
A data frame contains simulated data and corresponding simulation index. Random sample generation function to generate sets of samples from theoretical Poisson distribution.
nboot_small returns a data frame with generate sets of samples and simulation index.
This is a function used to simulate a given number sets of samples from a theoretical Poisson distribution that match input samples on sample size and sample mean (or theoretical Poisson parameter). Plotting these as envelopes in Q-Q plot shows the variability in shapes we can expect when sampling from the theoretical Poisson distribution.
a numeric vector of number of simulation sets that match input samples on sample size and sample mean (or theoretical Poisson parameter).
This function returns a data frame including data points and corresponding quantile.
new_quantile(data, sample)
new_quantile(data, sample)
data |
A numeric vector of sampled data points. |
sample |
A character string denotes which sample data points come from. |
This is a function developed to get quantile for samples with only a few integer values.
Define both and
.
Replace the point mass at each integer
by a bar on the interval
with height
. This is a more "continuous" approximation of quantiles in this case.
A data frame contains the corresponding probability from cumulative distribution function (CDF), sample name, and corresponding respective quantiles.
This function returns a data frame including the probability from cumulative distribution function (CDF) and corresponding quantiles.
new_quantile_pois(data, lambda)
new_quantile_pois(data, lambda)
data |
A numeric vector of sampled data points to compare with theoretical Poisson. |
lambda |
A numeric value for theoretical Poisson distribution parameter (equal to mean). |
This is a function developed to get corresponding quantiles from theoretical Poisson distribution. The data points ranges from 0 to maximum value of sampled data used to compare with the theoretical Poisson distribution.
A data frame contains CDF probability and corresponding quantiles from the theoretical Poisson distribution.
This function returns a vector consists of parameter estimates for overall offset, cell effect, and gene effect.
para_est_new(test_set)
para_est_new(test_set)
test_set |
A UMI count data matrix with genes as rows and cells as columns |
This is a function used to calculate parameter estimates based on
,
where
is the overall offset,
is a vector with the same length as the number of genes,
and
is a vector with the same length as the number of cells.
The order of elements in vectors
or
is the same as rows (genes) or
cells (columns) from input data. Be sure to remove cells/genes with all zeros.
A numeric vector containing parameter estimates from overall offset (first element), gene effect (same order as rows) and cell effect (same order as columns).
# Matrix as input test_set <- matrix(rpois(500, 0.5), nrow = 10) para_est_new(test_set)
# Matrix as input test_set <- matrix(rpois(500, 0.5), nrow = 10) para_est_new(test_set)
This function returns a data frame with paired quantiles in two samples after interpolation.
qq_interpolation(dfp, dfq, sample1, sample2)
qq_interpolation(dfp, dfq, sample1, sample2)
dfp |
A data frame generated from function new_quantile() based on a specific distribution. |
dfq |
Another data frame generated from function new_quantile() based on a specific distribution. |
sample1 |
A character to denote sample name of distribution used to generate |
sample2 |
A character to denote sample name of distribution used to generate |
This is a function for quantile interpolation of two samples. For each unique quantile value that has original data point in one sample but no corresponding original data point in another sample, apply a linear interpolation. So the common quantile values after interpolation should have unique points the same as unique quantile points from either sample.
A data frame contains corresponding probability from cumulative distribution function (CDF),
corresponding quantiles from the first sample (dfp
),
and corresponding quantiles from the second sample (dfq
).
This function returns a Q-Q plot with envelope using a more "continuous" approximation of quantiles.
qqplot_env_pois(sample_data, lambda, envelope_size = 100, ...)
qqplot_env_pois(sample_data, lambda, envelope_size = 100, ...)
sample_data |
A numeric vector of sample data points or an S3 object for class 'scppp'. |
lambda |
A numeric value specifying the theoretical Poisson parameter. |
envelope_size |
A numeric value specifying the size of envelope on Q-Q plot (default 100). |
... |
not used. |
This is a function for Q-Q envelope plot used to compare whether given sample data points come from the theoretical Poisson distribution. By simulating repeated samples of the same size from the candidate theoretical distribution, and overlaying the envelope on the same figure, it provides a feeling of understanding the natural variation from the theoretical distribution.
If an S3 object for class 'scppp' is used as input and the stored result under "data" is a matrix, The GLM-PCA algorithm will be applied to estimate the Poisson parameter for each matrix entry. Then a specific number of entries will be selected as sample data points to compare with the theoretical Poisson distribution.
A ggplot object.
Townes FW, Street K (2020). glmpca: Dimension Reduction of Non-Normally Distributed Data. R package version 0.2.0, https://CRAN.R-project.org/package=glmpca.
This function returns a ggplot object used to visualize quantiles comparing distributions of two samples.
qqplot_small_test(P, Q, sample1, sample2)
qqplot_small_test(P, Q, sample1, sample2)
P |
A numeric vector from one sample. |
Q |
A numeric vector from the other sample. |
sample1 |
A character to denote sample name of one distribution |
sample2 |
A character to denote sample name of the other distribution |
This is a function for quantile-quantile plot comparing comparing samples from two discrete distributions after continuity correction and linear interpolation
A ggplot object. Q-Q plot with continuity correction. Quantiles from one sample on the horizontal axis and corresponding quantiles from the other sample on the vertical axis.
Define S3 class that stores scRNA-seq data and associated information (e.g. model departure representation, cell clustering results) if corresponding functions are called.
scppp(data, sample = c("columns", "rows"))
scppp(data, sample = c("columns", "rows"))
data |
input data - Usually a matrix of counts |
sample |
by rows or columns |
S3 object for class 'scppp'.
This function returns a list with elements mainly generated from sigclust2.
sigp(test_dat, minSize = 10, sim = 100)
sigp(test_dat, minSize = 10, sim = 100)
test_dat |
A UMI count data matrix with samples to cluster as rows and features as columns. |
minSize |
A numeric value specifying the minimal allowable cluster size (the number of cells for the smallest cluster, default 10). |
sim |
A numeric value specifying the number of simulations during the Monte Carlo simulation procedure (default 100). |
This is a function used to calculate the significance level of the first split from hierarchical clustering based on euclidean distance and Ward's linkage.
A list with the following elements:
p
: p-value for the first split
z
: z-score for the first split
shc_result
: a shc
S3-object as defined in sigclust2 package
clust2
: a vector with group index for each cell
clust_dat
: a matrix of data representation used as input for hierarchical clustering
Kimes PK, Liu Y, Neil Hayes D, Marron JS (2017). “Statistical significance for hierarchical clustering.” Biometrics, 73(3), 811–821. Michael Linderman (2019). Rclusterpp: Linkable C++ Clustering. R package version 0.2.5, https://github.com/nolanlab/Rclusterpp.
This function generates ggplot object with theme elements that Dirk appreciates on his ggplots
theme_dirk( base_size = 22, base_family = "", base_line_size = base_size/22, base_rect_size = base_size/22, time_stamp = FALSE )
theme_dirk( base_size = 22, base_family = "", base_line_size = base_size/22, base_rect_size = base_size/22, time_stamp = FALSE )
base_size |
base font size, given in pts. |
base_family |
base font family |
base_line_size |
base size for line elements |
base_rect_size |
base size for rect elements |
time_stamp |
Logical value to indicate if the current time should be added as a caption to the plot. Helpful for versioning of plots. |
list that can be added to a ggplot object