Title: | Partition-Assisted Clustering and Multiple Alignments of Networks |
---|---|
Description: | Implements partition-assisted clustering and multiple alignments of networks. It 1) utilizes partition-assisted clustering to find robust and accurate clusters and 2) discovers coherent relationships of clusters across multiple samples. It is particularly useful for analyzing single-cell data set. Please see Li et al. (2017) <doi:10.1371/journal.pcbi.1005875> for detail method description. |
Authors: | Ye Henry Li, Dangna Li |
Maintainer: | Ye Henry Li <[email protected]> |
License: | GPL-3 |
Version: | 1.1.4 |
Built: | 2024-11-11 03:59:40 UTC |
Source: | https://github.com/cran/PAC |
Aggregates results from the clustering and merging step.
aggregateData(dataInput, labelsInput)
aggregateData(dataInput, labelsInput)
dataInput |
Data matrix, with first column being SampleID. |
labelsInput |
cluster labels from PAC. |
The aggregated data of dataInput
, with average signal levels for all clusters and sample combinations.
n = 5e3 # number of observations p = 1 # number of dimensions K = 3 # number of clusters w = rep(1,K)/K # component weights mu <- c(0,2,4) # component means sd <- rep(1,K)/K # component standard deviations g <- sample(1:K,prob=w,size=n,replace=TRUE) # ground truth for clustering X <- as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g])) y <- PAC(X, K) X2<-as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g])) y2<-PAC(X2,K) X<-cbind("Sample1", as.data.frame(X)); colnames(X)<-c("SampleID", "Value") X2<-cbind("Sample2", as.data.frame(X2)); colnames(X2)<-c("SampleID", "Value") aggregateData(rbind(X,X2),c(y,y2))
n = 5e3 # number of observations p = 1 # number of dimensions K = 3 # number of clusters w = rep(1,K)/K # component weights mu <- c(0,2,4) # component means sd <- rep(1,K)/K # component standard deviations g <- sample(1:K,prob=w,size=n,replace=TRUE) # ground truth for clustering X <- as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g])) y <- PAC(X, K) X2<-as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g])) y2<-PAC(X2,K) X<-cbind("Sample1", as.data.frame(X)); colnames(X)<-c("SampleID", "Value") X2<-cbind("Sample2", as.data.frame(X2)); colnames(X2)<-c("SampleID", "Value") aggregateData(rbind(X,X2),c(y,y2))
Creates annotation matrix for the clades in aggregated format. The matrix contains average signals of each dimension for each clade in each sample
annotateClades(sampleIDs, topHubs)
annotateClades(sampleIDs, topHubs)
sampleIDs |
sampleID vector |
topHubs |
number of top ranked genes to output for annotation; annotation is a concatenated list of top ranked genes. |
Annotated clade matrix
Adds subpopulation proportion for the annotation matrix for the clades
annotationMatrix_withSubpopProp(aggregateMatrix_withAnnotation)
annotationMatrix_withSubpopProp(aggregateMatrix_withAnnotation)
aggregateMatrix_withAnnotation |
the annotated clade matrix |
Annotated clade matrix with subpopulation proportions
Finds N Leaf centers in the data
BSPLeaveCenter(data, N = 40, method = "dsp")
BSPLeaveCenter(data, N = 40, method = "dsp")
data |
a n x p data matrix |
N |
number of leaves centers |
method |
partition method, either "dsp (discrepancy based partition)", or "ll (bayesian sequantial partition limited-look ahead)" |
leafctr N leaves centers
Makes constellation plot, in which the centroids are clusters are embedded in the t-SNE 2D plane and the cross-sample relationships are plotted as lines connecting related sample clusters (clades).
constellationPlot(pacman_results, perplexity, max_iter, seed, plotTitle = "Constellations of Clades", nudge_x = 0.3, nudge_y = 0.3)
constellationPlot(pacman_results, perplexity, max_iter, seed, plotTitle = "Constellations of Clades", nudge_x = 0.3, nudge_y = 0.3)
pacman_results |
PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels. |
perplexity |
perplexity setting for running t-SNE |
max_iter |
max_iter setting for running t-SNE |
seed |
set seed to make t-SNE and consetllation plot to be reproducible |
plotTitle |
max_iter setting for running t-SNE |
nudge_x |
nudge on x coordinate of centroid labels |
nudge_y |
nudge on y coordinate of centroid labels |
Compute the F measure between the ground truth and the estimated label
fmeasure(g, t)
fmeasure(g, t)
g |
the ground truth |
t |
estimated labels |
f the F measure
Calculate the (global) average spread of subpopulations in clades with 2 subpopulations on the constellation plot.
getAverageSpreadOf2SubpopClades(tsneResults, pacman_results)
getAverageSpreadOf2SubpopClades(tsneResults, pacman_results)
tsneResults |
t-SNE output of clade centroids' embedding. |
pacman_results |
PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels. |
Returns global average of 2-subpopulation clade spread on the constellation plot.
Calculates subpopulations in clades (with two or more subpopulations) that are too far away from other subpopulations (within the same clade) on the constellation plot; these far away subpopulations should be pruned away from the original clades.
getExtraneousCladeSubpopulations(tsneResults, pacman_results, threshold_multiplier, max_threshold)
getExtraneousCladeSubpopulations(tsneResults, pacman_results, threshold_multiplier, max_threshold)
tsneResults |
t-SNE output of clade centroids' embedding. |
pacman_results |
PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels. |
threshold_multiplier |
how many times the threshold ( (a) spread from center of clade for clades with three or more sample subpopulations and (b) distance from each subpopulation centroid for clades with exactly two subpopulations). |
max_threshold |
the maximum distance (on t-SNE plane) allowed for sample subpopulations to be categorized into the same clade. |
Returns clade subpopulations to be pruned.
Outputs representative networks for clades/subpopulations larger than a size filter (very small subpopulations are not considered in downstream analyses)
getRepresentativeNetworks(sampleIDs, dim_subset, SubpopSizeFilter, num_networkEdge)
getRepresentativeNetworks(sampleIDs, dim_subset, SubpopSizeFilter, num_networkEdge)
sampleIDs |
sampleID vector |
dim_subset |
a string vector of string names to subset the data columns for PAC; set to NULL to use all columns |
SubpopSizeFilter |
the cutoff for small subpopulations. Smaller subpopulations have unstable covariance structure, so no network structure is calculated |
num_networkEdge |
the number of edges to draw for each subpopulation mutual information network |
Creates the matrix that can be easily plotted with a heatmap function available in an R package
heatmapInput(aggregateMatrix_withAnnotation)
heatmapInput(aggregateMatrix_withAnnotation)
aggregateMatrix_withAnnotation |
the annotated clade matrix |
the heatmap input matrix
Calculates the Jaccard similarity matrix.
JaccardSM(network1, network2)
JaccardSM(network1, network2)
network1 |
first network matrix input |
network2 |
second network matrix input |
the alignment/co-occurene score
Creates network alignments using network constructed from subpopulations after PAC
MAN(sampleIDs, num_PACSupop, smallSubpopCutoff, k_clades)
MAN(sampleIDs, num_PACSupop, smallSubpopCutoff, k_clades)
sampleIDs |
sampleID vector |
num_PACSupop |
number of subpopulations learned in PAC step for each sample |
smallSubpopCutoff |
Population size cutoff for subpopulations in clade calculation. The small subpopulations will be considered in the refinement step. |
k_clades |
number of clades to output before refinement |
clades_network_only the clades constructed without small subpopulations (by cutoff) using mutual information network alignments
Mutual information network connection matrix generation (mrnet algorithm) using the parmigene package. Mutual information calculated with infotheo package.
MINetwork_matrix_topEdges(dataMatrix, threshold)
MINetwork_matrix_topEdges(dataMatrix, threshold)
dataMatrix |
data matrix |
threshold |
the number of edges to draw for each subpopulation mutual information network |
the mutual information network connection matrix with top edges
Outputs the vectorized summary of a network based on the number of edges connected to a node
MINetwork_simplified_topEdges(dataMatrix, threshold)
MINetwork_simplified_topEdges(dataMatrix, threshold)
dataMatrix |
data matrix |
threshold |
the number of edges to draw for each subpopulation mutual information network |
Plots mutual information network (mrnet algorithm) connection using the parmigene package. Mutual information calculated with infotheo package.
MINetworkPlot_topEdges(dataMatrix, threshold)
MINetworkPlot_topEdges(dataMatrix, threshold)
dataMatrix |
data matrix |
threshold |
the maximum number of edges to draw for each subpopulation mutual information network |
Wrapper to output the mutual information networks for subpopulations with size larger than a desired threshold.
outputNetworks_topEdges_matrix(dataMatrix, subpopulationLabels, threshold)
outputNetworks_topEdges_matrix(dataMatrix, subpopulationLabels, threshold)
dataMatrix |
data matrix with first column being the sample ID |
subpopulationLabels |
the subpopulation labels |
threshold |
the number of edges to draw for each subpopulation mutual information network |
Outputs the representative/clade networks (plots and summary vectors) for subpopulations with size larger than a desired threshold. Saves the networks and the data matrices without the smaller subpopulations.
outputRepresentativeNetworks_topEdges(dataMatrix, subpopulationLabels, threshold)
outputRepresentativeNetworks_topEdges(dataMatrix, subpopulationLabels, threshold)
dataMatrix |
data matrix with first column being the sample ID |
subpopulationLabels |
the subpopulation labels |
threshold |
the number of edges to draw for each subpopulation mutual information network |
Partition Assisted Clustering PAC 1) utilizes dsp or bsp-ll to recursively partition the data space and 2) applies a short round of kmeans style postprocessing to efficiently output clustered labels of data points.
PAC(data, K, maxlevel = 40, method = "dsp", max.iter = 50)
PAC(data, K, maxlevel = 40, method = "dsp", max.iter = 50)
data |
a n x p data matrix |
K |
number of final clusters in the output |
maxlevel |
the maximum level of the partition |
method |
partition method, either "dsp(discrepancy based partition)", or "bsp(bayesian sequantial partition)" |
max.iter |
maximum iteration for the kmeans step |
y cluter labels for the input
n = 5e3 # number of observations p = 1 # number of dimensions K = 3 # number of clusters w = rep(1,K)/K # component weights mu <- c(0,2,4) # component means sd <- rep(1,K)/K # component standard deviations g <- sample(1:K,prob=w,size=n,replace=TRUE) # ground truth for clustering X <- as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g])) y <- PAC(X, K) print(fmeasure(g,y))
n = 5e3 # number of observations p = 1 # number of dimensions K = 3 # number of clusters w = rep(1,K)/K # component weights mu <- c(0,2,4) # component means sd <- rep(1,K)/K # component standard deviations g <- sample(1:K,prob=w,size=n,replace=TRUE) # ground truth for clustering X <- as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g])) y <- PAC(X, K) print(fmeasure(g,y))
Calculates the within cluster spread
recordWithinClusterSpread(sampleIDs, dim_subset = NULL, SubpopSizeFilter)
recordWithinClusterSpread(sampleIDs, dim_subset = NULL, SubpopSizeFilter)
sampleIDs |
A vector of sample names. |
dim_subset |
a string vector of string names to subset the data columns for PAC; set to NULL to use all columns. |
SubpopSizeFilter |
threshold to filter out very small clusters with too few points; these very small subpopulations may not be outliers and not biologically relevant. |
Returns the sample within cluster spread
Refines the subpopulation labels from PAC using network alignment and small subpopulation information. Outputs a new set of files containing the representative labels.
refineSubpopulationLabels(sampleIDs, dim_subset, clades_network_only, expressionGroupClamp)
refineSubpopulationLabels(sampleIDs, dim_subset, clades_network_only, expressionGroupClamp)
sampleIDs |
sampleID vector |
dim_subset |
a string vector of string names to subset the data columns for PAC; set to NULL to use all columns |
clades_network_only |
the alignment results from MAN; used to translate the original sample-specific labels into clade labels |
expressionGroupClamp |
clamps the subpopulations into desired number of expression groups for assigning small subpopulations into larger groups or their own groups. |
Prune away specified subpopulations in clades that are far away.
renamePrunedSubpopulations(pacman_results, subpopulationsToPrune)
renamePrunedSubpopulations(pacman_results, subpopulationsToPrune)
pacman_results |
PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels. |
subpopulationsToPrune |
A vector of clade IDs; these clades will be pruned. |
Returns PAC-MAN analysis result matrix with pruned clades. The pruning process creates new clades to replace the original clade ID of the specified subpopulations.
Runs elbow point analysis to find the practical optimal number of clades to output. Outputs the average within sample cluster spread for all samples and the elbow point analysis plot with loess line fitted through the results.
runElbowPointAnalysis(ks, sampleIDs, dim_subset, num_PACSupop, smallSubpopCutoff, expressionGroupClamp, SubpopSizeFilter)
runElbowPointAnalysis(ks, sampleIDs, dim_subset, num_PACSupop, smallSubpopCutoff, expressionGroupClamp, SubpopSizeFilter)
ks |
Vector that is a sequence of clade sizes. |
sampleIDs |
A vector of sample names. |
dim_subset |
a string vector of string names to subset the data columns for PAC; set to NULL to use all columns. |
num_PACSupop |
Number of PAC subpopulation explored in each sample. |
smallSubpopCutoff |
Cutoff of minor subpopulation not used in multiple alignments of networks |
expressionGroupClamp |
clamps the subpopulations into desired number of expression groups for assigning small subpopulations into larger groups or their own groups. |
SubpopSizeFilter |
threshold to filter out very small clusters with too few points in the calculation of cluster spreads; these very small subpopulations may be outliers and not biologically relevant. |
A wrapper to run PAC and output subpopulation mutual information networks. Please use the PAC function itself for individual samples or if the MAN step is not needed.
samplePass(sampleIDs, dim_subset, hyperrectangles, num_PACSupop, max.iter, num_networkEdge)
samplePass(sampleIDs, dim_subset, hyperrectangles, num_PACSupop, max.iter, num_networkEdge)
sampleIDs |
sampleID vector |
dim_subset |
a string vector of string names to subset the data columns for PAC; set to NULL to use all columns |
hyperrectangles |
number of hyperrectangles to learn for each sample |
num_PACSupop |
number of subpopulations to output for each sample using PAC |
max.iter |
postprocessing kmeans iterations |
num_networkEdge |
a threshold on the number of edges to output for each subpopulation mutual information network |