OnlinePCA.jl (Julia API)
Binarization (CSV file)
OnlinePCA.csv2bin — Method
csv2bin(;csvfile::AbstractString="", binfile::AbstractString="")Convert a CSV file to Julia Binary file.
csvfile and binfile are specified such as Data.csv and Data.dat, respectively.
Binarization (Matrix Market <MM> file)
OnlinePCA.mm2bin — Method
mm2bin(;mmfile::AbstractString="", binfile::AbstractString="")
Convert a Matrix Market (MM) file to Julia Binary file.
Input Arguments
mmfile: Matrix Market file (e.g., Data.mtx).binfile: Julia Binary file (e.g., Data.mtx.zst).
Binarization (Binary COO <BinCOO> file)
OnlinePCA.bincoo2bin — Method
bincoo2bin(;bincoofile::AbstractString="", binfile::AbstractString="")
Convert a Binary COO (BinCOO) file to Julia Binary file.
Input Arguments
bincoofile: Binary COO file (e.g., Data.bincoo).binfile: Julia Binary file (e.g., Data.bincoo.zst).
Summarization
OnlinePCA.sumr — Method
sumr(; binfile::AbstractString="", outdir::AbstractString=".", pseudocount::Number=1.0, mode::AbstractString="dense", chunksize::Int=1)Extract the summary information of data matrix.
Input Arguments
binfileis a Julia Binary file generated bycsv2binfunction.outdiris specified the directory you want to save the result.pseudocountis specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.mode: "dense" or "sparse_mm" can be specified.chunksize: The number of rows to be read at once.
Output Files
Sample_NoCounts.csv: Sum of counts in each column.Feature_Means.csv: Mean in each row.Feature_LogMeans.csv: Log10(Mean+pseudocount) in each row.Feature_FTTMeans.csv: FTT(Mean+pseudocount) in each row.Feature_Vars.csv: Sample variance in each row.Feature_LogVars.csv: Log10(Var+pseudocount) in each row.Feature_FTTVars.csv: FTT(Var+pseudocount) in each row.Feature_CV2s.csv: Coefficient of Variation in each row.Feature_NoZeros.csv: Number of zero-elements in each row.
Filtering
OnlinePCA.filtering — Method
filtering(;input::AbstractString="", featurelist::AbstractString="", samplelist::AbstractString="", thr1::Number=0.0, thr2::Number=0.0, direct1::AbstractString="+", direct2::AbstractString="+", outdir::AbstractString=".")This function filters the genes by some standards such as mean or variance of the genes.
Input Arguments
input: A Julia Binary file generated bycsv2binfunction.featurelist: A row-wise summary data such as. The CSV files are generated bycsv2binfunction.thr: The threshold to reject low-signal feature.outdir: The directory specified the directory you want to save the result.
Output Files
filtered.zst: Filtered binary file.
Identifying Highly Variable Genes
OnlinePCA.hvg — Method
hvg(;binfile::AbstractString="", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", rowcv2list::AbstractString="", outdir::AbstractString=".")This function perform highly variable genes, which is an feature selection method in scRNA-seq studies.
Input Arguments
binfileis a Julia Binary file generated bycsv2binfunction.rowmeanlist: The mean of each row of matrix. The CSV file is generated bysumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated bysumrfunctions.rowcv2list: The cv2 of each row of matrix. The CSV file is generated bysumrfunctions.outdir: The directory specified the directory you want to save the result.
Output Files
HVG_useForFit.csv: Parameters to estimate the highly variable genes.HVG_a0.csv: Parameters to estimate the highly variable genes.HVG_a1.csv: Parameters to estimate the highly variable genes.HVG_afits.csv: Parameters to estimate the highly variable genes.HVG_varFitRatio.csv: Parameters to estimate the highly variable genes.HVG_df.csv: Parameters to estimate the highly variable genes.HVG_pval.csv: Parameters to estimate the highly variable genes.
Reference
GD-PCA
OnlinePCA.gd — Method
gd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Gradient descent method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numepoch: The number of epoch.scheduling: Learning parameter scheduling.robbins-monro,momentum,nag, andadagradare available.g: The parameter that is used when scheduling is specified as nag.epsilon: The parameter that is used when scheduling is specified as adagrad.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetFull: Off set value for avoding overflow when calculating full gradientoffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every 1000 iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
SGD-PCA
OnlinePCA.sgd — Method
sgd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Stochastic gradient descent method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numbatch: The number of batch size.numepoch: The number of epoch.scheduling: Learning parameter scheduling.robbins-monro,momentum,nag, andadagradare available.g: The parameter that is used when scheduling is specified as nag.epsilon: The parameter that is used when scheduling is specified as adagrad.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetFull: Off set value for avoding overflow when calculating full gradientoffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Oja's method
OnlinePCA.oja — Method
oja(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Oja's method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numepoch: The number of epoch.scheduling: Learning parameter scheduling.robbins-monro,momentum,nag, andadagradare available.g: The parameter that is used when scheduling is specified as nag.epsilon: The parameter that is used when scheduling is specified as adagrad.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetFull: Off set value for avoding overflow when calculating full gradientoffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Reference
- SGD-PCA(Oja's method) : Erkki Oja et. al., 1985, Erkki Oja, 1992
CCIPCA
OnlinePCA.ccipca — Method
ccipca(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0, numepoch::Number=3, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Candid covariance-free incremental PCA.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numepoch: The number of epoch.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Reference
- CCIPCA : Juyang Weng et. al., 2003
RSGD-PCA
OnlinePCA.rsgd — Method
rsgd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Riemannian stochastic gradient descent method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numbatch: The number of batch size.numepoch: The number of epoch.scheduling: Learning parameter scheduling.robbins-monro,momentum,nag, andadagradare available.g: The parameter that is used when scheduling is specified as nag.epsilon: The parameter that is used when scheduling is specified as adagrad.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetFull: Off set value for avoding overflow when calculating full gradientoffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Reference
- RSGD-PCA : Silvere Bonnabel, 2013
SVRG-PCA
OnlinePCA.svrg — Method
svrg(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Variance-reduced stochastic gradient descent method, also known as VR-PCA.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numbatch: The number of batch size.numepoch: The number of epoch.scheduling: Learning parameter scheduling.robbins-monro,momentum,nag, andadagradare available.g: The parameter that is used when scheduling is specified as nag.epsilon: The parameter that is used when scheduling is specified as adagrad.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetFull: Off set value for avoding overflow when calculating full gradientoffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Reference
- SVRG-PCA : Ohad Shamir, 2015
RSVRG-PCA
OnlinePCA.rsvrg — Method
rsvrg(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Riemannian variance-reduced stochastic gradient descent method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.stepsize: The parameter used in every iteration.numbatch: The number of batch size.numepoch: The number of epoch.scheduling: Learning parameter scheduling.robbins-monro,momentum,nag, andadagradare available.g: The parameter that is used when scheduling is specified as nag.epsilon: The parameter that is used when scheduling is specified as adagrad.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)evalfreq: Evaluation Frequency of Reconstruction ErroroffsetFull: Off set value for avoding overflow when calculating full gradientoffsetStoch: Off set value for avoding overflow when calculating stochastic gradientinitW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Reference
- RSVRG-PCA : Hongyi Zhang, et. al., 2016, Hiroyuki Sato, et. al., 2017
Orthogonal Iteration (Power method)
OnlinePCA.orthiter — Method
orthiter(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Orthogonal iteration, also known as block power method, subspace iteration or simultaneous iteration, ...etc.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.numepoch: The number of epoch.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Arnoldi method
OnlinePCA.arnoldi — Method
arnoldi(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, perm::Bool=false, cper::Number=1f0)Arnoldi method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Lanczos method
OnlinePCA.lanczos — Method
lanczos(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, perm::Bool=false, cper::Number=1f0)Lanczos method.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Halko's method
OnlinePCA.halko — Method
halko(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Halko's method, which is one of randomized SVD algorithm.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.noversamples: The number of over-sampling.niter: The number of power interation.initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix
Algorithm 971
OnlinePCA.algorithm971 — Method
algorithm971(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Algorithm 971, which is one of randomized SVD algorithm.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.noversamples: The number of over-sampling.niter: The number of power interation.initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix
Randomized Block Krylov Iteration
OnlinePCA.rbkiter — Method
rbkiter(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Randomized Block Krylov Iteration.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.numepoch: The number of epoch.lower: Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)upper: Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
W: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)V: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix- stop : Whether the calculation is converged
Single-pass PCA type I
OnlinePCA.singlepass — Method
singlepass(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Single-pass PCA type I, which is one of randomized SVD algorithm.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.noversamples: The number of over-sampling.niter: The number of power interation.initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix
Single-pass PCA type II
OnlinePCA.singlepass2 — Method
singlepass2(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)Single-pass PCA type II, which is one of randomized SVD algorithm.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {log,ftt,raw}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.noversamples: The number of over-sampling.niter: The number of power interation.initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix
Summarization for 10X-HDF5
OnlinePCA.tenxsumr — Method
tenxsumr(; tenxfile::AbstractString="", outdir::AbstractString=".", group::AbstractString="", chunksize::Number=5000)Extract the summary information of 10X-HDF5.
Input Arguments
tenxfileis the HDF5 file formatted by 10X Genomics.outdiris specified the directory you want to save the result.groupis the group name of HDF5 (e.g. mm10).chunksizeis the number of rows reading at once (e.g. 5000).
Output Files
Sample_NoCounts.csv: Sum of counts in each column.Feature_Means.csv: Mean in each row.Feature_LogMeans.csv: Log10(Mean+1) in each row.Feature_SqrtMeans.csv: sqrt(Mean+1) in each row.Feature_Vars.csv: Sample variance in each row.Feature_LogVars.csv: Log10(Var+1) in each row.Feature_SqrtVars.csv: sqrt(Var+1) in each row.Feature_CV2s.csv: Coefficient of Variation in each row.
ALGORITHM971 for 10X-HDF5
OnlinePCA.tenxpca — Method
tenxpca(; tenxfile::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="sqrt", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, chunksize::Number=5000, group::AbstractString, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1.0f0)A randomized SVD.
Input Arguments
tenxfile: Julia Binary file generated byOnlinePCA.csv2binfunction.outdir: The directory specified the directory you want to save the result.scale: {sqrt,log,raw}-scaling of the value.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.noversamples: The number of over-sampling.niter: The number of power interation.chunksizeis the number of rows reading at once (e.g. 5000).group: The group name of 10XHDF5 (e.g. mm10).initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix
Sparse Randomized SVD (ALGORITHM971 for Binarized MM file)
OnlinePCA.sparse_rsvd — Method
sparse_rsvd(; input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, chunksize::Number=1, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1.0f0)A randomized SVD.
Input Arguments
input: Julia Binary file generated byOnlinePCA.mm2binfunction.outdir: The directory specified the directory you want to save the result.scale: {ftt,log,raw}-scaling of the value.rowmeanlist: The mean of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.rowvarlist: The variance of each row of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.colsumlist: The sum of counts of each columns of matrix. The CSV file is generated byOnlinePCA.sumrfunctions.dim: The number of dimension of PCA.noversamples: The number of over-sampling.niter: The number of power interation.chunksizeis the number of rows reading at once (e.g. 1).initW: The CSV file saving the initial values of eigenvectors.initV: The CSV file saving the initial values of loadings.logdir: The directory where intermediate files are saved, in every evalfreq (e.g. 1) iteration.perm: Whether the data matrix is shuffled at random.cper: Count per X (e.g. CPM: Count per million <1e+6>)
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix
Exact Out-of-Core PCA
OnlinePCA.exact_ooc_pca — Method
exact_ooc_pca(; input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="raw", pseudocount::Number=1.0f0, dim::Number=3, chunksize::Number=1, mode::AbstractString="dense")Exact Out-of-Core PCA, which is based on normal full-rank SVD and does not assume the low-rank approximation.
Input Arguments
input: Julia Binary file generated byOnlinePCA.csv2binorOnlinPCA.mm2binfunction.outdir: The directory specified the directory you want to save the result.scale: {raw,log,ftt}-scaling of the value.pseudocount: The number specified to avoid NaN by log10(0) and used whenFeature_LogMeans.csv<log10(mean+pseudocount) value of each feature> is generated.dim: The number of dimension of PCA.chunksize: The number of rows to be read at once.mode: "dense", "sparsemm", or "sparsebincoo" can be specified.
Output Arguments
V: Eigen vectors of covariance matrix (No. columns of the data matrix × dim)λ: Eigen values (dim × dim)U: Loading vectors of covariance matrix (No. rows of the data matrix × dim)Scores: Principal component scoresExpVar: Explained variance by the eigenvectorsTotalVar: Total variance of the data matrix