OnlinePCA.jl (Julia API)

Binarization (CSV file)

OnlinePCA.csv2binMethod
csv2bin(;csvfile::AbstractString="", binfile::AbstractString="")

Convert a CSV file to Julia Binary file.

csvfile and binfile are specified such as Data.csv and Data.dat, respectively.

source

Binarization (Matrix Market <MM> file)

OnlinePCA.mm2binMethod

mm2bin(;mmfile::AbstractString="", binfile::AbstractString="")

Convert a Matrix Market (MM) file to Julia Binary file.

Input Arguments

  • mmfile : Matrix Market file (e.g., Data.mtx).
  • binfile : Julia Binary file (e.g., Data.mtx.zst).
source

Binarization (Binary COO <BinCOO> file)

OnlinePCA.bincoo2binMethod

bincoo2bin(;bincoofile::AbstractString="", binfile::AbstractString="")

Convert a Binary COO (BinCOO) file to Julia Binary file.

Input Arguments

  • bincoofile : Binary COO file (e.g., Data.bincoo).
  • binfile : Julia Binary file (e.g., Data.bincoo.zst).
source

Summarization

OnlinePCA.sumrMethod
sumr(; binfile::AbstractString="", outdir::AbstractString=".", pseudocount::Number=1.0, mode::AbstractString="dense", chunksize::Int=1)

Extract the summary information of data matrix.

Input Arguments

  • binfile is a Julia Binary file generated by csv2bin function.
  • outdir is specified the directory you want to save the result.
  • pseudocount is specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • mode : "dense" or "sparse_mm" can be specified.
  • chunksize : The number of rows to be read at once.

Output Files

  • Sample_NoCounts.csv : Sum of counts in each column.
  • Feature_Means.csv : Mean in each row.
  • Feature_LogMeans.csv : Log10(Mean+pseudocount) in each row.
  • Feature_FTTMeans.csv : FTT(Mean+pseudocount) in each row.
  • Feature_Vars.csv : Sample variance in each row.
  • Feature_LogVars.csv : Log10(Var+pseudocount) in each row.
  • Feature_FTTVars.csv : FTT(Var+pseudocount) in each row.
  • Feature_CV2s.csv : Coefficient of Variation in each row.
  • Feature_NoZeros.csv : Number of zero-elements in each row.
source

Filtering

OnlinePCA.filteringMethod
filtering(;input::AbstractString="", featurelist::AbstractString="", samplelist::AbstractString="", thr1::Number=0.0, thr2::Number=0.0, direct1::AbstractString="+", direct2::AbstractString="+", outdir::AbstractString=".")

This function filters the genes by some standards such as mean or variance of the genes.

Input Arguments

  • input : A Julia Binary file generated by csv2bin function.
  • featurelist : A row-wise summary data such as. The CSV files are generated by csv2bin function.
  • thr : The threshold to reject low-signal feature.
  • outdir : The directory specified the directory you want to save the result.

Output Files

  • filtered.zst : Filtered binary file.
source

Identifying Highly Variable Genes

OnlinePCA.hvgMethod
hvg(;binfile::AbstractString="", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", rowcv2list::AbstractString="", outdir::AbstractString=".")

This function perform highly variable genes, which is an feature selection method in scRNA-seq studies.

Input Arguments

  • binfile is a Julia Binary file generated by csv2bin function.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by sumr functions.
  • rowcv2list : The cv2 of each row of matrix. The CSV file is generated by sumr functions.
  • outdir : The directory specified the directory you want to save the result.

Output Files

  • HVG_useForFit.csv : Parameters to estimate the highly variable genes.
  • HVG_a0.csv : Parameters to estimate the highly variable genes.
  • HVG_a1.csv : Parameters to estimate the highly variable genes.
  • HVG_afits.csv : Parameters to estimate the highly variable genes.
  • HVG_varFitRatio.csv : Parameters to estimate the highly variable genes.
  • HVG_df.csv : Parameters to estimate the highly variable genes.
  • HVG_pval.csv : Parameters to estimate the highly variable genes.

Reference

source

GD-PCA

OnlinePCA.gdMethod
gd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Gradient descent method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numepoch : The number of epoch.
  • scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
  • g : The parameter that is used when scheduling is specified as nag.
  • epsilon : The parameter that is used when scheduling is specified as adagrad.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetFull : Off set value for avoding overflow when calculating full gradient
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every 1000 iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged
source

SGD-PCA

OnlinePCA.sgdMethod
sgd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Stochastic gradient descent method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numbatch : The number of batch size.
  • numepoch : The number of epoch.
  • scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
  • g : The parameter that is used when scheduling is specified as nag.
  • epsilon : The parameter that is used when scheduling is specified as adagrad.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetFull : Off set value for avoding overflow when calculating full gradient
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged
source

Oja's method

OnlinePCA.ojaMethod
oja(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Oja's method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numepoch : The number of epoch.
  • scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
  • g : The parameter that is used when scheduling is specified as nag.
  • epsilon : The parameter that is used when scheduling is specified as adagrad.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetFull : Off set value for avoding overflow when calculating full gradient
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged

Reference

source

CCIPCA

OnlinePCA.ccipcaMethod
ccipca(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0, numepoch::Number=3, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Candid covariance-free incremental PCA.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numepoch : The number of epoch.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged

Reference

source

RSGD-PCA

OnlinePCA.rsgdMethod
rsgd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Riemannian stochastic gradient descent method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numbatch : The number of batch size.
  • numepoch : The number of epoch.
  • scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
  • g : The parameter that is used when scheduling is specified as nag.
  • epsilon : The parameter that is used when scheduling is specified as adagrad.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetFull : Off set value for avoding overflow when calculating full gradient
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged

Reference

source

SVRG-PCA

OnlinePCA.svrgMethod
svrg(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Variance-reduced stochastic gradient descent method, also known as VR-PCA.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numbatch : The number of batch size.
  • numepoch : The number of epoch.
  • scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
  • g : The parameter that is used when scheduling is specified as nag.
  • epsilon : The parameter that is used when scheduling is specified as adagrad.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetFull : Off set value for avoding overflow when calculating full gradient
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged

Reference

source

RSVRG-PCA

OnlinePCA.rsvrgMethod
rsvrg(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Riemannian variance-reduced stochastic gradient descent method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • stepsize : The parameter used in every iteration.
  • numbatch : The number of batch size.
  • numepoch : The number of epoch.
  • scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
  • g : The parameter that is used when scheduling is specified as nag.
  • epsilon : The parameter that is used when scheduling is specified as adagrad.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • evalfreq : Evaluation Frequency of Reconstruction Error
  • offsetFull : Off set value for avoding overflow when calculating full gradient
  • offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged

Reference

source

Orthogonal Iteration (Power method)

OnlinePCA.orthiterMethod
orthiter(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Orthogonal iteration, also known as block power method, subspace iteration or simultaneous iteration, ...etc.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • numepoch : The number of epoch.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged
source

Arnoldi method

OnlinePCA.arnoldiMethod
arnoldi(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, perm::Bool=false, cper::Number=1f0)

Arnoldi method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged
source

Lanczos method

OnlinePCA.lanczosMethod
lanczos(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, perm::Bool=false, cper::Number=1f0)

Lanczos method.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged
source

Halko's method

OnlinePCA.halkoMethod
halko(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Halko's method, which is one of randomized SVD algorithm.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • noversamples : The number of over-sampling.
  • niter : The number of power interation.
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source

Algorithm 971

OnlinePCA.algorithm971Method
algorithm971(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Algorithm 971, which is one of randomized SVD algorithm.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • noversamples : The number of over-sampling.
  • niter : The number of power interation.
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source

Randomized Block Krylov Iteration

OnlinePCA.rbkiterMethod
rbkiter(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Randomized Block Krylov Iteration.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • numepoch : The number of epoch.
  • lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
  • upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
  • stop : Whether the calculation is converged
source

Single-pass PCA type I

OnlinePCA.singlepassMethod
singlepass(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Single-pass PCA type I, which is one of randomized SVD algorithm.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • noversamples : The number of over-sampling.
  • niter : The number of power interation.
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source

Single-pass PCA type II

OnlinePCA.singlepass2Method
singlepass2(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Single-pass PCA type II, which is one of randomized SVD algorithm.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {log,ftt,raw}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • noversamples : The number of over-sampling.
  • niter : The number of power interation.
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source

Summarization for 10X-HDF5

OnlinePCA.tenxsumrMethod
tenxsumr(; tenxfile::AbstractString="", outdir::AbstractString=".", group::AbstractString="", chunksize::Number=5000)

Extract the summary information of 10X-HDF5.

Input Arguments

  • tenxfile is the HDF5 file formatted by 10X Genomics.
  • outdir is specified the directory you want to save the result.
  • group is the group name of HDF5 (e.g. mm10).
  • chunksize is the number of rows reading at once (e.g. 5000).

Output Files

  • Sample_NoCounts.csv : Sum of counts in each column.
  • Feature_Means.csv : Mean in each row.
  • Feature_LogMeans.csv : Log10(Mean+1) in each row.
  • Feature_SqrtMeans.csv : sqrt(Mean+1) in each row.
  • Feature_Vars.csv : Sample variance in each row.
  • Feature_LogVars.csv : Log10(Var+1) in each row.
  • Feature_SqrtVars.csv : sqrt(Var+1) in each row.
  • Feature_CV2s.csv : Coefficient of Variation in each row.
source

ALGORITHM971 for 10X-HDF5

OnlinePCA.tenxpcaMethod
tenxpca(; tenxfile::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="sqrt", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, chunksize::Number=5000, group::AbstractString, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1.0f0)

A randomized SVD.

Input Arguments

  • tenxfile : Julia Binary file generated by OnlinePCA.csv2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {sqrt,log,raw}-scaling of the value.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • noversamples : The number of over-sampling.
  • niter : The number of power interation.
  • chunksize is the number of rows reading at once (e.g. 5000).
  • group : The group name of 10XHDF5 (e.g. mm10).
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source

Sparse Randomized SVD (ALGORITHM971 for Binarized MM file)

OnlinePCA.sparse_rsvdMethod
sparse_rsvd(; input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, chunksize::Number=1, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1.0f0)

A randomized SVD.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.mm2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {ftt,log,raw}-scaling of the value.
  • rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
  • dim : The number of dimension of PCA.
  • noversamples : The number of over-sampling.
  • niter : The number of power interation.
  • chunksize is the number of rows reading at once (e.g. 1).
  • initW : The CSV file saving the initial values of eigenvectors.
  • initV : The CSV file saving the initial values of loadings.
  • logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 1) iteration.
  • perm : Whether the data matrix is shuffled at random.
  • cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source

Exact Out-of-Core PCA

OnlinePCA.exact_ooc_pcaMethod
exact_ooc_pca(; input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="raw", pseudocount::Number=1.0f0, dim::Number=3, chunksize::Number=1, mode::AbstractString="dense")

Exact Out-of-Core PCA, which is based on normal full-rank SVD and does not assume the low-rank approximation.

Input Arguments

  • input : Julia Binary file generated by OnlinePCA.csv2bin or OnlinPCA.mm2bin function.
  • outdir : The directory specified the directory you want to save the result.
  • scale : {raw,log,ftt}-scaling of the value.
  • pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
  • dim : The number of dimension of PCA.
  • chunksize : The number of rows to be read at once.
  • mode : "dense", "sparsemm", or "sparsebincoo" can be specified.

Output Arguments

  • V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
  • λ : Eigen values (dim × dim)
  • U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
  • Scores : Principal component scores
  • ExpVar : Explained variance by the eigenvectors
  • TotalVar : Total variance of the data matrix
source