OnlinePCA.jl (Julia API)

Binarization (CSV file)

OnlinePCA.csv2bin — Method

csv2bin(;csvfile::AbstractString="", binfile::AbstractString="")

Convert a CSV file to Julia Binary file.

csvfile and binfile are specified such as Data.csv and Data.dat, respectively.

source

Binarization (Matrix Market <MM> file)

OnlinePCA.mm2bin — Method

mm2bin(;mmfile::AbstractString="", binfile::AbstractString="")

Convert a Matrix Market (MM) file to Julia Binary file.

Input Arguments

mmfile : Matrix Market file (e.g., Data.mtx).
binfile : Julia Binary file (e.g., Data.mtx.zst).

source

Binarization (Binary COO <BinCOO> file)

OnlinePCA.bincoo2bin — Method

bincoo2bin(;bincoofile::AbstractString="", binfile::AbstractString="")

Convert a Binary COO (BinCOO) file to Julia Binary file.

Input Arguments

bincoofile : Binary COO file (e.g., Data.bincoo).
binfile : Julia Binary file (e.g., Data.bincoo.zst).

source

Summarization

OnlinePCA.sumr — Method

sumr(; binfile::AbstractString="", outdir::AbstractString=".", pseudocount::Number=1.0, mode::AbstractString="dense", chunksize::Int=1)

Extract the summary information of data matrix.

Input Arguments

binfile is a Julia Binary file generated by csv2bin function.
outdir is specified the directory you want to save the result.
pseudocount is specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
mode : "dense" or "sparse_mm" can be specified.
chunksize : The number of rows to be read at once.

Output Files

Sample_NoCounts.csv : Sum of counts in each column.
Feature_Means.csv : Mean in each row.
Feature_LogMeans.csv : Log10(Mean+pseudocount) in each row.
Feature_FTTMeans.csv : FTT(Mean+pseudocount) in each row.
Feature_Vars.csv : Sample variance in each row.
Feature_LogVars.csv : Log10(Var+pseudocount) in each row.
Feature_FTTVars.csv : FTT(Var+pseudocount) in each row.
Feature_CV2s.csv : Coefficient of Variation in each row.
Feature_NoZeros.csv : Number of zero-elements in each row.

source

Filtering

OnlinePCA.filtering — Method

filtering(;input::AbstractString="", featurelist::AbstractString="", samplelist::AbstractString="", thr1::Number=0.0, thr2::Number=0.0, direct1::AbstractString="+", direct2::AbstractString="+", outdir::AbstractString=".")

This function filters the genes by some standards such as mean or variance of the genes.

Input Arguments

input : A Julia Binary file generated by csv2bin function.
featurelist : A row-wise summary data such as. The CSV files are generated by csv2bin function.
thr : The threshold to reject low-signal feature.
outdir : The directory specified the directory you want to save the result.

Output Files

filtered.zst : Filtered binary file.

source

Identifying Highly Variable Genes

OnlinePCA.hvg — Method

hvg(;binfile::AbstractString="", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", rowcv2list::AbstractString="", outdir::AbstractString=".")

This function perform highly variable genes, which is an feature selection method in scRNA-seq studies.

Input Arguments

binfile is a Julia Binary file generated by csv2bin function.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by sumr functions.
rowcv2list : The cv2 of each row of matrix. The CSV file is generated by sumr functions.
outdir : The directory specified the directory you want to save the result.

Output Files

HVG_useForFit.csv : Parameters to estimate the highly variable genes.
HVG_a0.csv : Parameters to estimate the highly variable genes.
HVG_a1.csv : Parameters to estimate the highly variable genes.
HVG_afits.csv : Parameters to estimate the highly variable genes.
HVG_varFitRatio.csv : Parameters to estimate the highly variable genes.
HVG_df.csv : Parameters to estimate the highly variable genes.
HVG_pval.csv : Parameters to estimate the highly variable genes.

Reference

Highly Variable Genes

source

GD-PCA

OnlinePCA.gd — Method

gd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Gradient descent method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numepoch : The number of epoch.
scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
g : The parameter that is used when scheduling is specified as nag.
epsilon : The parameter that is used when scheduling is specified as adagrad.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetFull : Off set value for avoding overflow when calculating full gradient
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every 1000 iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

source

SGD-PCA

OnlinePCA.sgd — Method

sgd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Stochastic gradient descent method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numbatch : The number of batch size.
numepoch : The number of epoch.
scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
g : The parameter that is used when scheduling is specified as nag.
epsilon : The parameter that is used when scheduling is specified as adagrad.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetFull : Off set value for avoding overflow when calculating full gradient
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

source

Oja's method

OnlinePCA.oja — Method

oja(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Oja's method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numepoch : The number of epoch.
scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
g : The parameter that is used when scheduling is specified as nag.
epsilon : The parameter that is used when scheduling is specified as adagrad.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetFull : Off set value for avoding overflow when calculating full gradient
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

Reference

SGD-PCA（Oja's method) : Erkki Oja et. al., 1985, Erkki Oja, 1992

source

CCIPCA

OnlinePCA.ccipca — Method

ccipca(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0, numepoch::Number=3, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Candid covariance-free incremental PCA.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numepoch : The number of epoch.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

Reference

CCIPCA : Juyang Weng et. al., 2003

source

RSGD-PCA

OnlinePCA.rsgd — Method

rsgd(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Riemannian stochastic gradient descent method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numbatch : The number of batch size.
numepoch : The number of epoch.
scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
g : The parameter that is used when scheduling is specified as nag.
epsilon : The parameter that is used when scheduling is specified as adagrad.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetFull : Off set value for avoding overflow when calculating full gradient
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

Reference

RSGD-PCA : Silvere Bonnabel, 2013

source

SVRG-PCA

OnlinePCA.svrg — Method

svrg(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Variance-reduced stochastic gradient descent method, also known as VR-PCA.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numbatch : The number of batch size.
numepoch : The number of epoch.
scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
g : The parameter that is used when scheduling is specified as nag.
epsilon : The parameter that is used when scheduling is specified as adagrad.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetFull : Off set value for avoding overflow when calculating full gradient
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

Reference

SVRG-PCA : Ohad Shamir, 2015

source

RSVRG-PCA

OnlinePCA.rsvrg — Method

rsvrg(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, stepsize::Number=0.1, numbatch::Number=100, numepoch::Number=3, scheduling::AbstractString="robbins-monro", g::Number=0.9, epsilon::Number=1.0e-8, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, evalfreq::Number=5000, offsetFull::Number=1f-20, offsetStoch::Number=1f-6, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Riemannian variance-reduced stochastic gradient descent method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
stepsize : The parameter used in every iteration.
numbatch : The number of batch size.
numepoch : The number of epoch.
scheduling : Learning parameter scheduling. robbins-monro, momentum, nag, and adagrad are available.
g : The parameter that is used when scheduling is specified as nag.
epsilon : The parameter that is used when scheduling is specified as adagrad.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
evalfreq : Evaluation Frequency of Reconstruction Error
offsetFull : Off set value for avoding overflow when calculating full gradient
offsetStoch : Off set value for avoding overflow when calculating stochastic gradient
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

Reference

RSVRG-PCA : Hongyi Zhang, et. al., 2016, Hiroyuki Sato, et. al., 2017

source

Orthogonal Iteration (Power method)

OnlinePCA.orthiter — Method

orthiter(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Orthogonal iteration, also known as block power method, subspace iteration or simultaneous iteration, ...etc.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
numepoch : The number of epoch.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

source

Arnoldi method

OnlinePCA.arnoldi — Method

arnoldi(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, perm::Bool=false, cper::Number=1f0)

Arnoldi method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

source

Lanczos method

OnlinePCA.lanczos — Method

lanczos(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, perm::Bool=false, cper::Number=1f0)

Lanczos method.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

source

Halko's method

OnlinePCA.halko — Method

halko(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Halko's method, which is one of randomized SVD algorithm.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
noversamples : The number of over-sampling.
niter : The number of power interation.
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source

Algorithm 971

OnlinePCA.algorithm971 — Method

algorithm971(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Algorithm 971, which is one of randomized SVD algorithm.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
noversamples : The number of over-sampling.
niter : The number of power interation.
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source

Randomized Block Krylov Iteration

OnlinePCA.rbkiter — Method

rbkiter(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="",colsumlist::AbstractString="", dim::Number=3, numepoch::Number=10, lower::Number=0, upper::Number=1.0f+38, expvar::Number=0.1f0, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Randomized Block Krylov Iteration.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
numepoch : The number of epoch.
lower : Stopping Criteria (When the relative change of error is below this value, the calculation is terminated)
upper : Stopping Criteria (When the relative change of error is above this value, the calculation is terminated)
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

W : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
V : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix
stop : Whether the calculation is converged

source

Single-pass PCA type I

OnlinePCA.singlepass — Method

singlepass(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Single-pass PCA type I, which is one of randomized SVD algorithm.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
noversamples : The number of over-sampling.
niter : The number of power interation.
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source

Single-pass PCA type II

OnlinePCA.singlepass2 — Method

singlepass2(;input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", pseudocount::Number=1f0, rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1f0)

Single-pass PCA type II, which is one of randomized SVD algorithm.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {log,ftt,raw}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
noversamples : The number of over-sampling.
niter : The number of power interation.
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source

Summarization for 10X-HDF5

OnlinePCA.tenxsumr — Method

tenxsumr(; tenxfile::AbstractString="", outdir::AbstractString=".", group::AbstractString="", chunksize::Number=5000)

Extract the summary information of 10X-HDF5.

Input Arguments

tenxfile is the HDF5 file formatted by 10X Genomics.
outdir is specified the directory you want to save the result.
group is the group name of HDF5 (e.g. mm10).
chunksize is the number of rows reading at once (e.g. 5000).

Output Files

Sample_NoCounts.csv : Sum of counts in each column.
Feature_Means.csv : Mean in each row.
Feature_LogMeans.csv : Log10(Mean+1) in each row.
Feature_SqrtMeans.csv : sqrt(Mean+1) in each row.
Feature_Vars.csv : Sample variance in each row.
Feature_LogVars.csv : Log10(Var+1) in each row.
Feature_SqrtVars.csv : sqrt(Var+1) in each row.
Feature_CV2s.csv : Coefficient of Variation in each row.

source

ALGORITHM971 for 10X-HDF5

OnlinePCA.tenxpca — Method

tenxpca(; tenxfile::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="sqrt", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, chunksize::Number=5000, group::AbstractString, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1.0f0)

A randomized SVD.

Input Arguments

tenxfile : Julia Binary file generated by OnlinePCA.csv2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {sqrt,log,raw}-scaling of the value.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
noversamples : The number of over-sampling.
niter : The number of power interation.
chunksize is the number of rows reading at once (e.g. 5000).
group : The group name of 10XHDF5 (e.g. mm10).
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 5000) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source

Sparse Randomized SVD (ALGORITHM971 for Binarized MM file)

OnlinePCA.sparse_rsvd — Method

sparse_rsvd(; input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="ftt", rowmeanlist::AbstractString="", rowvarlist::AbstractString="", colsumlist::AbstractString="", dim::Number=3, noversamples::Number=5, niter::Number=3, chunksize::Number=1, initW::Union{Nothing,AbstractString}=nothing, initV::Union{Nothing,AbstractString}=nothing, logdir::Union{Nothing,AbstractString}=nothing, perm::Bool=false, cper::Number=1.0f0)

A randomized SVD.

Input Arguments

input : Julia Binary file generated by OnlinePCA.mm2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {ftt,log,raw}-scaling of the value.
rowmeanlist : The mean of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
rowvarlist : The variance of each row of matrix. The CSV file is generated by OnlinePCA.sumr functions.
colsumlist : The sum of counts of each columns of matrix. The CSV file is generated by OnlinePCA.sumr functions.
dim : The number of dimension of PCA.
noversamples : The number of over-sampling.
niter : The number of power interation.
chunksize is the number of rows reading at once (e.g. 1).
initW : The CSV file saving the initial values of eigenvectors.
initV : The CSV file saving the initial values of loadings.
logdir : The directory where intermediate files are saved, in every evalfreq (e.g. 1) iteration.
perm : Whether the data matrix is shuffled at random.
cper : Count per X (e.g. CPM: Count per million <1e+6>)

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source

Exact Out-of-Core PCA

OnlinePCA.exact_ooc_pca — Method

exact_ooc_pca(; input::AbstractString="", outdir::Union{Nothing,AbstractString}=nothing, scale::AbstractString="raw", pseudocount::Number=1.0f0, dim::Number=3, chunksize::Number=1, mode::AbstractString="dense")

Exact Out-of-Core PCA, which is based on normal full-rank SVD and does not assume the low-rank approximation.

Input Arguments

input : Julia Binary file generated by OnlinePCA.csv2bin or OnlinPCA.mm2bin function.
outdir : The directory specified the directory you want to save the result.
scale : {raw,log,ftt}-scaling of the value.
pseudocount : The number specified to avoid NaN by log10(0) and used when Feature_LogMeans.csv <log10(mean+pseudocount) value of each feature> is generated.
dim : The number of dimension of PCA.
chunksize : The number of rows to be read at once.
mode : "dense", "sparsemm", or "sparsebincoo" can be specified.

Output Arguments

V : Eigen vectors of covariance matrix (No. columns of the data matrix × dim)
λ : Eigen values (dim × dim)
U : Loading vectors of covariance matrix (No. rows of the data matrix × dim)
Scores : Principal component scores
ExpVar : Explained variance by the eigenvectors
TotalVar : Total variance of the data matrix

source