OnlinePCA.jl Documentation
Overview
OnlinePCA.jl binarizes CSV file, summarizes the information of data matrix and, performs some online-PCA functions for extreamly large scale matrix in an out-of-core manner without loading whole data on memory space.
Online PCA methods are performed as following three steps.
- Step.1 Binarization : We assume that the data is matrix filled in integar count value and saved as comma-separated CSV file. The CSV file is converted to Julia binary file by
csv2binfunction. This step extremely accelerates I/O speed. Starting with version 0.3.0, themm2binfunction andbincoo2binare also available. These function generates binary files for data in Matrix Market (MM) format and Binary COO (BinCOO), not CSV. - Step.2 Summarization :
sumrfunction parses the binary file and then extract some summary statistics in each row and each column. The information is used by Step.3. Starting with version 0.3.0, thesparse_modeargument was added to thesumrfunction.
If this is specified as true, sumr can be applied to binary files output by mm2bin. The output files are used in sparse_rsvd. Assuming a situation where the input file is the HDF5 formatted by 10X Genomics, we also prepared tenxsumr function.
- Step.3 Online PCA : Some online PCA functions can be performed against the binary file. By using mean vector of each row generated by Step.2, row-wise centering (c.f.
--rowmeanlistoption) is performed. Likewise, sum vector of each columns are used column-wise normalization (c.f.--colsumlistoption).gd,sgd,oja,ccipca,rsgd,svrg,rsvrg,orthiter,lanczos,arnoldi,rbkiter,halko,algorithm971,singlepass, andsinglepass2are implemented. Assuming a situation where the input file is the HDF5 formatted by 10X Genomics, we also preparedtenxpcafunction. Starting with version 0.3.0, the functionsparse_rsvdhas been released, allowing the same algorithm (algorithm971) astenxpcato be performed onmm2binoutput files. A function namedexact_ooc_pcais also publicly available since version 0.3.0, which performs dimensionality compression by eigenvalue decomposition after constructing a column-by-column covariance matrix out-of-core for longitudinal data. This is not an approximate algorithm, but uses full-rank eigenvalue decomposition.
All programs are available as Julia API (OnlinePCA.jl (Julia API)) and command line tool (OnlinePCA.jl (Command line tool)).

Reference
Algorithms
- Gradient-based
- GD-PCA
- SGD-PCA
- Oja's method : [Erkki Oja et. al., 1985](https://www.sciencedirect.com/science/article/pii/0022247X85901313), [Erkki Oja, 1992](https://www.sciencedirect.com/science/article/pii/S0893608005800899)
- CCIPCA : [Juyang Weng et. al., 2003](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.5665&rep=rep1&type=pdf)
- RSGD-PCA : [Silvere Bonnabel, 2013](https://arxiv.org/abs/1111.5280)
- SVRG-PCA : [Ohad Shamir, 2015](http://proceedings.mlr.press/v37/shamir15.pdf)
- RSVRG-PCA : [Hongyi Zhang, et. al., 2016](http://papers.nips.cc/paper/6515-riemannian-svrg-fast-stochastic-optimization-on-riemannian-manifolds.pdf), [Hiroyuki Sato, et. al., 2017](https://arxiv.org/abs/1702.05594)- Krylov subspace-based
- Orthgonal Iteration (A power method to calculate multiple eigenvectors at once) : [Zhaofun Bai, 1987](https://www.amazon.co.jp/Templates-Solution-Algebraic-Eigenvalue-Problems/dp/0898714710)
- Arnoldi method : [Zhaofun Bai, 1987](https://www.amazon.co.jp/Templates-Solution-Algebraic-Eigenvalue-Problems/dp/0898714710)
- Lanczos method : [Zhaofun Bai, 1987](https://www.amazon.co.jp/Templates-Solution-Algebraic-Eigenvalue-Problems/dp/0898714710)- Random projection-based
- Halko's method : [Halko, N., et. al., 2011](https://arxiv.org/abs/0909.4061), [Halko, N. et. al., 2011](https://epubs.siam.org/doi/abs/10.1137/100804139)
- Algorithm971 : [George C. Linderman, et. al., 2017](https://arxiv.org/abs/1712.09005), [Huamin, Li, et. al., 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5625842/), [Vladimir Rokhlin, et. al., 2009](https://arxiv.org/abs/0809.2274)
- Randomized Block Krylov Iteration : [W, Yu, et. al., 2017](https://arxiv.org/abs/1504.05477)
- Single-pass PCA : [C Musco, et. al., 2015](https://www.ijcai.org/proceedings/2017/0468.pdf)Learning Parameter Scheduling
- Robbins-Monro : Herbert Robbins, et. al., 1951
- Momentum : Ning Qian, 1999
- Nesterov's Accelerated Gradient Descent(NAG) : Nesterov, 1983
- ADAGRAD : John Duchi, et. al., 2011