OnlinePCA.jl Documentation

Overview

OnlinePCA.jl binarizes CSV file, summarizes the information of data matrix and, performs some online-PCA functions for extreamly large scale matrix in an out-of-core manner without loading whole data on memory space.

Online PCA methods are performed as following three steps.

  • Step.1 Binarization : We assume that the data is matrix filled in integar count value and saved as comma-separated CSV file. The CSV file is converted to Julia binary file by csv2bin function. This step extremely accelerates I/O speed. Starting with version 0.3.0, the mm2bin function and bincoo2bin are also available. These function generates binary files for data in Matrix Market (MM) format and Binary COO (BinCOO), not CSV.
  • Step.2 Summarization : sumr function parses the binary file and then extract some summary statistics in each row and each column. The information is used by Step.3. Starting with version 0.3.0, the sparse_mode argument was added to the sumr function.

If this is specified as true, sumr can be applied to binary files output by mm2bin. The output files are used in sparse_rsvd. Assuming a situation where the input file is the HDF5 formatted by 10X Genomics, we also prepared tenxsumr function.

  • Step.3 Online PCA : Some online PCA functions can be performed against the binary file. By using mean vector of each row generated by Step.2, row-wise centering (c.f. --rowmeanlist option) is performed. Likewise, sum vector of each columns are used column-wise normalization (c.f. --colsumlist option). gd, sgd, oja, ccipca, rsgd, svrg, rsvrg, orthiter, lanczos, arnoldi, rbkiter, halko, algorithm971, singlepass, and singlepass2 are implemented. Assuming a situation where the input file is the HDF5 formatted by 10X Genomics, we also prepared tenxpca function. Starting with version 0.3.0, the function sparse_rsvd has been released, allowing the same algorithm (algorithm971) as tenxpca to be performed on mm2bin output files. A function named exact_ooc_pca is also publicly available since version 0.3.0, which performs dimensionality compression by eigenvalue decomposition after constructing a column-by-column covariance matrix out-of-core for longitudinal data. This is not an approximate algorithm, but uses full-rank eigenvalue decomposition.

All programs are available as Julia API (OnlinePCA.jl (Julia API)) and command line tool (OnlinePCA.jl (Command line tool)).

Overview of OnlinePCA.jl

Reference

Algorithms

  • Gradient-based
- GD-PCA
- SGD-PCA
- Oja's method : [Erkki Oja et. al., 1985](https://www.sciencedirect.com/science/article/pii/0022247X85901313), [Erkki Oja, 1992](https://www.sciencedirect.com/science/article/pii/S0893608005800899)
- CCIPCA : [Juyang Weng et. al., 2003](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.5665&rep=rep1&type=pdf)
- RSGD-PCA : [Silvere Bonnabel, 2013](https://arxiv.org/abs/1111.5280)
- SVRG-PCA : [Ohad Shamir, 2015](http://proceedings.mlr.press/v37/shamir15.pdf)
- RSVRG-PCA : [Hongyi Zhang, et. al., 2016](http://papers.nips.cc/paper/6515-riemannian-svrg-fast-stochastic-optimization-on-riemannian-manifolds.pdf), [Hiroyuki Sato, et. al., 2017](https://arxiv.org/abs/1702.05594)
  • Krylov subspace-based
- Orthgonal Iteration (A power method to calculate multiple eigenvectors at once) : [Zhaofun Bai, 1987](https://www.amazon.co.jp/Templates-Solution-Algebraic-Eigenvalue-Problems/dp/0898714710)
- Arnoldi method : [Zhaofun Bai, 1987](https://www.amazon.co.jp/Templates-Solution-Algebraic-Eigenvalue-Problems/dp/0898714710)
- Lanczos method : [Zhaofun Bai, 1987](https://www.amazon.co.jp/Templates-Solution-Algebraic-Eigenvalue-Problems/dp/0898714710)
  • Random projection-based
- Halko's method : [Halko, N., et. al., 2011](https://arxiv.org/abs/0909.4061), [Halko, N. et. al., 2011](https://epubs.siam.org/doi/abs/10.1137/100804139)
- Algorithm971 : [George C. Linderman, et. al., 2017](https://arxiv.org/abs/1712.09005), [Huamin, Li, et. al., 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5625842/), [Vladimir Rokhlin, et. al., 2009](https://arxiv.org/abs/0809.2274)
- Randomized Block Krylov Iteration : [W, Yu, et. al., 2017](https://arxiv.org/abs/1504.05477)
- Single-pass PCA : [C Musco, et. al., 2015](https://www.ijcai.org/proceedings/2017/0468.pdf)

Learning Parameter Scheduling

Identifying Highly Variable Genes