Skip to content

Commit

Permalink
Fixed comma error in relation to openjournals/joss-reviews#6533
Browse files Browse the repository at this point in the history
  • Loading branch information
Sm00thix committed Jul 23, 2024
1 parent 103a87c commit e7046e8
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ In conclusion, `ikpls` empowers researchers and practitioners in machine learnin

# Statement of need

PLS [@wold1966estimation] is a standard method in machine learning and chemometrics. PLS can be used as a regression model, PLS-R (PLS regression), [@wold1983food; @wold2001pls] or a classification model, PLS-DA (PLS discriminant analysis) [@barker2003partial]. PLS takes as input a matrix $\mathbf{X}$ with dimension $(N, K)$ of predictor variables and a matrix $\mathbf{Y}$ with dimension $(N, M)$ of response variables. PLS decomposes $\mathbf{X}$ and $\mathbf{Y}$ into $A$ latent variables (also called components), which are linear combinations of the original $\mathbf{X}$ and $\mathbf{Y}$. Choosing the optimal number of components, $A$, depends on the input data and varies from task to task. Additionally, selecting the optimal preprocessing method is challenging to assess before model validation [@rinnan2009review, @sorensen2021nir] but is required for achieving optimal performance [@du2022quantitative]. The optimal number of components and the optimal preprocessing method are typically chosen by cross-validation, which may be very computationally expensive. The implementations of the fast cross-validation algorithm [@engstrøm2024shortcutting] will significantly reduce the computational cost of cross-validation.
PLS [@wold1966estimation] is a standard method in machine learning and chemometrics. PLS can be used as a regression model, PLS-R (PLS regression) [@wold1983food; @wold2001pls], or a classification model, PLS-DA (PLS discriminant analysis) [@barker2003partial]. PLS takes as input a matrix $\mathbf{X}$ with dimension $(N, K)$ of predictor variables and a matrix $\mathbf{Y}$ with dimension $(N, M)$ of response variables. PLS decomposes $\mathbf{X}$ and $\mathbf{Y}$ into $A$ latent variables (also called components), which are linear combinations of the original $\mathbf{X}$ and $\mathbf{Y}$. Choosing the optimal number of components, $A$, depends on the input data and varies from task to task. Additionally, selecting the optimal preprocessing method is challenging to assess before model validation [@rinnan2009review, @sorensen2021nir] but is required for achieving optimal performance [@du2022quantitative]. The optimal number of components and the optimal preprocessing method are typically chosen by cross-validation, which may be very computationally expensive. The implementations of the fast cross-validation algorithm [@engstrøm2024shortcutting] will significantly reduce the computational cost of cross-validation.

This work introduces the Python software package, `ikpls`, with novel, fast implementations of IKPLS Algorithm #1 and Algorithm #2 by @dayal1997improved, which have previously been compared with other PLS algorithms and shown to be fast [@alin2009comparison] and numerically stable [@andersson2009comparison]. The implementations introduced in this work use NumPy [@harris2020array] and JAX [@jax2018github]. The NumPy implementations can be executed on CPUs, and the JAX implementations can be executed on CPUs, GPUs, and TPUs. The JAX implementations are also end-to-end differentiable, allowing integration into deep learning methods. This work compares the execution time of the implementations on input data of varying dimensions. It reveals that choosing the implementation that best fits the data will yield orders of magnitude faster execution than the common NIPALS [@wold1966estimation] implementation of PLS, which is the one implemented by scikit-learn [@scikit-learn], an extensive machine learning library for Python. With the implementations introduced in this work, choosing the optimal number of components and the optimal preprocessing becomes much more feasible than previously. Indeed, derivatives of this work have previously been applied to do this precisely [@engstrom2023improving; @engstrom2023analyzing].

Expand Down

0 comments on commit e7046e8

Please sign in to comment.