Hostname: page-component-cd9895bd7-lnqnp Total loading time: 0 Render date: 2024-12-27T08:14:24.511Z Has data issue: false hasContentIssue false

Structure-preserving deep learning

Published online by Cambridge University Press:  27 May 2021

E. CELLEDONI
Affiliation:
Department of Mathematical Sciences, NTNU, N-7491 Trondheim, Norway emails: [email protected]; [email protected]
M. J. EHRHARDT
Affiliation:
Institute for Mathematical Innovation, University of Bath, Bath BA2 7JU, UK email: [email protected]
C. ETMANN
Affiliation:
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK emails: [email protected]; [email protected]; [email protected]
R. I. MCLACHLAN
Affiliation:
School of Fundamental Sciences, Massey University, Private Bag 11-222, Palmerston North, New Zealand email: [email protected]
B. OWREN
Affiliation:
Department of Mathematical Sciences, NTNU, N-7491 Trondheim, Norway emails: [email protected]; [email protected]
C.-B. SCHONLIEB
Affiliation:
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK emails: [email protected]; [email protected]; [email protected]
F. SHERRY
Affiliation:
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK emails: [email protected]; [email protected]; [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Over the past few years, deep learning has risen to the foreground as a topic of massive interest, mainly as a result of successes obtained in solving large-scale image processing tasks. There are multiple challenging mathematical problems involved in applying deep learning: most deep learning methods require the solution of hard optimisation problems, and a good understanding of the trade-off between computational effort, amount of data and model complexity is required to successfully design a deep learning approach for a given problem.. A large amount of progress made in deep learning has been based on heuristic explorations, but there is a growing effort to mathematically understand the structure in existing deep learning methods and to systematically design new deep learning methods to preserve certain types of structure in deep learning. In this article, we review a number of these directions: some deep neural networks can be understood as discretisations of dynamical systems, neural networks can be designed to have desirable properties such as invertibility or group equivariance and new algorithmic frameworks based on conformal Hamiltonian systems and Riemannian manifolds to solve the optimisation problems have been proposed. We conclude our review of each of these topics by discussing some open problems that we consider to be interesting directions for future research.

Type
Papers
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press

References

Absil, P.-A., Mahony, R. & Sepulchre, R. (2008) Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton, NJ. With a foreword by Paul Van Dooren.CrossRefGoogle Scholar
Amari, S.-I. (1998) Natural gradient works efficiently in learning. Neural Comput. 10(2), 251276.CrossRefGoogle Scholar
Amari, S.-I., Cichocki, A. & Yang, H. H. (1996) A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems, pp. 757–763.Google Scholar
Amari, S.-I. & Douglas, S. C. (1998) Why natural gradient? In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 2, IEEE, pp. 1213–1216.Google Scholar
Ambrosio, L., Gigli, N. & Savaré, G. (2008) Gradient flows: in Metric Spaces and in the Space of Probability Measures, Springer Science & Business Media, Berlin.Google Scholar
Arridge, S., Maass, P., Öktem, O. & Schönlieb, C.-B. (2019) Solving inverse problems using data-driven models. Acta Numerica 28, 1–174.CrossRefGoogle Scholar
Asorey, M., Cariñena, J. F. & Ibort, L. A. (1983) Generalized canonical transformations for time-dependent systems. J. Math. Phys. 24(12), 27452750.CrossRefGoogle Scholar
Bécigneul, G. & Ganea, O.-E. (2019) Riemannian adaptive optimization methods. In: International Conference on Learning Representations.Google Scholar
Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D. & Jacobsen, J.-H. (2019) Invertible residual networks. In: Chaudhuri, K. and Salakhutdinov, R. (editors), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, 09–15 June 2019, PMLR, pp. 573–582.Google Scholar
Behrmann, J., Vicol, P., Wang, K. C., Grosse, R. & Jacobsen, J. H. (2021) Understanding and mitigating exploding inverses in invertible neural networks. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1792–1800.Google Scholar
Bekkers, E. J., Lafarge, M. W., Veta, M., Eppenhof, K. A. J., Pluim, J. P. W. & Duits, R. (2018) Roto-translation covariant convolutional networks for medical image analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, pp. 440448.Google Scholar
Benning, M., Celledoni, E., Ehrhardt, M. J., Owren, B. & Schönlieb, C.-B. (2019) Deep learning as optimal control problems: models and numerical methods. J. Comput. Dyn. 6(2), 171198.CrossRefGoogle Scholar
Bhatt, A., Floyd, D. & Moore, B. E. (2016) Second order conformal symplectic schemes for damped Hamiltonian systems. J. Sci. Comput. 66(3), 12341259.CrossRefGoogle Scholar
Bogachev, V. I. (2007) Measure Theory, Vol. 1, Springer Science & Business Media, Berlin.CrossRefGoogle Scholar
Bölcskei, H., Grohs, P., Kutyniok, G. & Petersen, P. (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1), 845.CrossRefGoogle Scholar
Bonnans, J. F. (2019) Course on Optimal Control. http://www.cmap.polytechnique.fr/~bonnans/notes/oc/ocbook.pdf.Google Scholar
Cardoso, J.-F. & Laheld, B. H. (1992) Equivariant adaptive source separation. IEEE Trans. Signal Process. 44, 30173030.CrossRefGoogle Scholar
Carnegie Mellon University Graphics Lab. (2003) Motion capture database. http://mocap.cs.cmu.edu/.Google Scholar
Celledoni, E., Eslitzbichler, M. & Schmeding, A. (2016) Shape analysis on Lie groups with applications in computer animation. J. Geom. Mech. 8(3), 273304.CrossRefGoogle Scholar
Celledoni, E. & Fiori, S. (2004) Neural learning by geometric integration of reduced ‘rigid-body’ equations. J. Comput. Appl. Math. 172(2), 247269.CrossRefGoogle Scholar
Celledoni, E. & Høiseth, E. H. (2017) Energy-Preserving and Passivity-Consistent Numerical Discretization of Port-Hamiltonian Systems. arXiv preprint arXiv:1706.08621.Google Scholar
Celledoni, E., Marthinsen, H. & Owren, B. (2014) An introduction to Lie group integrators—basics, new developments and applications. J. Comput. Phys. 257(part B), 10401061.CrossRefGoogle Scholar
Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D. & Holtham, E. (2018) Reversible architectures for arbitrarily deep residual neural networks. In: Thirty-Second AAAI Conference on Artificial Intelligence, Vol. 32, AAAI Press, Palo Alto, pp. 2811–2818.Google Scholar
Chen, T. Q., Behrmann, J., Duvenaud, D. & Jacobsen, J.-H. (2019) Residual flows for invertible generative modeling. In: Advances in Neural Information Processing Systems, pp. 9913–9923.Google Scholar
Chen, T. Q., Rubanova, Y., Bettencourt, J. & Duvenaud, D. (2018) Neural ordinary differential equations. In: Advances in Neural Information Processing Systems, pp. 6572–6583.Google Scholar
Chizat, L. & Bach, F. (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3036–3046.Google Scholar
Chizat, L. & Bach, F. (2020) Implicit Bias of Gradient Descent for Wide Two-Layer Neural Networks Trained with the Logistic Loss. arXiv preprint arXiv:2002.04486.Google Scholar
Cho, M. & Lee, J. (2017) Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, pp. 5225–5235.Google Scholar
Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C. & Gomez, F. (2018) NAIS-Net: stable deep networks from non-autonomous differential equations. In: Advances in Neural Information Processing Systems, pp. 3025–3035.Google Scholar
Clason, C. (2020) Regularization of Inverse Problems. arXiv:2001.00617.Google Scholar
Cohen, T., Geiger, M. & Weiler, M. (2019) A general theory of equivariant CNNs on homogeneous spaces. In: Advances in Neural Information Processing Systems 32, pp. 9145–9156.Google Scholar
Cohen, T. S., Geiger, M., Koehler, J. & Welling, M. (2018) Spherical CNNs. arXiv:1801.10130.Google Scholar
Cohen, T. S. & Welling, M. (2016) Group equivariant convolutional networks. In: International Conference on Machine Learning, pp. 2990–2999.Google Scholar
Cohen, T. S. & Welling, M. (2017) Steerable CNNs, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings.Google Scholar
Conn, A. R., Gould, N. I. M. & Toint, P. L. (2000) Trust-Region Methods, MPS-SIAM Series on Optimization, Vol. 1. MPS/SIAM, Philadelphia.Google Scholar
Cook, P., Bai, Y., Nedjati-Gilani, S., Seunarine, K., Hall, M., Parker, G. & Alexander, D. (2006) Camino: open-source diffusion-MRI reconstruction and processing. In: Proceedings of the 14th Scientific Meeting of ISMRM, Seattle WA, USA, Vol. 2759.Google Scholar
Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303314.CrossRefGoogle Scholar
Dahlquist, G. (1979) Generalized disks of contractivity for explicit and implicit Runge-Kutta methods. Technical report, CM-P00069451.Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 248–255.CrossRefGoogle Scholar
Dinh, L., Krueger, D. & Bengio, Y. (2014) NICE: Non-Linear Independent Components Estimation. arXiv preprint arXiv:1410.8516.Google Scholar
Dinh, L., Sohl-Dickstein, J. & Bengio, S. (2016) Density Estimation Using Real NVP. arXiv preprint arXiv:1605.08803.Google Scholar
Du, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R. & Singh, A. (2018) How Many Samples are Needed to Estimate a Convolutional or Recurrent Neural Network? arXiv:1805.07883.Google Scholar
Duchi, J., Shalev-Shwartz, S., Singer, Y. & Chandra, T. (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning - ICML, pp. 272–279.CrossRefGoogle Scholar
Dupont, E., Doucet, A. & Teh, Y. W. (2019) Augmented neural ODEs. In: Advances in Neural Information Processing Systems.Google Scholar
Durkan, C., Bekasov, A., Murray, I. & Papamakarios, G. (2019) Neural spline flows. In: Advances in Neural Information Processing Systems, pp. 7509–7520.Google Scholar
E, W. (2017) A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5(1), 111.CrossRefGoogle Scholar
E, W., Han, J. & Li, Q. (2018) A Mean-Field Optimal Control Formulation of Deep Learning. arXiv:1807.01083v1.CrossRefGoogle Scholar
E, W., Han, J. & Li, Q. (2019) A mean-field optimal control formulation of deep learning. Res. Math. Sci. 6(1), 141.CrossRefGoogle Scholar
E, W., Ma, C. & Wang, Q. (2019) A Priori Estimates of the Population Risk for Residual Networks. arXiv, pp. 1–19.Google Scholar
Engl, H. W., Hanke, M. & Neubauer, A. (1996) Regularization of Inverse Problems, Mathematics and Its Applications, Springer, Berlin.Google Scholar
Esteves, C., Allen-Blanchette, C., Makadia, A. & Daniilidis, K. (2018) Learning SO(3) equivariant representations with spherical CNNs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68.CrossRefGoogle Scholar
Etmann, C., Ke, R. & Schönlieb, C.-B. (2020) iUNets: Fully Invertible U-Nets with Learnable Up-and Downsampling. arXiv preprint arXiv:2005.05220.Google Scholar
França, G., Sulam, J., Robinson, D. P. & Vidal, R. (2019) Conformal Symplectic and Relativistic Optimization. arXiv preprint arXiv:1903.04100.Google Scholar
Gallot, S., Hulin, D. & Lafontaine, J. (2004) Riemannian Geometry, 3rd ed., Universitext, Springer-Verlag, Berlin.CrossRefGoogle Scholar
García Trillos, N. & Slepčev, D. (2016) Continuum limit of total variation on point clouds. Arch. Ration. Mech. Anal. 220(1), 193241.CrossRefGoogle Scholar
Gholami, A., Keutzer, K. & Biros, G. (2019) ANODE: unconditionally accurate memory-efficient gradients for neural ODEs. In: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2019, pp. 730–736.Google Scholar
Gomez, A. N., Ren, M., Urtasun, R. & Grosse, R. B. (2017) The reversible residual network: backpropagation without storing activations. In: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. and Garnett, R. (editors), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp. 22142224.Google Scholar
Grönwall, T. H. (1919) Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math. 20(4), 292296.CrossRefGoogle Scholar
Günther, S., Ruthotto, L., Schroder, J. B., Cyr, E. C. & Gauger, N. R. (2020) Layer-parallel training of deep residual neural networks. SIAM J. Math. Data Sci. 2(1), 123.CrossRefGoogle Scholar
Haber, E. & Ruthotto, L. (2017) Stable architectures for deep neural networks. Inverse Probl. 34(1), 014004.CrossRefGoogle Scholar
Hager, W. W. (2000) Runge-Kutta methods in optimal control and the transformed adjoint system. Numerische Mathematik 87(2), 247282.CrossRefGoogle Scholar
Hairer, E., Lubich, C. & Wanner, G. (2006) Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations, Vol. 31, Springer Science & Business Media, Berlin.Google Scholar
Hairer, E., Nørsett, S. P. & Wanner, G. (1993) Solving Ordinary Differential Equations I, 2nd ed., Springer Series in Computational Mathematics, Springer-Verlag, Berlin, Heidelberg.Google Scholar
Hairer, E. & Wanner, G. (2010) Solving Ordinary Differential Equations. II, Springer Series in Computational Mathematics, Vol. 14, Springer-Verlag, Berlin. Stiff and differential-algebraic problems, Second revised edition, paperback.Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.CrossRefGoogle Scholar
Hochreiter, S. & Schmidhuber, J. (1997) Flat minima. Neural Comput. 9(1), 142.CrossRefGoogle ScholarPubMed
Hoogeboom, E., Van Den Berg, R. & Welling, M. Emerging convolutions for generative normalizing flows. In: Chaudhuri, K. and Salakhutdinov, R. (editors), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, 09–15 June 2019, PMLR, pp. 2771–2780.Google Scholar
Hopfield, J. J. (1982) Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. 79(8), 25542558.CrossRefGoogle ScholarPubMed
Hornik, K. (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251257.CrossRefGoogle Scholar
Hutchinson, M. F. (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Commun. Stat. Simul. Comput. 19(2), 433450.CrossRefGoogle Scholar
Hyvärinen, A. & Oja, E. (2000) Independent component analysis: algorithms and applications. Neural Networks 13, 411430.CrossRefGoogle ScholarPubMed
Iserles, A., Munthe-Kaas, H. Z., Nørsett, S. P. & Zanna, A. (2000) Lie-group methods. In: Acta Numerica, 2000, Acta Numerica, Vol. 9, Cambridge University Press, Cambridge, pp. 215–365.CrossRefGoogle Scholar
Ito, K. & Jin, B. (2014) Inverse Problems - Tikhonov Theory and Algorithms, World Scientific, Singapore.CrossRefGoogle Scholar
Jacobsen, J.-H., Smeulders, A. W. M. & Oyallon, E. (2018) i-RevNet: deep invertible networks. In: International Conference on Learning Representations.Google Scholar
Karras, T., Laine, S. & Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410.CrossRefGoogle Scholar
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. (2017) On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR.Google Scholar
Kingma, D. P. & Ba, J. (2015) Adam: a method for stochastic optimization. In: ICLR.Google Scholar
Kingma, D. P. & Dhariwal, P. (2018) Glow: generative flow with invertible 1x1 convolutions. In: Advances in Neural Information Processing Systems, pp. 10215–10224.Google Scholar
Kobayashi, S. & Nomizu, K. (1996) Foundations of Differential Geometry, Vol. I, Wiley Classics Library, John Wiley & Sons, Inc., New York. Reprint of the 1963 original, A Wiley-Interscience Publication.Google Scholar
Kondor, R., Lin, Z. & Trivedi, S. (2018) Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network. Advances in Neural Information Processing Systems, 31, 1011710126.Google Scholar
Kondor, R. & Trivedi, S. (2018) On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. arXiv:1802.03690.Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 10971105.Google Scholar
LeCun, Y. (1988) A theoretical framework for back-propagation. In: Proceedings of the 1988 Connectionist Models Summer School, Vol. 1, CMU, Morgan Kaufmann, Pittsburgh, PA, pp. 21–28.Google Scholar
LeCun, Y. & Bengio, Y. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. (2015) Deep learning. Nature 521(7553), 436444.CrossRefGoogle ScholarPubMed
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. & Jackel, L. D. (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541551.CrossRefGoogle Scholar
Li, J., Li, F. & Todorovic, S. (2019) Efficient Riemannian optimization on the Stiefel manifold via the Cayley transform. In: International Conference on Learning Representations.Google Scholar
Li, Q., Chen, L., Tai, C. & E, W. (2018) Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 129.Google Scholar
Li, Q. & Hao, S. (2018) An optimal control approach to deep learning and applications to discrete-weight neural networks. In: Proceedings of the 35th International Conference on Machine Learning.Google Scholar
Li, Q., Tai, C. & E, W. (2019) Stochastic modified equations and dynamics of stochastic gradient algorithms I: mathematical foundations. J. Mach. Learn. Res. 20, 147.Google Scholar
Li, S. T. J. & Fuxin, L. (2020) Efficient Riemannian optimization on the Stiefel manifold via the Cayley transform. In: ICLR 2020.Google Scholar
Linnainmaa, S. (1970) The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors. Master’s Thesis (in Finnish), University Helsinki, pp. 6–7.Google Scholar
Lu, Y., Zhong, A., Li, Q. & Dong, B. (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In: 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings.Google Scholar
Lyu, K. & Li, J. (2020) Gradient descent maximizes the margin of homogeneous neural networks. In: International Conference on Learning Representations.Google Scholar
Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. & Doucet, A. (2018) Hamiltonian Descent Methods. arXiv preprint arXiv:1809.05042.Google Scholar
Martens, J. (2014) New Insights and Perspectives on the Natural Gradient Method. arXiv preprint arXiv:1412.1193.Google Scholar
Marthinsen, H. & Owren, B. (2016) Geometric integration of non-autonomous linear Hamiltonian problems. Adv. Comput. Math. 42(2), 313332.CrossRefGoogle Scholar
Massaroli, S., Poli, M., Califano, F., Faragasso, A., Park, J., Yamashita, A. & Asama, H. (2019) Port-Hamiltonian Approach to Neural Network Training. arXiv preprint arXiv:1909.02702.CrossRefGoogle Scholar
McLachlan, R. & Perlmutter, M. (2001) Conformal Hamiltonian systems. J. Geom. Phys. 39(4), 276300.CrossRefGoogle Scholar
McLachlan, R. I. & Quispel, G. R. W. (2002) Splitting methods. Acta Numer. 11, 341434.CrossRefGoogle Scholar
McLachlan, R. I., Quispel, G. R. W. & Robidoux, N. (1999) Geometric integration using discrete gradients. R. Soc. Lond. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 357(1754), 10211045.CrossRefGoogle Scholar
Modin, K. (2016) Geometry of Matrix Decompositions seen through Optimal Transport and Information Geometry. arXiv preprint arXiv:1601.01875.Google Scholar
Ng, A. Y. (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the 21 st International Conference on Machine Learning.CrossRefGoogle Scholar
Nocedal, J. & Wright, S. (2006) Numerical Optimization, Springer Science & Business Media, Berlin.Google Scholar
O’Donoghue, B. & Maddison, C. J. (2019) Hamiltonian descent for composite objectives. In: Advances in Neural Information Processing Systems, pp. 14443–14453.Google Scholar
Parpas, P. & Muir, C. (2019) Predict Globally, Correct Locally: Parallel-in-Time Optimal Control of Neural Networks. arXiv, 1974.Google Scholar
Pascanu, R. & Bengio, Y. (2013) Revisiting Natural Gradient for Deep Networks. arXiv preprint arXiv:1301.3584.Google Scholar
Petersen, P. & Voigtlaender, F. (2019) Equivalence of approximation by convolutional neural networks and fully-connected networks. Proc. Am. Math. Soc. 148(4), 15671581.CrossRefGoogle Scholar
Pontryagin, L. S. (1987) Mathematical Theory of Optimal Processes, Classics of Soviet Mathematics, Taylor & Francis, Montreux.Google Scholar
Putzky, P. & Welling, M. (2019) Invert to learn to invert. In: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 446–456.Google Scholar
Ranzato, M. A., Boureau, Y.-L. & Le Cun, Y. (2009) Sparse feature learning for deep belief networks. In: Advances in Neural Information Processing Systems 20 - Proceedings of the 2007 Conference.Google Scholar
Reddi, S. J., Kale, S. & Kumar, S. (2018) On the convergence of Adam and beyond. In: ICLR. Google Scholar
Rezende, D. J. & Mohamed, S. (2015) Variational inference with normalizing flows. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org., pp. 1530–1538.Google Scholar
Robbins, H. & Monro, S. (1951) A stochastic approximation method. Ann. Math. Stat., 22(3), 400407.CrossRefGoogle Scholar
Rocca, F., Prato, C. M. & Ferretti, A. (1997) An overview of ERS-SAR interferometry. In: Proceedings of the 3rdERS Symposium on Space at the Service of our Environment, Florence, Italy.Google Scholar
Ruthotto, L. & Haber, E. (2019) Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 113. Springer, Berlin.Google Scholar
Shalev-Shwartz, S. & Ben-David, S. (2014) Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge.CrossRefGoogle Scholar
Su, W., Boyd, S. P. & Candes, E. J. (2014) A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. In: NIPS, Vol. 14, pp. 25102518.Google Scholar
Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A. & Goldstein, T. (2016) Training neural networks without gradients: a scalable ADMM approach. In: ICML.Google Scholar
Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M. & Sugiyama, M. (2020) Coupling-based Invertible Neural Networks are Universal Diffeomorphism Approximators. arXiv preprint arXiv:2006.11469.Google Scholar
Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K. & Riley, P. (2018) Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds. arXiv:1802.08219.Google Scholar
Thorpe, M. & van Gennip, Y. (2018) Deep Limits of Residual Neural networks. arXiv preprint arXiv:1810.11741.Google Scholar
Udrişte, C. (1994) Convex Functions and Optimization Methods on Riemannian Manifolds, Mathematics and its Applications, Vol. 297, Kluwer Academic Publishers Group, Dordrecht.Google Scholar
Ulyanov, D., Vedaldi, A. & Lempitsky, V. (2018) Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454.Google Scholar
van der Schaft, A. & Jeltsema, D. (2014) Port-Hamiltonian systems theory: an introductory overview. Found. Trends Syst. Control 1(2–3), 173378.CrossRefGoogle Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 33713408.Google Scholar
Wang, X., Ma, S., Goldfarb, D. & Lu, W. (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927956.CrossRefGoogle Scholar
Weiler, M., Geiger, M., Welling, M., Boomsma, W. & Cohen, T. (2018) 3D steerable CNNs: learning rotationally equivariant features in volumetric data. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10402–10413.Google Scholar
Weiler, M., Hamprecht, F. A. & Storath, M. (2018) Learning steerable filters for rotation equivariant CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849–858CrossRefGoogle Scholar
Weinmann, A., Demaret, L. & Storath, M. (2014) Total variation regularization for manifold-valued data. SIAM J. Imaging Sci. 7(4), 22262257.CrossRefGoogle Scholar
Withers, C. S. & Nadarajah, S. (2010) log det A = tr log A. Int. J. Math. Edu. Sci. Technol. 41(8), 11211124.CrossRefGoogle Scholar
Worrall, D. E., Garbin, S. J., Turmukhambetov, D. & Brostow, G. J. (2017) Harmonic networks: Deep translation and rotation equivariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037.CrossRefGoogle Scholar
Xie, Y., Byrd, R. H. & Nocedal, J. (2020) Analysis of the BFGS method with errors. SIAM J. Optim. 30(1), 182209.CrossRefGoogle Scholar
Yang, H. H. & Amari, S.-i. (1997) Natural gradient descent for training multi-layer perceptrons. Submitted to IEEE Trans. Neural Networks. Google Scholar
Yang, Z., Liu, Y., Bao, C. & Shi, Z. (2020) Interpolation between residual and non-residual networks. In: International Conference on Machine Learning, PMLR, pp. 10736–10745.Google Scholar
Yarotsky, D. (2018) Universal Approximations of Invariant Maps by Neural Networks. arXiv:1804.10306.Google Scholar
Zaheer, M., Reddi, S., Sachan, D., Kale, S. & Kumar, S. (2018) Adaptive methods for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 9793–9803.Google Scholar
Zhang, G., Martens, J. & Grosse, R. B. (2019) Fast convergence of natural gradient descent for over-parameterized neural networks. In: Advances in Neural Information Processing Systems, pp. 8080–8091.Google Scholar
Zhang, L. & Schaeffer, H. (2020) Forward stability of resNet and its variants. J. Math. Imaging Vis. 62(3), 328351.CrossRefGoogle Scholar