[
Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R. and Wang, R. (2019). On exact computation with an infinitely wide neural net, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 8141–8150.
]Search in Google Scholar
[
Bartlett, P.L., Foster, D.J. and Telgarsky, M.J. (2017). Spectrally-normalized margin bounds for neural networks, in I. Guyon et al. (Eds), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, New York, pp. 6240–6249.
]Search in Google Scholar
[
Bottou, L. (1998). On-line learning and stochastic approximations, in L. Bottou (Ed) On-line Learning in Neural Networks, Cambridge University Press, pp. 9–42.
]Search in Google Scholar
[
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization, Journal of Machine Learning Research 2: 499–526, DOI: 10.1162/153244302760200704.
]Search in Google Scholar
[
Chen, H., Mo, Z., Yang, Z. and Wang, X. (2019). Theoretical investigation of generalization bound for residual networks, Proceedings of the 28th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Francisco, pp. 2081–2087, DOI: 10.24963/ijcai.2019/288.
]Search in Google Scholar
[
Chen, Y., Chang, H., Meng, J. and Zhang, D. (2019). Ensemble neural networks (enn): A gradient-free stochastic method, Neural Networks 110: 170–185.
]Search in Google Scholar
[
D’Agostino, R. and Pearson, E.S. (1973). Tests for departure from normality. empirical results for the distributions of b2 and √b1, Biometrika 60(3): 613–622.
]Search in Google Scholar
[
Dieuleveut, A., Durmus, A. and Bach, F. (2017). Bridging the gap between constant step size stochastic gradient descent and markov chains, arXiv: 1707.06386.
]Search in Google Scholar
[
Elisseeff, A., Evgeniou, T. and Pontil, M. (2005). Stability of randomized learning algorithms, Journal of Machine Learning Research 6: 55–79.
]Search in Google Scholar
[
Feng, Y., Gao, T., Li, L., Liu, J.-G. and Lu, Y. (2020). Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation, Communications in Mathematical Sciences 18(1).
]Search in Google Scholar
[
Hardt, M., Recht, B. and Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent, in M.F. Balcan and K.Q.Weinberger (Eds), Proceedings of The 33rd International Conference on Machine Learning, New York, USA, pp. 1225–1234.
]Search in Google Scholar
[
He, F., Liu, T. and Tao, D. (2019). Control batch size and learning rate to generalize well: Theoretical and empirical evidence, in H.Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 1143–1152.
]Search in Google Scholar
[
He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Poceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034.
]Search in Google Scholar
[
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H. and Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2704–2713, DOI:10.1109/CVPR.2018.00286.
]Search in Google Scholar
[
Kramers, H. (1940). Brownian motion in a field of force and the diffusion model of chemical reactions, Physica 7(4): 284–304.
]Search in Google Scholar
[
Krizhevsky, A. and Hinton, G. (2009). LearningMultiple Layers of Features from Tiny Images, Master’s thesis, Department of Computer Science, University of Toronto.
]Search in Google Scholar
[
LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/.
]Search in Google Scholar
[
Li, H., Xu, Z., Taylor, G., Studer, C. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets, in S. Bengio et al. (Eds), Advances in Neural Information Processing Systems, Vol. 31, Curran Associates, New York, pp. 6389–6399.
]Search in Google Scholar
[
Li, J., Luo, X. and Qiao, M. (2020). On generalization error bounds of noisy gradient methods for non-convex learning, arXiv: 1902.00621.
]Search in Google Scholar
[
Li, Q., Tai, C. and Weinan, E. (2019). Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, Journal of Machine Learning Research 20(40): 1–47.
]Search in Google Scholar
[
Li, X., Lu, J.,Wang, Z., Haupt, J. and Zhao, T. (2019). On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond, arXiv: 1806.05159.
]Search in Google Scholar
[
Ljung, L., Pflug, G. and Walk, H. (1992). Stochastic Approximation and Optimization of Random Systems, Birkhäuser, Basel, Switzerland.
]Search in Google Scholar
[
London, B., Huang, B., Taskar, B. and Getoor, L. (2014). PAC-Bayesian Collective Stability, in S. Kaski and J. Corander (Eds), Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, pp. 585–594.
]Search in Google Scholar
[
Mandt, S., Hoffman, M.D. and Blei, D.M. (2017). Stochastic gradient descent as approximate Bayesian inference, Journal of Machine Learning Research 18(1): 4873–4907.
]Search in Google Scholar
[
McAllester, D.A. (1999). PAC-Bayesian model averaging, Proceedings of the 12th Annual Conference on Computational Learning Theory, COLT ’99, New York, NY, USA, pp. 164–170, DOI: 10.1145/307400.307435.
]Search in Google Scholar
[
Negrea, J., Haghifam, M., Dziugaite, G. K., Khisti, A. and Roy, D. M. (2019). Information-theoretic generalization bounds for sgld via data-dependent estimates, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 11015–11025.
]Search in Google Scholar
[
Panigrahi, A., Somani, R., Goyal, N. and Netrapalli, P. (2019). Non-gaussianity of stochastic gradient noise, arXiv: 1910.09626.
]Search in Google Scholar
[
Qian, Y., Wang, Y., Wang, B., Gu, Z., Guo, Y. and Swaileh, W. (2022). Hessian-free second-order adversarial examples for adversarial learning, arXiv: 2207.01396.
]Search in Google Scholar
[
Rong, Y., Huang, W., Xu, T. and Huang, J. (2020). DropEdge: Towards deep graph convolutional networks on node classification, arXiv: 1907.10903.
]Search in Google Scholar
[
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, International Conference on Learning Representations 2020, https://openreview.net/pdf?id=Hkx1qkrKPr.
]Search in Google Scholar
[
Simsekli, U., Sagun, L. and Gurbuzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks, in K. Chaudhuri and R. Salakhutdinov (Eds), Proceedings of the 36th International Conference on Machine Learning, New York, pp. 5827–5837.
]Search in Google Scholar
[
Sulaiman, I.M., Kaelo, P., Khalid, R. and Nawawi, M.K.M. (2024). A descent generalized rmil spectral gradient algorithm for optimization problems, Internationl Jouranl of Applied Mathematics and Computer Science 34(2): 225–233, DOI: 10.61822/amcs-2024-0016.
]Search in Google Scholar
[
Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013). On the importance of initialization and momentum in deep learning, in S. Dasgupta and D. McAllester (Eds), Proceedings of the 30th International Conference on Machine Learning, Atlanta, USA, pp. 1139–1147.
]Search in Google Scholar
[
Villani, C. (2008). Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, Springer, Berlin/Heidelberg.
]Search in Google Scholar
[
Weinan, E., Ma, C. and Wang, Q. (2019). A priori estimates of the population risk for residual networks, arXiv: 1903.02154.
]Search in Google Scholar
[
Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Omnipress, Madison, WI, USA, p. 681–688.
]Search in Google Scholar
[
Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V. and Zhu, Z. (2020). On the noisy gradient descent that generalizes as SGD, International Conference on Machine Learning, PMLR 119: 10367–10376.
]Search in Google Scholar
[
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization, arXiv: 1611.03530.
]Search in Google Scholar