Question from a paper: I do not understand why it is stated that SGD employs the bootstrapping to calculate gradient?
In this paper, they state that:
As SGD employs the bootstrapping (i.e., random sampling with replacement) [67] for gradient calculation, we can obtain the unbiased estimation of standard gradients calculated by all the data, i.e., E[∇fit (wt)] ← ∇f (wt).
As far as I know, while using bootstrapping, suppose that we have a data set of ABCD, we create multiple datasets for that initial one. For example we create AABD, DCDA, BAAD, CBAA, AAAB, etc. However, in SGD, we first shuffle the input data and then, at each iteration we take only one data point (= one sample or one input data) at a time (in a random way). According to the paper, what they want to say when they say SGD employs the bootstrapping for gradient calculation, how we can talk about bootstrapping if we take only one data point at a time (in the case of SGD)?
Zhou, Qihua, et al. On-device Learning Systems for Edge Intelligence: A Software and Hardware Synergy Perspective. IEEE Internet of Things Journal (2021).
Topic boosting gradient-descent deep-learning
Category Data Science