Question 8: (12 points) Consider the best practices for trai…
Question 8: (12 points) Consider the best practices for training neural networks. Answer the following questions: (3 points) In practice, the Nesterov momentum often converges faster than the standard momentum. Explain why the correction based on the gradient at the anticipated position might prevent overshooting. (6 points) You observe the following training behaviors: Observation 1: Your deep network (8 layers) trains very slowly. The gradients in early layers are extremely small. Training loss decreases but very gradually. Observation 2: Your network achieves 99\% training accuracy but only 70\% validation accuracy. The gap is large and consistent. Observation 3: Your network’s training is unstable – loss fluctuates wildly and sometimes diverges. Different random initializations lead to very different outcomes. For each observation: (i) Identify whether dropout, batch normalization, or both would help, and (ii) Explain why the chosen technique addresses the specific problem. 3. (3 points) Consider a feedforward neural network with the following architecture: Input (100 features) → Dense(256) → ReLU → Dense(128) → ReLU → Dense(10) → Softmax. Calculate the total number of trainable parameters in this network. Show your work by computing the parameters for each layer separately, including both weights and biases.
Read Details