Understanding Feature Scaling for Improved Gradient Descent inMachine Learning Machine learning is a rapidly evolving ﬁeld with new techniques being developed all the timeto improve model performance. One such technique is feature scaling, which plays a crucialrole in enabling gradient descent to run much faster. This article will explore the relationshipbetween the size of a feature and its associated parameter, and how feature scaling canimprove the performance of gradient descent. The relationship between feature size and parameter size In order to understand the relationship between feature size and parameter size, let's take aconcrete example of predicting the price of a house. We'll use two features, the size of thehouse ( ) and the number of bedrooms ( ), to make our prediction. Let's say that 𝑥 1 𝑥 2 𝑥 1 typically ranges from 300 to 2000 square feet, while in the data set ranges from 0 to 5 𝑥 2 bedrooms. When we consider a house that has a size of 2000 square feet, ﬁve bedrooms, and a price of$500,000, we can see that different choices of parameters can result in vastly differentpredicted prices. For example, if is 50 and is 0.1, and b is 50, the estimated price would 𝑤 1 𝑤 2 be 100,000 + 0.5 + 50, which is over $100 million. Clearly, this is not a reasonable estimate.On the other hand, if is 0.1 and is 50, and b is still 50, the predicted price would be 𝑤 1 𝑤 2 , which is equal to $500,000, a much more reasonable estimate. 0. 1 × 2000 + 50 × 5 + 50 The impact of feature scaling on gradient descent Now let's consider the relationship between feature scaling and gradient descent. If we plotthe training data on a scatter plot, with the size of the house as the horizontal axis and thenumber of bedrooms as the vertical axis, we can see that the horizontal axis has a muchlarger range of values compared to the vertical axis. This can have a signiﬁcant impact onthe cost function in a contour plot, where the horizontal axis may have a much narrowerrange, while the vertical axis takes on much larger values.
As a result, the contours formed by the cost function may be elongated and narrow in onedirection, with a very small change in w1 having a large impact on the estimated price. Thiscan result in slow convergence and a suboptimal solution. On the other hand, if we applyfeature scaling to both features, so that their ranges of values are more similar, the contourswill be more circular and the algorithm will converge faster to a better solution.