Polinomial Regression
Intorduction
Polynomial regression is a statistical technique for modeling a relationship between a dependent variable () and one or more independent variables () using a polynomial function. In other words, it is a way of fitting a curve to a set of data points. The general form of the polynomial regression model is: $$y = \beta_0 +\beta_1 x +\beta_2 x^2 + .... \beta_n x^n$$ where,- = is the dependent variable.
- is the independent variable.
- are the coefficients of the polynomial terms.
For example, a polinomial of degree 2 is quadratic equation, and a polinomial of degree 3 is a cubic equation.
Fitting a polynomial regression model
The goal of polynomial regression is to fit the polinomial curve to the data in a way that minimizes the sum of squared differences between the observed and predicted values i.e. RSS. The RSS is a measure of the error between the predicted values of y and the actual values of y.
The equation for a simple linear regression (degree 1) is a special case of polynomial regression, where n =1.
$$y = \beta_0 +\beta_1 x$$ The coefficients are typically estimated using methods such as the method of least squares. The model is then used to make predictions based on new values of .It's important to note that while polynomial regression allows for a more flexible fit to the data, it also runs the risk of overfitting, especially with higher-degree polynomials. Overfitting occurs when the model captures noise or fluctuations in the training data, leading to poor generalization to new, unseen data.
Method for fitting polynomial regression models
There are two main methods for fitting polynomial regression models:- Least squares: This is the most common method for fitting polynomial regression models. It uses an iterative algorithm to find the coefficients that minimize the RSS.
- Regularization: This method can be used to prevent overfitting, which occurs when a model fits the training data too well and does not generalize well to new data. There are several different regularization methods, but the most common is ridge regression.
Interpreting the coefficients
The coefficients of a polynomial regression model can be interpreted in the following way:- is the average value of y
- is the slope of the line or curve at the point (0, )
- is the rate of change of the slope
- is the rate of change of the rate of change
- ...
Applications of polynomial regression
Polynomial regression has a wide variety of applications, including:- Predicting sales: Polynomial regression can be used to predict sales based on factors such as price, advertising, and economic conditions.
- Modeling the growth of plants: Polynomial regression can be used to model the growth of plants based on factors such as temperature, sunlight, and nutrients.
- Analyzing financial data: Polynomial regression can be used to analyze financial data to identify trends and patterns.
- Improving the accuracy of machine learning models: Polynomial regression can be used to improve the accuracy of machine learning models by providing them with a more complex and flexible representation of the data.
Example-1
- Importing the libraries:
# import libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
- Generating datasets/ loading the datasets:
In our current example, we generate a random dataset for X and y. Specially in our current example, we have considered following equation to generate the random data for y:
$$y = \frac{1}{2} x^2 +\frac{3}{2} x +2 +\text{outliers}$$
and hence python code is:
which gives the data for our example and plot the generated datasets.X = 6 * np.random.rand(100,1)-3 y = 0.5*X**2 + 1.5*X + 2 + np.random.randn(100,1) # quadratic equation is shown above plt.scatter(X, y, color='r') plt.xlabel("X") plt.ylabel("y") plt.show()
- Simple line:Now let's start with the simple line which is actually case of degree=1
so it gave## Apply linear regression from sklearn.linear_model import LinearRegression regression1 = LinearRegression() regression1.fit(X_train, y_train) ## plot Training data plot and best fit line plt.scatter(X_train, y_train, color = 'b') plt.plot(X_train, regression1.predict(X_train), color = 'r') plt.xlabel("X_train") plt.ylabel("y_pred") plt.show() from sklearn.metrics import r2_score score = r2_score(y_test, regression1.predict(X_test)) print(f"The r-squared value for the model is= {score}")
The r-squared value for the model is= 0.6405513731105184
and The coefficient in the case can be obtained using:regression.coef_ = [[1.43280818]].
- Quadratic equation:
Next we are going to use following equation:
$$y = \beta_0 +\beta_1 x +\beta_2 x^2$$
which gave the value of r-square =from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree = 2, include_bias = True) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test) from sklearn.metrics import r2_score regression = LinearRegression() regression.fit(X_train_poly, y_train) y_pred = regression.predict(X_test_poly) score = r2_score(y_test, y_pred) print(score)
0.8726125379887142
which some shows significant improvment.
The coefficient in the case can be obtained using: - Cubic:
This gave r-square value =from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree = 3, include_bias = True) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test) from sklearn.metrics import r2_score regression = LinearRegression() regression.fit(X_train_poly, y_train) y_pred1 = regression.predict(X_test_poly) score = r2_score(y_test, y_pred) print(score) from sklearn.metrics import r2_score regression = LinearRegression() regression.fit(X_train_poly, y_train) y_pred1 = regression.predict(X_test_poly) score = r2_score(y_test, y_pred) print(score)
0.8620083765320085
almost same value as it was in the case of degree 2.Prediction:
For a new datsets:
we will have following plot:## Prediction for new data X_new = np.linspace(-3,3,200).reshape(200,1) X_new_poly = poly.transform(X_new) y_new = regression.predict(X_new_poly) plt.plot(X_new, y_new, "r-", linewidth = 2, label = "New Prediction") plt.plot(X_train, y_train, "y.", label = "Training points") plt.plot(X_test, y_test, "b.", label ="Testing points") plt.xlabel("X") plt.plot("y") plt.legend() plt.show()
- Creating pipeline for any degree:
In this case, we first define a generic function and then supply the new dataset for the fitting.
Then we provide the datset and plot all degrees.from sklearn.pipeline import Pipeline def poly_regression(degree, X_new): """Function to fit and predict polynomial of specified degree""" poly_features = PolynomialFeatures(degree=degree, include_bias=True) lin_reg = LinearRegression() poly_regression = Pipeline([ ("poly_features", poly_features), ("lin_reg", lin_reg) ]) poly_regression.fit(X_train, y_train) y_pred_new = poly_regression.predict(X_new) return y_pred_new
# Generate X_new once for all degrees X_new = np.linspace(-3, 3, 200).reshape(200, 1) # Plotting for degrees 0, 1, 2, 3, 4 for degree in range(5): y_pred = poly_regression(degree, X_new) plt.plot(X_new, y_pred, label="Degree " + str(degree), linewidth=2) plt.legend(loc="upper left") plt.plot(X_train, y_train, "b.", linewidth=3, label="Training Data") plt.plot(X_test, y_test, "g.", linewidth=3, label="Test Data") plt.xlabel("X") plt.ylabel("y") plt.axis([-4, 4, 0, 10]) plt.show()
regression.coef_ = [[0. 1.47171982 0.42463995]].
References
- My github Repositories on Remote sensing Machine learning
- A Visual Introduction To Linear regression (Best reference for theory and visualization).
- Book on Regression model: Regression and Other Stories
- Book on Statistics: The Elements of Statistical Learning
Some other interesting things to know:
- Visit my website on For Data, Big Data, Data-modeling, Datawarehouse, SQL, cloud-compute.
- Visit my website on Data engineering