Machine learning

Polinomial Regression

Content

Introduction
Method for fitting polynomial regression models
Interpreting the coefficients
Example
Reference

Intorduction

Polynomial regression is a statistical technique for modeling a relationship between a dependent variable (

y

) and one or more independent variables (

x

) using a polynomial function. In other words, it is a way of fitting a curve to a set of data points. The general form of the polynomial regression model is: $$y = \beta_0 +\beta_1 x +\beta_2 x^2 + .... \beta_n x^n$$ where,

$y$ = is the dependent variable.
$x$ is the independent variable.
$β$ ₀ , β ₁ , ... , β _n are the coefficients of the polynomial terms.

The degree of the polynomial is determined by the highest power of

x

, denoted as

n

For example, a polinomial of degree 2 is quadratic equation, and a polinomial of degree 3 is a cubic equation.

Fitting a polynomial regression model

The goal of polynomial regression is to fit the polinomial curve to the data in a way that minimizes the sum of squared differences between the observed and predicted values i.e. RSS. The RSS is a measure of the error between the predicted values of y and the actual values of y.

The equation for a simple linear regression (degree 1) is a special case of polynomial regression, where n =1.

$$y = \beta_0 +\beta_1 x$$ The coefficients

β 0, β 1, ..., β n

are typically estimated using methods such as the method of least squares. The model is then used to make predictions based on new values of

x

It's important to note that while polynomial regression allows for a more flexible fit to the data, it also runs the risk of overfitting, especially with higher-degree polynomials. Overfitting occurs when the model captures noise or fluctuations in the training data, leading to poor generalization to new, unseen data.

More about the overfitting can be found at: Overfitting, underfitting and good fit.

Method for fitting polynomial regression models

There are two main methods for fitting polynomial regression models:

Least squares: This is the most common method for fitting polynomial regression models. It uses an iterative algorithm to find the coefficients that minimize the RSS.
Regularization: This method can be used to prevent overfitting, which occurs when a model fits the training data too well and does not generalize well to new data. There are several different regularization methods, but the most common is ridge regression.

Interpreting the coefficients

The coefficients of a polynomial regression model can be interpreted in the following way:

$β$ ₀ is the average value of y
$β$ ₁ is the slope of the line or curve at the point (0, $β$ ₀)
$β$ ₂ is the rate of change of the slope
$β$ ₃ is the rate of change of the rate of change
...

Applications of polynomial regression

Polynomial regression has a wide variety of applications, including:

Predicting sales: Polynomial regression can be used to predict sales based on factors such as price, advertising, and economic conditions.
Modeling the growth of plants: Polynomial regression can be used to model the growth of plants based on factors such as temperature, sunlight, and nutrients.
Analyzing financial data: Polynomial regression can be used to analyze financial data to identify trends and patterns.
Improving the accuracy of machine learning models: Polynomial regression can be used to improve the accuracy of machine learning models by providing them with a more complex and flexible representation of the data.

Example-1

Importing the libraries:


                    # import libraries
                    import numpy as np 
                    import pandas as pd 
                    import matplotlib.pyplot as plt 
                    %matplotlib inline

Generating datasets/ loading the datasets: In our current example, we generate a random dataset for X and y. Specially in our current example, we have considered following equation to generate the random data for y: $$y = \frac{1}{2} x^2 +\frac{3}{2} x +2 +\text{outliers}$$ and hence python code is:
```
                X = 6 * np.random.rand(100,1)-3
                y = 0.5*X**2 + 1.5*X + 2 + np.random.randn(100,1)

                # quadratic equation is shown above

                plt.scatter(X, y, color='r')
                plt.xlabel("X")
                plt.ylabel("y")
                plt.show()                
            
```
which gives the data for our example and plot the generated datasets.

Simple line:Now let's start with the simple line which is actually case of degree=1


                ## Apply linear regression
                from sklearn.linear_model import LinearRegression
                regression1 = LinearRegression()
                regression1.fit(X_train, y_train)
                ## plot Training data plot and best fit line

                plt.scatter(X_train, y_train, color = 'b')
                plt.plot(X_train, regression1.predict(X_train), color = 'r')
                plt.xlabel("X_train")
                plt.ylabel("y_pred")
                plt.show()

                from sklearn.metrics import r2_score

                score = r2_score(y_test, regression1.predict(X_test))
                print(f"The r-squared value for the model is= {score}")

so it gave The r-squared value for the model is= 0.6405513731105184 and

The coefficient in the case can be obtained using: regression.coef_ = [[1.43280818]].

Quadratic equation: Next we are going to use following equation: $$y = \beta_0 +\beta_1 x +\beta_2 x^2$$


                from sklearn.preprocessing import PolynomialFeatures
                poly = PolynomialFeatures(degree = 2, include_bias = True)
                X_train_poly = poly.fit_transform(X_train)
                X_test_poly = poly.transform(X_test)

                from sklearn.metrics import r2_score
                regression = LinearRegression()
                regression.fit(X_train_poly, y_train)
                y_pred = regression.predict(X_test_poly)
                score = r2_score(y_test, y_pred)
                print(score)

which gave the value of r-square = 0.8726125379887142 which some shows significant improvment.

regression.coef_ = [[0. 1.47171982 0.42463995]].

Cubic:


                from sklearn.preprocessing import PolynomialFeatures
                poly = PolynomialFeatures(degree = 3, include_bias = True)
                X_train_poly = poly.fit_transform(X_train)
                X_test_poly = poly.transform(X_test)

                from sklearn.metrics import r2_score
                regression = LinearRegression()
                regression.fit(X_train_poly, y_train)
                y_pred1 = regression.predict(X_test_poly)
                score = r2_score(y_test, y_pred)
                print(score)

                from sklearn.metrics import r2_score
                regression = LinearRegression()
                regression.fit(X_train_poly, y_train)
                y_pred1 = regression.predict(X_test_poly)
                score = r2_score(y_test, y_pred)
                print(score)

This gave r-square value = 0.8620083765320085 almost same value as it was in the case of degree 2.

Prediction:

For a new datsets:


                ## Prediction for new data 
                X_new = np.linspace(-3,3,200).reshape(200,1)
                X_new_poly = poly.transform(X_new)
                y_new = regression.predict(X_new_poly)
                plt.plot(X_new, y_new, "r-", linewidth = 2, label = "New Prediction")
                plt.plot(X_train, y_train, "y.", label = "Training points")
                plt.plot(X_test, y_test, "b.", label ="Testing points")
                plt.xlabel("X")
                plt.plot("y")
                plt.legend()
                plt.show()

we will have following plot:

Creating pipeline for any degree: In this case, we first define a generic function and then supply the new dataset for the fitting.


                from sklearn.pipeline import Pipeline
                def poly_regression(degree, X_new):
                """Function to fit and predict polynomial of specified degree"""
                poly_features = PolynomialFeatures(degree=degree, include_bias=True)
                lin_reg = LinearRegression()
                poly_regression = Pipeline([
                    ("poly_features", poly_features),
                    ("lin_reg", lin_reg)
                ])
                poly_regression.fit(X_train, y_train)
                y_pred_new = poly_regression.predict(X_new)
                return y_pred_new

Then we provide the datset and plot all degrees.


                # Generate X_new once for all degrees
                X_new = np.linspace(-3, 3, 200).reshape(200, 1)

                # Plotting for degrees 0, 1, 2, 3, 4
                for degree in range(5):
                    y_pred = poly_regression(degree, X_new)
                    plt.plot(X_new, y_pred, label="Degree " + str(degree), linewidth=2)

                plt.legend(loc="upper left")
                plt.plot(X_train, y_train, "b.", linewidth=3, label="Training Data")
                plt.plot(X_test, y_test, "g.", linewidth=3, label="Test Data")
                plt.xlabel("X")
                plt.ylabel("y")
                plt.axis([-4, 4, 0, 10])
                plt.show()

References

My github Repositories on Remote sensing Machine learning
A Visual Introduction To Linear regression (Best reference for theory and visualization).
Book on Regression model: Regression and Other Stories
Book on Statistics: The Elements of Statistical Learning

Some other interesting things to know:

Visit my website on For Data, Big Data, Data-modeling, Datawarehouse, SQL, cloud-compute.
Visit my website on Data engineering