Intorduction

Polynomial regression is a statistical technique for modeling a relationship between a dependent variable (y) and one or more independent variables (x) using a polynomial function. In other words, it is a way of fitting a curve to a set of data points. The general form of the polynomial regression model is: $$y = \beta_0 +\beta_1 x +\beta_2 x^2 + .... \beta_n x^n$$ where,
  • y = is the dependent variable.
  • x is the independent variable.
  • β 0 , β 1 , ... , β n are the coefficients of the polynomial terms.
The degree of the polynomial is determined by the highest power of x, denoted as n.

For example, a polinomial of degree 2 is quadratic equation, and a polinomial of degree 3 is a cubic equation.

Fitting a polynomial regression model

The goal of polynomial regression is to fit the polinomial curve to the data in a way that minimizes the sum of squared differences between the observed and predicted values i.e. RSS. The RSS is a measure of the error between the predicted values of y and the actual values of y.

The equation for a simple linear regression (degree 1) is a special case of polynomial regression, where n =1.

$$y = \beta_0 +\beta_1 x$$ The coefficients β 0 , β 1 , ... , β n are typically estimated using methods such as the method of least squares. The model is then used to make predictions based on new values of x.

It's important to note that while polynomial regression allows for a more flexible fit to the data, it also runs the risk of overfitting, especially with higher-degree polynomials. Overfitting occurs when the model captures noise or fluctuations in the training data, leading to poor generalization to new, unseen data.

More about the overfitting can be found at: Overfitting, underfitting and good fit.

Method for fitting polynomial regression models

There are two main methods for fitting polynomial regression models:
  • Least squares: This is the most common method for fitting polynomial regression models. It uses an iterative algorithm to find the coefficients that minimize the RSS.
  • Regularization: This method can be used to prevent overfitting, which occurs when a model fits the training data too well and does not generalize well to new data. There are several different regularization methods, but the most common is ridge regression.

Interpreting the coefficients

The coefficients of a polynomial regression model can be interpreted in the following way:
  • β 0 is the average value of y
  • β 1 is the slope of the line or curve at the point (0, β 0)
  • β 2 is the rate of change of the slope
  • β 3 is the rate of change of the rate of change
  • ...

Applications of polynomial regression

Polynomial regression has a wide variety of applications, including:
  • Predicting sales: Polynomial regression can be used to predict sales based on factors such as price, advertising, and economic conditions.
  • Modeling the growth of plants: Polynomial regression can be used to model the growth of plants based on factors such as temperature, sunlight, and nutrients.
  • Analyzing financial data: Polynomial regression can be used to analyze financial data to identify trends and patterns.
  • Improving the accuracy of machine learning models: Polynomial regression can be used to improve the accuracy of machine learning models by providing them with a more complex and flexible representation of the data.

Example-1

  • Importing the libraries:
    
                        # import libraries
                        import numpy as np 
                        import pandas as pd 
                        import matplotlib.pyplot as plt 
                        %matplotlib inline                    
                    
  • Generating datasets/ loading the datasets: In our current example, we generate a random dataset for X and y. Specially in our current example, we have considered following equation to generate the random data for y: $$y = \frac{1}{2} x^2 +\frac{3}{2} x +2 +\text{outliers}$$ and hence python code is:
    
                    X = 6 * np.random.rand(100,1)-3
                    y = 0.5*X**2 + 1.5*X + 2 + np.random.randn(100,1)
    
                    # quadratic equation is shown above
    
                    plt.scatter(X, y, color='r')
                    plt.xlabel("X")
                    plt.ylabel("y")
                    plt.show()                
                
    which gives the data for our example and plot the generated datasets.
  • Simple line:Now let's start with the simple line which is actually case of degree=1
    
                    ## Apply linear regression
                    from sklearn.linear_model import LinearRegression
                    regression1 = LinearRegression()
                    regression1.fit(X_train, y_train)
                    ## plot Training data plot and best fit line
    
                    plt.scatter(X_train, y_train, color = 'b')
                    plt.plot(X_train, regression1.predict(X_train), color = 'r')
                    plt.xlabel("X_train")
                    plt.ylabel("y_pred")
                    plt.show()
    
                    from sklearn.metrics import r2_score
    
                    score = r2_score(y_test, regression1.predict(X_test))
                    print(f"The r-squared value for the model is= {score}")
                
    so it gave The r-squared value for the model is= 0.6405513731105184 and
    The coefficient in the case can be obtained using: regression.coef_ = [[1.43280818]].
  • Quadratic equation: Next we are going to use following equation: $$y = \beta_0 +\beta_1 x +\beta_2 x^2$$
    
                    from sklearn.preprocessing import PolynomialFeatures
                    poly = PolynomialFeatures(degree = 2, include_bias = True)
                    X_train_poly = poly.fit_transform(X_train)
                    X_test_poly = poly.transform(X_test)
    
                    from sklearn.metrics import r2_score
                    regression = LinearRegression()
                    regression.fit(X_train_poly, y_train)
                    y_pred = regression.predict(X_test_poly)
                    score = r2_score(y_test, y_pred)
                    print(score)
                
    which gave the value of r-square = 0.8726125379887142 which some shows significant improvment.
  • The coefficient in the case can be obtained using: regression.coef_ = [[0. 1.47171982 0.42463995]].
  • Cubic:
    
                    from sklearn.preprocessing import PolynomialFeatures
                    poly = PolynomialFeatures(degree = 3, include_bias = True)
                    X_train_poly = poly.fit_transform(X_train)
                    X_test_poly = poly.transform(X_test)
    
                    from sklearn.metrics import r2_score
                    regression = LinearRegression()
                    regression.fit(X_train_poly, y_train)
                    y_pred1 = regression.predict(X_test_poly)
                    score = r2_score(y_test, y_pred)
                    print(score)
    
                    from sklearn.metrics import r2_score
                    regression = LinearRegression()
                    regression.fit(X_train_poly, y_train)
                    y_pred1 = regression.predict(X_test_poly)
                    score = r2_score(y_test, y_pred)
                    print(score)                
                
    This gave r-square value = 0.8620083765320085 almost same value as it was in the case of degree 2.

    Prediction:

    For a new datsets:
    
                    ## Prediction for new data 
                    X_new = np.linspace(-3,3,200).reshape(200,1)
                    X_new_poly = poly.transform(X_new)
                    y_new = regression.predict(X_new_poly)
                    plt.plot(X_new, y_new, "r-", linewidth = 2, label = "New Prediction")
                    plt.plot(X_train, y_train, "y.", label = "Training points")
                    plt.plot(X_test, y_test, "b.", label ="Testing points")
                    plt.xlabel("X")
                    plt.plot("y")
                    plt.legend()
                    plt.show()                
                
    we will have following plot:
  • Creating pipeline for any degree: In this case, we first define a generic function and then supply the new dataset for the fitting.
    
                    from sklearn.pipeline import Pipeline
                    def poly_regression(degree, X_new):
                    """Function to fit and predict polynomial of specified degree"""
                    poly_features = PolynomialFeatures(degree=degree, include_bias=True)
                    lin_reg = LinearRegression()
                    poly_regression = Pipeline([
                        ("poly_features", poly_features),
                        ("lin_reg", lin_reg)
                    ])
                    poly_regression.fit(X_train, y_train)
                    y_pred_new = poly_regression.predict(X_new)
                    return y_pred_new
                
    Then we provide the datset and plot all degrees.
    
                    # Generate X_new once for all degrees
                    X_new = np.linspace(-3, 3, 200).reshape(200, 1)
    
                    # Plotting for degrees 0, 1, 2, 3, 4
                    for degree in range(5):
                        y_pred = poly_regression(degree, X_new)
                        plt.plot(X_new, y_pred, label="Degree " + str(degree), linewidth=2)
    
                    plt.legend(loc="upper left")
                    plt.plot(X_train, y_train, "b.", linewidth=3, label="Training Data")
                    plt.plot(X_test, y_test, "g.", linewidth=3, label="Test Data")
                    plt.xlabel("X")
                    plt.ylabel("y")
                    plt.axis([-4, 4, 0, 10])
                    plt.show()
                

References


Some other interesting things to know: