Machine Learning: Linear Regression

Author: Iqbal Hossain

In this example we will learn how to create linear regression plot and get predicted value using linear regression model. Pre-requisite: Python3 I have used Jupyter web console to execute following script. You can use default Python3 command line console or PyCharm. We need to install few packages, i.e. numpy, scipy, sklearn, matplotlib; First create a python file given any name and copy past following codes in that file.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
1. numpy used for handling large data set. It is useful to handle array and their dimension. 2. scipy is useful for scientific calculation. We used 'stats' submodule of scipy library. 3. matplotlib is used to draw the plot. We should remember that numpy and matplotlib are included to scipy library. But to use those library we need to include separately in our .py file.
xArr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
yArr = [3, 3.5, 6.1, 9, 8, 9.5, 13, 18, 17.4, 19.8]
slope, intercept, r_value, p_value, std_err = stats.linregress(xArr, yArr)
print("Coefficient: ", slope, "\nIntercept: ", intercept, "\nR2: ", r_value**2, "\nP Value: %.16f" % float(p_value), "\nStd. Error: ", std_err)

Coefficient:  1.94848484848 
Intercept:  0.0133333333333 
R2:  0.948627993113 
P Value: 0.0000019448646088 
Std. Error:  0.16031248182
In the above code we have declared python two variables xArr and yArr. Both hold list and list should be same length. Then we use stats.linregress function and passed xArr and yArr as argument to get these results as return slope, intercept, r_value, p_value, std_err. We need to squared r_value to get squared value and covert p_value to floating point to understand easily. We need to reshape xArr dimension to 2D dimension and let's store it to 'x' variable. We need to make numpy array the yArr to 'y' variable. Because to draw plot we need numpy array. We will use scatter plot to draw our observed data. Scatter plot permits many arguments but we used five arguments in this example. We passed x and y array with color, zorder and label. zorder argument is used to make the dot points top because we used grid lines in our plot. If we don't set zorder=2 it may happen that grid line appears in upside. We can safely skip to show prediction point. Please see the code comment. To show the predicted point we can append the 'x' value to our x value and then draw fitted line. In this example we used predicted value 15 for x. We would like to see how our linear model works.
x = np.reshape(xArr, (-1,1))
y = np.array(yArr)

plt.scatter(x, y, color='black', zorder=2, label='Original data')

# Following 3 lines are optional. We used it to show the predict point and our fitted line till predicted point. You can skip if you don't want o show the next predicted point. In this example we predicted x = 15 to predict y value.

xArr.append(15)
x = np.reshape(xArr, (-1,1))
plt.plot(15, (intercept+slope*15),'ro', label='Predicted point')

plt.plot(x, intercept + slope*x, linestyle='--', color='b', label='Fitted line')
plt.title("Visualize predicted point in plot")
plt.grid(color='grey', linestyle='--')
plt.xlabel("values of x")
plt.ylabel("values of y")
plt.legend()
plt.show()

In the above plot we see fitted line as drawn in blue color which reached to our predicted red point. We set our predicted point x=15. So we got y value accordingly which is drawn in red color. Black points are our observed value or read data. Let's briefly describe our linear model. We know linear regression model equation is:  y =β0 + β1x Hear βis intercept. In this model our intercept value is 0.0133333333333 (Please see the above result). β1 is 'x' coefficient which value is 1.94848484848. It means every 1.94848484848 * x value + intercept increases the value of y. We also observed the model has R2 value is 0.948627993113 which is near to 1 and P Value is 0.0000019448646088 < 0.05 which is very significant to have positive relation between 'x' and 'y' variable. This example is a very basic example to use linear regression in python. To use other data set just replace 'x' and 'y' value according to your requirements. All the other setting is almost same.