Machine Learning: Logistic Regression

Author: Iqbal Hossain

Logistic regression is a statistical algorithm for analyzing a dataset in which there are one or more independent variables that determine an binary outcome. Binary outcome means the response will be either True or False. If we think it in numeric it would be 1 or 0, or we can tell it Yes or No. The goal of logistic regression is to find the best fitting model to describe the relationship between the binary characteristic of interest (dependent variable) and independent (predictor or explanatory) variables. In machine learning we can use to find such results like - - Whether customer can get the bank loan or not - Whether student can pass or fail in the exam - Whether particular resource can finish the task in time or not - Whether people have interest on particular product or not - Whether patient is sick or not sick In this article we will not discuss deep about analyzing data. Because there are hundreds of articles available in internet regarding Logistic Regression. Here I will let you know the simple procedure to predict the given value using logistic regression model thus you can implement for Machine Learning purpose. Suppose we have data of few students where how long they studied in a week and their result either they are passed or failed.
hour_studied = [.5, 0.75, 1, 1.25, 1.50, 1.75, 1.75, 2, 2.25, 2.50, 2.75, 3, 3.25, 3.50, 4, 4.25, 4.50, 4.75, 5, 5.50]
        exam_passed = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In the above array of data we did not have any grade points, rather just two values whether they passed = 1 or fail = 0 Let's see how these data could be visualized in Python. I commented the codes as much easy I could. You need basic knowledge of python syntax.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Declare variables
hour_studied = [.5, 0.75, 1, 1.25, 1.50, 1.75, 1.75, 2, 2.25, 2.50, 2.75, 3, 3.25, 3.50, 4, 4.25, 4.50, 4.75, 5, 5.50]
exam_passed = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Make it data frame to understand. You can use numpy arrays to shape the # data also.

data = pd.DataFrame({'Studied': hour_studied, 'Result': exam_passed })
data = data[['Studied', 'Result']]

# Use classifier of Logistic Regression from sklearn
clf = LogisticRegression()

# Classifier need to fit the data. Here data[['Studied']] is two dimensional independent variable

clf.fit(data[['Studied']], data['Result'])

# Classifier gives the result of coefficient and intercept value. It requires to calculate the probability of our result.

cof = clf.coef_[0][0]
intercept = clf.intercept_[0]
print("Coefficient: ", cof)
print("Intercept: ", intercept)

Output:
Coefficient: 0.843658423352
Intercept: -1.39047646944

# Define the model to calculate each porbability point from independent variable 'X'

def logit_model(x):
    result =  1 / (1 + np.exp(-x))
    print(result)
    return result

# Draw the plot

plt.clf()
plt.scatter(data[['Studied']], data['Result'], color='black', marker='.')
logit = logit_model(data['Studied'].ravel() * cof + intercept).ravel()
plt.plot(data[['Studied']], logit, color='red', linewidth=1)
plt.xlabel("Study hours")
plt.ylabel("Possibilities to pass")
plt.show()
    

As because logistic regression regression gives 2 categorical outputs, i.e. pass/fail, yes/no, male/female, sick/cure, bad/good, so the data represents only in the line of 0 and 1, means 'Fail' or 'Pass'. But the red line refers the probability from logistic regression model. Probability line was drawn from Logistic Model equation. In the equation 'a' refers to intercept and 'b' is coefficient of X.

But our aim is to predict the result of unknown data. Either we can use regression model equation or use classifier predict function to get the predicted value.
y_predict = clf.predict(7)[0]
print("If studied 7 hours: ", y_predict)

output: 
If studied 7 hours: 1

y_predict = clf.predict(1)[0]
print("If studied 1 hour: ", y_predict)

output: 
If studied 1 hour: 0
    
End of the above code just add these lines and you will get your desired result.  Now we will pass 2 test values. If a student study 7 hours and another student study 1 hour what would be their desired result? Actually the classifier was trained from the above values given at the start of the program and construct the fitted model. We just used the fitted model equation to predicts result. In this way you can trained your machine or program to predict unknown value.