Linear Regression: An Introduction to Machine Learning

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • Linear Regression: An Introduction to Machine Learning

      Over the past two decades Machine Learning has become one of the mainstays of information technology. The really basic idea is to take a dataset D and adapt from it while taking certain factors into account, and then predict an output with a given input. Some overview of all fields in Machine Learning (but we just do a really basic linear regression to get into it):


      From: Deep Learning Ian Goodfellow, Yoshua Bengio, Aaron Courville

      Example: Predicting house prizes (not-linear regression)
      You think about buying a new Ancient Villa and want to know when, which and where to buy it. You could now do different things: Either predict the price on a timeline (like how much the prices in winter differ from summer) or predict the price of houses in different area(e.g. cities). We stick on the first part because its way easier for now.
      Example:
      Just a simple function: ypredict(x) = 100 + sin(x) * 5

      On this (not linear) regression prediction the y-axis I put the price of a house with e.g. 150qm, the x-axis shows all months of a year. We can now see that the price for buying a house is different for each month. If you want to buy a house in the middle of January you probably pay 100.000€ and in the end of April you tend to pay roughly 10.000€ less. Now one would likely buy a house around April-May.
      But why is this called regression? Well as said earlier, we have a continuous output for every value we put into our predictor ypredict(x).

      Linear regression:
      What is a linear function?
      In genrel we have:

      Where b is our bias, a our slope and y our output value.

      However, in Machine Learning one uses different letters:

      We would call them Beta-hat. The beta hat indicated with 0 is still the same bias and the other one is also the same slope, just different names. But we change y to y-hat because we say it's a predictor.


      But how would one train a model like this and how can we optimize it?
      We need:
      1. a dataset with two variables
      2. a training algorithm with a formula to minimize analytically (usually we minimize a cost function/error measure)
      3. an error measure, for our wrong predictions and to look how good our model predicts. In general we look how much difference there is between the real and predicted values.


      Our observed dataset contains x values and y values. We later want to predict a y value for any input x.


      Now we have our data set we can move on to our training algorithm.
      We now have to calculate our betas, bias and slope.
      We do that by minimizing the error measure and solve after beta: Residual Sum of Squares(RSS)
      Keep in mind: N is the size of the dataset, c.f. indices of our x,y columns above.


      It results into a least squares problem by minimizing it after beta0 and beta1

      Furthermore, I need to mention that xbar is the mean of all x. Which would be the sum over all x divided by x.



      And last we use the Mean Squared Error measure to see how good our model is by using the sum of dividing our predicted y_hat(x) from the real y-value we have in our dataset and then square it.

      But since this would be a lot of work to do hand-written, so we instead use Python to compute all those things.

      Source Code

      1. """
      2. @author: Chris
      3. """
      4. #used for calculating the solution of the predictor
      5. def predictor(beta0, beta1, x):
      6. return beta0 + beta1 * x
      7. import pandas as pd
      8. import numpy as np
      9. import matplotlib.pyplot as plt
      10. #read dataset
      11. dataset = pd.read_csv("tutorial3.dat", sep = " ",
      12. names= ["x", "y"])
      13. #compute ß1
      14. xmean = np.mean(dataset["x"])
      15. ymean = np.mean(dataset["y"])
      16. numerator = 0
      17. denominator = 0
      18. #calcilate numerator and denominator of the formula
      19. for i, row in dataset.iterrows():
      20. numerator += ((row["x"] - xmean)) * (row["y"] - ymean)
      21. denominator += (row["x"] - xmean)**2
      22. #fraction
      23. beta1 = numerator / denominator
      24. #compute ß0 using the formula given
      25. beta0 = ymean - beta1 * xmean
      26. print("Predictor:")
      27. print("yhat = " + str(beta0) + " + " + str(beta1) + "*x")
      28. #plot our data
      29. plt.scatter(dataset["x"],dataset["y"])
      30. #plot predictor function
      31. x = np.linspace(1,6, 100)
      32. y = beta0 + beta1 * x
      33. plt.plot(x,y, color = "red")
      34. plt.show()
      35. #Calculate Mean Squared error using the formula
      36. MSE = 0
      37. for i, row in dataset.iterrows():
      38. MSE += (row["y"] - predictor(beta0, beta1, row["x"])) **2
      39. MSE = MSE / len(dataset)
      40. print("Mean Squared Error:")
      41. print(np.around(MSE,4))
      Display All



      In the end we get the following result:


      For this specific dataset we see that linear regression performs quite good.

      The post was edited 3 times, last by Chryssy ().