Madhavan Mukund



Data Mining and Machine Learning,
Aug-Nov 2017

Assignment 3: Regression

20 November, 2017, due 1 December, 2017



The Task

The "Combined Cycle Power Plant" data set from the UCI Machine Learning Repository contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the plant was set to work with full load. The original URL for the UCI repository is http://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant.

A combined cycle power plant (CCCP) is composed of gas turbines, steam turbines and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another.

The features consist of hourly average ambient variables:

  • Temperature (T) in the range 1.81°C – 37.11°C,
  • Ambient Pressure (AP) in the range 992.89 – 1033.30 milibar,
  • Relative Humidity (RH) in the range 25.56% – 100.16%
  • Exhaust Vacuum (V) in the range 25.36 – 81.56 cm Hg
  • Net hourly electrical energy output (EP) in the range 420.26 – 495.76 MW

The task is to predict the energy output based on the other parameters.

  • Use the data as it is and build a linear regression predictive model for the output.

  • Normalize the data set so that the values in the four columns are adjusted so that the mean is zero and build a fresh linear regression predictive model. Does the performance on the test set improve?

Build both models using batch gradient descent as well as stochastic gradient descent. In all cases, note down the number of iteration before you converge. Explain how you fine tuned the step size for gradient descent. Plot relevant graphs demonstrating the choice of the set size.


Solving the Task

  • You can use any programming language.You should not use standard packages or libraries for regression. You must write the regression code yourself. Clearly document any external packages used by your code.

  • Submit via Moodle a single archive (zip, tar.gz, …) containing:

    • The code you used to solve the assignment.

    • A link to the output produced by your code. Do not include the output in this submission. Save it somewhere on the cloud and provide a link.

    • A short write up describing how your code ran on the data sets, including convergence time, step size for gradient descent and anything else that you feel is relevant.

  • You can work in groups of two or three. Each group makes a single submission to Moodle. Use any one person's Moodle account to submit. The submission should mention the names of the partners.

  • There will be a short oral presentation and question/answer session for each group.