Machine Learning - Unit 3: Correlation and Regression

Overview

Correlation and regression are statistical techniques to quantify and describe the relationship between variables. Correlation quantifies the strength and type of relationship whereas regression express the relationship in the form of equation that enable us to predict value and relationship of unseen variables.

My Reflection

This week had a specific focus on one of the statistical measures that are essential to Exploratory Data Analysis (EDA) that we explored in the second week, as well as being essential to machine learning fundamentals. Correlation and regression were introduced through a lecturecast and readings, which also shed light on some of its applications in the financial risk measurement and prediction in particular. The readings also included comparisons between correlation and regression, and the different types of regression.

Types of regression were also explored through the notebooks that were attached in the unit's formative activities. The notebooks had a pre-written code to implement linear, multiple linear and polynomial regression, and we were invited to try changing parameters to see their effect on the overall output.

For me, I didn't see the activity of having pre-written code with an ambiguous invitation to change parameters very useful or effective to understand the models and algorithms. However, I tried to make the most out of it, and recorded a summary of the three algorithms in the Artefacts section below.

I'd like also to note that linear and multiple linear regression, as the basic forms of machine learning, are the two machine learning algorithms that I had been already familiar and experienced with. I had a sufficient understanding of polynomial, with no hands-on experience with it. So, the unit relatively gave me the chance to have another look at polynomial regression and have better consideration for non-linear regression models overall.

For the team project, we started having more colleagues jumping in on Teams, and we had our first call, in which we explored the assignment, and agreed on the team contract. Later in the week, we sent the contract to the module's tutor.

Artefacts: Collaborative Discussion 1: The 4th Industrial Revolution

Summary post

In the third unit, we wrapped up what we started in Unit 1, summarising our initial post with the peer responses that it received

My summary post

In my initial post, I critiqued the dominant Industry 4.0 approach to attendance and performance management systems, particularly in remote work settings. Drawing on Hartono et al. (2024) and Ali et al. (2022), I noted that most research prioritises technological efficiency and accuracy, often at the expense of employee wellbeing. I argued for a shift toward Industry 5.0’s human-centric paradigm (Nousala, Metcalf and Ing, 2024), where the psychological, ethical, and social impacts of monitoring systems are foregrounded.

Both peer responses strongly supported this shift. Lauretta Oghenevurie emphasised the erosion of trust caused by intrusive monitoring (O’Connell, 2024) and advocated for User-Centred Design (Anshari et al., 2021) and ethical data governance frameworks like GDPR (Mettler and Naous, 2022). She also introduced the concept of “digital boundaries” (Galanxhi and Nah, 2021) to protect mental health in remote environments.

Jordan Speight extended these ideas by stressing participatory design and output-based management. He highlighted the importance of banning off-hours surveillance, offering opt-outs, and using anonymised wellbeing indicators (Mbare et al., 2024). He also proposed feedback loops and worker committees to transform monitoring into a collaborative process.

Therefore, all contributors agreed that that Industry 4.0’s efficiency-centric systems risk harming employee wellbeing, the need for participatory, user-centred design and ethical safeguards and that Industry 5.0 vision of balancing technological innovation with human dignity and trust is something to look for.

Additionally, the contribution of peer responses expanded the discussion, as Lauretta introduced the legal and psychological dimensions of monitoring, while Jordan added practical policy recommendations and emphasized autonomy and feedback mechanisms.


Reference list


Artefacts: Correlation and Regression Notebooks

Exploring different values effect on correlation and regression

In this activity, we were given four Jupyter notebooks with Python scripts on covariance and Pearson correlation, linear regression, multiple linear regression and polynomial regression. The task was to explore how changing variables affect the analysis outcomes. My notes was as follows:

Observing the Impact of Data Points on Correlation and Regression

  1. Covariance & Pearson Correlation

    Using randomly generated data, the notebook calculates covariance and Pearson’s correlation coefficient. By changing the data generation parameters (mean, standard deviation, or the relationship between data1 and data2), you can observe how the strength and direction of correlation changes. For example, increasing noise reduces correlation, while making data2 more dependent on data1 increases correlation.

  2. Linear Regression

    This notebook fits a linear regression model to a set of (x, y) data points. The Pearson correlation is also calculated. Modifying the data points (e.g., adding outliers or changing the spread) directly affects the regression line and the correlation value. Strong linear relationships yield higher correlation and a better-fitting regression line.

  3. Multiple Linear Regression

    Here, CO2 emissions are predicted using two features: weight and volume. The coefficients show how each feature impacts the target. Changing the data (e.g., increasing weight or volume) alters the predicted CO2 and the learned coefficients, demonstrating how multiple variables jointly influence the outcome.

  4. Polynomial Regression

    This notebook fits a polynomial curve to car speed data. The r-squared value measures how well the model fits. Adjusting the data points (e.g., making the relationship more nonlinear or adding noise) changes the curve and r-squared, showing the importance of data distribution in model selection.

Conclusion

Across all notebooks, modifying data points (by adding noise, outliers, or changing relationships) impacts correlation coefficients, regression lines/curves, and model fit metrics. This demonstrates the sensitivity of statistical and machine learning models to the underlying data, highlighting the importance of data quality and distribution in analysis and prediction.