# Will you regress this?

Simran Thadhani

In order to understand when to use regression analysis, we must first understand what it exactly does. Here’s a simple answer that pops up, when you Google its use:

*Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. *Don’t worry if any of these terms sound unfamiliar. This article will help you understand its meaning, application and interpretation. And to do all of this, we’ll take a simple example.

We live in a very dynamic, progressive world. However there is an ongoing debate about pay gaps, based on gender. It's overwhelming to see women fighting for equal rights even today. Let’s say that you’re a neutral party here. You aren’t a part of this movement because you don’t really understand what the fuss is about. Maybe that is because you haven’t faced or witnessed this yet. That’s fair. However, you really want to understand whether gender does play a role in determining your pay. So you collect data on compensation packages of a set of people. This by the way, is your *dependent variable, *since you’re determining what this really *depends* on. Going ahead, you also collect data pertaining to education, work experience and gender of these said individuals. There you go, you have your *independent variables. *Remember that there has to be no correlation between education, work experience and education. We can use this data only when this condition is satisfied.

Now, we make a regression model with the data available. How does an equation help us get a deeper understanding of the issue at hand? It's actually the coefficients that help us understand the impacts better.

My dependent variable gets its value from the independent ones. Therefore, if I had to put this mathematically, I’d put it in the following manner:

**Compensation package= A base value + Education + Experience + Gender**

What regression does, is put values that show the impact of these variables on the dependent variable (compensation package). So our result would look like:

**Compensation package (Y)= β0 +β1 (Education)+ β2 (Experience) + β3(Gender)**

Where the β values talk about the impact a variable has on Y i.e. your compensation package.

It's very important that we interpret the β values correctly, so as to understand our data and the bigger picture better. Here’s the general way of reading the results:

For a unit change in the independent variables, Y changes by β units. So, in our example, we can say, given that education and experience are fixed, gender has compensation changes by β3 units.

To verify and take a stance, we can put the same values for education and experience, to understand if people with the same education and experience are paid differently. The only differing input value here, is the gender. Since we can’t input words like “Male”, “Female” and “Other” in our model, we assign mathematical values for them.

0=Female

1=Male

2=Other

If my output, Y (compensation package) differs for the same education and experience, then we can say that gender has a role to play in determining an individual’s compensation package. Further, it's important to note the sign of the coefficient. A + sign denotes a direct relation, i.e. as the variable changes by a unit, the Y value moves in the positive direction and vice versa for the negative sign.

And now coming to the second part of the Google result above that read, “*If the dependent variable is dichotomous, then logistic regression should be used.” *Well, we now move away from the gender based pay gap to a new example. Imagine that you’re a bank, and want to analyse which applicant is likely to default in loan repayments. There already is a credit score that exists right? So what are we talking about now? Credit score is one of the many parameters that are taken into account to determine an applicant’s creditworthiness. I need a model that simply tells me how likely an individual is to default. In this model, I get 2 outcomes, 0 and 1. Zero indicating no default, while one indicating a default. This is the *dichotomous *dependent variable that the result talks about. When we create such a model, we’ve basically made a *logistic regression *model*. *

Depending on what you want to attain through your model, you can choose the type of regression model to make. Logistic regression is used to classify items, and hence often known as a classifier model. However, calculating things like income, price calculations, GDP growth rates etc need a linear regression model.