Understanding the need for Mathematics

7 min readJan 12, 2022

Recently I have been asked questions like “I am not very good at Mathematics . Can I make a break into the data science domain?” or “Is maths an absolute must for data science?”. I could very well understand the reason behind such questions. I also had similar fears at one point. This can be attributed to focusing on just the theory in our academic curriculum. The theory is very much important but I think understanding the practical applications allows us to appreciate the theory better. I believe that beginning of every education must be with a thorough understanding of the need for it. This generates curiosity and respect for the subject that will help learners to stick to the subject when tackling advanced concepts. In this article, I will go through a few example use cases from various fields of mathematics that helped me generate curiosity for that subject.

Linear Algebra

Imagine you are in a posh neighborhood looking to buy a house. Based on the features of the house the cost of the house might vary. For simplicity's, sake let the price of the house only depend on only the lot size. You manage to get this data for a couple of houses and plot it in a 2D graph.

Now if you were asked to predict the price of a house you can simply do that by fitting a line through the dots.

Mathematically the equation of the above line becomes y=mx+c. Where

m: the slope of the line

c: the point where the line cuts the y-axis

But in reality, the price of a house depends upon a lot of different factors like the number of bedrooms, neighborhood, built year, garage Area, number of bathrooms, etc. If we were to consider so many features we cannot visualize them in a 2-D plot. For this reason, we can use linear algebra. Consider the below table.

Here instead of one we have 11 features on which SalePrice of the house depends. In 2-D we tried to fit a line through the points but here we see we are having 11-D and our mind cannot conceive anything greater than 3-D. Linear algebra allows scaling any rules applicable in lesser dimensions to higher dimensions. So in 2-D our equation of line was y=m*x+c. This can also be written as

y=w1*x+w0 where w1=m, w0 =c

In our example with 11-D, the same equation will look something like

y=w0+w1*x1+w2*x2+……+w10*x10+w11*x11

Where [x1,x2….,x11]: Features of the house on which selling price depends

[w0,w1…,w11]: Weights for each feature. Similar to slope in 2-D. (More on this in next section)

As you can see writing such lengthy equations is a bit cumbersome. Now imagine we have to predict the price for hundreds of houses. Thankfully linear algebra has matrix and vector operations…

Not only this makes calculations much manageable it is computationally faster also.

In machine learning, we usually have datasets that have hundreds of dimensions. To operate on any of these we require linear algebra.

Differential Calculus

In the previous section, we have discussed fitting a line through the points but not on how to do that. Here I will shed some light on that. Let's return back to our 2-D example, it's much easier to explain that way. Let's try to imagine how that line should be.

Looking at points anyone can say that we won't be able to perfectly fit a line that passes through all the points. So instead we will try to find the best possible fit. One possible way of thinking is a line in such a way that the total squared distance between the point and the line is as small as possible.

If we frame the above statement as a mathematical problem what we will have is an optimization equation which looks like…

The scary-looking equation above says that fin w,w0 such that difference between the actual price and the predicted price is as small as possible. This difference is measured as loss and our loss should be minimum.

To solve the above equation we will use a widely used optimization algorithm called gradient descent. Here we assume that there exists a value of the weights for which loss is minimum. If we plot the above loss with respect to weights we might see a plot like this

To begin with, we randomly assign values to the weights. In that case, the loss would be high. To reduce this loss we differentiate the above optimization equation w.r.t weights and subtract from weights to get the new weights i.e.

Let's understand the reason for differentiating…

So as you can see differentiating the loss function points us in correct direction where the loss would be minimal.

Probability

Probably the least intuitive field in mathematics might be probability. This can be attributed to our way of thinking. We are used to thinking deterministically. Something can be either 0 or 1, good or bad, black or white. But most of the time things lie somewhere in the middle of the spectrum. I will explain with help of an example.

Suppose one fine morning you woke up and you didn’t feel very well. You went for a check-up and after running some tests doctor says you have a rare disease that affects nearly 0.1% of the population. So you get worried and ask how accurate the test is to which he replies, it correctly identifies 99% who have the disease and incorrectly identifies 1% who don’t.

Immediately you jump to the conclusion that you actually have the disease but that might not be the case. Because if we just go by the results of the test we are not taking into consideration one major factor i.e. how prevalent the disease is. In such a case we need to apply the Bayes theorem to calculate the probability of having the disease given the test resulted positive. Let

A: Event of you having the disease.

~A: Event of you not having the disease.

B: Event of you testing +ve.

Now according to Bayes theorem,

P(A|B)=(P(B|A) * P(A)) / P(B)

P(A|B): Probability of you having the disease given that test resulted in +ve also called as posterior probability.

P(B|A): Probability of getting a +ve test result if you have the disease. Here in our case, it is 0.99.

P(A): Prevalence or prior. It is our prior knowledge before taking the test i.e. how prevalent the disease is. Which in our case is 0.1%.

P(B): Your test can result in +ve in two cases. One, you have the disease. Two, a false positive. So, P(B)=P(B|A) * P(A) +P(B|~A)*P(~A).

So plugging in our values,

P(A|B)=(0.99*0.001)/((0.99*0.001)+0.01*0.999) =0.0901≈9.01%

So it turns out there is only a 9% chance you have the disease. That is incredibly low given that the test resulted positive that too for a 99% accurate test. But why is it so un-intuitive? Actually, it's not. Let us have a look…

At the beginning before testing, if someone would have asked you what are your chances of getting that rare disease, you might have looked up the stats and given the answer as 0.1% or since it's such low you might reply there is no chance. But when you took the test i.e. you got new evidence regarding this subject and this new evidence completely changed the perspective without any regard to the prior knowledge. Bayes theorem says that new evidence should update your prior knowledge not obviate it.

(One thing I would like to note here is that value of 0.1% prior is subject to change based on other evidence like symptoms of disease or genetic conditions or some other factors.)

Now if you were to do the test again and the test resulted positive then in this case the posterior we calculated earlier will become the prior. If we recalculate the probability of having the disease given another test resulted positive is

P(A|B) = 0.99*0.09/(0.99*0.09+0.01*0.999)=0.899 ≈ 90%.

It makes sense. You certainly have a disease now that two test results are saying so.

On a final note, I hope that I was able to generate some curiosity. I have referred to the below sources for topics discussed in this blog.

Andrew Ng’s Machine learning course
How to update your beliefs systematically? Bayes-Theorem
Bayes Theorem, Geometry of changing beliefs

Have a nice day…

Understanding the need for Mathematics

Linear Algebra

Differential Calculus

Probability

Written by Baivab Dash