Data Science Concepts

12 concepts to master for Data Science interviews.

Apr 03, 2025

I would like to begin by sharing I am not a data scientist. So, definitely , not an expert. So, if you look at this article with cautious eyes, I understand. However, I am someone who has studied machine learning and deep learning. I studied machine learning in my undergraduation getting an equivalent of A. I also studied deep learning and introduction to statistics in my masters course doing decently well (not an A). In 2017, I wasn’t sure about a career. Secondly, I desperately needed a job. This meant I applied across jobs and studied for various jobs. One of those careers was data science. I had a fair experience building models in data science and also building kaggle projects. However, I had a hard time studying statistics. There were myriad concepts and the level only seemed to get tougher. The introduction to stats class in my masters was hardly an introduction but rather a deep dive. And having struggled through the semester, and begun my job search, I realized a fundamental thing: Getting an A in a course and preparing for interviews are two entirely different things. Infact, interview preparation is also a skill. Having that thought in my head, I knew that I needed a definite guide to prepare for stats questions if they came up in an interview. So, at the end, of the semester, our professor shared 12 concepts that were most important. I had noted them down and used those notes and existing resources to refine them even further. Sharing them below and I hope it is helpful to you :

What is Central Limit Theorem and explain with an example ?

The Central Limit Theorem states that the sampling distribution of the mean of a large number of independent, identically distributed random samples will approach a normal distribution, regardless of the original population’s distribution.

The CLT can be used at any company with a large amount of data. Consider companies like Uber/Lyft wants to test whether adding a new feature will increase the booked rides or not using hypothesis testing. So if we have a large number of individual ride X, which in this case is a Bernoulli random variable (since the rider will book a ride or not), we can estimate the statistical properties of the total number of bookings. Understanding and estimating these statistical properties play a significant role in applying hypothesis testing to your data and knowing whether adding a new feature will increase the number of booked riders or not.

What is conditional probability and bayes theorem ?

Conditional Probability and Bayes’ Theorem For any two events A and B, P(A|B) represents the conditional probability of occurrence of event A given that event B has already occurred. The formula for conditional probability is given by the following equation –

In continuation to the discussion of conditional probability, revising our prior probability of an event when new information becomes available is a crucial phase and that is where Bayes’ Theorem becomes useful. The following mathematical equation sums up Bayes’ Theorem.

In this equation A is an event and B is empirical evidence or information received from data. So, P(A) is the prior probability of event A and P(B) is the probability of event B based on evidence from data, and P(B|A) is known as the likelihood. So, Bayes’ Theorem gives us the probability of an event based on our prior knowledge about the event and updates that conditional probability when we get some new information about the same.

A very easy example of Bayes’ theorem can be to predict the probability of raining on a particular day given that the morning was cloudy. Suppose, the probability of raining i.e., P(Rain) is 10% on a day in June and the probability that the morning was cloudy given it rained i.e., P(Cloud|Rain) is 50%. Additionally, the probability of a cloudy morning in any day in June i.e., P(Cloud) is 40%, then applying Bayes’ theorem we can conclude that the probability that it will rain today given that it was cloudy in the morning is:

What are standard deviation and variance and what is skewness ?

Variance and standard deviation both measure the dispersion or spread of a dataset. Variance is the average of the squared differences from the mean. It gives a sense of how much the values in a dataset differ from the mean. However, because it uses squared differences, the units are squared as well, which can be less intuitive than the standard deviation. Standard deviation is the square root of the variance, bringing the units back to the same as the original data. It provides a more interpretable measure of spread. For example, if the variance of a dataset is 25, the standard deviation is √25 = 5.

Skewness measures the asymmetry of a dataset around its mean, which can be positive, negative, or zero. Data with positive skewness, or right-skewed data, has a longer right tail, meaning the mean is greater than the median. Data with negative skewness, or left-skewed data, has a longer left tail, meaning the mean is less than the median. Zero skewness indicates a symmetric distribution, like a normal distribution, where the mean, median, and mode are equal.

What is normal distribution ? What is poisson distribution ? What is binomial distribution ?

The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution characterized by its bell-shaped curve, which is symmetric about the mean. With normal distributions, the mean is, therefore, equal to the median. Also, it’s known that about 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. This is known as the 68-95-99.7 Rule.

The binomial distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. It is used when there are exactly two possible outcomes (success and failure) for each trial. For example, it can be used to model the number of heads in a series of coin flips.

The Poisson distribution is a discrete probability distribution that models the number of events occurring within a fixed interval of time or space, where the events occur independently and at a constant average rate. It is appropriate to use when you want to model the count of rare events, such as the number of emails received in an hour or the number of earthquakes in a year.

What is A/B testing ? What are some pitfalls with A/B testing ?

A/B testing is a way to compare two versions of something (like a webpage, app feature, or product offering) to see which one performs better. You make a change and then test if that change really makes a difference—or if things just happened by chance.

How It Works

You split your audience into at least two groups:

Control group: sees the original version (Version A)
Treatment group: sees the new or changed version (Version B)

Then you track how each group behaves. For example, do more people click “Buy” on Version B than Version A?

Example 2: Website Landing Page

You design two versions of a website’s landing page. You want more people to sign up (your success metric is the conversion rate).

Null hypothesis (H0): There’s no difference between the two versions.

Alternative hypothesis (H1): One version does better.

Randomly show one version to some visitors, and the other version to others. After enough people visit, you use stats to check: are the results really different, or is it just random?

Statistical Testing

To be confident in your results, you run a 2-sample t-test. It tells you if the difference is big enough to matter statistically.

p-value < alpha (like 0.05) → Your result is statistically significant.
That means: you're 95% confident the new version is better (or worse).

Things That Can Go Wrong

Wrong goals: If you’re measuring clicks but care about purchases, your results won’t help.
No counter metrics: You might improve one thing but hurt another (e.g., more sign-ups but worse user experience).
Unfair comparison: If one group is very different (age, location, device), results won’t be valid.
Too small test: If not enough people are in the test or it's too short, the results won’t be reliable.
Network effects: If users influence each other (like in social apps), it can mess with your test.

What is a confidence interval?

A 95% confidence interval means that if we were to take many samples and calculate a confidence interval for each sample, about 95% of these intervals would contain the true population parameter. We could also say that we are 95% confident that the parameter value lies in the estimated interval.

What is difference between a z-test and t-test ?

Z-test is used when the population standard deviation is known, and the sample size is large (typically, n > 30). The Z-test assumes that the underlying population is normally distributed, or the sample size is large enough for the Central Limit Theorem to hold. In a Z-test, the test statistic follows the standard normal distribution (Z-distribution), which has a mean of 0 and a standard deviation of 1.

t-test is used when the population standard deviation is unknown, and the sample size is small (typically, n <= 30). The t-test assumes that the sample is drawn from a population with a normal distribution. In a t-test, the test statistic follows a t-distribution, which is similar to the standard normal distribution but has thicker tails. The t-distribution is characterized by the degrees of freedom, which depend on the sample size. As the sample size increases, the t-distribution approaches the standard normal distribution.

What is selection Bias and how to avoid it ?

Sampling bias is the phenomenon that occurs when a research study design fails to collect a representative sample of a target population. This typically occurs because the selection criteria for respondents failed to capture a wide enough sampling frame to represent all viewpoints.

The cause of sampling bias almost always owes to one of two conditions.

Poor methodology: In most cases, non-representative samples pop up when researchers set improper parameters for survey research. The most accurate and repeatable sampling method is simple random sampling where a large number of respondents are chosen at random. When researchers stray from random sampling (also called probability sampling), they risk injecting their own selection bias into recruiting respondents.
Poor execution: Sometimes data researchers craft scientifically sound sampling methods, but their work is undermined when field workers cut corners. By reverting to convenience sampling (where the only people studied are those who are easy to reach) or giving up on reaching non-responders, a field worker can jeopardize the careful methodology set up by data scientists.

The best way to avoid sampling bias is to stick to probability-based sampling methods. These include simple random sampling, systematic sampling, cluster sampling, and stratified sampling. In these methodologies, respondents are only chosen through processes of random selection—even if they are sometimes sorted into demographic groups along the way.

What is regularization ?

Regularization techniques are a powerful technique for treating multicollinearity in regression models. They are also used to prevent overfitting by adding a penalty to the model for having large coefficients. This helps in creating a more generalizable model. Common regularization techniques include Lasso and Ridge Regression.

What is bias variance tradeoff ?

The bias-variance tradeoff in machine learning involves balancing two error sources. Bias is the error from overly simplistic model assumptions, causing underfitting and missing data patterns. Variance is the error from excessive sensitivity to training data fluctuations, causing overfitting and capturing noise instead of true patterns.

What is the relationship between the significance level and the confidence level in Statistics?

Confidence level = 1 - significance level.

It's closely related to hypothesis testing and confidence intervals.

Significance Level according to the hypothesis testing literature means the probability of Type-I error one is willing to tolerate.

Confidence Level according to the confidence interval literature means the probability in terms of the true parameter value lying inside the confidence interval. They are usually written in percentages.

What are type 1 and type 2 errors ?

Type I errors in hypothesis testing occur when the null hypothesis is true, but we incorrectly reject it, resulting in a false positive. The probability of making a Type I error is the same as the significance level. Type II errors occur when the null hypothesis is false, but we fail to reject it, leading to a false negative.

Additional sources that can be helpful. Also credits to them :

https://www.datacamp.com/blog/statistics-interview-questions

https://towardsdatascience.com/12-statistics-concepts-you-must-know-for-your-next-data-science-interview-45c677355b49/

https://grabngoinfo.com/top-12-statistical-concepts-data-science-interview-questions/

https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-briefly-explain-the-ab-testing-and-its-application-what-are-some-common-pitfalls-encountered-in-ab-testing

Veeraj’s Substack

Data Science Concepts

12 concepts to master for Data Science interviews.