Data Science Interview Prep

Jun 23, 2025

I would like to begin by sharing I am not a data scientist. So, definitely , not an expert. So, if you look at this article with cautious eyes, I understand. However, I am someone who has studied machine learning and deep learning. I studied machine learning in my undergraduation getting an equivalent of A. I also studied deep learning and introduction to statistics in my masters course doing decently well (not an A). In 2017, I wasn’t sure about a career. Secondly, I desperately needed a job. This meant I applied across jobs and studied for various jobs. One of those careers was data science. I had a fair experience building models in data science and also building kaggle projects. However, I had a hard time studying statistics. There were myriad concepts and the level only seemed to get tougher. The introduction to stats class in my masters was hardly an introduction but rather a deep dive. And having struggled through the semester, and begun my job search, I realized a fundamental thing: Getting an A in a course and preparing for interviews are two entirely different things. Infact, interview preparation is also a skill. Having that thought in my head, I knew that I needed a definite guide to prepare for stats questions if they came up in an interview. So, at the end, of the semester, our professor shared 12 concepts that were most important. I had noted them down and used those notes and existing resources to refine them even further. Sharing them below and I hope it is helpful to you :

Before you dive in, just a note : I provide all my resource and information for free and I hope that even 1% of this can help you in your career. At the same time, I do this all by myself and don’t have anyone to help or any marketing budget to work with. So, if you found this article helpful, consider supporting me by making a donation through buymeacoffee , becoming a paid member of substack or susbcribing to my Youtube page. That will not only help me keep sharing these resources for free but also helps me keep going this without relying on brands.

IMPORTANT STATS QUESTIONS :

What is Central Limit Theorem and explain with an example ?

The Central Limit Theorem states that the sampling distribution of the mean of a large number of independent, identically distributed random samples will approach a normal distribution, regardless of the original population’s distribution.

The CLT can be used at any company with a large amount of data. Consider companies like Uber/Lyft wants to test whether adding a new feature will increase the booked rides or not using hypothesis testing. So if we have a large number of individual ride X, which in this case is a Bernoulli random variable (since the rider will book a ride or not), we can estimate the statistical properties of the total number of bookings. Understanding and estimating these statistical properties play a significant role in applying hypothesis testing to your data and knowing whether adding a new feature will increase the number of booked riders or not.

What is conditional probability and bayes theorem ?

Conditional Probability and Bayes’ Theorem For any two events A and B, P(A|B) represents the conditional probability of occurrence of event A given that event B has already occurred. The formula for conditional probability is given by the following equation –

In continuation to the discussion of conditional probability, revising our prior probability of an event when new information becomes available is a crucial phase and that is where Bayes’ Theorem becomes useful. The following mathematical equation sums up Bayes’ Theorem.

In this equation A is an event and B is empirical evidence or information received from data. So, P(A) is the prior probability of event A and P(B) is the probability of event B based on evidence from data, and P(B|A) is known as the likelihood. So, Bayes’ Theorem gives us the probability of an event based on our prior knowledge about the event and updates that conditional probability when we get some new information about the same.

A very easy example of Bayes’ theorem can be to predict the probability of raining on a particular day given that the morning was cloudy. Suppose, the probability of raining i.e., P(Rain) is 10% on a day in June and the probability that the morning was cloudy given it rained i.e., P(Cloud|Rain) is 50%. Additionally, the probability of a cloudy morning in any day in June i.e., P(Cloud) is 40%, then applying Bayes’ theorem we can conclude that the probability that it will rain today given that it was cloudy in the morning is:

What are standard deviation and variance and what is skewness ?

Variance and standard deviation both measure the dispersion or spread of a dataset. Variance is the average of the squared differences from the mean. It gives a sense of how much the values in a dataset differ from the mean. However, because it uses squared differences, the units are squared as well, which can be less intuitive than the standard deviation. Standard deviation is the square root of the variance, bringing the units back to the same as the original data. It provides a more interpretable measure of spread. For example, if the variance of a dataset is 25, the standard deviation is √25 = 5.

Skewness measures the asymmetry of a dataset around its mean, which can be positive, negative, or zero. Data with positive skewness, or right-skewed data, has a longer right tail, meaning the mean is greater than the median. Data with negative skewness, or left-skewed data, has a longer left tail, meaning the mean is less than the median. Zero skewness indicates a symmetric distribution, like a normal distribution, where the mean, median, and mode are equal.

What is normal distribution ? What is poisson distribution ? What is binomial distribution ?

The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution characterized by its bell-shaped curve, which is symmetric about the mean. With normal distributions, the mean is, therefore, equal to the median. Also, it’s known that about 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. This is known as the 68-95-99.7 Rule.

The binomial distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. It is used when there are exactly two possible outcomes (success and failure) for each trial. For example, it can be used to model the number of heads in a series of coin flips.

The Poisson distribution is a discrete probability distribution that models the number of events occurring within a fixed interval of time or space, where the events occur independently and at a constant average rate. It is appropriate to use when you want to model the count of rare events, such as the number of emails received in an hour or the number of earthquakes in a year.

What is A/B testing ? What are some pitfalls with A/B testing ?

A/B testing is a way to compare two versions of something (like a webpage, app feature, or product offering) to see which one performs better. You make a change and then test if that change really makes a difference—or if things just happened by chance.

How It Works

You split your audience into at least two groups:

Control group: sees the original version (Version A)
Treatment group: sees the new or changed version (Version B)

Then you track how each group behaves. For example, do more people click “Buy” on Version B than Version A?

Example 2: Website Landing Page

You design two versions of a website’s landing page. You want more people to sign up (your success metric is the conversion rate).

Null hypothesis (H0): There’s no difference between the two versions.

Alternative hypothesis (H1): One version does better.

Randomly show one version to some visitors, and the other version to others. After enough people visit, you use stats to check: are the results really different, or is it just random?

Statistical Testing

To be confident in your results, you run a 2-sample t-test. It tells you if the difference is big enough to matter statistically.

p-value < alpha (like 0.05) → Your result is statistically significant.
That means: you're 95% confident the new version is better (or worse).

Things That Can Go Wrong

Wrong goals: If you’re measuring clicks but care about purchases, your results won’t help.
No counter metrics: You might improve one thing but hurt another (e.g., more sign-ups but worse user experience).
Unfair comparison: If one group is very different (age, location, device), results won’t be valid.
Too small test: If not enough people are in the test or it's too short, the results won’t be reliable.
Network effects: If users influence each other (like in social apps), it can mess with your test.

What is a confidence interval?

A 95% confidence interval means that if we were to take many samples and calculate a confidence interval for each sample, about 95% of these intervals would contain the true population parameter. We could also say that we are 95% confident that the parameter value lies in the estimated interval.

What is difference between a z-test and t-test ?

Z-test is used when the population standard deviation is known, and the sample size is large (typically, n > 30). The Z-test assumes that the underlying population is normally distributed, or the sample size is large enough for the Central Limit Theorem to hold. In a Z-test, the test statistic follows the standard normal distribution (Z-distribution), which has a mean of 0 and a standard deviation of 1.

t-test is used when the population standard deviation is unknown, and the sample size is small (typically, n <= 30). The t-test assumes that the sample is drawn from a population with a normal distribution. In a t-test, the test statistic follows a t-distribution, which is similar to the standard normal distribution but has thicker tails. The t-distribution is characterized by the degrees of freedom, which depend on the sample size. As the sample size increases, the t-distribution approaches the standard normal distribution.

What is selection Bias and how to avoid it ?

Sampling bias is the phenomenon that occurs when a research study design fails to collect a representative sample of a target population. This typically occurs because the selection criteria for respondents failed to capture a wide enough sampling frame to represent all viewpoints.

The cause of sampling bias almost always owes to one of two conditions.

Poor methodology: In most cases, non-representative samples pop up when researchers set improper parameters for survey research. The most accurate and repeatable sampling method is simple random sampling where a large number of respondents are chosen at random. When researchers stray from random sampling (also called probability sampling), they risk injecting their own selection bias into recruiting respondents.
Poor execution: Sometimes data researchers craft scientifically sound sampling methods, but their work is undermined when field workers cut corners. By reverting to convenience sampling (where the only people studied are those who are easy to reach) or giving up on reaching non-responders, a field worker can jeopardize the careful methodology set up by data scientists.

The best way to avoid sampling bias is to stick to probability-based sampling methods. These include simple random sampling, systematic sampling, cluster sampling, and stratified sampling. In these methodologies, respondents are only chosen through processes of random selection—even if they are sometimes sorted into demographic groups along the way.

What is regularization ?

Regularization techniques are a powerful technique for treating multicollinearity in regression models. They are also used to prevent overfitting by adding a penalty to the model for having large coefficients. This helps in creating a more generalizable model. Common regularization techniques include Lasso and Ridge Regression.

What is bias variance tradeoff ?

The bias-variance tradeoff in machine learning involves balancing two error sources. Bias is the error from overly simplistic model assumptions, causing underfitting and missing data patterns. Variance is the error from excessive sensitivity to training data fluctuations, causing overfitting and capturing noise instead of true patterns.

What is the relationship between the significance level and the confidence level in Statistics?

Confidence level = 1 - significance level.

It's closely related to hypothesis testing and confidence intervals.

Significance Level according to the hypothesis testing literature means the probability of Type-I error one is willing to tolerate.

Confidence Level according to the confidence interval literature means the probability in terms of the true parameter value lying inside the confidence interval. They are usually written in percentages.

What are type 1 and type 2 errors ?

Type I errors in hypothesis testing occur when the null hypothesis is true, but we incorrectly reject it, resulting in a false positive. The probability of making a Type I error is the same as the significance level. Type II errors occur when the null hypothesis is false, but we fail to reject it, leading to a false negative.

Additional sources that can be helpful. Also credits to them :

https://www.datacamp.com/blog/statistics-interview-questions

https://towardsdatascience.com/12-statistics-concepts-you-must-know-for-your-next-data-science-interview-45c677355b49/

https://grabngoinfo.com/top-12-statistical-concepts-data-science-interview-questions/

https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-briefly-explain-the-ab-testing-and-its-application-what-are-some-common-pitfalls-encountered-in-ab-testing

IMPORTANT DATA SCIENCE QUESTIONS :

What are L1 and L2 regularization? What are the differences between the two?

Answer:

Regularization is a technique used to avoid overfitting by trying to make the model more simple. Rest of the answer can be leveraged here.

Mention three ways to handle missing or corrupted data in a dataset.

Answer:

In general, real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The rest of the answer is here

What are the Bias and Variance in a Machine Learning Model and explain the bias-variance trade-off?

Answer:

The goal of any supervised machine learning model is to estimate the mapping function (f) that predicts the target variable (y) given input (x). The prediction error can be broken down into three parts: The rest of the answer is here

Mention three ways to make your model robust to outliers.

Investigating the outliers is always the first step in understanding how to treat them. After you understand the nature of why the outliers occurred you can apply one of the several methods mentioned here.

Define Precision, recall, and F1 and discuss the trade-off between them?

Precision and recall are two classification evaluation metrics that are used beyond accuracy. The rest of the answer is here

Explain briefly the K-Means clustering and how can we find the best value of K?

K-Means is a well-known clustering algorithm. K-means clustering is often used because it is easy to interpret and implement. The rest of the answer is here

Explain briefly the logistic regression model and state an example of when you have used it recently.

Answer:

Logistic regression is used to calculate the probability of occurrence of an event in the form of a dependent output variable based on independent input variables. Logistic regression is commonly used to estimate the probability that an instance belongs to a particular class. If the probability is bigger than 0.5 then it will belong to that class (positive) and if it is below 0.5 it will belong to the other class. This will make it a binary classifier.

It is important to remember that the Logistic regression isn't a classification model, it's an ordinary type of regression algorithm, and it was developed and used before machine learning, but it can be used in classification when we put a threshold to determine specific categories"

There is a lot of classification applications to it:

Classify email as spam or not, To identify whether the patient is healthy or not, and so on.

Explain briefly batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. and what are the pros and cons for each of them?

Gradient descent is a generic optimization algorithm cable for finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameters iteratively in order to minimize a cost function.

Batch Gradient Descent: In Batch Gradient descent the whole training data is used to minimize the loss function by taking a step toward the nearest minimum by calculating the gradient (the direction of descent)

Pros: Since the whole data set is used to calculate the gradient it will be stable and reach the minimum of the cost function without bouncing (if the learning rate is chosen cooreclty)

Cons:

Since batch gradient descent uses all the training set to compute the gradient at every step, it will be very slow especially if the size of the training data is large.

Stochastic Gradient Descent:

Stochastic Gradient Descent picks up a random instance in the training data set at every step and computes the gradient based only on that single instance.

Pros:

It makes the training much faster as it only works on one instance at a time.
It become easier to train large datasets

Cons:

Due to the stochastic (random) nature of this algorithm, this algorithm is much less regular than the batch gradient descent. Instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, not settling down there. So once the algorithm stops the final parameters are good but not optimal. For this reason, it is important to use a training schedule to overcome this randomness.

Mini-batch Gradient:

At each step instead of computing the gradients on the whole data set as in the Batch Gradient Descent or using one random instance as in the Stochastic Gradient Descent, this algorithm computes the gradients on small random sets of instances called mini-batches.

Pros:

The algorithm's progress space is less erratic than with Stochastic Gradient Descent, especially with large mini-batches.
You can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

Cons:

It might be difficult to escape from local minima.

What is boosting in the context of ensemble learners discuss two famous boosting methods

Answer:

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

There are many boosting methods available, but by far the most popular are:

Adaptive Boosting: One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor under-fitted. This results in new predictors focusing more and more on the hard cases.
Gradient Boosting: Another very popular Boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration as AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

What are Loss Functions and Cost Functions? Explain the key Difference Between them.

Answer: The loss function is the measure of the performance of the model on a single training example, whereas the cost function is the average loss function over all training examples or across the batch in the case of mini-batch gradient descent.

Some examples of loss functions are Mean Squared Error, Binary Cross Entropy, etc.

Whereas, the cost function is the average of the above loss functions over training examples.

Why boosting is a more stable algorithm as compared to other ensemble algorithms?

Answer:

Boosting algorithms focus on errors found in previous iterations until they become obsolete. Whereas in bagging there is no corrective loop. That’s why boosting is a more stable algorithm compared to other ensemble algorithms.

Can you explain the ARIMA model and its components?

Answer: The ARIMA model, which stands for Autoregressive Integrated Moving Average, is a widely used time series forecasting model. It combines three key components: Autoregression (AR), Differencing (I), and Moving Average (MA).

Autoregression (AR): The autoregressive component captures the relationship between an observation in a time series and a certain number of lagged observations. It assumes that the value at a given time depends linearly on its own previous values. The "p" parameter in ARIMA(p, d, q) represents the order of autoregressive terms. For example, ARIMA(1, 0, 0) refers to a model with one autoregressive term.
Differencing (I): Differencing is used to make a time series stationary by removing trends or seasonality. It calculates the difference between consecutive observations to eliminate any non-stationary behavior. The "d" parameter in ARIMA(p, d, q) represents the order of differencing. For instance, ARIMA(0, 1, 0) indicates that differencing is applied once.
Moving Average (MA): The moving average component takes into account the dependency between an observation and a residual error from a moving average model applied to lagged observations. It assumes that the value at a given time depends linearly on the error terms from previous time steps. The "q" parameter in ARIMA(p, d, q) represents the order of the moving average terms. For example, ARIMA(0, 0, 1) signifies a model with one moving average term.

By combining these three components, the ARIMA model can capture both autoregressive patterns, temporal dependencies, and stationary behavior in a time series. The parameters p, d, and q are typically determined through techniques like the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).

It's worth noting that there are variations of the ARIMA model, such as SARIMA (Seasonal ARIMA), which incorporates additional seasonal components for modeling seasonal patterns in the data.

ARIMA models are widely used in forecasting applications, but they do make certain assumptions about the underlying data, such as linearity and stationarity. It's important to validate these assumptions and adjust the model accordingly if they are not met.

DATA SCIENCE PROJECTS YOU CAN PURSUE : Link

Hope this is helpful and good luck!

I am sure you will found this article helpful. If so, consider supporting me by making a donation through buymeacoffee , becoming a paid member of substack or susbcribing to my Youtube page.

Veeraj’s Substack

Discussion about this post