1 months ago|

Interview questions

AI interview questions 2025 with answers

AI interview questions

Explain the Bias-Variance Trade-Off and How to Optimise It ? Answer: The bias-variance trade-off balances a model's ability to fit training data (low bias) with its ability to generalise to unseen data (low variance). High bias leads to under fitting (e.g., overly simple models), while high variance leads to overfitting (e.g., overly complex models). Optimisation involves techniques like cross-validation to assess performance, regularisation (e.g., L1/L2) to penalise complexity, and collecting more diverse data to reduce variance. Adjusting model complexity, such as tuning the number of layers in a neural network, also helps find the sweet spot.
Describe Back propagation in Detail and Derive the Gradient Update Rule? Answer: Back propagation is an optimisation algorithm for training neural networks by minimising error through gradient descent. It involves a forward pass to compute predictions, then a backward pass to calculate the gradient of the loss function with respect to each weight. The gradient update rule is: wnew=wold−η⋅∂L∂w w_{new} = w_{old} - \eta \cdot \frac{\partial L}{\partial w} wnew=wold−η⋅∂w∂L, where η \eta η is the learning rate and ∂L∂w \frac{\partial L}{\partial w} ∂w∂L is the partial derivative of the loss L L L with respect to weight w w w. This adjusts weights iteratively to reduce error, using chain rule across layers.
How Would You Handle Overfitting in a Deep Learning Model with Limited Data?Answer: With limited data, use techniques like data augmentation (e.g., rotating images) to artificially expand the dataset. Apply regularisation methods such as dropout (randomly disabling neurons during training) or weight decay. Implement early stopping by monitoring validation loss to halt training when performance plateaus. Additionally, leverage transfer learning with a pre-trained model (e.g., on ImageNet) and fine-tune it on the small dataset to exploit learned features.
Design an Algorithm to Solve a Reinforcement Learning Problem with Sparse Rewards Answer: For sparse rewards, use a Q-learning or Deep Q-Network (DQN) approach with techniques like reward shaping (adding intermediate rewards) to guide the agent. Incorporate exploration strategies like epsilon-greedy or entropy regularization to encourage diverse actions. Use a replay buffer to store and sample past experiences, stabilising training. For long-term dependencies, consider adding a Long Short-Term Memory (LSTM) layer to the DQN to retain memory of sparse reward events.
Explain the Challenges of Training a Generative Adversarial Network (GAN) and How to Mitigate Them Answer: GAN training is challenging due to mode collapse (generator produces limited variety), unstable training (oscillations between generator and discriminator), and vanishing gradients. Mitigate mode collapse with minibatch discrimination or unrolled GANs to enforce diversity. Stabilise training using label smoothing, spectral normalisation, or Wasserstein GANs with gradient penalty to enforce Lipschitz continuity. Adjust learning rates and use progressive growing for better convergence.
How Would You Implement a Custom Loss Function for a Multimodal AI Model? Answer: Define a loss function combining modality-specific losses (e.g., cross-entropy for text, mean squared error for images) with a regularisation term to align modalities. For example, use a weighted sum: Ltotal=w1⋅Ltext+w2⋅Limage+λ⋅Lalignment L_{total} = w_1 \cdot L_{text} + w_2 \cdot L_{image} + \lambda \cdot L_{alignment} Ltotal=w1⋅Ltext+w2⋅Limage+λ⋅Lalignment, where Lalignment L_{alignment} Lalignment could be a cosine similarity loss between modality embeddings. Implement in PyTorch/TensorFlow with back propagation support, ensuring gradients are computable for all terms, and tune weights w1,w2,λ w_1, w_2, \lambda w1,w2,λ via validation.

Advanced technical questions

Why is Quantisation Used?

1. Reduced Model Size:

Lower-bit numbers take less space, so the model requires less memory.

2. Faster Inference:

Lower precision arithmetic operations (e.g. INT8) are faster on compatible hardware (like mobile CPUs, GPUs, TPUs).

3. Lower Power Consumption:

Very important for deploying models on edge devices (phones, IoT, embedded systems).

🔹 Types of Quantisation

1. Post-Training Quantisation (PTQ):

Applied after training. Easy to use but may slightly reduce accuracy.

2. Quantisation-Aware Training (QAT):

Simulates quantisation during training. Helps maintain higher accuracy post-deployment.

🔹 When is Quantisation Used?

Deploying models on mobile or embedded devices
Serving models in real-time systems where speed is critical
Optimising large models (e.g. transformers, CNNs) for production

🔹 Tools That Support Quantisation

TensorFlow Lite
PyTorch (torch.quantisation)
ONNX Runtime
NVIDIA TensorRT

🔹 Example

Convert a trained PyTorch model to INT8 using post-training quantisation:

python
CopyEdit
import torch.quantization

model_fp32 = load_model()
model_int8 = torch.quantization.quantize_dynamic(model_fp32, {torch.nn.Linear}, dtype=torch.qint8)

1. Machine Learning Fundamentals

Q1. Explain the difference between supervised, unsupervised and reinforcement learning.

Supervised learning uses labelled data to train models (e.g. classification).
Unsupervised learning uses unlabelled data to discover patterns (e.g. clustering).
Reinforcement learning involves an agent learning through rewards and penalties in an environment.

Q2. What is overfitting? How can it be prevented?

Overfitting occurs when a model learns training data too well, including noise. It performs poorly on unseen data. It can be prevented using:

Cross-validation
Regularisation
Pruning (in decision trees)
Early stopping
Using simpler models

Q3. What are precision, recall and F1-score?

Precision: True Positives / (True Positives + False Positives)
Recall: True Positives / (True Positives + False Negatives)
F1-score: Harmonic mean of precision and recall.
Use F1-score when you want a balance between precision and recall.

Q4. Explain bias-variance trade-off.

High bias: Model is too simple, underfitting data
High variance: Model is too complex, overfitting data
The goal is to find a balance where both are minimised.

Q5. How does regularisation work (L1 vs L2)?

Regularisation adds a penalty to the loss function:

L1 (Lasso) adds absolute values of coefficients, promoting sparsity
L2 (Ridge) adds squared values of coefficients, reducing magnitude
Used to prevent overfitting.

Q6. Difference between bagging and boosting?

Bagging: Trains multiple models in parallel on random subsets (e.g. Random Forest)
Boosting: Trains models sequentially, each correcting the previous (e.g. XGBoost)

2. Deep Learning

Q7. How do neural networks learn?

Neural networks learn by adjusting weights using backpropagation and gradient descent to minimise a loss function.

Q8. What are activation functions? Why use ReLU?

Activation functions introduce non-linearity.

ReLU (Rectified Linear Unit) is popular because it's simple and avoids vanishing gradients.

Q9. Difference between CNNs and RNNs?

CNNs: Used for spatial data like images
RNNs: Used for sequential data like text or time series

Q10. What is backpropagation?

An algorithm that calculates gradients of the loss function with respect to model weights, allowing them to be updated during training.

Q11. What are vanishing and exploding gradients?

Vanishing: Gradients become too small, slowing or stopping learning
Exploding: Gradients grow too large, causing instability
Resolved using techniques like gradient clipping and proper weight initialisation.

Q12. What is transfer learning?

Using a pre-trained model on a new but related task to save time and data. Common in image and NLP tasks.

3. Natural Language Processing (NLP)

Q13. How does a transformer model work?

Transformers use self-attention mechanisms to process input sequences in parallel, capturing context without relying on recurrence.

Q14. What are word embeddings?

Word embeddings like Word2Vec and GloVe convert words into dense vectors capturing semantic meaning.

Q15. What is attention mechanism?

Allows the model to focus on relevant parts of input when generating output, improving context handling in sequences.

Q16. Difference between BERT and GPT?

BERT: Bidirectional encoder for understanding tasks
GPT: Unidirectional decoder for generative tasks

Q17. How do you handle out-of-vocabulary words?

Use subword tokenisation (e.g. Byte-Pair Encoding) or character-level models to break unknown words into known pieces.

4. Data Handling and Preprocessing

Q18. How do you deal with missing data?

Remove records or columns
Impute values using mean, median, mode or prediction models
Use algorithms that support missing values

Q19. How to handle imbalanced datasets?

Resampling (oversample minority, undersample majority)
Use synthetic data (e.g. SMOTE)
Adjust class weights in model training

Q20. What is feature engineering?

Creating new features or modifying existing ones to improve model performance. It includes encoding, transformation and selection.

Q21. Explain dimensionality reduction.

Reduces number of features while retaining key information.

PCA: Projects data into directions of maximum variance
t-SNE: Visualises high-dimensional data in 2D or 3D

Q22. How do you normalise or standardise data?

Normalisation scales values to [0,1]
Standardisation rescales to have zero mean and unit variance
Used to ensure equal weighting in models.

5. Programming and Tools

Q23. Which languages do you use for AI development and why?

Python: Popular due to libraries like TensorFlow, PyTorch, scikit-learn
R: Preferred for statistics-heavy tasks
C++/Java: Used for performance-critical systems

Q24. How would you implement a neural network?

Using Python and PyTorch or TensorFlow:

Define layers and activation functions
Use loss function and optimiser
Train with forward pass and backpropagation
Evaluate on test data

Q25

What Is Temperature?

Temperature is used during sampling from probability distributions, such as when a model predicts the next word or token. It's commonly applied in models like GPT and other language generators.

🔹 How Does It Work?

Given a probability distribution over possible next tokens, temperature adjusts the sharpness or softness of that distribution:

Low temperature (< 1.0) → Makes the model more confident, producing less random, more predictable outputs
High temperature (> 1.0) → Makes the model more random and creative, allowing less likely options
Temperature = 1.0 → No change to the original distribution

🔹 Formula (Simplified)

The adjusted probabilities PiP_iPi are computed as:

Pi=exp⁡(log⁡(pi)/T)∑jexp⁡(log⁡(pj)/T)P_i = \frac{\exp(\log(p_i) / T)}{\sum_j \exp(\log(p_j) / T)}Pi=∑jexp(log(pj)/T)exp(log(pi)/T)Where:

pip_ipi is the original probability
TTT is the temperature

#interview #ai #aiinterviewquestions #technicalinterviews #passinterviews #interviewquestions #AIinterviews