16 Advanced Machine Learning Engineer - Python Interview Questions & Answers
Below is a list of our Advanced Machine Learning Engineer - Python interview questions. Click on any interview question to view our answer advice and answer examples. You may view six answer examples before our paywall loads. Afterwards, you'll be asked to upgrade to view the rest of our answers.
1. Have you used Tensorflow? Why is tensorflow useful for machine learning?
This question shows the developer's knowledge of machine learning algorithms.
Tensor flow is an open-source machine learning library developed by Google. It is used by many machine learning engineers as it provides incredibly powerful libraries that can be used to build and analyze a vast variety of machine learning algorithms.
Tensorflow is compatible with both javascript and python making it versatile for engineers to create machine learning models for desktop, mobile, web, and cloud.
It is advised to install TensorFlow on a virtual machine as it is a relatively large package.
The technical interviewer may ask this as a "starter" question in order to open a discussion about machine learning algorithms.
Written by Ryan Brown on July 6th, 2021
2. What is syntactic analysis and semantic analysis in the field of Natural Language Processing?
This question shows the developer's understanding of machine learning analytics.
NLP or Natural Language Processing is the technique used to analyze text in order to determine key metrics such as the sentiment of the text.
NLP applies two techniques to help the computer "understand" text:
1. Syntactic Analysis
2. Semantic Analysis
Syntactic Analysis analyses text using grammatical rules to identify sentence structure, how the words are organized, and how the words relate to each other.
Syntactic Analysis can be broken down into 4 main sub-tasks:
1. Tokenization: This process consists of breaking up the text into smaller components called tokens. This makes the text easier to analyze.
2. Part of Speech: PoS tagging or Part of Speech tagging labels the verbs, adjectives, adverbs, and nouns within the text. This provides context and helps to understand the meaning of the text.
3. Lemmatization and Stemming: This process consists of reducing inflected words to their base form. This makes the text easier to analyze.
4. Stop-word removal: This removes the frequently occurring words that don't add any semantic value. Some examples of words removed are: I, they, have, like, etc
Semantic Analysis
Semantic Analysis focuses on capturing the meaning of the text. There are two main steps involved in semantic analysis:
1. Word sense disambiguation - this is the process where the algorithm attempts to identify the context that the given word is being used.
2. Relationship extraction - this process attempts to understand how entities relate to each other within a given text. For example, it determines how a place or person may relate to each other within a given text.
Both syntactic and semantic techniques are used in NLP algorithms to help the algorithm "understand" text.
Written by Ryan Brown on July 6th, 2021
3. What is the 'bag-of-words' algorithm?
This question shows the developer's knowledge of machine learning algorithm.
The bag of word model is used for NLP (Natural Language Processing). A given text is broken up into a set or "Bag" of words. This disregards the grammar and the word order of the text.
The frequency of occurrence of each word is then recorded. This process is used to simplify the NLP algorithms. The technical interviewer may ask this question in order to determine if you are aware of various simplification processes before asking you to demonstrate a natural language processing algorithm.
Writing an NLP algorithm entirely from scratch can take a long time so the technical interviewer may provide the following code in order for you to build the algorithm further.
Written by Tiarnan Brady on June 13th, 2021
4. Demonstrate the 'bag of words' model and state the benefits and limitations of using this model.
This question shows the developer's knowledge of machine learning terminology and its purpose.
One of the benefits of using the "Bag of words'' model is that it simplifies some NLP algorithms. Some of the possible limitations of the model are related to the sparsity and the meaning of the text.
The sparsity of the text refers to the bag of word models creating "sparse" vectors. This increases the spatial complexity of the algorithm.
The Meaning refers to the context of the text. The bag of word model does not take into consideration the order of the words in the text nor does it "understand" the context of the text. The "meaning" of the sentence is lost in this model.
Ensure that you have installed all the packages required for your algorithm.
Below is an example of the implementation of the "bag of words" model for a given sample text. The output shows the sample text along with the frequency calculation of each of the words.
import numpy as np
import nltk
from nltk import word_tokenize,sent_tokenize
from nltk.tokenize import word_tokenize
from collections import defaultdict
data = ['I really love pizza, it is delicious. I think it is the best', 'She is a good person','good people are the best' ]
sentences = []
vocab = []
for sent in data:
x = word_tokenize(sent)
sentence = [w.lower() for w in x if w.isalpha()]
sentences.append(sentence)
for word in sentence:
if word not in vocab:
vocab.append(word)
len_vector = len(vocab)
index_word = {}
i=0
for word in vocab:
index_word[word] = i
i += 1
def bag_of_words(sent):
count_dict = defaultdict(int)
vec = np.zeros(len_vector)
for item in sent:
count_dict[item] += 1
for key , item in count_dict.items():
vec[index_word[key]] = item
return vec
vector = bag_of_words(sentences[0])
print(sentences[0])
print(vector)
Written by Tiarnan Brady on June 13th, 2021
5. Write a script to demonstrate stemming.
This question shows the developer's ability to work with machine learning algorithms.
Stemming is the process of producing similar variants of a root or base word. A stemming algorithm will reduce words into a common root word. For example, if a given text has the word "happiness" it may be reduced to "happy".
The technical interviewer may ask you to demonstrate a particular process used in Natural Language Processing as opposed to deriving an entire algorithm from scratch as this may consume the vast majority of time in a technical interview.
Below is an example of how the nltk library can be leveraged to write a short script that carries out stemming of an array of words.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
words = ["programs","programmers","programmable"]
for w in words:
print(w,":", ps.stem(w))
Below is an example of how we can "stem" a sentence. First we must "tokenize" the words within the sentence:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "This is a sample sentence that demonstrates the application of the stemming algorithm as it shows the words being condensed into root variants"
words = word_tokenize(sentence)
for w in words:
print(w,":", ps.stem(w))
Written by Tiarnan Brady on June 13th, 2021
6. Write a script to demonstrate lemmatization.
This question shows the developer's knowledge of machine learning terminology and its purpose.
Lemmatization is the process of grouping words together with specific inflected forms. This means words can be analyzed as a single item by linking words of similar meaning.
One of the primary benefits of lemmatization is that the context of the words is taken into consideration unlike stemming where the context and order of the words in the text are lost.
Below is the python script that is used to demonstrate lemmatization. Ensure that you have installed all the necessary libraries before running this code.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatiser = WordNetLemmatizer()
sentence = "Cats and dogs are similar. Shoes and horses aren't"
word_list = nltk.word_tokenize(sentence)
lemmatised_output = ' '.join([lemmatiser.lemmatize(w) for w in word_list])
print(lemmatised_output)
Written by Tiarnan Brady on June 13th, 2021
7. Explain 'look ahead bias'.
The technical interviewer is attempting to determine if you are aware of potential issues related to data and machine learning algorithms. Knowledge of common problems associated with machine learning algorithms shows that you have the expertise to debug and troubleshoot algorithms within the exisiting code base. This is an integral skill for any machine learning engineer.
Look ahead bias occurs by using information or data in the training of the algorithm that would not have been known during the period being analysed. This information may create a bias when the algorithm makes a prediction or classification. This is potentially a problem as going forward the algorithm will not be privy to this information and accuracy may suffer as a consequence.
Understanding what look ahead bias is and when it might have skewed a result or prediction of a machine learning algorithm is important to prevent over confidence in predictions.
Written by Ryan Brown on July 6th, 2021
8. Data cleaning - describe how would you 'clean' a dataset?
This question shows the developer's understanding of machine learning data and techniques.
Cleaning data is an important skill to have as a machine learning engineer. Having high-quality data can help improve the accuracy of a machine learning algorithm as the training data is more accurate and contains fewer errors and anomalies.
As a machine learning engineer and data scientist, you will often have to "clean" data. There are a number of techniques commonly used to clean data and ensure accuracy.
1. Remove duplicate or irrelevant data: Removing data points if they are not applicable to the training of the algorithm.
2. Fixing structural errors: Structural errors could be an incorrect naming convention, typos, or incorrect capitalization to name a few. For example, some of your data may have a "N/A" entry where some data points are recorded as "Not Applicable" this will cause a decrease in accuracy further down the line when training the algorithm.
3. Filter the unwanted outliers: This can be a difficult process as you do not want to remove outliers that are correctly recorded. You only want to remove outliers that are incorrectly recorded and skew the data hence making the algorithm less accurate.
4. Missing Data: Missing Data is a very difficult machine learning engineers encounter when working with less than ideal data. There are 3 primary ways you can mitigate the impact of missing data. One is to ignore any fields with missing data, this may impact the algorithm. Another potential method is to input data into missing values based on prior or other results, this is not an ideal method as it too may negatively impact the integrity of the data and the accuracy of the algorithm. The final method is to alter your algorithm so that you no longer require the missing data to train your algorithm, this method reduces the dataset size which can impact the training process of the algorithm however it does not undermine the integrity of the data.
In all cases, missing data possesses a major problem for machine learning engineers.
Written by Ryan Brown on July 6th, 2021
9. How would you analyze the quality of a dataset ?
This question shows the developer's knowledge of machine learning data sets.
The ability to assess the quality of a given data set is incredibly important for a machine learning engineer. It is often said in the industry that the algorithm is only as good as the data it uses.
Training algorithms using low-quality data has a massive negative impact on their accuracy.
There are 5 primary features that can be used to assess the quality of a data set:
1. Validity
2. Accuracy
3. Completeness
4. Consistency
5. Uniformity
Validity relates to the degree to which your data conforms to the rules or constraints you impose on the data set.
Accuracy means that the data is representative of the actual observed values.
Completeness means that there is no "missing value" or in other words all the data that is required is known and recorded.
Consistency means that your data is consistent with other related data sets. This means the data set that you are using to train your data set does not contain contrary information other than your test set.
Uniformity is the degree to which the measurements and units are the same. For example, if a dataset outlined the weight of individuals it is important to ensure the measurement only includes kilograms and not a mixture of pounds and kilograms as this would adversely affect the accuracy of the algorithm.
Written by Ryan Brown on July 6th, 2021
10. Demonstrate how to normalize a data set.
This question shows the developer's knowledge of machine learning terminology and its purpose.
Normalization is a scaling technique that is particularly useful when the data you are analyzing does not follow a Gaussian distribution. Normalization mitigates the impact large outliers have on skewing a data set. The dataset will be reduced to a range of values between 0 and 1.
A technical interviewer wants you to demonstrate what normalization is, why it is necessary and how to normalize a dataset.
The basic equation used to normalize data is a follows:
X_norm = (X - X_min) / (X_max - X_min)
Below is an example of how to normalise data using the min-max feature scaling method:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,100,222,123], columns = ["Col A"])
display(df)
df_max_scaled = df.copy()
for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()
display(df_max_scaled)
Written by Tiarnan Brady on June 13th, 2021
11. Explain how you would analyze the performance of a machine learning algorithm.
The technical interviewer is testing your knowledge not only of machine learning techniques but also of the theory behind analyzing the performance of a given machine learning algorithm.
Analyzing performance as well as improving the accuracy of algorithms already in use within a given company is an important part of a machine learning engineer's job.
There are a number of metrics used to analyze the performance of a given algorithm. They include the:
Accuracy
Precision
Recall
F1 score
Confusion matrices
These metrics can be used to understand if an algorithm is making decisions or classifications that are for the most part correct.
These metrics are based on analyzing the number of:
True Positives (TP)
False Positives (FP)
True Negatives (TN)
True Positives (TP)
Leveraging a confusion matrix can help a developer to visualize the performance of an algorithm and make improvements to increase the performance for a given function.
This code below shows how to quickly implement performance analysis on the output of an algorithm using the scikit learn library:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
y_pred = [0,3,1,2]
y_true = [0,1,1,5]
print(accuracy_score(y_true, y_pred))
print(precision_score(y_true, y_pred, average="macro"))
print(recall_score(y_true, y_pred, average="macro"))
Written by Tiarnan Brady on June 13th, 2021
12. Suggest how you would increase the performance of a machine learning algorithm.
This question shows the developer's knowledge of the machine learning algorithm and its purpose.
Improving the performance of existing algorithms is an incredibly powerful skill to have. In some cases ensuring that the prediction and decision of a machine learning model are accurate are of extreme importance such as in a self-driving car and in cancer diagnosis.
Using the confusion matrix of; True Positive, True Negatives, False Positive, and False Negatives is helpful in understanding "where" the algorithm is "struggling". For instance, if there are a large number of false positives relative to the number of false negatives this may indicate a positive bias within the algorithm.
There are technically no correct answers for a question framed in this way however the technical interviewer wants to understand your thought process behind troubleshooting and your ability to think through a solution.
One possible starting point may be in the training of the algorithm. Many algorithms produce inaccurate conclusions as a result of the training process. Incomplete or inaccurate data in the training set may train the algorithm incorrectly.
Another possible source of error is the phenomenon of over or underfitting. This occurs when an algorithm either fits too closely to a set of data points or not close enough. This leads to inaccurate insight, prediction, and classifications.
Knowledge of some common sources of error within machine learning algorithms demonstrates a deeper understanding of some of the practical problems in terms of implementation and performance analysis.
Written by Ryan Brown on July 6th, 2021
13. Explain a scenario where you used a machine learning algorithm? And why was it necessary to use machine learning?
The technical interviewer may ask open questions in an attempt to open up a discourse regarding machine learning methods. It is entirely possible that the technical interviewer will ask you to implement a simple algorithm based on the answer you provide for this question so it is best to stay with algorithms you are comfortable with.
Some examples of where you may use machine engineering are as follows:
Self-driving cars
Cancer Diagnosis from medical scans
Stock market predictions
Voice interpretation and voice commands (Siri, Alexa, etc)
A follow-up question in terms of why is machine learning required for these scenarios may be answered by talking through some of the benefits of using machine learning such as:
Can analyze large data sets.
Importantly it can make an incredibly accurate and quick decision that outperforms a human's ability.
Can lead to a better outcome and safer processes.
Removes a lot of manual mundane human labor from processes.
Written by Ryan Brown on July 6th, 2021
14. Demonstrate a random forest.
This question shows the developer's understanding of machine learning algorithms.
A random forest algorithm can be broken down into 4 fundamental steps:
1. Selecting random samples from a given dataset.
2. Constructing a decision tree for each sample. From this, you will get a predicted result from each decision tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.
This example will use the data set called iris, this is a famous dataset within machine learning but feel free to use any data set you are comfortable with:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import pandas as pd
iris = datasets.load_iris()
data = pd.DataFrame({
'sepal length' : iris.data[:,0],
'sepal width' : iris.data[:,1],
'petal length' : iris.data[:,2],
'petal width' : iris.data[:,3],
'species' : iris.target
})
X = data[['sepal length','sepal width','petal length','petal width']]
Y = data['species']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy",metrics.accuracy_score(y_test,y_pred))
Written by Tiarnan Brady on June 13th, 2021
15. I have a data set containing images of cats and dogs. Suggest an algorithm that can be used to classify these images
The technical interviewer may ask these questions to determine your knowledge of both machine learning algorithms and their applications. This is generally an open-ended question where you should outline the pros and cons of the algorithm you have selected.
The main problem the machine learning algorithm selected should solve is image classification.
Some example of effective image classification algorithms are:
Convolutional Neural Network.
K nearest neighbor.
Decision trees.
Support Vector Machines.
It is recommended that you select the algorithm you are most comfortable with as the interviewer may ask you to demonstrate the algorithm using python.
Written by Ryan Brown on July 6th, 2021
16. What would you ask an advanced AI oracle/chat bot?
At the end of a technical interview, an interviewer may ask a "soft" skills question regarding company culture. These questions are often referred to as "fit" questions.
The interviewer may also ask more philosophical questions regarding the future implication of machine learning in society and your views on the future of the field to end the interview.
There is no correct answer for these types of questions. They are targeted at your thought process and your interest in the wider field and future innovation and implications.
These kinds of questions may also be interesting to ask your technical interviewer to open a dialogue and discussion.
Written by Ryan Brown on July 6th, 2021