DATA SCIENCE INTERVIEW QUESTIONS AND ANSWERS



Best Data Science Interview Questions and Answers

Do you believe that you have the right stuff to be a section in the advancement of future Data Science, the GangBoard is here to control you to sustain your vocation. Various fortune 1000 organizations around the world are utilizing the innovation of Data Science to meet the necessities of their customers. Data Science is being utilized as a part of numerous businesses. To have a great development in Data Science work, our page furnishes you with nitty-gritty data as Data Science prospective employee meeting questions and answers. Data Science Interview Questions and answers are prepared by 10+ years experienced industry experts. Data Science Interview Questions and answers are very useful to the Fresher or Experienced person who is looking for the new challenging job from the reputed company. Our Data Science Questions and answers are very simple and have more examples for your better understanding.

By this Data Science Interview Questions and answers, many students are got placed in many reputed companies with high package salary. So utilize our Data Science Interview Questions and answers to grow in your career.  

Q1)What are the types of machine learning?

Supervised learning


Unsupervised learning


Reinforcement Learning


Q2)What is the Supervised learning in machine learning?

Supervised learning – When you know your target variable for the problem statement, it becomes Supervised learning. This can be applied to perform regression and classification.

Example: Linear Regression and Logistic Regression.

Q3)What is the Unsupervised learning in machine learning?

Unsupervised learning – When you do not know your target variable for the problem statement, it becomes Unsupervised learning. This is widely used to perform Clustering.

Example: K-Means and Hierarchical clustering.

Q4)What are the commonly used python packages?

Numpy


Pandas


SCI-KIT Learn


Matplot library


Q5)What are the commonly used R packages?

Caret


Data.Table


Reshape


Reshape2


E1071


DMwR


Dplyr


Lubridate


Q6)Name the commonly used algorithms.

Linear regression


Logistic regression


Random Forest


KNN


Q7)What is precision?

The ration of predicted positive against the actual positive.

It is the most commonly used error metric is n classification mechanism.

The range is from 0 to 1, where 1 represents 100%.

Q8)What is recall?

The ratio of the true positive rate against the actual positive rate.

The range is again from 0 to 1

Q9)Which metric acts like accuracy in classification problem statement?

F1 Score –  2 * (Precision*Recall)/Precision + Recall

Q10) What is a normal distribution?

When the data distribution is equally distributed as such the mean, median and mode are equal.

Q11)What is overfitting?

Any prediction rate which has high inconsistency between the training error and the test error leads ta a high business problem, if the error rate in training set is low and the error rate ithe n test set is high, then we can conclude it as overfitting model.

Q12)What is underfitting?

Any prediction rate which has provides low prediction in the training error and the test error leads to a high business problem, if the error rate in training set is high and the error rate inthe test set is also high, then we can conclude it as overfitting model.

Q13)What is a univariate analysis?

An Analysis that can be applied to one attribute at a time is called as a univariate analysis.

Boxplot is one of the widely used univariate model.

Scatter plot and cook’s distance are other methods used for bivariate and multivariate analysis.

Q14)Name few methods for Missing Value Treatments.

Central Imputation – This method acts more like central tendencies. All the missing values will be filed with mean and median mode respective to numerical and categorical datatypes.

KNN – K Nearest Neighbour imputation.

Distance between two or multiple attributes are calculated using Euclidian’s distance and the same will be used to treat the missing values. Mean and mode will agaibe n used as in CI.

Q15)What is the Pearson correlation?

Correlation between predicted and actual data can be examined and understood using this method.

The range is from -1 to +1.

-1 refers to negative 100% whereas +1 refers to positive 100%.

The formula is Sd(x)*m/Sd.(y)

Q16)How and by what methods data visualizations can be effectively used?

In addition to giving insights in a very effective and efficient manner, data visualization can also be used in such a way that it is not only restricted to bar, line or some stereotypic graphs. Data can be represented in a much more visually pleasing manner.

One thing have to be taken care of is to convey the intended insight or finding correctly to the audience. Once the baseline is set. Innovative and creative part can help you come up with better looking and functional dashboards. There is a fine line between the simple insightful dashboard and awesome looking 0 fruitful insight dashboards.

Q17)How to understand the problems faced during data analysis?

Most of the problem faced during hands on analysis or data science is because of poor understanding of the problem in hand and concentrating more on tools, end results and other aspects of the project.

Breaking the problem down to a granular level and understanding takes a lot of time and practice to master. Coming back to square one in data science projects can be seen in lot of companies and even in your own project or kaggle problems.

Q18)Advantages of Tableau Prep?

Tableau Prep will reduce a lot of time like how its parent software (Tableau) does when creating impressive visualizations. The tool has a lot of potentials in taking professionals from data cleaning, merging step to creating final usable data that can be linked to Tableau desktop for getting visualization and business insights. A lot of manual tasks will be reduced and the time can be used to make better findings and insights.

Q19)What is the common perception about visualization?

People think visualization as just charts and summary information. But they are beyond that and drive business with a lot of underlying principles. Learning design principles can help anyone build effective and efficient visualizations and this Tableau prep tool can drastically increase our time on focusing more important part. The only issue with Tableau is, it is paid and companies need to pay for leveraging that awesome tool.

Q20)What are the time series algorithms?

Time series algorithms like ARIMA, ARIMAX, SARIMA, Holts winters are very interesting to learn and use as well to solve a lot of complex problems for businesses. Data preparation for time series analysis plays a vital role. The stationarity, seasonality, cycles and noises need time and attention. Take as much time as you would like to make the data right. Then you can run any model on top of it.

Q21)How to choose the right chart in case of creating a viz?

Using the right chart to represent data is one of the key aspects of data visualization and design principle. You will always have options to choose from when deciding on a chart. But fixing to the right chart comes only by experience, practice and deep understanding of end-user needs. That dictates everything in the dashboard.

Q22) Where to seek help in case of discrepancies in Tableau?

When you face any issue regarding Tableau, try searching in the Tableau community forum. It is one of the best places to get your queries answered. You can always write your question and get the query answered with an hour or a day. You can always post on LinkedIn and follow people.

Q23)Now companies are heavily investing their money and time to make the dashboards. Why?

To make stakeholders more aware about the business through data. Working on visualization projects helps you develop one of the key skills every data scientist should possess i.e. Thinking from the shoes of the end user.

If you’re learning any visualization tool, download a dataset from kaggle. Building charts and graphs for the dashboard should be the last step. Research more about the domain and think about the KPIs you would like to see in the dashboard if you’re going to be the end user. Then start building the dashboard piece by piece.

Q24)How can I achieve accuracy in the first model that I built?

Building machine learning models involves a lot of interesting steps. 90% accuracy models don’t come in the very first attempt. You have done a lot of better feature selection techniques to get that point, which means it involves a lot of trial and error. The process will help you learn new concepts in statistics, math and probability.

Q25)What is the basic responsibility of a Data Scientist?

As a data scientist, we have the responsibility to make complex things simple enough that anyone without context should understand, what we are trying to convey.

The moment, we start explaining even the simple things the mission of making the complex simple goes away. This happens a lot when we are doing data visualization.

Less is more. Rather than pushing too much information on to readers brain, we need to figure out how easily we can help them consume a dashboard or a chart.

The process is simple to say but difficult to implement. You must bring the complex business value out of a self-explanatory chart. It’s a skill every data scientist should strive towards and good to have in their arsenal.

Q26)How do I enhance a SAS analyst?

Step 1: Earn a College Degree. Businesses prefer SAS programmers who have completed a

statistics or computer science bachelor’s degree program.

Step 2: Acquire SAS Certification.

Step 3: Consider Getting an Advanced Degree.

Step 4: Gain SAS Program Coding Work Background.

Q27)What does SAS stand out to be the best over other data analytics tools?

Ease to understand: The provisions included in SAS are remarkably easy to learn. Further, it offers the most suitable option for those who already are aware of the SQL. On the other hand, R comes with a steep training cover which is supposed to be a low-level programming style.

Data Handling Capacities: it is at par the most leading tool which also includes the R& Python.

If it advances before handling the huge data, it is the best platform to engage Graphical Capacities: it comes with functional graphical capacities and has a limited knowledge field.

It is useful to customize the plots Better tool management: It benefits in a release the updates with regards to the controlled conditions.

This is the main reason why it is well tested. Whereas if you considered R&Python, it has open contribution also the risk of errors in the current development is also high.

Q28)What is RUN-Group processing?

To practice RUN-group processing, you start the system and then submit many RUNgroups.

A RUN-group is a group of records that contain at least one product group including ends with a RUN statement. It can contain different SAS statements such as AXIS, BY, GOPTIONS,

LEGEND, Power, or WHERE.

Q29)Definitions of is BY-Group processing?

Definitions for BY-Group Processing. is a method of preparing observations from one or numerous SAS data sets that are arranged or ordered by importance of individual or more

shared variables. All data sets that are being connected must include one or more BY variables.

Q30)What is the right way to validate the SAS program?

The OPTIONS OBS=0 through the commencement of the code needs to be written but if yourself require to perform the same then their mind be any log which gets recognized by the

colors that get highlighted.

Q31)Do you know any SAS functions and Call Routines?

Can be a mutable type, uniform, or any SAS expression, including different use. This product also a letter from contentions that SAS allows are called by special purposes. Multiple

arguments are separated with a comma.

Q32)What is means by precision and Recall?

Recall:

It is known as a true real rate. The number of positives that your model has claimed related to the original defined number of positives available during this data.

Precision:

It is also known as a positive predicted value. This is more based on the prediction. That indicates a time like a number of accurate positives that the model needs when compared to the

number of positives it actually claims.

Q33)What is deep learning?

Deep learning is a process where it is considered to be a subset of machine learning process.

Q34)What is the F1 score?

The F1 score is defined as a measure of a model’s performance.

35)How is F1 score is used?

The average of Precision and Recall of a model is nothing but F1 score measure. Based on the results, the F1 score is 1 then it is classified as best and 0 being the worst

Q36)What is the difference between Machine learning Vs Data Mining?

Data mining is about working on unlimited data and then extract it to a level anywhere the unusual and unknown patterns are identified.

Machine learning is any method about a study whether it closely relates to design, development concerning the algorithms that provide an ability to certain computers to capacity to learn.

Q37)What are confounding variables?

These are obvious variables in a scientific model that correlates directly or inversely with both the subject and the objective variable. The study fails to account for the confounding factor.

Q38)How can you randomize the items of a list in place in Python?

Consider the example shown below:

from random import shuffle

x = [‘Data’, ‘Class’, ‘Blue’, ‘Flag’, ‘Red’, ‘Slow’]

shuffle(x)

print(x)

The output of the following code is as below.

[‘Red’, ‘Data’, ‘Blue’, ‘Slow’, ‘Class’, ‘Flag’]

Q39)How to get indices of N maximum values in a NumPy array?

We can get the indices of N maximum values in a NumPy array using the below code:

import numpy as np

arr = np.array([1, 3, 2, 4, 5])

print(arr.argsort()[-3:][::-1])

Output

[ 4 3 1 ]

Q40)How make you 3D plots/visualizations using NumPy/SciPy?

Like 2D plotting, 3D graphics is beyond the scope of NumPy and SciPy, but just as in this 2D example, packages exist that integrate with NumPy. Matplotlib provides primary 3D plotting in

the mplot3d subpackage, whereas Mayavi produces a wide range of high-quality 3D visualization features, utilizing the powerful VTK engine.

Q41)What are the types of biases that can occur during sampling?

Some simple models of selection bias are described below. Undercoverage occurs when some members of the population live badly represented inside the sample. … The survey relied on a service unit, drawn of telephone directories and car registration lists.

Selection bias


Under coverage bias


Survivorship bias


Q42)Which Python library is used for data visualization?

Plotly. The fifth tool is Plotly, also called as Plot.ly because of its main platform online. It is an interactive online visualization tool that is being used for data analytics, scientific graphs, and

other visualization. This contains some great API including one for Python

Q43)Write code to sort a DataFrame in Python in descending order.

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’,

na_position=’last’)[source]

Sort by the values along either axis

Parameters:

by: str or list of str

Name or list of names to sort by.

if an axis is 0 or ‘index’ then by may contain index levels and/or column labels

if the axis is 1 or ‘columns’ then by may contain column levels and/or index labels

Changed in version 0.23.0: Allow specifying index or column level names.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Axis to be sorted

ascending: bool or list of bool, default True

Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must

match the length of the by.

in place: bool, default False

if True, perform operation in-place

kind: {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’

Choice of sorting algorithm. See also array.np.sort for more information. mergesort is the only

stable algorithm. For DataFrames, this option is only applied when sorting on a single column or

label.

na_position : {‘first’, ‘last’}, default ‘last’

first puts NaNs at the beginning, last put NaNs at the end

Returns:

sorted_obj: DataFrame

Q44)Why you should use NumPy arrays instead of nested Python lists?

Ans : let’s say you have a list a of numbers, and you want to add 1 to every element of the list.

In regular python, you would do:

a = [6, 2, 1, 4, 3]

b = [e + 1 fore in a]

Whereas with numpy, you simply have to do:

import numpy as np

a = np.array([6, 2, 1, 4, 3])

b = a + 1

It also works for every numpy mathematics function: you can take the exponential of every

element of a list using np.exp for example.

Comments

Popular Posts