Friday, 12 August 2022

TYPES OF MACHINE LEARNING ALGORITHMS

Home Blog Book Learning About Me Non-technical Buy me a coffee

Hui Lin

2017-07-08

TYPES OF MACHINE LEARNING ALGORITHMs

The categorization here is based on the structure (such as tree model, Regularization Methods) or type of question to answer (such as regression).The summary of various algorithms for data science in this section is based on Jason Brownlee’s blog “(A Tour of Machine Learning Algorithms)[http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/].” I added and subtracted some algorithms in each category and gave additional comments. It is far less than perfect but will help to show a bigger map of different algorithms. Some can be legitimately classified into multiple categories, such as support vector machine (SVM) can be a classifier, and can also be used for regression. So you may see other ways of grouping. Also, the following summary does not list all the existing algorithms (there are just too many).

Regression

Regression can refer to the algorithm or a particular type of problem. It is supervised learning. Regression is one of the oldest and most widely used statistical models. It is often called the statistical machine learning method. Standard regression models are:

Ordinary Least Squares Regression
Logistic Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)

The least squares regression and logistic regression are traditional statistical models. Both of them are highly interpretable. MARS is similar to neural networks and partial least squares (PLS) in the respect that they all use surrogate features instead of original predictors.

They differ in how to create the surrogate features. PLS and neural networks use linear combinations of the original predictors as surrogate features ^[To be clear on neural networks, the linear combinations of predictors are put through non-linear activation functions, deeper neural networks have many layers of non-linear transformation]. MARS creates two contrasted versions of a predictor by a truncation point. And LOESS is a non-parametric model, usually only used in visualization.

Similarity-based Algorithms

This type of model is based on a similarity measure. There are three main steps: (1) compare the new sample with the existing ones; (2) search for the closest sample; (3) and let the response of the nearest sample be used as the prediction.

K-Nearest Neighbour [KNN]
Learning Vector Quantization [LVQ]
Self-Organizing Map [SOM]

The biggest advantage of this type of model is that they are intuitive. K-Nearest Neighbour is generally the most popular algorithm in this set. The other two are less common. The key to similarity-based algorithms is to find an appropriate distance metric for your data.

Feature Selection Algorithms

The primary purpose of feature selection is to exclude non-information or redundant variables and also reduce dimension. Although it is possible that all the independent variables are significant for explaining the response. But more often, the response is only related to a portion of the predictors. We will expand the feature selection in detail later.

Filter method
Wrapper method
Embedded method

Filter method focuses on the relationship between a single feature and a target variable. It evaluates each feature (or an independent variable) before modeling and selects “important” variables.

Wrapper method removes the variable according to particular law and finds the feature combination that optimizes the model fitting by evaluating a set of feature combinations. In essence, it is a searching algorithm.

Embedding method is part of the machine learning model. Some model has built-in variable selection function such as lasso, and decision tree.

Regularization Method

This method itself is not a complete model, but rather an add-on to other models (such as regression models). It appends a penalty function on the criteria used by the original model to estimate the variables (such as likelihood function or the sum of squared error). In this way, it penalizes model complexity and contracts the model parameters. That is why people call them “shrinkage method.” This approach is advantageous in practice.

Ridge Regression
Least Absolute Shrinkage and Selection Operator (LASSO)
Elastic Net

Decision Tree

Decision trees are no doubt one of the most popular machine learning algorithms. Thanks to all kinds of software, implementation is a no-brainer which requires nearly zero understanding of the mechanism. The followings are some of the common trees:

Classification and Regression Tree (CART)
Iterative Dichotomiser 3 (ID3)
C4.5
Random Forest
Gradient Boosting Machines (GBM)

Bayesian Models

People usually confuse Bayes theorem with Bayesian models. Bayes theorem is an implication of probability theory which gives Bayesian data analysis its name.

The actual Bayesian model is not identical to Bayes theorem. Given a likelihood, parameters to estimate, and a prior for each parameter, a Bayesian model treats the estimates as a purely logical consequence of those assumptions. The resulting estimates are the posterior distribution which is the relative plausibility of different parameter values, conditional on the observations. The Bayesian model here is not strictly in the sense of Bayesian but rather model using Bayes theorem.

Naïve Bayes
Averaged One-Dependence Estimators (AODE)
Bayesian Belief Network (BBN)

Kernel Methods

The most common kernel method is the support vector machine (SVM). This type of algorithm maps the input data to a higher order vector space where classification or regression problems are easier to solve.

Support Vector Machine (SVM)
Radial Basis Function (RBF)
Linear Discriminate Analysis (LDA)

Clustering Methods

Like regression, when people mention clustering, sometimes they mean a class of problems, sometimes a class of algorithms. The clustering algorithm usually clusters similar samples to categories in a centroidal or hierarchical manner. The two are the most common clustering methods:

K-Means
Hierarchical Clustering

Association Rule

The basic idea of an association rule is: when events occur together more often than one would expect from their rates of occurrence, such co-occurrence is an interesting pattern. The most used algorithms are:

Apriori algorithm
Eclat algorithm

Artificial Neural Network

The term neural network has evolved to encompass a repertoire of models and learning methods. There has been lots of hype around the model family making them seem magical and mysterious. A neural network is a two-stage regression or classification model. The basic idea is that it uses linear combinations of the original predictors as surrogate features, and then the new features are put through non-linear activation functions to get hidden units in the 2nd stage. When there are multiple hidden layers, it is called deep learning, another over hyped term. Among varieties of neural network models, the most widely used “vanilla” net is the single hidden layer back-propagation network.

Perceptron Neural Network
Back Propagation
Hopield Network
Self-Organizing Map (SOM)
Learning Vector Quantization (LVQ)

Deep Learning

The name is a little misleading. As mentioned before, it is multilayer neural network. It is hyped tremendously especially after AlphaGO defeated Li Shishi at the board game Go. We don’t have too much experience with the application of deep learning and are not in the right position to talk more about it. Here are some of the common algorithms:

Restricted Boltzmann Machine (RBN)
Deep Belief Networks (DBN)
Convolutional Network
Stacked Autoencoders
Long short-term memory (LSTM)

Dimensionality Reduction

Its purpose is to construct new features that have significant physical or statistical characteristics, such as capturing as much of the variance as possible.

Principle Component Analysis (PCA)
Partial Least Square Regression (PLS)
Multi-Dimensional Scaling (MDS)
Exploratory Factor Analysis (EFA)

PCA attempts to find uncorrelated linear combinations of original variables that can explain the variance to the greatest extent possible. EFA also tries to explain as much variance as possible in a lower dimension. MDS maps the observed similarity to a low dimension, such as a two-dimensional plane. Instead of extracting underlying components or latent factors, MDS attempts to find a lower-dimensional map that best preserves all the observed similarities between items. So it needs to define a similarity measure as in clustering methods.

Ensemble Methods

Ensemble method made its debut in the 1990s. The idea is to build a prediction model by combining the strengths of a collection of simpler base models. Bagging, originally proposed by Leo Breiman, is one of the earliest ensemble methods. After that, people developed Random Forest [@Ho1998; @amit1997] and Boosting method [@Valiant1984; @KV1989]. This is a class of powerful and effective algorithms.

Bootstrapped Aggregation (Bagging)
Random Forest
Gradient Boosting Machine (GBM)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Thursday, 11 August 2022

Scientist Cafe

https://scientistcafe.com/

ML algo types

Tuesday, 9 August 2022

Data Mining vs Data Analysis

data-analyst-interview-questions-and-answers

coursera:data-analyst-interview-questions-and-answers

15 Data Analyst Interview Questions and Answers

Written by Coursera Staff • Updated on Feb 5, 2024

Enter your data analyst interview with confidence by preparing with these 15 interview questions.

A smiling woman in a wheelchair interviews for a data analyst job with a hiring manager.

If you’re like many people, the job interview can be one of the most intimidating parts of the job search process. But it doesn’t have to be. With some advanced preparation, you can walk into your data analyst interview feeling calm and confident.

In this article, we’ll review some of the most common interview questions you’ll likely encounter as you apply for an entry-level data analyst position. We’ll walk through what the interviewer is looking for and how best to answer each question. Finally, we’ll cover some tips and best practices for interviewing success. Let’s get started.

General data analyst interview questions

These questions cover data analysis from a high level and are more likely to appear early in an interview.

1. Tell me about yourself.

What they’re really asking: What makes you the right fit for this job?

This question can sound broad and open-ended, but it’s really about your relationship with data analytics. Keep your answer focused on your journey toward becoming a data analyst. What sparked your interest in the field? What data analyst skills do you bring from previous jobs or coursework?

As you formulate your answer, try to answer these three questions:

What excites you about data analysis?
What excites you about this role?
What makes you the best candidate for the job?

An interviewer might also ask:

What made you want to become a data analyst?
What brought you here?
How would you describe yourself as a data analyst?

2. What do data analysts do?

What they’re really asking: Do you understand the role and its value to the company?

If you’re applying for a job as a data analyst, you likely know the basics of what data analysts do. Go beyond a simple dictionary definition to demonstrate your understanding of the role and its importance.

Outline the main tasks of a data analyst: identify, collect, clean, analyze, and interpret. Talk about how these tasks can lead to better business decisions, and be ready to explain the value of data-driven decision-making.

An interviewer might also ask:

What is the process of data analysis?
What steps do you take to solve a business problem?
What is your process when you start a new project?
3. What was your most successful/most challenging data analysis project?
What they’re really asking: What are your strengths and weaknesses?
When an interviewer asks you this type of question, they’re often looking to evaluate your strengths and weaknesses as a data analyst. How do you overcome challenges, and how do you measure the success of a data project?
Getting asked about a project you’re proud of is your chance to highlight your skills and strengths. Do this by discussing your role in the project and what made it so successful. As you prepare your answer, take a look at the original job description. See if you can incorporate some of the skills and requirements listed.
If you get asked the negative version of the question (least successful or most challenging project), be honest as you focus your answer on lessons learned. Identify what went wrong—maybe your data was incomplete or your sample size was too small—and talk about what you’d do differently in the future to correct the error. We’re human, and mistakes are a part of life. What’s important here is your ability to learn from them.
An interviewer might also ask:
Walk me through your portfolio.
What is your greatest strength as a data analyst? How about your greatest weakness?
Tell me about a data problem that challenged you.

4. What’s the largest data set you’ve worked with?

What they’re really asking: Can you handle large data sets?

Many businesses have more data at their disposal than ever before. Hiring managers want to know you can work with large, complex data sets. Focus your answer on the size and type of data. How many entries and variables did you work with? What types of data were in the set?

The experience you highlight doesn't have to come from a job. You’ll often have the chance to work with data sets of varying sizes and types as a part of a data analysis course, bootcamp, certificate program, or degree. As you put together a portfolio, you may also complete some independent projects where you find and analyze a data set. All of this is valid material to build your answer.

An interviewer might also ask:

What type of data have you worked with in the past?

Data analysis process questions

The work of a data analyst involves a range of tasks and skills. Interviewers will likely ask questions specific to various parts of the data analysis process to evaluate how well you perform each step.

5. Explain how you would estimate … ?

What they’re really asking: What’s your thought process? Are you an analytical thinker?

With this type of question (sometimes called a guesstimate), the interviewer presents you with a problem to solve. How would you estimate the best month to offer a discount on shoes? How would you estimate the weekly profit of your favorite restaurant?

The purpose here is to evaluate your problem-solving ability and overall comfort working with numbers. Since this is about how you think, think out loud as you work through your answer.

What types of data would you need?
Where might you find that data?
Once you have the data, how would you use it to calculate an estimate?

6. What is your process for cleaning data?

What they’re really asking: How do you handle missing data, outliers, duplicate data, etc.?

As a data analyst, data preparation, also known as data cleaning or data cleansing, will often account for the majority of your time. A potential employer will want to know that you’re familiar with the process and why it’s important.

In your answer, briefly describe what data cleaning is and why it’s important to the overall process. Then walk through the steps you typically take to clean a data set. Consider mentioning how you handle:

Missing data
Duplicate data
Data from different sources
Structural errors
Outliers

An interviewer might also ask:

How do you deal with messy data?
What is data cleaning?

7. How do you explain technical concepts to a non-technical audience?

What they’re really asking: How are your communication skills?

While drawing insights from data is a critical skill for a data analyst, communicating those insights to stakeholders, management, and non-technical co-workers is just as important.

Your answer should include the types of audiences you’ve presented to in the past (size, background, context). If you don’t have a lot of experience presenting, you can still talk about how you’d present data findings differently depending on the audience.

An interviewer might also ask:

What is your experience conducting presentations?
Why are communication skills important to a data analyst?
How do you present your findings to management?

Tip: In some cases, your interviewer might not be involved in data analysis. The entire interview, then, is an opportunity to demonstrate your ability to communicate clearly. Consider practicing your answers on a non-technical friend or family member.

8. Tell me about a time when you got unexpected results.

What they’re really asking: Do you let the data or your expectations drive your analysis?

Effective data analysts let the data tell the story. After all, data-driven decisions are based on facts rather than intuition or gut feelings. When asking this question, an interviewer might be trying to determine:

How you validate results to ensure accuracy
How you overcome selection bias
If you’re able to find new business opportunities in surprising results

Be sure to describe the situation that surprised you and what you learned from it. This is your opportunity to demonstrate your natural curiosity and excitement to learn new things from data.

9. How would you go about measuring the performance of our company?

What they’re really asking: Have you done your research?

Before your interview, be sure to do some research on the company, its business goals, and the larger industry. Think about the types of business problems that could be solved through data analysis, and what types of data you’d need to perform that analysis. Read up on how data is used by competitors and in the industry.

Show that you can be business-minded by tying this back to the company. How would this analysis bring value to their business?

Technical skill questions

Interviewers will be looking for candidates who can leverage a wide range of technical data analyst skills. These questions are geared toward evaluating your competency across several skills.

10. What data analytics software are you familiar with?

What they’re really asking: Do you have basic competency with common tools? How much training will you need?

This is a good time to revisit the job listing to look for any software emphasized in the description. As you answer, explain how you’ve used that software (or something similar) in the past. Show your familiarity with the tool by using associated terminology.

Mention software solutions you’ve used for various stages of the data analysis process. You don’t need to go into great detail here. What you used and what you used it for should suffice.

An interviewer might also ask:

What data software have you used in the past?
What data analytics software are you trained in?

Tip: Gain experience with data analytics software through a Guided Project on Coursera. Get hands-on learning in under two hours, without having to download or purchase software. You’ll be ready with something to talk about during your next interview for analysis tools like:

11. What scripting languages are you trained in?

As a data analyst, you’ll likely have to use SQL and a statistical programming language like R or Python. If you’re already familiar with the language of choice at the company, you’re applying to, great. If not, you can take this time to show enthusiasm for learning. Point out that your experience with one (or more) languages has set you up for success in learning new ones. Talk about how you’re currently growing your skills.

Interviewer might also ask:

What functions in SQL do you like most?
Do you prefer R or Python?

Five SQL interview questions for data analysts

Knowledge of SQL is one of the most important skills you can have as a data analyst. Many interviews for data analyst jobs include an SQL screening where you’ll be asked to write code on a computer or whiteboard. Here are five SQL questions and tasks to prepare for:

1. Create an SQL query: Be ready to use JOIN and COUNT functions to show a query result from a given database.

2. Describe an SQL query: Given an SQL query, explain what data is being retrieved.

3. Modify a database: Insert new rows, modify existing records, or permanently delete records from a database.

4. Debug a query: Correct the errors in an existing query to make it functional.

5. Define an SQL term: Understand what terms like foreign and primary key, truncate, drop, union, union all, and left join and inner join mean (and when you’d use them).

12. What statistical methods have you used in data analysis?

What they’re really asking: Do you have basic statistical knowledge?

Most entry-level data analyst roles will require at least a basic competency in statistics and an understanding of how statistical analysis ties into business goals. List the types of statistical calculations you’ve used in the past and what business insights those calculations yielded.

If you’ve ever worked with or created statistical models, be sure to mention that as well. If you’re not already, familiarize yourself with the following statistical concepts:

Mean
Standard deviation
Variance
Regression
Sample size
Descriptive and inferential statistics

An interviewer might also ask:

What is your knowledge of statistics?
How have you used statistics in your work as a data analyst?

13. How have you used Excel for data analysis in the past?

Spreadsheets rank among the most common tools used by data analysts. It’s common for interviews to include one or more questions meant to gauge your skill working with data in Microsoft Excel.

Five Excel interview questions for data analysts

Here are five more questions specific to Excel that you might be asked during your interview:

1. What is a VLOOKUP, and what are its limitations?

2. What is a pivot table, and how do you make one?

3. How do you find and remove duplicate data?

4. What are INDEX and MATCH functions, and how do they work together?

5. What’s the difference between a function and a formula?

14. Explain the term…

What they’re really asking: Are you familiar with the terminology of data analytics?

Throughout your interview, you may be asked to define a term or explain what it means. In most cases, the interviewer is trying to determine how well you know the field and how effective you are at communicating technical concepts in simple terms. While it’s impossible to know what exact terms you may be asked about, here are a few you should be familiar with:

Normal distribution
Data wrangling
KNN imputation method
Clustering
Outlier
N-grams
Statistical model

15. Can you describe the difference between … ?

Similar to the last type of question, these interview questions help determine your knowledge of analytics concepts by asking you to compare two related terms. Some pairs you might want to be familiar with include:

Data mining vs. data profiling
Quantitative vs. qualitative data
Variance vs. covariance
Univariate vs. bivariate vs. multivariate analysis
Clustered vs. non-clustered index
1-sample T-test vs. 2-sample T-test in SQL
Joining vs. blending in Tableau

The final question: Do you have any questions?

Almost every interview, regardless of field, ends with some variation of this question. This process is about you evaluating the company as much as it is about the company evaluating you. Come prepared with a few questions for your interviewer, but don’t be afraid to ask any questions that came up during the interview as well. Some topics you can ask about include:

What a typical day is like
Expectations for your first 90 days
Company culture and goals
Your potential team and manager
The interviewer’s favorite part about the company

Testgorilla: data-analyst-interview-questions

Saturday, 6 August 2022

Data Analytics CV and Cover Letter

Zety Blog

Friday, 12 August 2022