Machine Learning with Apache Spark

Hi Readers,

In this post I will share my learning of  ‘Machine Learning with Apache Spark’.

What is machine learning?

Machine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers.

Some example use cases like Fraud detectionIdentifying fraudulent transactions and anomaly detection), Cyber security( To detect DOS attacks, and scale up the instances upon imminent threats), Self-driving carsSentiment analysis, Credit-Risk.

Categories of machine learning algorithms

Supervised learning:

It analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

Example algorithms like Decision trees, Regression, Neural networks, SVM.

Unsupervised learning:

In unsupervised learning, there is no pre-existing data with known labels.A well known example of this is customer segmentation, where you want to identify customer segments based on the behavior. Unsupervised learning are used in fraud detection and cyber security.

Example algorithms like Principal component analysis and Clustering.

Semi-supervised learning:

The data is partially labeled, and use estimation techniques to identify unlabeled data. It has superior performance over unsupervised learning, which is often CPU intensive.

Example algorithms include Clustering and Factorization machines.

Reinforcement Learning (RL):

It is a field within machine learning which involves sequential decision making and learning from interaction. In RL, an agent chooses actions that will maximize the expected cumulative reward over a period of time.

Examples like gaming, such as chess , Go , casino games etc.

Spark MLib

MLLib stands for Machine Learning Library in Spark. It was created in the Berkeley AMPLab.

ML algorithms include:

  • Classification: logistic regression, naive Bayes
  • Regression: generalized linear regression, survival regression
  • Decision trees, random forests, and gradient-boosted trees
  • Recommendation: alternating least squares (ALS)
  • Clustering: K-means, Gaussian mixtures (GMMs),…
  • Topic modeling: latent Dirichlet allocation (LDA)
  • Frequent itemsets, association rules, and sequential pattern mining

Spark-Packages, a third party packages library which has 79 machine learning packages.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

One thought on “Machine Learning with Apache Spark

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.