In this post I will share my learning of ‘Machine Learning with Apache Spark’.
What is machine learning?
Machine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.
Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers.
Some example use cases like Fraud detection( Identifying fraudulent transactions and anomaly detection), Cyber security( To detect DOS attacks, and scale up the instances upon imminent threats), Self-driving cars, Sentiment analysis, Credit-Risk.
Categories of machine learning algorithms
It analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.
Example algorithms like Decision trees, Regression, Neural networks, SVM.
In unsupervised learning, there is no pre-existing data with known labels.A well known example of this is customer segmentation, where you want to identify customer segments based on the behavior. Unsupervised learning are used in fraud detection and cyber security.
Example algorithms like Principal component analysis and Clustering.
The data is partially labeled, and use estimation techniques to identify unlabeled data. It has superior performance over unsupervised learning, which is often CPU intensive.
Example algorithms include Clustering and Factorization machines.
Reinforcement Learning (RL):
It is a field within machine learning which involves sequential decision making and learning from interaction. In RL, an agent chooses actions that will maximize the expected cumulative reward over a period of time.
Examples like gaming, such as chess , Go , casino games etc.
MLLib stands for Machine Learning Library in Spark. It was created in the Berkeley AMPLab.
ML algorithms include:
- Classification: logistic regression, naive Bayes
- Regression: generalized linear regression, survival regression
- Decision trees, random forests, and gradient-boosted trees
- Recommendation: alternating least squares (ALS)
- Clustering: K-means, Gaussian mixtures (GMMs),…
- Topic modeling: latent Dirichlet allocation (LDA)
- Frequent itemsets, association rules, and sequential pattern mining
Spark-Packages, a third party packages library which has 79 machine learning packages.
You can also bookmark this page for future reference.
You can share this page with your friends.
Follow mefor future notifications.