Machine Learning with Apache Spark

Hi Readers,

In this post I will share my learning of  ‘Machine Learning with Apache Spark’.

What is machine learning?

Machine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers.

Some example use cases like Fraud detectionIdentifying fraudulent transactions and anomaly detection), Cyber security( To detect DOS attacks, and scale up the instances upon imminent threats), Self-driving carsSentiment analysis, Credit-Risk.

Categories of machine learning algorithms

Supervised learning:

It analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

Example algorithms like Decision trees, Regression, Neural networks, SVM.

Unsupervised learning:

In unsupervised learning, there is no pre-existing data with known labels.A well known example of this is customer segmentation, where you want to identify customer segments based on the behavior. Unsupervised learning are used in fraud detection and cyber security.

Example algorithms like Principal component analysis and Clustering.

Semi-supervised learning:

The data is partially labeled, and use estimation techniques to identify unlabeled data. It has superior performance over unsupervised learning, which is often CPU intensive.

Example algorithms include Clustering and Factorization machines.

Reinforcement Learning (RL):

It is a field within machine learning which involves sequential decision making and learning from interaction. In RL, an agent chooses actions that will maximize the expected cumulative reward over a period of time.

Examples like gaming, such as chess , Go , casino games etc.

Spark MLib

MLLib stands for Machine Learning Library in Spark. It was created in the Berkeley AMPLab.

ML algorithms include:

  • Classification: logistic regression, naive Bayes
  • Regression: generalized linear regression, survival regression
  • Decision trees, random forests, and gradient-boosted trees
  • Recommendation: alternating least squares (ALS)
  • Clustering: K-means, Gaussian mixtures (GMMs),…
  • Topic modeling: latent Dirichlet allocation (LDA)
  • Frequent itemsets, association rules, and sequential pattern mining

Spark-Packages, a third party packages library which has 79 machine learning packages.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Apache Spark – Convert Parquet files into Avro schema

Hi Readers,

In this post I will explain how to convert Parquet files into Avro schema. Read Parquet file, use write.avro() method of this package com.databricks:spark-avro_2.11:3.2.0.

Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes.

Apache Avro is a data serialization system. A compact, fast, binary data format. A container file, to store persistent data. Avro relies on schemas.

Step 1:

Start the spark shell using spark2-shell –packages com.databricks:spark-avro_2.11:3.2.0. Avro dependency gets downloaded.

Step 2:

Import the avro dependency in the spark shell as

scala> import com.databricks.spark.avro._

Step 3:

Read the parquet file from hdfs directory using spark.read.parquet() method.

scala> val reviewParquetDF = spark.read.parquet(“/user/edureka_162051/parquetdata/part-00000-6e546050-c328-4cee-84cd-dd445ff9ac2c.snappy.parquet”)

Step 4:

Now use the write.avro method of com.databricks:spark-avro_2.11:3.2.0 package and pass a hdfs directory.  I used coalesce(coalesce results in partitions with different amounts of data) So i have used 1. Only 1 avro file gets created in hdfs.

scala> reviewParquetDF.coalesce(1).write.avro(“/user/edureka_162051/parquettoavrodata”)

parqueut_avro_janscreenshots

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Difference between spark, storm, samza and flink tools

Hi Readers,

In this post, I have consolidated the various features difference between spark, storm, samza and flink tools.

apachesparkdiff

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Spark SQL with JSON to Avro schema

Hi Readers,

In this post I will explain two things. How to convert JSON file to avro format. Read avro data, use sparksql to query and partition avro data using some condition.

Apache Avro is a data serialization system. A compact, fast, binary data format. A container file, to store persistent data. Avro relies on schemas.

Step 1:

The JSON dataset is in my hdfs at ‘user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json’ then start the spark shell using “spark2-shell –packages com.databricks:spark-avro_2.11:3.2.0”

Step 2:

Import import com.databricks.spark.avro._ package in the spark-shell. Load the JSON data into reviewDF.

scala> val reviewDF = spark.read.json(“/user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json”)

Use printSchema() to know the fields and characteristics.

reviewDF.printSchema()

Step 3:

Convert the JSON data to avro format. I used coalesce(coalesce results in partitions with different amounts of data) So i have used 1. Only 1 avro file gets created in hdfs.

scala> reviewDF.filter(“overall < 4”).coalesce(1).write.avro(“/user/edureka_162051/avrodata”)

Step 4:

Read the avro data from hdfs.

scala> val reviewAvroDF = spark.read.avro(“/user/edureka_162051/avrodata/part-00000-eca1b21e-78dc-44fc-95ea-4ebd02897721.avro”)

Use printSchema() to know the fields and characteristics.

scala> reviewAvroDF.printSchema()

Use show(10) to get the top 10 data in tabular format.

scala> reviewAvroDF.select(“reviewerName”,”reviewText”,”reviewTime”).show(10)

avro_sparkscala2

Step 5:

Use uncompressed, snappy, and deflate compression to compress the avro data and partition using column. Here I have used deflate compression and set the deflate level to 5.

scala> spark.conf.set(“spark.sql.avro.compression.codec”, “deflate”)

scala> spark.conf.set(“spark.sql.avro.deflate.level”, “5”)

Partition using overall field ( has rating values 1,2,3 ).

scala> reviewAvroDF.write.partitionBy(“overall”).avro(“/user/edureka_162051/avrodata/partitioned”)

Once the partitioned has been done. Please check the hdfs folder 3 folders will be created as show below.

avro_sparkscala1

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Spark SQL with JSON to parquet files

Hi Readers,

In this post I will explain two things. How to convert JSON file to parquet files. Read parquet data, use sparksql to query and partition parquet data using some condition.

Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes.

Step 1:

The JSON dataset is in my hdfs at ‘user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json’ then start the spark shell using “spark2-shell”

Step 2:

Load the JSON data into reviewDF.

scala> val reviewDF = spark.read.json(“/user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json”)

Use printSchema() to know the fields and characteristics.

scala> reviewDF.printSchema()

Step 3:

Convert the JSON data to parquet file. I used coalesce(coalesce results in partitions with different amounts of data) So i have used 1. Only 1 parquet file gets created in hdfs.

scala> reviewDF.filter(“overall < 4”).coalesce(1).write.parquet(“/user/edureka_162051/parquetdata”)

Step 4:

Read the parquet data from hdfs.

scala> val reviewParquetDF = spark.read.parquet(“/user/edureka_162051/parquetdata/part-00000-6e546050-c328-4cee-84cd-dd445ff9ac2c.snappy.parquet”)

Use printSchema() to know the fields and characteristics.

scala> reviewParquetDF.printSchema()

scala> reviewParquetDF.createOrReplaceTempView(“reviewsTable”)

scala> val reviewDetailsDF = spark.sql(“select reviewerName,reviewText,summary from reviewsTable”)

scala> reviewDetailsDF.show(5)

parquet_sparkscala_jan1

Step 5:

Use snappy compression to compress the parquet file and partition using column field.

scala> spark.conf.set(“spark.sql.parquet.compression.codec”, “snappy”)

Partition using overall( has rating values 1,2,3 ) field.

scala> reviewParquetDF.write.partitionBy(“overall”).parquet(“/user/edureka_162051/parquetdata/partitioned”)

Once the partitioned has been done. Please check the hdfs folder 3 folders will be created as show below.

parquet_sparkscala_jan

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Spark SQL with JSON data

Hi Readers,

In this post I will show how to  read a JSON dataset to create Spark SQL DataFrame and then analyse the data.

Step 1:

The JSON dataset is in my hdfs at ‘user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json’ then start the spark shell using “spark2-shell”

Step 2:

Read the json data using available spark session ‘spark’

scala> val reviewDF = spark.read.json("/user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json")

Use printSchema() to verify the fields and characteristics of reviewDF.


scala> reviewDF.printSchema()

The schema looks like below.

root
|– asin: string (nullable = true)
|– helpful: array (nullable = true)
| |– element: long (containsNull = true)
|– overall: double (nullable = true)
|– reviewText: string (nullable = true)
|– reviewTime: string (nullable = true)
|– reviewerID: string (nullable = true)
|– reviewerName: string (nullable = true)
|– summary: string (nullable = true)
|– unixReviewTime: long (nullable = true)

Step 3:

Now I need to create a ”createOrReplaceTempView” (Creates a new temporary view using a SparkDataFrame in the Spark Session). With that temporary view I can use SQL query.

scala> val selectDF = spark.sql("select asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime from reviewsTable")


scala> selectDF.show()  //you can see the reviewTable data in tabular format.

Step 4:
Get review results where overall value greater than 4.


scala> val overallDF = spark.sql("select asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary from reviewsTable where overall >=4")


scala> selectDF.show()  //you can see the overall value greater than 4 data in tabular format.

json_sparkscala

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

My quora answers on big data, data science, blockchain, bitcoin, windows 10/ 8/ 7, linux, j2EE and much more

Hi Readers,

I used to write answers for the tech questions which are asked in quora.com. In this post  you can read all my answer for various topics like big data, data science, blockchain, bitcoin, windows 10/ 8/ 7, linux, j2EE etc.

Big Data, Data Science Problems and Solutions

https://hadoopexamples.quora.com/Big-Data-Data-Science-Problems-and-Solutions

Java, J2EE Problems and Solutions

https://javastackexamples.quora.com/Java-J2EE-Problems-and-Solutions

Windows 10/ 8/ 7, OS, Memory issues, Linux, VM Problems and Solutions

https://issuesfixed.quora.com/Windows-10-8-7-OS-Memory-issues-Linux-VM-Problems-and-Solutions

Blockchain, Bitcoin Problems and Solutions

https://blockchainlearning.quora.com/Blockchain-Bitcoin-Problems-and-Solutions

#askmequestions #quora

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.