Spring Boot with Angular

Hi Readers,

In post you will find my source code for my spring boot with angular app at https://github.com/JosePraveen/spring-boot-with-angular.

Project Structure

Spring Boot App

springboot-angular2

 

Angular App

springboot-angular3

Start the angular app using, npm start

springboot-angular4

UI Screens

Bike register screen

springboot-angular1

Admin screen

springboot-angular

Buyer Details Screen

springboot-angular5

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Optimizing joins techniques in Apache Spark

Hi Readers,

In this post you will be learning the various optimizing joins techniques that can be used in Apache Spark.

Three types of joins in Spark are

Shuffle hash join (default):

  • It is a map-reduce type join.
  • Based on output key it shuffles the datasets.
  • During reduce phase, it joins the datasets for same output key.

Broadcast hash join:

  • Use this when one dataset is small enough to fit in memory.

Cartesian join:

  • Use this when every row of one table is joined with every row of the other table.

example:
The below code snippet uses the .join() with orderDF and orderItemDF to get a joinedOrderDataDF.

scala> var joinedOrderDataDF = orderDF.join(orderItemDF,orderDF(“order_id”)===orderItemDF(“order_item_order_id”))
joinedOrderDataDF: org.apache.spark.sql.DataFrame = [order_id: int, order_date: bigint … 8 more fields]

check which execution strategy of the join has been used in joinedOrderDataDF using
scala>joinedOrderDataDF.explain —> to get which join type has been used.
scala>joinedOrderDataDF.queryExecution.executedPlan —> gives information on how the dataframe has been executed.

march2018_sparkscala-sqljoin

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Parallelism techniques in Apache Spark

Hi Readers,

In this post you will be learning the various parellelism techniques that can be used in Apache Spark.

We need to use parallelism techniques to achieve the full utilization of the cluster capacity. In HDFS, it means that the number of partitions is the same as the number of input splits, which is mostly the same as the number of blocks.

Example:

If you want to check the parallelism default value in your cluster use sc.defaultParallelism in your spark shell.

scala> sc.defaultParallelism
res3: Int = 2

In my cluster the defaultparallelism value is 2.

In Spark shell we can achieve parallelism by adding  value 10 to the sc.textFile() which means the file has number of partition as 10.

scala> sc.textFile(“/user/edureka_162051/HackerRank-Developer-Survey-2018-Values.csv”,10)

Important :
To maximize the parallelism, the number of partitions value should be two to three times the number of cores present in your cluster.

 

coalesce(numOfPartitions):

Reduce the number of partitions using a RDD method called coalesce(numOfPartitions),where numOfPartitions is the final number of partitions.

repartition(numOfPartitions):

If your data has to be reshuffled over the network, then use the RDD method called repartition(numOfPartitions), where numOfPartitions is the final number of partitions.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

File Optimization and Compression techniques in Apache Hive

Hi Readers,

In this post you will be learning the various file optimization and compression techniques that can be used in Apache Hive.

Hive supports TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET file formats.

Optimization Techniques:

TEXTFILE:

  • This is the default file format for Hive.
  • Data is not compressed in the text file.
  • It can be compressed with compression tools, such as GZip, Bzip2, and Snappy.

SEQUENCEFILE:

  • This is a binary storage format for key/value pairs.
  • The sequence file is  more compact than a text file and fits well with the MapReduce output format.
  • Sequence files can be compressed on record or block level where block level has a better compression ratio.

To enable block level compression, we need to do the following settings:
hive> SET hive.exec.compress.output=true;

hive> SET io.seqfile.compression.type=BLOCK;

RCFILE:

  • Record Columnar File is a flat file consisting of binary key/value pairs that shares much similarity with a sequence file.
  • The RCFile splits data horizontally into row groups. Then, RCFile saves the row group data in a columnar format by saving the first column across all rows, then the second column across all rows, and so on.
  • This format is splittable and is it faster.

ORC:

  • Optimized Row Columnar is available since Hive 0.11.0.
  • It provides a larger block size of 256 MB by default (RCFILE has 4 MB and SEQUENCEFILE has 1 MB) optimized for large sequential reads on HDFS for more throughput and fewer files to reduce overload in the namenode.
  • It also stores basic statistics, such as MIN, MAX, SUM, and COUNT, on columns as well as a lightweight index that can be used to skip blocks of rows that do not matter.

PARQUET:

  • Parquet is another row columnar file format that has a similar design to that of ORC.
  • Parquet has a wider range of support for the majority projects in the Hadoop ecosystem compared to ORC that only supports Hive and Pig.
  • Parquet has native support since Hive 0.13.0.

 

Compression techniques

To reduce the amount of data transferring between mappers and reducers by proper intermediate output compression as well as output data size in HDFS by output compression,

hive2:> SET hive.exec.compress.intermediate=true

The compression codec can be specified in either mapred-site.xml, hive-site.xml, or Hive CLI,

hive2:> SET hive.intermediate.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;

Intermediate compression will only save disk space for specific jobs that require multiple map and reduce jobs.

For further saving of disk space, the actual Hive output files can be compressed.

When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapred.map.output.compression.codec property to compress the storage in HDFS as follows.

These properties can be set in the hive-site.xml or in the Hive CLI.

hive> SET hive.exec.compress.output=true

hive> SET mapred.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Optimization techniques in Apache Spark

Hi Readers,

In this post you will be learning the various optimization techniques used in apache spark.

We can optimize our Spark applications by using data serialization technique, broadcasting etc…

Data serialization
Spark provides two options for data serialization
1 Java serialization
2 Kryo serialization

Compared to Java serialization, Kryo serialization is much fastert than Jav serialization.

You can start using Kryo by initializing your Spark job with a SparkConf as below

scala> import org.apache.spark._
import org.apache.spark._
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala> val conf = new SparkConf().setAppName("My App").setMaster("local[*]")
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@54c15e
scala> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
res1: org.apache.spark.SparkConf = org.apache.spark.SparkConf@54c15e

Broadcasting
A broadcast enables a read-only copy of an instance or class variable cached on each driver program, rather than transferring a copy of its own.

When to use:
If you have a certain task in your Spark job that uses large objects from the driver program, you should turn it into a broadcast variable.

how to use:
you can instantiate it using SparkContext.broadcast

scala> val m1 = 20
m1: Int = 20
scala> val m2 = sc.broadcast(m1)
bv: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(0)
scala> m2.value
res7: Int = 20

The Broadcast feature of Spark uses the SparkContext to create broadcast values. After that, the BroadcastManager and ContextCleaner are used to control their life cycle.

Some other techniques like memory tuning, data structure tuning, GC tuning etc.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Run a Scala program in Apache Spark

Hi Readers,

In this post you will learn how to Run a Scala program in Apache Spark.

In your machine you need to install scalasbt(build tool), java 8 and spark cluster( aws / use cloudera vm).

Open command prompt and type ‘sbt new scala/hello-world.g8’.

write your logic in src/main/scala/Main.scala file.

To compile/ run / package the scala project open the root path of the project in command prompt

to compile- sbt compile

to create a package – sbt package

to run that project – sbt run

In that spark cluster, please check the spark version is same as the build.sbt spark version.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Machine Learning with Apache Spark

Hi Readers,

In this post I will share my learning of  ‘Machine Learning with Apache Spark’.

What is machine learning?

Machine learning is a field of computer science that gives computer systems the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers.

Some example use cases like Fraud detectionIdentifying fraudulent transactions and anomaly detection), Cyber security( To detect DOS attacks, and scale up the instances upon imminent threats), Self-driving carsSentiment analysis, Credit-Risk.

Categories of machine learning algorithms

Supervised learning:

It analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

Example algorithms like Decision trees, Regression, Neural networks, SVM.

Unsupervised learning:

In unsupervised learning, there is no pre-existing data with known labels.A well known example of this is customer segmentation, where you want to identify customer segments based on the behavior. Unsupervised learning are used in fraud detection and cyber security.

Example algorithms like Principal component analysis and Clustering.

Semi-supervised learning:

The data is partially labeled, and use estimation techniques to identify unlabeled data. It has superior performance over unsupervised learning, which is often CPU intensive.

Example algorithms include Clustering and Factorization machines.

Reinforcement Learning (RL):

It is a field within machine learning which involves sequential decision making and learning from interaction. In RL, an agent chooses actions that will maximize the expected cumulative reward over a period of time.

Examples like gaming, such as chess , Go , casino games etc.

Spark MLib

MLLib stands for Machine Learning Library in Spark. It was created in the Berkeley AMPLab.

ML algorithms include:

  • Classification: logistic regression, naive Bayes
  • Regression: generalized linear regression, survival regression
  • Decision trees, random forests, and gradient-boosted trees
  • Recommendation: alternating least squares (ALS)
  • Clustering: K-means, Gaussian mixtures (GMMs),…
  • Topic modeling: latent Dirichlet allocation (LDA)
  • Frequent itemsets, association rules, and sequential pattern mining

Spark-Packages, a third party packages library which has 79 machine learning packages.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.