Machine Learning with Apache Spark

Hi Readers, In this post I will share my learning of  'Machine Learning with Apache Spark'. What is machine learning? Machine learning is a field of computer science that gives computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. Machine learning is closely related …

Continue reading Machine Learning with Apache Spark

Apache Spark – Convert Parquet files into Avro schema

Hi Readers, In this post I will explain how to convert Parquet files into Avro schema. Read Parquet file, use write.avro() method of this package com.databricks:spark-avro_2.11:3.2.0. Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes. Apache Avro is a data serialization system. A compact, fast, binary data …

Continue reading Apache Spark – Convert Parquet files into Avro schema

Difference between spark, storm, samza and flink tools

Hi Readers, In this post, I have consolidated the various features difference between spark, storm, samza and flink tools. https://twitter.com/tjosepraveen/status/961292477310255105 https://twitter.com/tjosepraveen/status/961292666880245761 You can also bookmark this page for future reference. You can share this page with your friends. Follow me Jose Praveen for future notifications.

Spark SQL with JSON to Avro schema

Hi Readers, In this post I will explain two things. How to convert JSON file to avro format. Read avro data, use sparksql to query and partition avro data using some condition. Apache Avro is a data serialization system. A compact, fast, binary data format. A container file, to store persistent data. Avro relies on schemas. …

Continue reading Spark SQL with JSON to Avro schema

Spark SQL with JSON to parquet files

Hi Readers, In this post I will explain two things. How to convert JSON file to parquet files. Read parquet data, use sparksql to query and partition parquet data using some condition. Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes. Step 1: The JSON dataset …

Continue reading Spark SQL with JSON to parquet files

Spark SQL with JSON data

Hi Readers, In this post I will show how to  read a JSON dataset to create Spark SQL DataFrame and then analyse the data. Step 1: The JSON dataset is in my hdfs at 'user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json' then start the spark shell using "spark2-shell" Step 2: Read the json data using available spark session 'spark' scala> val …

Continue reading Spark SQL with JSON data

My quora answers on big data, data science, blockchain, bitcoin, windows 10/ 8/ 7, linux, j2EE and much more

Hi Readers, I used to write answers for the tech questions which are asked in quora.com. In this post  you can read all my answer for various topics like big data, data science, blockchain, bitcoin, windows 10/ 8/ 7, linux, j2EE etc. Big Data, Data Science Problems and Solutions - https://hadoopexamples.quora.com/Big-Data-Data-Science-Problems-and-Solutions Java, J2EE Problems and Solutions …

Continue reading My quora answers on big data, data science, blockchain, bitcoin, windows 10/ 8/ 7, linux, j2EE and much more