Saturday, November 27, 2021

Difference between Dataproc, Dataflow, Dataprep and Datafusion.

Dataproc

Dataflow

Dataprep

DataFusion

It supports manual provision to clusters

It supports automatic provision to clusters

It helps to prepare and clean the data for future use.

It supports major Hadoop distributions (MapR, Hortonworks) and Cloud (AWS, GCP, AZURE) to build pipeline.

If systems are Hadoop dependent then good to use Dataproc.

It is created as an extension service for Hadoop.


Real-time data collection with Hadoop and Spark integration feature is more prominent in it.

It is based on Apache Beam, used for data lake data collection, cleaning and workload processing in parallel manner.

It mainly merges programming & execution models.

It is only seen as a data processing tool helps in Visual analytics and processing data as a plus-point.

Mainly used with Big table and Big query.

If you are only looking to find any anomalies or redundancy in the data, then can use it. 

It is based on CDAP, an open-source pipeline development tool.

It offers visualization tool to build ETL/ELT pipelines.

If one prefers a hands-on Dev-ops approach, then choose Dataproc

 if you prefer a serverless approach, then select Dataflow.

Fully Managed and No Ops approach

On the other hand, it is UI-driven, Fully Managed and follows No-Ops approach.

In GCP it uses cloud dataproc cluster to perform jobs and comes up with multiple prebuilt connectors from to connect source to sink.

Simple easy to use

Relatively difficult to use

Easy to use

Very easy to use

Used for Data Science and ML Eco System

Used for Batch and Streaming Data processing

Used for UI driver data processing where as multiple source data integrations.

It gives you codeless pipeline development and enterprise readiness gives data lineage, metadata management much easier