for the beginners and laterals . . .: Difference between Dataproc, Dataflow, Dataprep and Datafusion.

Saturday, November 27, 2021

Difference between Dataproc, Dataflow, Dataprep and Datafusion.

Dataproc	Dataflow	Dataprep	DataFusion
It supports manual provision to clusters	It supports automatic provision to clusters	It helps to prepare and clean the data for future use.	It supports major Hadoop distributions (MapR, Hortonworks) and Cloud (AWS, GCP, AZURE) to build pipeline.
If systems are Hadoop dependent then good to use Dataproc. It is created as an extension service for Hadoop. Real-time data collection with Hadoop and Spark integration feature is more prominent in it.	It is based on Apache Beam, used for data lake data collection, cleaning and workload processing in parallel manner. It mainly merges programming & execution models.	It is only seen as a data processing tool helps in Visual analytics and processing data as a plus-point. Mainly used with Big table and Big query. If you are only looking to find any anomalies or redundancy in the data, then can use it.	It is based on CDAP, an open-source pipeline development tool. It offers visualization tool to build ETL/ELT pipelines.
If one prefers a hands-on Dev-ops approach, then choose Dataproc	if you prefer a serverless approach, then select Dataflow. Fully Managed and No Ops approach	On the other hand, it is UI-driven, Fully Managed and follows No-Ops approach.	In GCP it uses cloud dataproc cluster to perform jobs and comes up with multiple prebuilt connectors from to connect source to sink.
Simple easy to use	Relatively difficult to use	Easy to use	Very easy to use
Used for Data Science and ML Eco System	Used for Batch and Streaming Data processing	Used for UI driver data processing where as multiple source data integrations.	It gives you codeless pipeline development and enterprise readiness gives data lineage, metadata management much easier