Dataproc |
Dataflow |
Dataprep |
DataFusion |
It supports manual provision to clusters |
It supports automatic provision to clusters |
It helps to prepare and clean the data for future use. |
It supports major Hadoop distributions (MapR, Hortonworks) and Cloud (AWS, GCP, AZURE) to build pipeline. |
If systems are Hadoop dependent then good to use Dataproc. It is created as an extension service for Hadoop. Real-time data collection with Hadoop and Spark integration feature is more prominent in it. |
It is based on Apache Beam, used for data lake data collection, cleaning and workload processing in parallel manner. It mainly merges programming & execution models. |
It is only seen as a data processing tool helps in Visual analytics and processing data as a plus-point. Mainly used with Big table and Big query. If you are only looking to find any anomalies or redundancy in the data, then can use it. |
It is based on CDAP, an open-source pipeline development tool. It offers visualization tool to build ETL/ELT pipelines. |
If one prefers a hands-on Dev-ops approach, then choose Dataproc |
if you prefer a serverless approach, then select Dataflow. Fully Managed and No Ops approach |
On the other hand, it is UI-driven, Fully Managed and follows No-Ops approach. |
In GCP it uses cloud dataproc cluster to perform jobs and comes up with multiple prebuilt connectors from to connect source to sink. |
Simple easy to use |
Relatively difficult to use |
Easy to use |
Very easy to use |
Used for Data Science and ML Eco System |
Used for Batch and Streaming Data processing |
Used for UI driver data processing where as multiple source data integrations. |
It gives you codeless pipeline development and enterprise readiness gives data lineage, metadata management much easier |
Tips and good resources for all.. Oracle, Big data, Hadoop, Unix, Linux
Saturday, November 27, 2021
Difference between Dataproc, Dataflow, Dataprep and Datafusion.
Tuesday, August 3, 2021
Difference between Structured, Semi-Structured and Un Structured data
Properties |
Structured
data |
Semi-structured
data |
Unstructured
data |
Basic |
Data whose
elements are addressable for effective analysis and organized into a
formatted tables, schemas or repository that is typically a database. |
Data is
information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store
them in the relation database. |
Data is a data
which is not organized in a predefined manner or does not have a predefined
data model; thus it is not a good fit for a mainstream relational database. There
are some alternative platforms for storing and managing, |
Databases |
RDBMS like Oracle,
MySQL, PostgreSQL.
Commonly data
stored in data warehouses. |
Non RDBMS / NoSQL
databases like Mongo DB, Dynamo DB, Riak, Redis, etc.
Follows Hadoop
Methodology
Commonly data
stored in data lakes and data marts. |
NoSQL databases
like Mongo DB, Cassandra
HBase, CouchDB, Dynamo DB, Riak, Redis, etc.
Store’s character
and binary data such as pictures, audio, video, pdf, log files, satellite
images, scientific images, radar data, etc
Commonly data
stored in data lakes and data marts. |
Scalability |
Very difficult
to scale DB schema. Can apply horizonal and vertical scaling |
scaling is
simpler than structured data |
more scalable. |
Transactions |
Matured transaction
and various concurrency techniques supports ACID |
Transaction is
adapted from DBMS not matured |
No transaction
management and no concurrency |
Flexibility |
It is schema
dependent and less flexible
Having predefined
format of data
Schema on write |
It is more
flexible than structured data but less flexible than unstructured data.
Variety of data
in shapes and sizes.
Schema on read |
It is more
flexible and there is absence of schema.
Variety of data
in shapes and sizes.
Schema on read |
Query
performance |
Structured
query allow complex joining |
Queries over
anonymous nodes are possible |
Only textual
queries are possible |
Version
management |
Versioning over
tuples,row,tables |
Versioning over
tuples or graph is possible |
Versioned as a whole |
Robustness |
Very robust |
New technology,
not very spread |
New technology,
not very spread |
Thursday, June 17, 2021
compare star schema and snowflake schema in datawarehouse modeling
Star Schema |
Snowflake Schema |
Simple Database Design. |
Very Complex Database Design. |
De-normalized Data structure. |
Normalized Data Structure. |
Query also run faster. |
Query runs slower comparatively star schema. |
It contains one fact table surrounded by
dimension tables. |
One fact table surrounded by dimension table
which are in turn surrounded by dimension table |
It is easy to understand and provides optimal disk usage |
It uses smaller disk space. |
Only single join creates the relationship
between the fact table and any dimension tables. |
It requires many joins to fetch the data. |
Single Dimension table contains aggregated
data. |
Data Split into different Dimension Tables. |
High level of data redundancy |
Very low-level data redundancy |
Cube processing is faster. |
Cube processing might be slow because of the
complex join. |
Hierarchies for the dimensions are stored in
the dimensional table. |
Hierarchies are divided into separate
tables. |
Offers higher performing queries using Star
Join Query Optimization. Tables may be connected with multiple dimensions. |
It is represented by
centralized fact table which unlikely connected with multiple dimensions. |
Sunday, May 30, 2021
Standard SQL vs Legacy SQL - functions
Standard
SQL |
Legacy
SQL |
#standardSQL SELECT repository.url FROM `bigquery-public-data.samples.github_nested` LIMIT 5; |
#legacySQL SELECT repository.url FROM [bigquery-public-data.samples.github_nested] LIMIT 5; |
Numeric |
|
SAFE_CAST(x
AS INT64) |
INTEGER(x) |
SAFE_CAST(x
AS INT64) |
CAST(x
AS INTEGER) |
APPROX_COUNT_DISTINCT(x) |
COUNT(DISTINCT
x) |
COUNT(DISTINCT
x) |
EXACT_COUNT_DISTINCT(x) |
APPROX_QUANTILES(x,
buckets) |
QUANTILES(x, buckets
+ 1) |
APPROX_TOP_COUNT(x,
num) |
TOP(x, num), COUNT(*) |
MOD(x,
y) |
x
% y |
Datetime |
|
TIMESTAMP_DIFF(t1,
t2, DAY) |
DATEDIFF(t1, t2) |
CURRENT_TIMESTAMP |
NOW |
FORMAT_TIMESTAMP(fmt,
t) |
STRFTIME_UTC_USEC(t, fmt) |
TIMESTAMP_TRUNC(t,
DAY) |
UTC_USEC_TO_DAY(t) |
REGEXP_CONTAINS(s,
pattern) |
REGEXP_MATCH(s, pattern) |
x
IS NULL |
IS_NULL(x) |
Strings |
|
SAFE_CAST(x
AS STRING) |
STRING(x) |
SAFE_CAST(x
AS STRING) |
CAST(x
AS STRING) |
SUBSTR(s,
0, len) |
LEFT(s, len) |
SUBSTR(s,
-len) |
RIGHT(s, len) |
STRPOS(s,
"abc") > 0 or s LIKE '%abc%' |
s
CONTAINS "abc" |
STRING_AGG(s,
sep) |
GROUP_CONCAT_UNQUOTED(s, sep) |
IFNULL(LOGICAL_OR(x), false) |
SOME(x) |
IFNULL(LOGICAL_AND(x), true) |
EVERY(x) |
Arrays |
|
ARRAY_AGG(x) |
NEST(x) |
ANY_VALUE(x) |
ANY(x) |
arr[SAFE_ORDINAL(index)] |
NTH(index, arr)
WITHIN RECORD |
ARRAY_LENGTH(arr) |
COUNT(arr)
WITHIN RECORD |
Url / IP Address Functions |
|
NET.HOST(url) |
HOST(url) |
NET.PUBLIC_SUFFIX(url) |
TLD(url) |
NET.REG_DOMAIN(url) |
DOMAIN(url) |
NET.IPV4_TO_INT64( NET.IP_FROM_STRING( addr_string)) |
PARSE_IP(addr_string) |
NET.IP_TO_STRING( NET.IPV4_FROM_INT64( addr_int64
& 0xFFFFFFFF)) |
FORMAT_IP(addr_int64) |
NET.IP_FROM_STRING(addr_string) |
PARSE_PACKED_IP(addr_string) |
NET.IP_TO_STRING(addr_bytes) |
FORMAT_PACKED_IP(addr_bytes) |