Tuesday, August 3, 2021

Difference between Structured, Semi-Structured and Un Structured data

Properties

Structured data

Semi-structured data

Unstructured data

Basic

Data whose elements are addressable for effective analysis and organized into a formatted tables, schemas or repository that is typically a database.

Data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database.

Data is a data which is not organized in a predefined manner or does not have a predefined data model; thus it is not a good fit for a mainstream relational database. There are some alternative platforms for storing and managing,

Databases

RDBMS like Oracle, MySQL, PostgreSQL.

 

Commonly data stored in data warehouses.

Non RDBMS / NoSQL databases like Mongo DB, Dynamo DB, Riak, Redis, etc.

 

Follows Hadoop Methodology

 

Commonly data stored in data lakes and data marts.

NoSQL databases like

Mongo DB, Cassandra HBase, CouchDB, Dynamo DB, Riak, Redis, etc.

 

Store’s character and binary data such as pictures, audio, video, pdf, log files, satellite images, scientific images, radar data, etc

 

Commonly data stored in data lakes and data marts.

Scalability

Very difficult to scale DB schema. Can apply horizonal and vertical scaling

scaling is simpler than structured data

more scalable.

Transactions

Matured transaction and various concurrency techniques supports ACID

Transaction is adapted from DBMS not matured

No transaction management and no concurrency

Flexibility

It is schema dependent and less flexible

 

Having predefined format of data

 

Schema on write

It is more flexible than structured data but less flexible than unstructured data.

 

Variety of data in shapes and sizes.

 

Schema on read

It is more flexible and there is absence of schema.

 

Variety of data in shapes and sizes.

 

Schema on read

Query performance

Structured query allow complex joining 

Queries over anonymous nodes are possible

Only textual queries are possible

Version management

Versioning over tuples,row,tables

Versioning over tuples or graph is possible

Versioned as a whole

Robustness

Very robust

New technology, not very spread

New technology, not very spread

Thursday, June 17, 2021

compare star schema and snowflake schema in datawarehouse modeling

Star Schema

Snowflake Schema

Simple Database Design.

Very Complex Database Design.

De-normalized Data structure.

Normalized Data Structure.

Query also run faster.

Query runs slower comparatively star schema.

It contains one fact table surrounded by dimension tables.

One fact table surrounded by dimension table which are in turn surrounded by dimension table

It is easy to understand and provides optimal disk usage

It uses smaller disk space.

Only single join creates the relationship between the fact table and any dimension tables.

It requires many joins to fetch the data.

Single Dimension table contains aggregated data.

Data Split into different Dimension Tables.

High level of data redundancy

Very low-level data redundancy

Cube processing is faster.

Cube processing might be slow because of the complex join.

Hierarchies for the dimensions are stored in the dimensional table.

Hierarchies are divided into separate tables.

Offers higher performing queries using Star Join Query Optimization. Tables may be connected with multiple dimensions.

It is represented by centralized fact table which unlikely connected with multiple dimensions.

Sunday, May 30, 2021

Standard SQL vs Legacy SQL - functions

Standard SQL

Legacy SQL


#standardSQL

SELECT 

repository.url

FROM 

`bigquery-public-data.samples.github_nested`

LIMIT 5;



#legacySQL

SELECT  

repository.url

FROM

[bigquery-public-data.samples.github_nested]

LIMIT 5;


Numeric

SAFE_CAST(x AS INT64)

INTEGER(x)

SAFE_CAST(x AS INT64)

CAST(x AS INTEGER)

APPROX_COUNT_DISTINCT(x)

COUNT(DISTINCT x)

COUNT(DISTINCT x)

EXACT_COUNT_DISTINCT(x)

APPROX_QUANTILES(x, buckets)

QUANTILES(x, buckets + 1)

APPROX_TOP_COUNT(x, num)

TOP(x, num), COUNT(*)

MOD(x, y)

x % y

Datetime

TIMESTAMP_DIFF(t1, t2, DAY)

DATEDIFF(t1, t2)

CURRENT_TIMESTAMP

NOW

FORMAT_TIMESTAMP(fmt, t)

STRFTIME_UTC_USEC(t, fmt)

TIMESTAMP_TRUNC(t, DAY)

UTC_USEC_TO_DAY(t)

REGEXP_CONTAINS(s, pattern)

REGEXP_MATCH(s, pattern)

x IS NULL

IS_NULL(x)

Strings

SAFE_CAST(x AS STRING)

STRING(x)

SAFE_CAST(x AS STRING)

CAST(x AS STRING)

SUBSTR(s, 0, len)

LEFT(s, len)

SUBSTR(s, -len)

RIGHT(s, len)

STRPOS(s, "abc") > 0 or s LIKE '%abc%'

s CONTAINS "abc"

STRING_AGG(s, sep)

GROUP_CONCAT_UNQUOTED(s, sep)

IFNULL(LOGICAL_OR(x), false)

SOME(x)

IFNULL(LOGICAL_AND(x), true)

EVERY(x)

Arrays

ARRAY_AGG(x)

NEST(x)

ANY_VALUE(x)

ANY(x)

arr[SAFE_ORDINAL(index)]

NTH(index, arr) WITHIN RECORD

ARRAY_LENGTH(arr)

COUNT(arr) WITHIN RECORD

Url / IP Address Functions

NET.HOST(url)

HOST(url)

NET.PUBLIC_SUFFIX(url)

TLD(url)

NET.REG_DOMAIN(url)

DOMAIN(url)

NET.IPV4_TO_INT64(

NET.IP_FROM_STRING(

addr_string))

PARSE_IP(addr_string)

NET.IP_TO_STRING(

NET.IPV4_FROM_INT64(

addr_int64 & 0xFFFFFFFF))

FORMAT_IP(addr_int64)

NET.IP_FROM_STRING(addr_string)

PARSE_PACKED_IP(addr_string)

NET.IP_TO_STRING(addr_bytes)

FORMAT_PACKED_IP(addr_bytes)