Sunday, May 30, 2021

Standard SQL vs Legacy SQL - functions

Standard SQL

Legacy SQL


#standardSQL

SELECT 

repository.url

FROM 

`bigquery-public-data.samples.github_nested`

LIMIT 5;



#legacySQL

SELECT  

repository.url

FROM

[bigquery-public-data.samples.github_nested]

LIMIT 5;


Numeric

SAFE_CAST(x AS INT64)

INTEGER(x)

SAFE_CAST(x AS INT64)

CAST(x AS INTEGER)

APPROX_COUNT_DISTINCT(x)

COUNT(DISTINCT x)

COUNT(DISTINCT x)

EXACT_COUNT_DISTINCT(x)

APPROX_QUANTILES(x, buckets)

QUANTILES(x, buckets + 1)

APPROX_TOP_COUNT(x, num)

TOP(x, num), COUNT(*)

MOD(x, y)

x % y

Datetime

TIMESTAMP_DIFF(t1, t2, DAY)

DATEDIFF(t1, t2)

CURRENT_TIMESTAMP

NOW

FORMAT_TIMESTAMP(fmt, t)

STRFTIME_UTC_USEC(t, fmt)

TIMESTAMP_TRUNC(t, DAY)

UTC_USEC_TO_DAY(t)

REGEXP_CONTAINS(s, pattern)

REGEXP_MATCH(s, pattern)

x IS NULL

IS_NULL(x)

Strings

SAFE_CAST(x AS STRING)

STRING(x)

SAFE_CAST(x AS STRING)

CAST(x AS STRING)

SUBSTR(s, 0, len)

LEFT(s, len)

SUBSTR(s, -len)

RIGHT(s, len)

STRPOS(s, "abc") > 0 or s LIKE '%abc%'

s CONTAINS "abc"

STRING_AGG(s, sep)

GROUP_CONCAT_UNQUOTED(s, sep)

IFNULL(LOGICAL_OR(x), false)

SOME(x)

IFNULL(LOGICAL_AND(x), true)

EVERY(x)

Arrays

ARRAY_AGG(x)

NEST(x)

ANY_VALUE(x)

ANY(x)

arr[SAFE_ORDINAL(index)]

NTH(index, arr) WITHIN RECORD

ARRAY_LENGTH(arr)

COUNT(arr) WITHIN RECORD

Url / IP Address Functions

NET.HOST(url)

HOST(url)

NET.PUBLIC_SUFFIX(url)

TLD(url)

NET.REG_DOMAIN(url)

DOMAIN(url)

NET.IPV4_TO_INT64(

NET.IP_FROM_STRING(

addr_string))

PARSE_IP(addr_string)

NET.IP_TO_STRING(

NET.IPV4_FROM_INT64(

addr_int64 & 0xFFFFFFFF))

FORMAT_IP(addr_int64)

NET.IP_FROM_STRING(addr_string)

PARSE_PACKED_IP(addr_string)

NET.IP_TO_STRING(addr_bytes)

FORMAT_PACKED_IP(addr_bytes)


Wednesday, May 5, 2021

Difference between Cloud Storage and HDFS

Feature
Cloud Storage
HDFS
I/O variance
Higher variance
We can avoid it by using caching and read replica, considering cloud big table and cloud datastore.
Lower variance
Support of file appends or truncates
No
can do overwrite of files
Yes
POSIX - compliant 
not compliant
not fully compliant
file system information
no information available
"hadoop fsck --files --blocks"   -  exposes all directory information's in HDFS.
request latency
greater round trip latency
lesser latency
cost
57% lower total cost of ownership 
Much higher compared to Cloud
separation between cloud storage and HDFS
Multiple clusters can access same location
usage of gs:// in cloud workloads
much better performance on Persistent Disk
Single cluster access
access using hdfs:// with in cluster
lesser performance due to low grade HDD
interoperability
seamless connectivity between spark and hadoop instances.
exported files can import in big query.
easier  to manage cloud dataflow tasks
standalone cluster instances operations
manual access provision between instances and storage
storage management
Lesser overhead on operations team
Much overhead on operations team
startup time
quicker job startup time during job execution
it takes significant time to start up jobs
security         
cloud inbuilt security and access using IAM
manual security and access setup. changing security key is an overhead to operations teams