Sunday, May 30, 2021

Standard SQL vs Legacy SQL - functions

Standard SQL

Legacy SQL


#standardSQL

SELECT 

repository.url

FROM 

`bigquery-public-data.samples.github_nested`

LIMIT 5;



#legacySQL

SELECT  

repository.url

FROM

[bigquery-public-data.samples.github_nested]

LIMIT 5;


Numeric

SAFE_CAST(x AS INT64)

INTEGER(x)

SAFE_CAST(x AS INT64)

CAST(x AS INTEGER)

APPROX_COUNT_DISTINCT(x)

COUNT(DISTINCT x)

COUNT(DISTINCT x)

EXACT_COUNT_DISTINCT(x)

APPROX_QUANTILES(x, buckets)

QUANTILES(x, buckets + 1)

APPROX_TOP_COUNT(x, num)

TOP(x, num), COUNT(*)

MOD(x, y)

x % y

Datetime

TIMESTAMP_DIFF(t1, t2, DAY)

DATEDIFF(t1, t2)

CURRENT_TIMESTAMP

NOW

FORMAT_TIMESTAMP(fmt, t)

STRFTIME_UTC_USEC(t, fmt)

TIMESTAMP_TRUNC(t, DAY)

UTC_USEC_TO_DAY(t)

REGEXP_CONTAINS(s, pattern)

REGEXP_MATCH(s, pattern)

x IS NULL

IS_NULL(x)

Strings

SAFE_CAST(x AS STRING)

STRING(x)

SAFE_CAST(x AS STRING)

CAST(x AS STRING)

SUBSTR(s, 0, len)

LEFT(s, len)

SUBSTR(s, -len)

RIGHT(s, len)

STRPOS(s, "abc") > 0 or s LIKE '%abc%'

s CONTAINS "abc"

STRING_AGG(s, sep)

GROUP_CONCAT_UNQUOTED(s, sep)

IFNULL(LOGICAL_OR(x), false)

SOME(x)

IFNULL(LOGICAL_AND(x), true)

EVERY(x)

Arrays

ARRAY_AGG(x)

NEST(x)

ANY_VALUE(x)

ANY(x)

arr[SAFE_ORDINAL(index)]

NTH(index, arr) WITHIN RECORD

ARRAY_LENGTH(arr)

COUNT(arr) WITHIN RECORD

Url / IP Address Functions

NET.HOST(url)

HOST(url)

NET.PUBLIC_SUFFIX(url)

TLD(url)

NET.REG_DOMAIN(url)

DOMAIN(url)

NET.IPV4_TO_INT64(

NET.IP_FROM_STRING(

addr_string))

PARSE_IP(addr_string)

NET.IP_TO_STRING(

NET.IPV4_FROM_INT64(

addr_int64 & 0xFFFFFFFF))

FORMAT_IP(addr_int64)

NET.IP_FROM_STRING(addr_string)

PARSE_PACKED_IP(addr_string)

NET.IP_TO_STRING(addr_bytes)

FORMAT_PACKED_IP(addr_bytes)


Wednesday, May 5, 2021

Difference between Cloud Storage and HDFS

Feature
Cloud Storage
HDFS
I/O variance
Higher variance
We can avoid it by using caching and read replica, considering cloud big table and cloud datastore.
Lower variance
Support of file appends or truncates
No
can do overwrite of files
Yes
POSIX - compliant 
not compliant
not fully compliant
file system information
no information available
"hadoop fsck --files --blocks"   -  exposes all directory information's in HDFS.
request latency
greater round trip latency
lesser latency
cost
57% lower total cost of ownership 
Much higher compared to Cloud
separation between cloud storage and HDFS
Multiple clusters can access same location
usage of gs:// in cloud workloads
much better performance on Persistent Disk
Single cluster access
access using hdfs:// with in cluster
lesser performance due to low grade HDD
interoperability
seamless connectivity between spark and hadoop instances.
exported files can import in big query.
easier  to manage cloud dataflow tasks
standalone cluster instances operations
manual access provision between instances and storage
storage management
Lesser overhead on operations team
Much overhead on operations team
startup time
quicker job startup time during job execution
it takes significant time to start up jobs
security         
cloud inbuilt security and access using IAM
manual security and access setup. changing security key is an overhead to operations teams

Saturday, August 24, 2019

create ssh keys and copy it on the remote server

Step 1: Create public and private keys using ssh-key-gen on local-host

mkdir -p $HOME/.ssh
chmod 0700 $HOME/.ssh

ssh-keygen -t rsa -b 4096 <> -f ~/.ssh/newremoteweb.key <> -C "My new remote web key" <>
ssh-keygen -t rsa -P ""


Step 2: Copy the public key to remote-host using ssh-copy-id

ssh-copy-id -i ~/.ssh/id_rsa.pub remote-host
ssh-copy-id -i ~/.ssh/id_rsa.pub remote-user@remote-host

scp $HOME/.ssh/id_rsa.pub remote-user@remote-host:~/.ssh/authorized_keys


adding the public key in remote server

## First create .ssh directory on server
ssh remote-user@remote-host "umask 077; test -d .ssh || mkdir .ssh"

## cat local id.rsa.pub file and pipe over ssh to append the public key in remote servercat $HOME/.ssh/id_rsa.pub | ssh remote-user@remote-host "cat >> .ssh/authorized_keys"

Now try logging into the machine, with "ssh 'remote-host'", and check in:

.ssh/authorized_keys

Step 3: Login to remote-host without entering the password

ssh remote-host

It should connect without password. Our testing is over..

Some important stuffs which you can consider.

$HOME/.ssh/id_rsa– contains your private key.
$HOME/.ssh/id_rsa.pub – contain your public key.


restarting  ssh daemon/service on the server.

## on centos/rhel/fedora
sudo service sshd reload

## on linux server
sudo systemctl reload sshd

## on debian/ubuntu - older version
sudo /etc/init.d/ssh reload

## on debian/ubuntu - latest
sudo systemctl reload ssh

## common method to reload sshd
sudo kill -HUP `cat /var/run/sshd.pid`
or
sudo kill -HUP $(cat /var/run/sshd.pid)


## list all public key parameters of all identities from the ssh-agent:
ssh-add -L

## deleting all private keys from the ssh-agent:
ssh-add -D

## kill the ssh agent, which is running:
kill $SSH_AGENT_PID
trap "kill $SSH_AGENT_PID" 0


Thursday, July 18, 2019

When HARD PARSE and SOFT PARSE and How we can avoid it?


When any new SQL arrives, 
  • It tries to find a suitable child cursor on the library cache, then SOFT PARSE occurs. 
  • If there is no parent cursor found, then HARD PARSE occurs. 
  • If there is a parent cursor found, but It's existing children can't be reused by this call.
  • As it depend on then size of bind variables and optimizer settings and NLS settings, as well. At this time, there will be a HARD PARSE.
  • If there is a parent cursor found, but if any existing child cursors executed with similar execution plan, then it can be reused and, there will be a SOFT PARSE.
Ideally, the Parent cursor contains the SQL statement text only and Child cursor contains Execution plans.

By using bind variables we can avoid unnecessary HARD PARSE to the DB.
If bind variables are not used, then there is HARD PARSE of all SQL statements. 
This has a major server impact on performance and we will face with the high waits time during SQL execution and higher cost of SQL.

Saturday, July 6, 2019

Difference between Hadoop 2.x and Hadoop 3.x

Feature
Hadoop 2.x
Hadoop 3.x
Java Version
Java 7
Java 8, must
Fault tolerance
Achieved with replication
Via erasure coding
Storage
Replication=3 for data reliability
Which increase disk usage.
for ex. File A of 3 blocks occupies 3*3 blocks.
Storage overhead = 9 /3 * 100 = 300%
Erasure coding for data reliability
Under erasure coding the blocks are not replicated in fact HDFS calculates the parity blocks for all file blocks. 
Whenever the file blocks get corrupted, the Hadoop framework recreates using the remaining blocks along with the parity blocks.
Storage overhead is drastically reduced to more than 50%.
for ex. File A of 3 blocks occupies 9 blocks.
Storage overhead = 3/3 * 100 = 100%
Yarn Timeline Service
Scalability issues over data increases.
Yarn Timeline Service 1.x present since Hadoop 1.x, and its not scalable beyond small clusters. It has a single instance of writer and storage.
Yarn Timeline Service 2.x , provides for more scalability, reliability and enhanced usability. It has scalable back-end storage and distributed writer architecture.
Heap Size Management
Need to configure HADOOP HEAPSIZE
There are new ways to configure Map & Reduce daemon heap sizes. Auto tuning based on the memory of the host and globally. HADOOP_HEAPSIZE & JAVA_HEAP_SIZE variable is no longer used. We have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables in MB. Also, if you want to enable the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.
Standby NN
Supports only 1 Standby NN, tolerating the failures of cluster.
Supports 2 and more Standby NN. Only One is in active state and others are in standby state.
Containers
Hadoop 2.x works on the principle of guaranteed containers.The container will start running immediately as there is a guarantee that the resources will be available. But it has drawbacks.
FeedBack Delays  Once the container finishes execution it notifies to RM. When RM schedules a new container at that node, AM gets notified. Then AM starts the new container. Hence there is a delay in terms of notifications to RM and AM.
Allocated v/s utilized resources – The resources which RM allocates to the container can be under-utilized. For ex, RM may allocates container of 4 GB and out of which it uses only 2 GB. This reduces effective resource utilization.
 Hadoop 3.x implements opportunistic containers.
Containers wait in a queue if the resources are not available.
The opportunistic containers have less priority than guaranteed containers.
Hence, the scheduler attempts opportunistic containers to be available for guaranteed containers.
Port Numbers for multiple services
It uses ephemeral port numbers range (32768-61000), which lead failure of Hadoop services in startup in Linux Server.
The ephemeral port numbers changes affected to NN, SN and DN port numbers.
Name Node ports: 
50470 –> 9871,
50070 –> 9870,
8020 –> 9820
Secondary Name Node ports: 
50091 –> 9869,
50090 –> 9868
Data Node ports: 
50020 –> 9867,
50010 –> 9866,
50475 –> 9865,
50075 –> 9864
Data Load balancer
A single Data Node manages many disks. These disks fill up during a normal write operation. But, adding or replacing disks can lead to significant issues within a Data Node. Hadoop 2.x has HDFS balancer which cannot handle this situation. 
New intra-Data Node balancing functionality handles the above situation.
diskbalancer CLI invokes intra-DataNode balancer.
To enable  this,
dfs.disk.balancer.enabled=true on all DataNodes.
File System Support
Local filesystems,HDFS (Default FS), FTP File system, Amazon S3 (Simple Storage Service) file system, Windows Azure Storage Blobs (WASB) file system, Distributed Filesystems, etc.
It supports all the previous one as well as Microsoft Azure Data Lake filesystem and Aliyn object storage system.
Scalability
Cluster can be scale upto 10000 nodes.
Cluster can be scale more than 10000 nodes.

Thursday, September 6, 2018

What are Edge Nodes or Gateway Nodes in Hadoop?

Edge nodes are the interface between the Hadoop cluster and the outside network from which Hadoop user can store files in Hadoop cluster. It’s a gateway to the cluster, Hence some time we refer it as a gateway node as well.

Commonly, edge nodes are used to run cluster administration tools and client applications.  Edge-nodes are kept separate from the cluster nodes that contain HDFS, MapReduce, etc components in it, It mainly to keeps the computing resources separate from the outer world.

Edge nodes running within the cluster allow for centralized management of all the Hadoop configurations on the cluster nodes which helps to reduce the administration efforts needed to update the config files through cluster administrators. 
It’s a limited security within Hadoop itself, even if your Hadoop cluster operates in a LAN or WAN behind a security firewall. You may consider a cluster-specific firewall to fully protect non-public data of Hadoop cluster.