Standard
SQL |
Legacy
SQL |
#standardSQL SELECT repository.url FROM `bigquery-public-data.samples.github_nested` LIMIT 5; |
#legacySQL SELECT repository.url FROM [bigquery-public-data.samples.github_nested] LIMIT 5; |
Numeric |
|
SAFE_CAST(x
AS INT64) |
INTEGER(x) |
SAFE_CAST(x
AS INT64) |
CAST(x
AS INTEGER) |
APPROX_COUNT_DISTINCT(x) |
COUNT(DISTINCT
x) |
COUNT(DISTINCT
x) |
EXACT_COUNT_DISTINCT(x) |
APPROX_QUANTILES(x,
buckets) |
QUANTILES(x, buckets
+ 1) |
APPROX_TOP_COUNT(x,
num) |
TOP(x, num), COUNT(*) |
MOD(x,
y) |
x
% y |
Datetime |
|
TIMESTAMP_DIFF(t1,
t2, DAY) |
DATEDIFF(t1, t2) |
CURRENT_TIMESTAMP |
NOW |
FORMAT_TIMESTAMP(fmt,
t) |
STRFTIME_UTC_USEC(t, fmt) |
TIMESTAMP_TRUNC(t,
DAY) |
UTC_USEC_TO_DAY(t) |
REGEXP_CONTAINS(s,
pattern) |
REGEXP_MATCH(s, pattern) |
x
IS NULL |
IS_NULL(x) |
Strings |
|
SAFE_CAST(x
AS STRING) |
STRING(x) |
SAFE_CAST(x
AS STRING) |
CAST(x
AS STRING) |
SUBSTR(s,
0, len) |
LEFT(s, len) |
SUBSTR(s,
-len) |
RIGHT(s, len) |
STRPOS(s,
"abc") > 0 or s LIKE '%abc%' |
s
CONTAINS "abc" |
STRING_AGG(s,
sep) |
GROUP_CONCAT_UNQUOTED(s, sep) |
IFNULL(LOGICAL_OR(x), false) |
SOME(x) |
IFNULL(LOGICAL_AND(x), true) |
EVERY(x) |
Arrays |
|
ARRAY_AGG(x) |
NEST(x) |
ANY_VALUE(x) |
ANY(x) |
arr[SAFE_ORDINAL(index)] |
NTH(index, arr)
WITHIN RECORD |
ARRAY_LENGTH(arr) |
COUNT(arr)
WITHIN RECORD |
Url / IP Address Functions |
|
NET.HOST(url) |
HOST(url) |
NET.PUBLIC_SUFFIX(url) |
TLD(url) |
NET.REG_DOMAIN(url) |
DOMAIN(url) |
NET.IPV4_TO_INT64( NET.IP_FROM_STRING( addr_string)) |
PARSE_IP(addr_string) |
NET.IP_TO_STRING( NET.IPV4_FROM_INT64( addr_int64
& 0xFFFFFFFF)) |
FORMAT_IP(addr_int64) |
NET.IP_FROM_STRING(addr_string) |
PARSE_PACKED_IP(addr_string) |
NET.IP_TO_STRING(addr_bytes) |
FORMAT_PACKED_IP(addr_bytes) |
Tips and good resources for all.. Oracle, Big data, Hadoop, Unix, Linux
Sunday, May 30, 2021
Standard SQL vs Legacy SQL - functions
Wednesday, May 5, 2021
Difference between Cloud Storage and HDFS
Feature | Cloud Storage | HDFS |
I/O variance | Higher variance We can avoid it by using caching and read replica, considering cloud big table and cloud datastore. | Lower variance |
Support of file appends or truncates | No can do overwrite of files | Yes |
POSIX - compliant | not compliant | not fully compliant |
file system information | no information available | "hadoop fsck --files --blocks" - exposes all directory information's in HDFS. |
request latency | greater round trip latency | lesser latency |
cost | 57% lower total cost of ownership | Much higher compared to Cloud |
separation between cloud storage and HDFS | Multiple clusters can access same location usage of gs:// in cloud workloads much better performance on Persistent Disk | Single cluster access access using hdfs:// with in cluster lesser performance due to low grade HDD |
interoperability | seamless connectivity between spark and hadoop instances. exported files can import in big query. easier to manage cloud dataflow tasks | standalone cluster instances operations manual access provision between instances and storage |
storage management | Lesser overhead on operations team | Much overhead on operations team |
startup time | quicker job startup time during job execution | it takes significant time to start up jobs |
security | cloud inbuilt security and access using IAM | manual security and access setup. changing security key is an overhead to operations teams |
Saturday, August 24, 2019
create ssh keys and copy it on the remote server
Step 1: Create public and private keys using ssh-key-gen on local-host
mkdir -p $HOME/.ssh
chmod 0700 $HOME/.ssh
ssh-keygen -t rsa -b 4096 <> -f ~/.ssh/newremoteweb.key <> -C "My new remote web key" <>
ssh-keygen -t rsa -P ""
Step 2: Copy the public key to remote-host using ssh-copy-id
ssh-copy-id -i ~/.ssh/id_rsa.pub remote-host
ssh-copy-id -i ~/.ssh/id_rsa.pub remote-user@remote-host
scp $HOME/.ssh/id_rsa.pub remote-user@remote-host:~/.ssh/authorized_keys
adding the public key in remote server
## First create .ssh directory on server
ssh remote-user@remote-host "umask 077; test -d .ssh || mkdir .ssh"
## cat local id.rsa.pub file and pipe over ssh to append the public key in remote servercat $HOME/.ssh/id_rsa.pub | ssh remote-user@remote-host "cat >> .ssh/authorized_keys"
Now try logging into the machine, with "ssh 'remote-host'", and check in:
.ssh/authorized_keys
Step 3: Login to remote-host without entering the password
mkdir -p $HOME/.ssh
chmod 0700 $HOME/.ssh
ssh-keygen -t rsa -b 4096 <
ssh-keygen -t rsa -P ""
Step 2: Copy the public key to remote-host using ssh-copy-id
ssh-copy-id -i ~/.ssh/id_rsa.pub remote-host
ssh-copy-id -i ~/.ssh/id_rsa.pub remote-user@remote-host
scp $HOME/.ssh/id_rsa.pub remote-user@remote-host:~/.ssh/authorized_keys
adding the public key in remote server
## First create .ssh directory on server
ssh remote-user@remote-host "umask 077; test -d .ssh || mkdir .ssh"
## cat local id.rsa.pub file and pipe over ssh to append the public key in remote servercat $HOME/.ssh/id_rsa.pub | ssh remote-user@remote-host "cat >> .ssh/authorized_keys"
Now try logging into the machine, with "ssh 'remote-host'", and check in:
.ssh/authorized_keys
Step 3: Login to remote-host without entering the password
ssh remote-host
It should connect without password. Our testing is over..
Some important stuffs which you can consider.
$HOME/.ssh/id_rsa– contains your private key.
$HOME/.ssh/id_rsa.pub – contain your public key.
restarting ssh daemon/service on the server.
## on centos/rhel/fedora
sudo service sshd reload
## on linux server
sudo systemctl reload sshd
## on debian/ubuntu - older version
sudo /etc/init.d/ssh reload
## on debian/ubuntu - latest
sudo systemctl reload ssh
## common method to reload sshd
sudo kill -HUP `cat /var/run/sshd.pid`
or
sudo kill -HUP $(cat /var/run/sshd.pid)
## list all public key parameters of all identities from the ssh-agent:
ssh-add -L
## deleting all private keys from the ssh-agent:
ssh-add -D
## kill the ssh agent, which is running:
kill $SSH_AGENT_PID
trap "kill $SSH_AGENT_PID" 0
It should connect without password. Our testing is over..
Some important stuffs which you can consider.
$HOME/.ssh/id_rsa– contains your private key.
$HOME/.ssh/id_rsa.pub – contain your public key.
restarting ssh daemon/service on the server.
## on centos/rhel/fedora
sudo service sshd reload
## on linux server
sudo systemctl reload sshd
## on debian/ubuntu - older version
sudo /etc/init.d/ssh reload
## on debian/ubuntu - latest
sudo systemctl reload ssh
## common method to reload sshd
sudo kill -HUP `cat /var/run/sshd.pid`
or
sudo kill -HUP $(cat /var/run/sshd.pid)
## list all public key parameters of all identities from the ssh-agent:
ssh-add -L
## deleting all private keys from the ssh-agent:
ssh-add -D
## kill the ssh agent, which is running:
kill $SSH_AGENT_PID
trap "kill $SSH_AGENT_PID" 0
Thursday, July 18, 2019
When HARD PARSE and SOFT PARSE and How we can avoid it?
When any new SQL arrives,
- It tries to find a suitable child cursor on the library cache, then SOFT PARSE occurs.
- If there is no parent cursor found, then HARD PARSE occurs.
- If there is a parent cursor found, but It's existing children can't be reused by this call.
- As it depend on then size of bind variables and optimizer settings and NLS settings, as well. At this time, there will be a HARD PARSE.
- If there is a parent cursor found, but if any existing child cursors executed with similar execution plan, then it can be reused and, there will be a SOFT PARSE.
By using bind variables we can avoid unnecessary HARD PARSE to the DB.
If bind variables are not used, then there is HARD PARSE of all SQL statements.
This has a major server impact on performance and we will face with the high waits time during SQL execution and higher cost of SQL.
Saturday, July 6, 2019
Difference between Hadoop 2.x and Hadoop 3.x
Feature
|
Hadoop 2.x
|
Hadoop 3.x
|
Java
Version
|
Java
7
|
Java
8, must
|
Fault
tolerance
|
Achieved
with replication
|
Via
erasure coding
|
Storage
|
Replication=3
for data reliability
Which
increase disk usage.
for
ex. File A of 3 blocks occupies 3*3 blocks.
Storage
overhead = 9 /3 * 100 = 300%
|
Erasure
coding for data reliability
Under erasure coding the blocks are not replicated in fact
HDFS calculates the parity blocks for all file blocks.
Whenever the file blocks get corrupted, the Hadoop
framework recreates using the remaining blocks along with the parity blocks.
Storage overhead is drastically reduced to more than
50%.
for
ex. File A of 3 blocks occupies 9 blocks.
Storage
overhead = 3/3 * 100 = 100%
|
Yarn
Timeline Service
|
Scalability
issues over data increases.
Yarn
Timeline Service 1.x present since Hadoop 1.x, and its not scalable beyond
small clusters. It has a single instance of writer and storage.
|
Yarn
Timeline Service 2.x , provides for more
scalability, reliability and enhanced usability. It has scalable back-end
storage and distributed writer architecture.
|
Heap
Size Management
|
Need
to configure HADOOP HEAPSIZE
|
There are new ways to configure Map & Reduce daemon heap sizes. Auto tuning based on the memory of the host and globally.
HADOOP_HEAPSIZE & JAVA_HEAP_SIZE variable is no longer used.
We have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables in MB.
Also, if you want to enable the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.
|
Standby
NN
|
Supports
only 1 Standby NN, tolerating the failures of
cluster.
|
Supports
2 and more Standby NN. Only One is in active state and others are in standby
state.
|
Containers
|
Hadoop 2.x works on the principle of guaranteed containers.The
container will start running immediately as there is a guarantee that the
resources will be available. But it has drawbacks.
FeedBack Delays – Once the container finishes execution it notifies to RM.
When RM schedules a new container at that node, AM gets notified. Then AM
starts the new container. Hence there is a delay in terms of notifications to
RM and AM.
Allocated v/s utilized
resources – The resources which RM allocates to the container can be
under-utilized. For ex, RM may allocates container of 4 GB and out of which
it uses only 2 GB. This reduces effective resource utilization.
|
Hadoop 3.x implements
opportunistic containers.
Containers wait in a queue if the resources are not available.
The opportunistic containers have less priority than
guaranteed containers.
Hence, the scheduler attempts opportunistic containers to
be available for guaranteed containers.
|
Port Numbers for multiple services
|
It uses ephemeral port numbers range (32768-61000), which
lead failure of Hadoop services in startup in Linux Server.
|
The ephemeral port numbers changes affected to NN, SN and DN
port numbers.
Name Node ports:
50470 –> 9871,
50070 –> 9870,
8020 –> 9820
Secondary Name Node ports:
50091 –> 9869,
50090 –> 9868
Data Node ports:
50020 –> 9867,
50010 –> 9866,
50475 –> 9865,
50075 –> 9864
|
Data Load balancer
|
A single Data Node manages many disks. These disks fill up during
a normal write operation. But, adding or replacing disks can lead to
significant issues within a Data Node. Hadoop 2.x has HDFS balancer which
cannot handle this situation.
|
New intra-Data Node balancing functionality handles the
above situation.
diskbalancer CLI invokes intra-DataNode balancer.
To enable this,
dfs.disk.balancer.enabled=true on all DataNodes.
|
File System Support
|
||
Thursday, September 6, 2018
What are Edge Nodes or Gateway Nodes in Hadoop?
Edge
nodes are the interface between the Hadoop cluster and the outside network from
which Hadoop user can store files in Hadoop cluster. It’s a gateway to the
cluster, Hence some time we refer it as a gateway node as well.
Commonly, edge nodes are used to run cluster administration tools and client applications. Edge-nodes are kept separate from the cluster nodes that contain HDFS, MapReduce, etc components in it, It mainly to keeps the computing resources separate from the outer world.
Edge nodes running within the cluster allow for centralized management of all the Hadoop configurations on the cluster nodes which helps to reduce the administration efforts needed to update the config files through cluster administrators.
Commonly, edge nodes are used to run cluster administration tools and client applications. Edge-nodes are kept separate from the cluster nodes that contain HDFS, MapReduce, etc components in it, It mainly to keeps the computing resources separate from the outer world.
Edge nodes running within the cluster allow for centralized management of all the Hadoop configurations on the cluster nodes which helps to reduce the administration efforts needed to update the config files through cluster administrators.
It’s
a limited security within Hadoop itself, even if your Hadoop cluster operates in
a LAN or WAN behind a security firewall. You may consider a cluster-specific
firewall to fully protect non-public data of Hadoop cluster.
Subscribe to:
Posts (Atom)