Monday, February 17, 2025

What are the steps involved in migrating application to cloud?

Below are some of steps which can be taken care while performing any application or database migration in cloud.

  1. Identify Application and Database - Architecture and Flow

    • Check all the on-prem application functionalities

    • Majorly focuses on how the application services seamlessly working

    • Write down the requirement specification to enable decision making to application architect/cloud architect

  2. Identify Application and Database – Coding Complexity

    • Application is designed in old legacy language i.e. cobol, fortran, c, c++, etc.

    • Check the code coverage and code availability and code expertise people availability

    • Consider the complexity of the application code

    • Consider how data communication happens in current application

    • Consider how much security features covered in current application

  3. Discussion – Tech Team / Cloud Vendor

    • Discussion with Tech Team on Application Design and Operations

    • Draw the flow charts and application diagrams accordingly

    • Define time line to make a pilot development and deployment

    • Prepare check list for application and database

    • Identify cloud vendor i.e. AWS, Azure, GCP, OCI, etc. to approach first

    • kick off meeting with cloud vendors and choose which one to go

  4. Discussion – Tech Team/Business Team

    • Meeting with Application/Tech Team Management on finalization of further approach for cloud migration

  5. Finalize – Tech Stack and Operating Model in Cloud

    • Check and compare the cloud vendors Tech Solutions along with Application and Business Teams, in data, infra, reporting, etc. aspects.

    • Consider Tech Solutions based upon your application scalability, complexity, and budget to migrate and operate

    • Consider current Tech Solutions which also provides SAAS Services in cloud and identify their operating/charging model

  6. Finalize – Tech Stack and Operating Model in Cloud

    • Meeting with cloud vendors i.e. AWS, Azure, GCP, OCI, etc.

    • Discuss current application checklist, application flow requirement, budgeting and all other details

    • Ask cloud vendors for their tech solutions - use cases and service model

    • Verified all aspects whether its fulfilled by cloud vendors Tech Solutions or not

    • Identify charging models of cloud vendors, long term or pay per use or any other, carefully understand charging pattern

    • Ideally choose – pay per usage resource charging model

  7. Empower – Tech Team / Migration Team

    • Identify or gather Tech Talent based on Cloud Vendor Tech Solutions and make a migration team responsible for migration in cloud

    • Provide training to inhouse teams on Tech Solutions

    • Partner with Cloud Vendor on training services offered by them to upskill our teams

    • Migration team should have more than 2 Architects (Cloud / On-Prem)

    • Provision a Migration Team Lead to supervise – guide team on migration activities

  8. Standards – Migration/Cloud

    • Validate earlier prepared check list for application and database with cloud

    • Choose Agile/Scrum/Kanban Model to track migration process

    • According to me, big bang approach is not suitable for migration

    • Divide application based on most used to least used functionalities, if possible

    • Leverage any IaC Tool i.e Terraform, OpenTofu, Pulumi, etc.

    • Consider hybrid approach to create Network connectivity, DNS, Load balancing, etc.

    • Security/Identity and Access (IAM) configs and group/pool level access in cloud

  9. Execution – Pilot / POC

    • Perform Pilot / POC with cloud vendor by Migration Team using IaC Tool

    • Test at least 2-3 application flows

  10. Standards – Define Strategies

    • Migration Team responsible for defining Enterprise-wide standards for application/database migration, along with environments to create in cloud

    • Define strategies for load balancing along with Automated Stress Testing along with HA of application/database in cloud

  11. Migrate – Business/Application/Tech/Migration Team

    • Obtain Management Approvals, upon successfully execution of Pilot/POC of the application

    • Migrate other workloads in cloud

    • Create and Validate DEV first, then QA, TEST and PROD will be last with no manual changes allowed in cloud resources

  12. Perform – BACKUP/DR

    • Consider BACKUP and DR as well for Application and Database workloads

    • Define Storage retention policies in cloud

    • Test BACKUP and DR at least once in cloud after every 4-5 Months/ 2-3 times in a year

  13. Optimize – application/database

    • Implement performance optimization techiques, caching, memory storage, etc. at database/application level

    • Increase network bandwidth, machine size, disks, if required

    • Add more metrices to monitor performance

Sunday, February 2, 2025

what are the application deployment strategies?

 1. Recreate Deployment

“Version 1 is down and version 2 is rolled out on Version 1”

Recreating deployment terminates all the pods/applications and replaces them with the new version. This can be useful in situations where an old and new version of the application cannot run at the same time.

 

Users experience some downtime because we need to stop the application before the new version is running. The application state is entirely renewed since they are completely replaced. if we do decide to roll back, it may cause even more downtime.

 

2. Rolling Deployment (Ramped OR Incremental Updates OR rolling updates OR rolling upgrade)

“Version 2 is slowly in phased rolled out and replacing version 1”

Rolling deployment strategy involves updating the software version in a phased manner, typically by deploying to a subset of servers at a time. This allows for incremental updates and minimizes the impact of any potential issues.

 

If a problem is detected, it can be addressed before moving on to the next set of servers. This strategy ensures a smoother deployment process and reduces the risk of downtime. After replacing all instances of the older version, it will be shut down by Ops/Dev team. After that, the new version is in charge of all production traffic.

 

Rolling deployments are the default Kubernetes (k8s) cluster offering designed to reduce downtime to the cluster. A rolling deployment replaces pods running the old version of the application with the new version without downtime.

 

3. Shadow Deployment

“Version 2 receives real-world traffic alongwith version 1 and doesn’t impact the response.”

In this deployment strategy, developers deploy the new version alongside the old version. However, users cannot access the new version immediately. The latest version hides in the shadows. To test how the new version will handle the requests when live, developers fork or copy a copy of the old version to the shadow version.

 

As a result, the Ops/Dev Team should be careful to ensure that the forked traffic does not create a duplicate live request since two versions of the same system are running simultaneously.

It's expensive and complex to set up, and it can create serious problems. Shadow deployment allows engineers to monitor system performance and conduct stability tests.

 

4. A/B Testing

“Version 2 is released to a specific subset of users under specific condition to test and validate.”

A/B testing is a deployment strategy that involves deploying two different versions of the application simultaneously to different user groups. New version is only available to a limited number of users, who are selected according to certain conditions and configuration parameters. UI UX settings, Request type, location, device type, and operating system can serve as parameters for selecting these users.

 

By measuring the performance and user feedback of each version, Ops/Dev teams can gather valuable data to follow proper decision making procedures for the management. The use of real-time statistical data can help developers make informed decisions about their application. However, A/B testing is difficult to set up and requires a high-end load balancer and robust infrastructure.

 

This strategy allows for data-driven optimization and helps ensure that only the most effective features or changes are rolled out to all users.

 

5. Blue-Green / Red-Black Deployment

“Version 2 is released with Version 1, after testing passed, traffic is switched to version 2.”

The Blue-Green / Red-Black deployment strategy involves maintaining two identical production environments, one currently running (blue/red) and one inactive (green/black). Stable or older versions of the application are always blue or red, while newer versions are green or black.

 

When a new version of the application is tested and ready, it is deployed to the green/black environment. Once the new version is tested and verified, the load balancer automatically switches traffic from the older version to the newer one, a smooth transition is made to routing traffic from the blue to the green/black environment, minimizing downtime and ensuring a seamless user experience.

 

In addition to offering a quick update or rollout of a new application version, this strategy has the disadvantage of being expensive since both the new and old versions must run simultaneously. Dev/Ops team, mostly use this method in mobile/web app development and deployment.

 

6. Canary Deployment

“Version 2 is released to a specific subset of users, then proceed to a full rollout on success.”

Canary deployment strategy involves deploying a new version of the application to a small subset of users or servers, known as canary groups. For instance, at a certain stage in the process, 90% of production traffic may still go through the older version while only 10% goes through the newer one, the team responsible for deployment gradually redirects traffic from the older version to the new one.

 

This allows for early testing and feedback. If the new version performs well, it can gradually be rolled out to the rest of the users or servers to achieve reduced risk of rollback.

 

Canary Deployments strategy reduces the risk of widespread issues and allows for rapid rollbacks if necessary.  It enables better performance monitoring, faster and easier application rollbacks, and facilitates automation in the deployment pipeline. However, it has a slow deployment cycle, requires more time, and demands a robust infrastructure.

This approach allows Dev/Ops Teams to evaluate the stability of the new version by using live traffic from a subset of end-users at varying levels throughout production.

Saturday, November 27, 2021

Difference between Dataproc, Dataflow, Dataprep and Datafusion.

Dataproc

Dataflow

Dataprep

DataFusion

It supports manual provision to clusters

It supports automatic provision to clusters

It helps to prepare and clean the data for future use.

It supports major Hadoop distributions (MapR, Hortonworks) and Cloud (AWS, GCP, AZURE) to build pipeline.

If systems are Hadoop dependent then good to use Dataproc.

It is created as an extension service for Hadoop.


Real-time data collection with Hadoop and Spark integration feature is more prominent in it.

It is based on Apache Beam, used for data lake data collection, cleaning and workload processing in parallel manner.

It mainly merges programming & execution models.

It is only seen as a data processing tool helps in Visual analytics and processing data as a plus-point.

Mainly used with Big table and Big query.

If you are only looking to find any anomalies or redundancy in the data, then can use it. 

It is based on CDAP, an open-source pipeline development tool.

It offers visualization tool to build ETL/ELT pipelines.

If one prefers a hands-on Dev-ops approach, then choose Dataproc

 if you prefer a serverless approach, then select Dataflow.

Fully Managed and No Ops approach

On the other hand, it is UI-driven, Fully Managed and follows No-Ops approach.

In GCP it uses cloud dataproc cluster to perform jobs and comes up with multiple prebuilt connectors from to connect source to sink.

Simple easy to use

Relatively difficult to use

Easy to use

Very easy to use

Used for Data Science and ML Eco System

Used for Batch and Streaming Data processing

Used for UI driver data processing where as multiple source data integrations.

It gives you codeless pipeline development and enterprise readiness gives data lineage, metadata management much easier

Tuesday, August 3, 2021

Difference between Structured, Semi-Structured and UnStructured data

Properties

Structured data

Semi-structured data

Unstructured data

Basic

Data whose elements are addressable for effective analysis and organized into a formatted tables, schemas or repository that is typically a database.

Data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database.

Data is a data which is not organized in a predefined manner or does not have a predefined data model; thus it is not a good fit for a mainstream relational database. There are some alternative platforms for storing and managing,

Databases

RDBMS like Oracle, MySQL, PostgreSQL.

 

Commonly data stored in data warehouses.

Non RDBMS / NoSQL databases like Mongo DB, Dynamo DB, Riak, Redis, etc.

 

Follows Hadoop Methodology

 

Commonly data stored in data lakes and data marts.

NoSQL databases like

Mongo DB, Cassandra HBase, CouchDB, Dynamo DB, Riak, Redis, etc.

 

Store’s character and binary data such as pictures, audio, video, pdf, log files, satellite images, scientific images, radar data, etc

 

Commonly data stored in data lakes and data marts.

Scalability

Very difficult to scale DB schema. Can apply horizonal and vertical scaling

scaling is simpler than structured data

more scalable.

Transactions

Matured transaction and various concurrency techniques supports ACID

Transaction is adapted from DBMS not matured

No transaction management and no concurrency

Flexibility

It is schema dependent and less flexible

 

Having predefined format of data

 

Schema on write

It is more flexible than structured data but less flexible than unstructured data.

 

Variety of data in shapes and sizes.

 

Schema on read

It is more flexible and there is absence of schema.

 

Variety of data in shapes and sizes.

 

Schema on read

Query performance

Structured query allow complex joining 

Queries over anonymous nodes are possible

Only textual queries are possible

Version management

Versioning over tuples,row,tables

Versioning over tuples or graph is possible

Versioned as a whole

Robustness

Very robust

New technology, not very spread

New technology, not very spread

Thursday, June 17, 2021

compare star schema and snowflake schema in datawarehouse modeling

Star Schema

Snowflake Schema

Simple Database Design.

Very Complex Database Design.

De-normalized Data structure.

Normalized Data Structure.

Query also run faster.

Query runs slower comparatively star schema.

It contains one fact table surrounded by dimension tables.

One fact table surrounded by dimension table which are in turn surrounded by dimension table

It is easy to understand and provides optimal disk usage

It uses smaller disk space.

Only single join creates the relationship between the fact table and any dimension tables.

It requires many joins to fetch the data.

Single Dimension table contains aggregated data.

Data Split into different Dimension Tables.

High level of data redundancy

Very low-level data redundancy

Cube processing is faster.

Cube processing might be slow because of the complex join.

Hierarchies for the dimensions are stored in the dimensional table.

Hierarchies are divided into separate tables.

Offers higher performing queries using Star Join Query Optimization. Tables may be connected with multiple dimensions.

It is represented by centralized fact table which unlikely connected with multiple dimensions.

Sunday, May 30, 2021

Standard SQL vs Legacy SQL - functions

Standard SQL

Legacy SQL


#standardSQL

SELECT 

repository.url

FROM 

`bigquery-public-data.samples.github_nested`

LIMIT 5;



#legacySQL

SELECT  

repository.url

FROM

[bigquery-public-data.samples.github_nested]

LIMIT 5;


Numeric

SAFE_CAST(x AS INT64)

INTEGER(x)

SAFE_CAST(x AS INT64)

CAST(x AS INTEGER)

APPROX_COUNT_DISTINCT(x)

COUNT(DISTINCT x)

COUNT(DISTINCT x)

EXACT_COUNT_DISTINCT(x)

APPROX_QUANTILES(x, buckets)

QUANTILES(x, buckets + 1)

APPROX_TOP_COUNT(x, num)

TOP(x, num), COUNT(*)

MOD(x, y)

x % y

Datetime

TIMESTAMP_DIFF(t1, t2, DAY)

DATEDIFF(t1, t2)

CURRENT_TIMESTAMP

NOW

FORMAT_TIMESTAMP(fmt, t)

STRFTIME_UTC_USEC(t, fmt)

TIMESTAMP_TRUNC(t, DAY)

UTC_USEC_TO_DAY(t)

REGEXP_CONTAINS(s, pattern)

REGEXP_MATCH(s, pattern)

x IS NULL

IS_NULL(x)

Strings

SAFE_CAST(x AS STRING)

STRING(x)

SAFE_CAST(x AS STRING)

CAST(x AS STRING)

SUBSTR(s, 0, len)

LEFT(s, len)

SUBSTR(s, -len)

RIGHT(s, len)

STRPOS(s, "abc") > 0 or s LIKE '%abc%'

s CONTAINS "abc"

STRING_AGG(s, sep)

GROUP_CONCAT_UNQUOTED(s, sep)

IFNULL(LOGICAL_OR(x), false)

SOME(x)

IFNULL(LOGICAL_AND(x), true)

EVERY(x)

Arrays

ARRAY_AGG(x)

NEST(x)

ANY_VALUE(x)

ANY(x)

arr[SAFE_ORDINAL(index)]

NTH(index, arr) WITHIN RECORD

ARRAY_LENGTH(arr)

COUNT(arr) WITHIN RECORD

Url / IP Address Functions

NET.HOST(url)

HOST(url)

NET.PUBLIC_SUFFIX(url)

TLD(url)

NET.REG_DOMAIN(url)

DOMAIN(url)

NET.IPV4_TO_INT64(

NET.IP_FROM_STRING(

addr_string))

PARSE_IP(addr_string)

NET.IP_TO_STRING(

NET.IPV4_FROM_INT64(

addr_int64 & 0xFFFFFFFF))

FORMAT_IP(addr_int64)

NET.IP_FROM_STRING(addr_string)

PARSE_PACKED_IP(addr_string)

NET.IP_TO_STRING(addr_bytes)

FORMAT_PACKED_IP(addr_bytes)