
Senior Data Engineer with over 10 years of experience in the banking, finance & telecom for Fortune 500 clients with certifications in AWS and Databricks. I leverage my experience to design, build, and implement scalable analytics solutions with data engineering workflows and complex data pipelines on the cloud. I work with cross-functional teams to deliver high-quality data products and services for the banking and finance sector, using cutting-edge technologies such as Data Engineer, DevOps, Confluent Kafka, Spark, and ML technologies including OpenAI. Extensive professional experience in software architecture, design, development and integration. Design, build and implementation of scalable analytics solutions with data engineering workflows and complex data pipelines both on-premise and AWS Cloud. Developing large-scale distributed applications using Hadoop (HDP & CDH), MR, Hive, Spark, Streaming, Kafka. Building enterprise cloud data platform on AWS Creating enterprise Data Lake and modern Data Warehouse capabilities and patterns. Creating DevOps CI/CD pipelines using Git, Jenkins, Dockers, Kubernetes.
AWS Data Engineer
Synechron TechnologiesLead Data Engineer
Synechron TechnologiesModule Lead
National Stock Exchange Information Technology (NSE-IT)Senior Big Data Developer
Collabera TechnologiesSoftware Engineer
NatWest Group (Royal Bank of Scotland)Lead AWS Data Engineer
NatWest Group (Royal Bank of Scotland)Senior Data Engineer
Wipro TechnologiesSpecialist
Wipro TechnologiesSenior Data Engineer
Ericsson Global India Services Pvt. Ltd.Freelancer
IIT BombayAssociate Software Engineer
Gopisoft Pvt. Ltd.Assistant Engineer
Ericsson Global India Services Pvt. Ltd.
HDFS

Sqoop

Hive

Impala

Oozie

Spark

Kafka

Airflow

AWS
Azure

GCP

Maven

Gradle

REST API

Bitbucket

Git
Jira

Oracle

MySQL
.jpg)
Teradata

SQL Server

PostgreSQL
.png)
Docker

Kubernetes

Terraform
.png)
Jenkins

Cloudera
.png)
Databricks

AWS EMR

Eclipse

IBM RAD

PyCharm

IntelliJ IDEA

Jupyter Notebook

HDFS

Sqoop

Hive

Airflow

AWS

GCP

Maven

SQL Server

Terraform

Databricks

Jupyter Notebook

Scikit-learn

NumPy

Pandas

SciPy

Pyplot

Beautiful Soup

Matplotlib

Seaborn
Project Description: New data-lake using AWS Environment
Tech Stack: Hive, Spark, Python, AWS (EC2, S3, RDS, Redshift, SNS, EMR, Glue, DMS), Airflow, Terraform
Project 1: Customer Care IVR Automation
Description: Automate the current IVR process to remove dependency on Customer Care Executive
Tech Stack: Confluent Kafka, Spark, Python, Machine Learning, IBM RTC, Jenkins, REST API, Web Services, HSM
Project 2: Datawarehouse Migration
Description: To create Datalake which shall be the single & comprehensive source of information to improve decision making
Tech Stack: Sqoop, Hive, Spark, Python, AWS (EC2, EMR)
Project 1: Charging System Tariff Plans Regression | U Mobile, Malaysia
Description: Constant update in tariff raises persistent need to track the impact on ROI which determines end user satisfaction.
Tech Stack: Sqoop, Hive, Spark, Python
Project 2: New Datalake for LTE Network
Description: To create new Datalake which shall be the single & comprehensive source of information for critical KPIs
Project 3: Portal Development
Tech Stack: Core Java, JSP, Servlets, MySQL
Project 1: Scilab Textbook Companion | IIT Bombay | Under Ministry of HRD Project
Project 2: Spoken Tutorials | IIT Bombay | Under Ministry of HRD Project
I have 10 years of working experience, so I started my career as a Java developer for initial 2 years. And for about past 8 years, I'm working as a data engineer. So if I talk about my skill set, I have worked on Hive, Pyspap, And, uh, for real time, I worked on Kafka. And other tools which I've worked on from past few years is Databricks and Ocassition. I worked on DX tool. And other things like in DevOps, I am also involved in creating CICD pipelines and new infrastructure could telephone. So this is about my overall experience.
So if I need to optimize a PySpark job, which has a 2 large dataset. So first, I will analyze and, accordingly, uh, we'll set the partitioning key so that, uh, there is no major shuffling happen. And, uh, the part whatever partitioning here, I will see I will set I will see if there is any data skewness or is there or not. If the data skewness is still there after uh, setting the required partitioning key. I will choose the, uh, sorting key then, in which I will uniformly distribute the data so that whenever I use the joints, uh, between 2 large dataset, Uh, my job will not get stubbed, and my and since I'm using large dataset, I will be using the more, uh, like, memory optimized cluster instead of compute optimize. So in this way, I can optimize it. And, also, some of the parameters in spark properties while submitting, I will increase those tune those parameters. And and then accordingly, we can write the output to a snowflake
So, basically, process of tuning a PySpark application, like, first way is, like, uh, if we can increase or decrease the number of spark partitions. This for default, it is 200. Second is, uh, like, if we are using a join condition and if one of the, uh, joint condition as it has a small data so we can broadcast a particular data frame to a bigger data frame. And 3rd 3rd way of tuning is, like, whatever if we are, like, using a big joint a big table joint. So we whatever partitioning key we are set we are setting, we We need to see if there is any data skewness or is there or not. If there is a data skewness still, we can use the, uh, sorting technique so that our PySpark application is accordingly optimized. And another ways are, like, we need to choose these spark properties accordingly according to our requirement. So this is all.
I have not worked on snowflake
So we can monitor, like so whenever we face some failure conditions. So if you're using AWS services, like, we can use the SMS service to, uh, so whenever we get a failure condition, so we will get a alert on our email so that we don't need to monitor it. Just we can monitor our emails for any failure emails, and, accordingly, we can rectify those errors. So this is how we can monitor the ETL job failures and any delays. We can add the delay condition in our program also. So if it doesn't, like, if it exceeds after this time, so this will also send an alert
So if we have to construct a serverless data processing pipeline, so what we can do is, like, as soon as our data get land landed in the in our SD bucket, so we can put a trigger based on which, uh, Lambda. We can get some preprocessing parameters and then that using Lambda. If so if we are using AWS services, we can either, like, trigger the blue job, which will, like, uh, launch another serverless, like, pipeline for processing. So in this way, like, we can, uh, design a serverless architecture using Lambda, Glue, or maybe, like, if we need is DynamoDB for metadata information. Those things also we can do.
So if you have to complete the data quality framework, so what we can do is we can create the unit test cases using Python, PySpark, and so we can do the unit testing in our local only so that, uh, whatever which send accordingly, we can pair, uh, like, test cases so that, uh, we can check on our local only, uh, so for any data quality issues.
So if we need to automate the backup process, uh, for no SQL database, so one of the example can be if you're using DynamoDB, which is a managed version of NoSQL database on AWS. So there is an option, like, where we can do the backup by the web itself, so that it is automatically take care taken care of. Otherwise, we can do do the man backup manually also. So at the end of each day, like, we can do like, we will create a backup or a different region maybe, so that the we can we can access the data in case of a disaster
So the strategy is, like, basically to migrate the existing Python history cluster. So we can add the multistrading capability through Python, so which will, uh, like so so first, we will create this, like, we will create a generalized framework so that any, like, uh, ETL pipelines, we can, uh, we can, like, multi we can run those frameworks. And whatever, like, tables it will be running, so it will be, like, running in, uh, multithreading, so it can it will be running in parallel, basically. So using this, creating a generalized framework and running in multithreading, so we can create a more enhanced parallel of processing capabilities.
So if we need to, uh, process the data in real time, so what we can do is, So so either, like, Lambda, we can use for some preprocessing steps or, like, If we need to use a Kinesis, so we can ingest the data using Kinesis also. And within Kinesis, I think, We can process the data data itself. Uh, so in that way, we don't need to go to the Lambda. Just all the data ingestion and processing, we can It can be happen within
To design a high availability disaster recovery, so we can create a multi, like, available tenant architecture. So in which, like, it will be available in the multiple results so that there is no it will be, like, high high availability option.