Vetted Talent

Rajat Gupta

Vetted Talent

Senior Data Engineer with over 10 years of experience in the banking, finance & telecom for Fortune 500 clients with certifications in AWS and Databricks. I leverage my experience to design, build, and implement scalable analytics solutions with data engineering workflows and complex data pipelines on the cloud. I work with cross-functional teams to deliver high-quality data products and services for the banking and finance sector, using cutting-edge technologies such as Data Engineer, DevOps, Confluent Kafka, Spark, and ML technologies including OpenAI. Extensive professional experience in software architecture, design, development and integration. Design, build and implementation of scalable analytics solutions with data engineering workflows and complex data pipelines both on-premise and AWS Cloud. Developing large-scale distributed applications using Hadoop (HDP & CDH), MR, Hive, Spark, Streaming, Kafka. Building enterprise cloud data platform on AWS Creating enterprise Data Lake and modern Data Warehouse capabilities and patterns. Creating DevOps CI/CD pipelines using Git, Jenkins, Dockers, Kubernetes.

Role
Data Engineer
Years of Experience
10 years

Skillsets

PyCharm
Terraform
Teradata
SQS
SQL Server
SNS
Seaborn
SciPy
Scikit-learn
Sagemaker
REST API
RDS
Zookeeper
Postgre SQL
pandas
Oracle
NumPy
Maven
Matplotlib
Lambda
Kubernetes
Kinesis
Kafka
AWS EMR
StatsModels
Random
Math
Beautiful Soup
Pyplot
Plot.ly
AutoSys
Ibm rational team concert
Jupyter Notebook
IntelliJ IDEA
Ibm rad
Jira
Confluent Kafka
Cloudera
Sbt
Dataproc
CloudSQL
BigTable
ADLS
DMS
Sqoop
MapReduce
SVM
Redshift - 5 Years
Spark - 8 Years
MySQL - 8 Years
Linear/Logistic Regression
Core Java
SQL - 8 Years
ANN
Time Series
Hierarchical clustering
PCA
Random Forests
ETL - 10 Years
Logistic Regression
DevOps
CI/CD
Hadoop
Python - 6 Years
Java
PySpark
AWS - 6 Years
AWS
Big Data - 8 Years
DynamoDB
Jenkins
Hive
HDFS
Gradle
Glue
Git
GCP
EMR
Eclipse
EC2
Big Data
Docker
Databricks
Composer
Bitbucket
BigQuery
Azure
Airflow
ADF
S3 - 5 Years
Snowflake - 2 Years

Vetted For

11Skills

Roles & Skills
Results
Details

Senior Data EngineerAI Screening
50%

Skills assessed :BigQuery, AWS, Big Data Technology, ETL, NO SQL, PySpark, Snowflake, 組込みLinux, Problem Solving Attitude, Python, SQL
Score: 45/90

Professional Summary

10Years

May, 2024 - Aug, 2024 3 months
AWS Data Engineer
Synechron Technologies
May, 2024 - Aug, 2024 3 months
Lead Data Engineer
Synechron Technologies
Jan, 2022 - Dec, 20231 yr 11 months
Module Lead
National Stock Exchange Information Technology (NSE-IT)
Jan, 2020 - Feb, 20211 yr 1 month
Senior Big Data Developer
Collabera Technologies
Mar, 2021 - Aug, 2021 5 months
Software Engineer
NatWest Group (Royal Bank of Scotland)
Mar, 2021 - Aug, 2021 5 months
Lead AWS Data Engineer
NatWest Group (Royal Bank of Scotland)
Mar, 2018 - Aug, 20191 yr 5 months
Senior Data Engineer
Wipro Technologies
Mar, 2018 - Aug, 20191 yr 5 months
Specialist
Wipro Technologies
Jul, 2014 - Sep, 20173 yr 2 months
Senior Data Engineer
Ericsson Global India Services Pvt. Ltd.
Mar, 2011 - Aug, 20121 yr 5 months
Freelancer
IIT Bombay
Mar, 2013 - Jul, 20141 yr 4 months
Associate Software Engineer
Gopisoft Pvt. Ltd.
Jul, 2014 - Sep, 20173 yr 2 months
Assistant Engineer
Ericsson Global India Services Pvt. Ltd.

Applications & Tools Known

HDFS
Sqoop
Hive
Impala
Oozie
Spark
Kafka
Airflow
AWS
Azure
GCP
Maven
Gradle
REST API
Bitbucket
Git
Jira
Oracle
MySQL
Teradata
SQL Server
PostgreSQL
Docker
Kubernetes
Terraform
Jenkins
Cloudera
Databricks
AWS EMR
Eclipse
IBM RAD
PyCharm
IntelliJ IDEA
Jupyter Notebook
HDFS
Sqoop
Hive
Airflow
AWS
GCP
Maven
SQL Server
Terraform
Databricks
Jupyter Notebook
Scikit-learn
NumPy
Pandas
SciPy
Pyplot
Beautiful Soup
Matplotlib
Seaborn

Work History

10Years

AWS Data Engineer

Synechron Technologies

May, 2024 - Aug, 2024 3 months

Designed and developed robust data pipelines using Spark and Hive to support ETL processes, ensuring timely and accurate data flow across multiple sources and destinations.
Managed and optimized AWS infrastructure to support large-scale data processing, implementing best practices for cost efficiency, data security, and scalability.
Led data modeling efforts to enhance data accuracy and accessibility, transforming raw data into structured formats suitable for analytics and reporting.
Conducted performance tuning on Spark and Hive processes to minimize execution times, improve throughput, and maintain high performance standards.
Worked closely with American Express stakeholders to understand data requirements, align on project goals, and deliver solutions that support their business objectives.
Provided technical guidance to a team of data engineers, conducting code reviews, mentoring team members, and fostering a collaborative development environment.
Implemented data validation and quality checks throughout the pipeline to ensure data integrity, resolving issues proactively to maintain high data quality.

Lead Data Engineer

Synechron Technologies

May, 2024 - Aug, 2024 3 months

Module Lead

National Stock Exchange Information Technology (NSE-IT)

Jan, 2022 - Dec, 20231 yr 11 months

Create new data-lake for processing financial data related to debt, recovery

Lead AWS Data Engineer

NatWest Group (Royal Bank of Scotland)

Mar, 2021 - Aug, 2021 5 months

Project Description: New data-lake using AWS Environment

Created Data Ingestion Pipelines from various RDBMS databases to AWS S3 layer using AWS DMS
Worked on creating ETL data pipelines consisting of PySpark, Glue Jobs, Lambda, DynamoDB, Athena, S3 with parquet files
Built new infrastructure in AWS using Terraform & CI/CD Pipeline using Bitbucket and TeamCity

Tech Stack: Hive, Spark, Python, AWS (EC2, S3, RDS, Redshift, SNS, EMR, Glue, DMS), Airflow, Terraform

Software Engineer

NatWest Group (Royal Bank of Scotland)

Mar, 2021 - Aug, 2021 5 months

New data-lake using AWS Environment

Senior Big Data Developer

Collabera Technologies

Jan, 2020 - Feb, 20211 yr 1 month

Integrate new data sources using Spark

Senior Data Engineer

Wipro Technologies

Mar, 2018 - Aug, 20191 yr 5 months

Project 1: Customer Care IVR Automation

Description: Automate the current IVR process to remove dependency on Customer Care Executive

Handled Data Ingestion via Kafka for real time processing of the data with Spark Streaming for large scale data processing
Involved in translation of complex functional and technical requirements into detailed high and low level design
Integrated HSM API with Kafka to provide hardware encryption along with software level 256-bit encryption to secure
Transactions like debit/credit card details along with integrating with REST API for encryption and decryption of messages
Played a key role in transformation by Spark scripts for data transformation from structured & semi-structured data
Performed analysis on data by implementing various machine learning algorithms using Spark ML
Improved algorithms by deploying best hyperparameter, deployed GridSearchCV for tuning

Tech Stack: Confluent Kafka, Spark, Python, Machine Learning, IBM RTC, Jenkins, REST API, Web Services, HSM

Project 2: Datawarehouse Migration

Description: To create Datalake which shall be the single & comprehensive source of information to improve decision making

Setup environment on AWS EMR for development purposes
Oversaw data extraction from Charging System nodes Oracle Exadata to HDFS using Data ingestion tool Sqoop
Analyzed data by Spark scripts to extract meaning and value from structured data

Tech Stack: Sqoop, Hive, Spark, Python, AWS (EC2, EMR)

Specialist

Wipro Technologies

Mar, 2018 - Aug, 20191 yr 5 months

Automate the current IVR process to remove dependency on Customer Care Executive

Senior Data Engineer

Ericsson Global India Services Pvt. Ltd.

Jul, 2014 - Sep, 20173 yr 2 months

Project 1: Charging System Tariff Plans Regression | U Mobile, Malaysia

Description: Constant update in tariff raises persistent need to track the impact on ROI which determines end user satisfaction.

Developed PySpark scripts to calculate traditional & ad-hoc KPIs from structured & semi-structured data
Managed Data ingestion using Sqoop, cleaning and manipulation of data using Spark scripts

Tech Stack: Sqoop, Hive, Spark, Python

Project 2: New Datalake for LTE Network

Description: To create new Datalake which shall be the single & comprehensive source of information for critical KPIs

Performed ETL from ENIQ Oracle database to HDFS using Sqoop and processing of Data by Hive scripts
Tech Stack: Sqoop, Hive, Shell Scripting

Project 3: Portal Development

Migrated old dashboard written in PHP scripts to new dashboard using Java

Tech Stack: Core Java, JSP, Servlets, MySQL

Assistant Engineer

Ericsson Global India Services Pvt. Ltd.

Jul, 2014 - Sep, 20173 yr 2 months

Constant update in tariff raises persistent need to track the impact on ROI which determines end user satisfaction

Associate Software Engineer

Gopisoft Pvt. Ltd.

Mar, 2013 - Jul, 20141 yr 4 months

Freelancer

IIT Bombay

Mar, 2011 - Aug, 20121 yr 5 months

Handled open source development projects including Scilab Textbook Companion funded by Ministry of HRD

Achievements

Intel Edge AI Scholarship recipient
80% scholarship recipient from Swades Foundation for PGD in Data Science
Organizer of Linux Workshop in association with IIT Bombay
Google Scholar for Scilab Consortium
Open-Source Developer for Scilab Consortium
Awarded Intel Edge AI Scholarship from Intel & Udacity in Dec19
Awarded 80% scholarship from an NGO, Swades Foundation for PGD in Data Science in Sep18
Organized Linux Workshop in association with IIT Bombay in Amity Youth Fest 12
Google Scholar for the open-source development work done for Scilab Consortium
Open-Source Developer for Scilab Consortium (INRIA, France) with 2 documents published at scilab.in

Major Projects

1Projects

Scilab Textbook Companion- Internship Projects-

Aug, 2011 - Mar, 2012 7 months

Project 1: Scilab Textbook Companion | IIT Bombay | Under Ministry of HRD Project

Description: Port out examples from standard textbooks and make it easy for users and to improve the documentation for Scilab
Published 2 documents which are available at Electronic Communication Systems & Applications Of GSM

Project 2: Spoken Tutorials | IIT Bombay | Under Ministry of HRD Project

Description: An Initiative of "Talk to a Teacher" project for National Mission on Education through ICT, MHRD, Govt. of India
Worked as Workshop Ambassador to promote and develop content for FOSSEE

Education

Post Graduate Diploma in Data Science
IIIT Bangalore (2019)
Post Graduate Diploma in Advanced Computing
CDAC Noida (2015)
B. Tech. Electronics & Telecommunication
Amity University, Noida, IN (2012)
B. Tech. – Electronics & Telecommunication
Amity University, Noida, IN (2012)
Post Graduate Diploma in Advanced Computing
CDAC Noida, IN (2013)

Certifications

Aws solutions architect associate
Databricks data engineer associate
Aws solutions architect associate certified in aug 2023
Databricks data engineer associate certified from jan’23 to jan’25.
Intel edge ai scholarship from intel & udacity in dec19
Databricks data engineer associate certified from jan23 to jan25
Gcp associate cloud engineer certified from oct’24 to oct’27.
Awarded intel® edge ai scholarship from intel & udacity in dec’19
Awarded 80% scholarship from an ngo, swades foundation for pgd in data science in sep’ 18
Open-source developer for scilab consortium (inria, france) with 2 documents published at scilab.in
Google scholar for the open-source development work done for scilab consortium
Organized linux workshop in association with iit bombay in amity youth fest ‘12

AI-interview Questions & Answers

I have 10 years of working experience, so I started my career as a Java developer for initial 2 years. And for about past 8 years, I'm working as a data engineer. So if I talk about my skill set, I have worked on Hive, Pyspap, And, uh, for real time, I worked on Kafka. And other tools which I've worked on from past few years is Databricks and Ocassition. I worked on DX tool. And other things like in DevOps, I am also involved in creating CICD pipelines and new infrastructure could telephone. So this is about my overall experience.

So if I need to optimize a PySpark job, which has a 2 large dataset. So first, I will analyze and, accordingly, uh, we'll set the partitioning key so that, uh, there is no major shuffling happen. And, uh, the part whatever partitioning here, I will see I will set I will see if there is any data skewness or is there or not. If the data skewness is still there after uh, setting the required partitioning key. I will choose the, uh, sorting key then, in which I will uniformly distribute the data so that whenever I use the joints, uh, between 2 large dataset, Uh, my job will not get stubbed, and my and since I'm using large dataset, I will be using the more, uh, like, memory optimized cluster instead of compute optimize. So in this way, I can optimize it. And, also, some of the parameters in spark properties while submitting, I will increase those tune those parameters. And and then accordingly, we can write the output to a snowflake

So, basically, process of tuning a PySpark application, like, first way is, like, uh, if we can increase or decrease the number of spark partitions. This for default, it is 200. Second is, uh, like, if we are using a join condition and if one of the, uh, joint condition as it has a small data so we can broadcast a particular data frame to a bigger data frame. And 3rd 3rd way of tuning is, like, whatever if we are, like, using a big joint a big table joint. So we whatever partitioning key we are set we are setting, we We need to see if there is any data skewness or is there or not. If there is a data skewness still, we can use the, uh, sorting technique so that our PySpark application is accordingly optimized. And another ways are, like, we need to choose these spark properties accordingly according to our requirement. So this is all.

I have not worked on snowflake

So we can monitor, like so whenever we face some failure conditions. So if you're using AWS services, like, we can use the SMS service to, uh, so whenever we get a failure condition, so we will get a alert on our email so that we don't need to monitor it. Just we can monitor our emails for any failure emails, and, accordingly, we can rectify those errors. So this is how we can monitor the ETL job failures and any delays. We can add the delay condition in our program also. So if it doesn't, like, if it exceeds after this time, so this will also send an alert

So if we have to construct a serverless data processing pipeline, so what we can do is, like, as soon as our data get land landed in the in our SD bucket, so we can put a trigger based on which, uh, Lambda. We can get some preprocessing parameters and then that using Lambda. If so if we are using AWS services, we can either, like, trigger the blue job, which will, like, uh, launch another serverless, like, pipeline for processing. So in this way, like, we can, uh, design a serverless architecture using Lambda, Glue, or maybe, like, if we need is DynamoDB for metadata information. Those things also we can do.

So if you have to complete the data quality framework, so what we can do is we can create the unit test cases using Python, PySpark, and so we can do the unit testing in our local only so that, uh, whatever which send accordingly, we can pair, uh, like, test cases so that, uh, we can check on our local only, uh, so for any data quality issues.

So if we need to automate the backup process, uh, for no SQL database, so one of the example can be if you're using DynamoDB, which is a managed version of NoSQL database on AWS. So there is an option, like, where we can do the backup by the web itself, so that it is automatically take care taken care of. Otherwise, we can do do the man backup manually also. So at the end of each day, like, we can do like, we will create a backup or a different region maybe, so that the we can we can access the data in case of a disaster

So the strategy is, like, basically to migrate the existing Python history cluster. So we can add the multistrading capability through Python, so which will, uh, like so so first, we will create this, like, we will create a generalized framework so that any, like, uh, ETL pipelines, we can, uh, we can, like, multi we can run those frameworks. And whatever, like, tables it will be running, so it will be, like, running in, uh, multithreading, so it can it will be running in parallel, basically. So using this, creating a generalized framework and running in multithreading, so we can create a more enhanced parallel of processing capabilities.

So if we need to, uh, process the data in real time, so what we can do is, So so either, like, Lambda, we can use for some preprocessing steps or, like, If we need to use a Kinesis, so we can ingest the data using Kinesis also. And within Kinesis, I think, We can process the data data itself. Uh, so in that way, we don't need to go to the Lambda. Just all the data ingestion and processing, we can It can be happen within

To design a high availability disaster recovery, so we can create a multi, like, available tenant architecture. So in which, like, it will be available in the multiple results so that there is no it will be, like, high high availability option.

Rajat Gupta

Data Engineer

10 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

AWS Data Engineer

Lead Data Engineer

Module Lead

Lead AWS Data Engineer

Software Engineer

Senior Big Data Developer

Senior Data Engineer

Specialist

Senior Data Engineer

Assistant Engineer

Associate Software Engineer

Freelancer

Achievements

Major Projects

Scilab Textbook Companion- Internship Projects-

Education

Post Graduate Diploma in Data Science

Post Graduate Diploma in Advanced Computing

B. Tech. Electronics & Telecommunication

B. Tech. – Electronics & Telecommunication

Post Graduate Diploma in Advanced Computing

Certifications

Aws solutions architect associate

Databricks data engineer associate

Aws solutions architect associate certified in aug 2023

Databricks data engineer associate certified from jan’23 to jan’25.

Intel edge ai scholarship from intel & udacity in dec19

Databricks data engineer associate certified from jan23 to jan25

Gcp associate cloud engineer certified from oct’24 to oct’27.

Awarded intel® edge ai scholarship from intel & udacity in dec’19

Awarded 80% scholarship from an ngo, swades foundation for pgd in data science in sep’ 18

Open-source developer for scilab consortium (inria, france) with 2 documents published at scilab.in

Google scholar for the open-source development work done for scilab consortium

Organized linux workshop in association with iit bombay in amity youth fest ‘12

AI-interview Questions & Answers