profile-pic
Vetted Talent

Aswathy Raj

Vetted Talent

Diligent engineer with 12+ years of experience which includes contributions in data science and engineering,

development of software framework, platforms, applications and customer interaction with multilingual and

multicultural clients. An effective team player and well versed in various platforms, programming languages

and programming with different databases. Also have extensive experience in all phases of software

development, and on waterfall and agile methods of project life cycle.

  • Role

    Senior Data Engineer

  • Years of Experience

    12.00 years

Skillsets

  • Reporting & documentation
  • Github
  • Implementation Support
  • It infra management
  • Jira
  • Jupyter Notebook
  • MySQL
  • NSIS
  • Providing product demo
  • PyCharm
  • Eclipse
  • Requirement gathering
  • SQLite
  • SVN
  • Visual Studio
  • Data insights & strategy
  • Data analytics dashboard
  • Support software development
  • Tender proposals
  • Jaspersoft ireport designer 5.1.0
  • Ant script
  • Snowflake - 4 Years
  • Java - 2 Years
  • Python - 4 Years
  • MS SQL - 2 Years
  • SQL - 8 Years
  • Redshift - 1 Years
  • ETL - 4 Years
  • S3 - 4 Years
  • PySpark
  • Azure DevOps - 1 Years
  • Tcl/tk script
  • AWS - 4 Years
  • Airflow
  • Ant script
  • AWS
  • Business Intelligence
  • Ci/ cd implementation
  • Client Management
  • Databricks

Vetted For

11Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data EngineerAI Screening
  • 62%
    icon-arrow-down
  • Skills assessed :BigQuery, AWS, Big Data Technology, ETL, NO SQL, PySpark, Snowflake, 組込みLinux, Problem Solving Attitude, Python, SQL
  • Score: 56/90

Professional Summary

12.00Years
  • Oct, 2020 - Present5 yr

    Consultant | Data Science Engineer

    Sinergia Media Labs
  • Jan, 2013 - May, 20163 yr 4 months

    IT Consultant

    AI Rawahy Technical Services
  • Sep, 2006 - Nov, 20104 yr 2 months

    Software Engineer

    Huawei

Applications & Tools Known

  • icon-tool

    Airflow

  • icon-tool

    Pycharm

  • icon-tool

    Jupyter Notebook

  • icon-tool

    Eclipse

  • icon-tool

    Visual Studio

  • icon-tool

    Jupyter Notebook

  • icon-tool

    GitHub

  • icon-tool

    SVN

Work History

12.00Years

Consultant | Data Science Engineer

Sinergia Media Labs
Oct, 2020 - Present5 yr
    Developing machine learning applications, selecting datasets, implementing ML algorithms, running tests, maintaining databases, filtering data, and preparing analysis reports.

IT Consultant

AI Rawahy Technical Services
Jan, 2013 - May, 20163 yr 4 months
    Planning project activities, managing end-to-end project management, handling technical aspects, and imparting training.

Software Engineer

Huawei
Sep, 2006 - Nov, 20104 yr 2 months
    Design and development of functional and technical solutions, removing corrupted data, rapid application development, compiling code, executing tests, defining pipeline steps, and integrating code changes.

Achievements

  • Awarded for the Tableau Dataset Migration (Customer Appreciation)
  • Achieved CEO Team Award in January 2023
  • Bagged Monthly Shining Star Award for exceptional performance under pressure, meeting strict timelines, and delivering quality results during March 2022
  • Successfully worked in China for 6 months to implement urgent requirements, completed tasks with high-quality
  • Awarded for contributions towards making projects CI (Continuous Integration) compliant, including setting up Cruise Control, writing scripts in ANT, XSL, and XML, providing technical assistance, and conducting training sessions

Major Projects

8Projects

Techstyle

Mar, 2024 - Present1 yr 7 months
    Techstyle is an American fashion brand that operates in the e-commerce domain. Technologies AMWAA, Python, SQL, Snowflake, MS SQL Server, Pycharm, GitHub Accountabilities: Continuously oversee active processes, promptly identifying and resolving any issues or failures to ensure seamless operations. Design and implement new features and data pipelines to enhance functionality and efficiency. Conduct rigorous data validation to ensure accuracy, consistency, and integrity across all datasets. Troubleshoot and fix any failures, optimizing system performance for improved reliability and speed.

NBC (National Broadcasting Company)

Nov, 2021 - Feb, 20242 yr 3 months
    National Broadcasting Company is an American commercial broadcast television and radio network. Technology Used: Python, PySpark, SQL, Amazon S3, Databricks, Airflow, Snowflake, MySQL Accountabilities: Developed frameworks and pipelines for capturing data from APIs and other sources, storing it in Amazon S3 and loading it into Snowflake tables after transformation Optimization and migration of Tableau datasets - PySPark code Managed migration process from SnapLogic to Airflow and Python, implemented distributed processing with PySpark and Spark SQL in Databricks Added and maintained ETL pipelines in Airflow and optimized Spark SQL queries to reduce reporting query run times Created and managed Delta tables and evaluated data optimization technologies Developed an audit framework integrated with Python scripts Imparted training to the new hires on domains and pipelines

Social Pulse

Mar, 2023 - Sep, 2023 6 months
    In-house project to leverage the data from various social media endpoints like Youtube, Facebook, Instagram, Twitter, LinkedIn, and TikTok, to provide a reporting dashboard. Technology Used: Python, Redshift, Amazon S3, AWS QuickSight, React, node.js Accountabilities: Architected & managed the project until completion & ensured the development of the framework, and pipelines for data capturing from AP

Amgen

Jun, 2021 - Oct, 2021 4 months
    Amgen is an American multinational biopharmaceutical company. The data science project was carried out to identify the factors leading to customer/patient dropout of one of their drugs Otezla. Technology Used: ML, Python, SQL | Platforms: Databricks Accountabilities: Analysed data in a Data Lake with over 300 tables to understand the pharma domain Prepared two aggregated datasets: one at the customer level and another at the patient level Conducted Exploratory Data Analysis, handled missing values and encoded categorical data Performed feature engineering for feature elimination and developed 12 machine-learning models for classification and clustering Created an ML pipeline for model retraining

Indventor

Oct, 2020 - Jun, 2021 8 months
    Indventor Bag Valve Mask-based low-cost ventilator which is the standard method of providing rescue ventilation to patients.

In-House Project, Indventor

Oct, 2020 - Jun, 2021 8 months
    Indventor Bag Valve Mask-based low-cost ventilator which is the standard method of providing rescue ventilation to patients. Technology Used: Python, SQL, Selenium Accountabilities: Reviewed the documents & code and presented client-side product presentation Researched features as per customer request and created UI path RPA flows

Ministry of Agriculture and Fisheries, Oman

Jan, 2013 - May, 20163 yr 4 months
    The Ministry of Agriculture and Fisheries is initiated to enrich the fields related to agriculture, livestock and fisheries. The project aimed to centralize the data from various regions. Technology Used: Core Java, Jaspersoft iReport, Windows / Software configurations Accountabilities: Understood the project architecture and the functionality of the Fisheries Licensing module Interacted with the ministry to clarify requirements, ensured alignment & conducted legacy database data analysis for migration to a new database Designed & created license cards, certificates & statistical reports using iReport, and integrated these reports into the application Deployed the database and application on the ministry's centralized server.

Security Solutions, Huawei

Sep, 2006 - Nov, 20104 yr 2 months
    Huawei is an organization worldwide known for its work in telecommunication. The project aimed to enhance the security offered at the IP layer. The product contains support from IKEv1 as well as IKEv2. I worked on a project which developed applications to enhance the security of the telecom servers. Technology Used: SQLite, C++, tcl/tk, Core Java Accountabilities: Built projects to enhance the code quality by developing an on-the-fly feedback and correction system for eclipse systems. Created CI/CD pipelines for projects in co-operating command mode integration of code quality and QA tools. Provided training for the team for building continuous integration systems for projects. Developed automation suites for building libraries across various platforms and boards Implemented and managed Continuous Integration (CI) processes and conducted training for the project team Developed and implemented GUI-specific code along with analysing new requirements and designing solutions for implementation Enhanced coding skills in Core Java and Swing and gained proficiency in Oracle database administration Extended customer support for LGT, LVM & LMT and implemented logging and auditing policies Created XML configuration files based on CIS Benchmarks Parsed, retrieved, and wrote XML configuration files and conducted training sessions on using the plug-in

Education

  • M. Tech. (Data Science and Engineering)

    BITS Pilani, India (2022)
  • B.Tech. (Computer Science and Engineering)

    MG University, India (2006)

Certifications

  • Data warehousing workshop - snowflake - october 2024

  • Academy accreditation - databricks lakehouse fundamentals - march 2023

  • Basics of natural language processing using python - march 2021 (nielit)

  • Databases and sql for data science by ibm (coursera) - nov 2019

  • Introduction to git and github by google (coursera) - october 2020

  • Exploratory data analysis with python and pandas (coursera) - march 2021

  • Data warehousing workshop - snowflake - october 2024 (credential id 119306090)

  • Basics of natural language processing using python - march 2021 (nielit) - (credential id olc3190)

  • Exploratory data analysis with python and pandas (coursera) - march 2021 (credential id yv8396ns2l25)

  • Databases and sql for data science by ibm (coursera) - nov 2019 (credential id - usxtmtufvyt8)

  • Introduction to git and github by google (coursera) - october 2020 (credential id - nuwxqp5a3gte)

AI-interview Questions & Answers

Uh, could you help me understand more about your background by giving a brief introduction about yourself? Sure. Uh, Yeah. I have, uh, I'm a data engineer working as a data engineer right now and, uh, working remotely also. And, uh, I have a total of, uh, 11 years experience, uh, which spans over multiple domains. So, currently, the domain is ecommerce. And, uh, previous to this, I was working in a media and entertainment domain, uh, close to 2.5 years. And prior to that, it was a pharmaceutical, uh, client. And, uh, that was on the data, uh, engineering and, you know, data science experience. Prior to that, I was working in the telecom domain and creating security solutions for their server applications. And I I love solving problems, uh, more on the technical side, and I am a good learner of tools, can grasp and understand the technical documentation and get things work. And I love doing POCs, uh, and, uh, exploring the new tools and the feasibility on our current environment. And, also, I am, uh, I am passionate about, you know, guiding, uh, the juniors, uh, and getting to have a seamless, uh, work atmosphere. And I'm a very great team player, and, uh, I I have been working remotely for the past, uh, 3 years. And I'm very happy about it. Like, prior to that, I was working in office, and it had been, uh, no different. And I love to connect with them and work with them how I was in office and also build a rapport among them, yeah, which all was a great value addition when I was working and, you know, sharing knowledge, uh, getting, uh, assistance when running into issues and all that. Yeah. Thank you.

How do you implement a data quality framework using PySpark to ensure the integrity of ETL process data? Probably, I would like to put that the quality of data is is very much required for the downstream process. And, mainly, we use spike spark for, uh, the big data, to process the big data. And, uh, to ensure the quality, we have to have all these mandatory fields which are being used downstream properly, uh, you know, uh, null checks and all, uh, to be ensured. And, uh, um, the so using using the, uh, PySpark, we can enable, like, all these null checks and all implemented for the, um, for the, uh, columns, uh, that is incoming. And, also, we can have, uh, the schema enabled while reading the data, which will ensure that the data of each, uh, column is inappropriate, uh, for our data. And as well as if we need some mandatory checks to be conducted, uh, that also can be accommodated in the schema.

How do you perform deduplication on a dataset in Snowflake that has been ingested from an ETL pipeline incorrectly multiple times? This was multiple times here, uh, kind of performed in my workplace. So, uh, we we, uh, in in almost all of these, uh, in incremental loads happening into the snowflake, and those all those corresponding tables have the, uh, time stamp, uh, columns, uh, so which, uh, which will actually determine, uh, the duplicate data. And we used to perform a count analysis on, uh, how, uh, data was ingested or when the data was ingested basically to the table. And, uh, depending on the count and also the time, if, uh, it will be expected to complete, uh, at, um, a specific time only. And if there are any discrepancies, we would detect that and, you know, delete, uh, all those data. Uh, and apart from that, we would have the, uh, data in like, um, all those data deleted from, uh, from the Snowflake table. Yeah. Sorry for that interruption in between.

If you were to design an ideal pipeline that handles time series data, which design patterns would you implement and why? To be frank, I have not handled any time series data as such uh, until now. Uh, but, um, one one thing that comes to my mind is we we we would need, uh, you know, specific checks happening, uh, to see that when the data is going to be populated. So, uh, the scheduler should run it. I I'm not, uh, I'm not sure if I'm right, uh, hitting the right point of this question. But this is what comes to my mind when I read the question. Uh, so, um, probably, uh, there should be handles to check if the data is loaded at the specific, uh, time points, uh, and the the scheduler should be, uh, running at appropriate times. And the data should be loaded to the, uh, lake, whichever, uh, lake, uh, tools we have, whether it be, uh, Databricks, Delta Lake, or Snowflake, wherever it is. It, uh, that needs to have a a time count, uh, or a time stamp, which will ensure that that the data has all appropriated, uh, appropriately populated at the right timings. Uh, in terms of design patterns, uh, yeah, this this is one pattern I would suggest to handle. And if there are any discrepancies, like, if the previous if a load has failed at a particular point of time, then alerts should be, uh, you know, posted to the appropriate channels, uh, of the stakeholders to let them know that, um, uh, the data has failed to load.

How do you detect and handle skewness in a large data set when performing data transformation using, uh, PySpark. Skewness, uh, in terms of, uh, large dataset. Okay. Uh, I am not, uh, that much familiar with the term skewness in terms of data engineering. But in terms of data science, yeah, we we can say that, uh, when the data is not appropriately distributed, then, uh, there is a skewness in the data. So I I under I I suppose, that, uh, when when the data is, uh, you know, received and then they are stored, basically, how they are partitioned, uh, down, uh, like, while saving is what might cause the skewness. So if it is not, uh, partitioned properly, then, uh, a a few partitions might have a larger amount of data, uh, while the others not, which will, uh, cause while while reading the data at the the later point of time, uh, the latency, uh, would be high, uh, for those partitions with the larger size. Uh, so, uh, so if if suppose we have such, uh, we we encounter such issues while reading the data that a few partitions are having, um, higher, you know, um, size than the others, then, uh, we should, uh, pick up the, uh, tran like, the partitions appropriately. So I would say, like, in in the Databricks, we have, uh, terminology. Um, I'm not able to recollect it, uh, completely. Like, we can set on, um, like, around the I I think it is around 4 columns initially, and then a total of 30 columns wherein we keep track of the statistics of those columns. And then, uh, the data will be stored appropriately. Uh, meaning, uh, on, uh, we will order the data based on the value of those columns. So, uh, when, uh, when the data is fetched, this, uh, particular order will be of great help. Like, say, suppose we, uh, we are filtering the data based on an on a value. So if the value is, uh, on on on a specific, uh, layer in the stack, so it will appropriately go and, uh, fetch the data from that location. Like, this information, it will get from the metadata, actually. So that is the way, uh, uh, I I with my current knowledge, I might handle the data skewness.

When optimizing a skill queries for reporting purposes in Snowflake, what best practices would you follow? Yeah. So here, uh, the the Snowflake say, uh, stores data in terms of micropartitions. So, um, basically, all those queries, whatever we give, uh, the the filter conditions and all, uh, should be the the filtering should happen, uh, first in the inner query rather than in the outer query. So, uh, once it come meaning, uh, the all the inquiries, we have to, um, have the filter condition as and when we we have to, you know, give it. Because what if you are doing the filter queries on the outermost, uh, fill filter conditions in the outermost query, uh, what will happen is it will take up more resources, and all those data will get loaded into the memory and the processing, uh, like, the warehouse will be throttled. So what is the best practice will be all those, um, subqueries. Uh, there, the maximum filter conditions should go in, and only the computation should happen in the outer query. So, uh, in this way, we can ensure that, uh, only the relevant data is actually fetched for the computation.

How would you optimize a PySpark job that has to join 2 large datasets and write the output to a Snowflake table? ViceSpark's job has to that has to join 2 large datasets. Okay. Probably, um, I'm not sure if my approach is correct. But what what can be done is this join, uh, can be done, and then we can write the file as parquet. And that parquet, we can actually, you know, copy into the corresponding Snowflake table. So in that way, the processing will be faster. And the join of the 2 large areas, obviously, the filter conditions, uh, should be applied on the appropriate subqueries first, and then uh, the join should happen on those queries, um, on the on the partition columns, basically. All the filters should happen on the partition columns so that the data fetches, uh, faster. Because, uh, in in case of in case of Databricks, uh, delta, the if if you are, uh, fetching, uh, the del uh, the data based on the partition columns, Uh, the the statistics of the data is already available, and, uh, hence, uh, the date it need not scan the entire data actually of the other partitions. So it can go directly to the corresponding partition and then fetch the data and apply the, uh, computational logic.

What AWS services would you leverage for constructing a serverless data processing pipeline? And how would Lambda functions fit in the architecture? Serverless data processing pipeline. Okay. One thing is, uh, the AWS managed airflow, which can be, uh, is, uh, as such, a full, uh, in um, service wherein it can actually, uh, you know, pick the data, uh, from multiple data sources and then, uh, store it to multiple endpoints as well. So we can leverage, uh, that end to end, basically, which I am currently using in my project. Uh, the lambda and how would lambda functions fit in the architecture. Actually, lambda functions, basically, when we have to do some computation on the transformation on the data, irrespective of, uh, you know, like, loading of data into the table. So we have some source data. And upon that, we are doing some, uh, processing, and we have to store the data into into another, uh, you know, another location. So that is when probably Lambda functions come into picture. It can be a call from the API or it can be, uh, some triggered Lambda functions. So, uh, what have you been when a data is available in some particular locations, this can be triggered so that the data gets processed appropriately, and it gets stored into into the, um, appropriate location. So Lambda functions can pretty much, uh, do, uh, like, I have not personally used Lambda functions, uh, but I I suppose Lambda functions can, uh, very well do spinning up the EC 2 machines or, uh, any any of the AWS services, what all it needs to, uh, process. So all that can be done. And once the result is available into another location, so, uh, uh, the downstream systems can use it.

What strategy would you use to migrate Python ETL scripts running on legacy systems to utilize PySpark for enhanced parallel processing capabilities? Okay, yeah, this was one aspect we had done for the project with our media and editing team inclined. So we had this Python scripts running with like, it was running in a SnapLogic tool and that was the one scheduling it and at the back end it was like triggering EMR jobs and all that. But what we did was, we brought Databricks into place and we leveraged the parallel processing, big data processing capability of the Databricks and we coded in PySpark and we used Spark SQL as well. And then, since it is a seamless, you know, we could, you know, internally, even if it is Python scripts, so internally it would be actually a Spark data frame and that will do all the processing in the background in a parallel way only. And save the data or load the data into Lakehouse or save the data into Parquet format or how you want it to be. So this can very well be achieved using Databricks and yeah, have done that already.

What key metrics would you use to measure and improve the performance of an ETL pipeline that frequently handles JSON and CSV data files? In terms of performance perspective, it would be, uh, like, multiple things will come into consideration. How fast it is able to, uh, process the data? What is the level of failure? And, uh, when the JSON files come in, uh, yeah, like, when we need to encounter some new elements, is it failing? Or in CSV, if we encounter any new columns, is it failing? So how have we generalized it? Uh, so, uh, main thing, uh, we can have as a schema in place, uh, with the with the appropriate data types and all, uh, mentioned. So, uh, when which will make sure that we are experience I mean, we are expecting this data, uh, of this data type, uh, in in the source, and then only we'll process. So that can be ensured to improve the quality, um, and also the performance of our retail pipeline.

Can you propose a method for real time data processing using AWS Lab and Kinesis for a data driven application? Yeah. As I said before, AWS Lambda is, uh, something that I have not used. But, uh, in the project that I worked, there was a data science aspect to it. So in that, what what was done is, like, um, we had all the models ready and deployed. So, uh, this Lambda functions would actually when in a on a when when scheduled on a regular basis, when running its, uh, uh, when running the job, it would actually take in data appropriately, generate, uh, the results and store into the, uh, the target table. So and that target table would be used subsequently for its, um, you know, predict like, subsequently used for showing the customers on what are the, uh, best products that they can buy and all that. So for Kinesis, uh, yeah, this was used for the iTrouble, actually. So iTrouble will send the data batch wise. So Kinesis used to, uh, was configured such that we get the data through Kinesis, uh, for processing, and that data would actually come and sit in our s three location.