
As a Senior Data Engineer at EY, I lead and ensure the delivery of high-quality data solutions to our clients, collaborating with cross-functional teams and leveraging Azure Databricks, Spark, Python, and Azure Cloud technologies. I have saved time and money by implementing automations for data integrity validations and infrastructure management. In my role as an Azure Data Engineer at EY, I engineered a new analytic platform with a central marketplace for analytics discovery and a robust engine for deploying repeatable data solutions.
I have a Bachelor of Technology in Computer Science from APJ Abdul Kalam Technological University, where I gained a solid foundation in data engineering, data integration, data transformation, and machine learning algorithms. I am also certified in Python, Cloud, and Databricks. I am passionate about learning and exploring new tech opportunities, and I am always eager to expand my skillset and knowledge.
Senior Analyst(Data Engineer)
Ernst & Young Analytic Hub - Actionable Insights at scaleAssociate Analyst(Data Engineer)
EY GDS
Python

SQL

ETL

Azure Databricks

Machine Learning

Azure DevOps

Azure Data Factory

Scikit-Learn

Azure Key Vault

Pandas

OpenCV

GitHub

BigData

Spark

Power BI

Matplotlib

PowerShell
Sure. So my name is Mohammed. I'm a, uh, uh, data. Basically, uh, I've been working with since 2020 batch. Since then, I have been exploring the data field. Initially, I did, uh, a machine learning internship. So actually internship of 6 months. Then, uh, later at April 2021, I joined, uh, at as a data engineer. Since then, I have been working as an Azure data engineer. So I have been uh, mostly handling handling the data engineering operations end to end from, uh, receiving the data and transforming it and from using the and the insight in order to consume for the analyst stress. So I have been mostly working on batch, uh, requirements, and it's been around 3 years so far working with I have been handling various clients and, uh, uh, mostly the requirement was on Azure cloud and, uh, it was having, uh, these text types of Spark SQL as your data factory, uh, Azure, uh, Databricks and, uh, keyword and function at this sort of technologies.
Sure. So if we are if we are, uh, it depends upon the, uh, what sort of ETL we are actually trying to design. So if we are, uh, if we are planning to design an all app system, which is, uh, which is designed to be at the end consumed by a Power BI or reporting system, then, uh, the requirement will be mostly, uh, fall into this level of design that the, uh, inputs will be receiving, uh, uh, from, uh, some sort of sources, which can be which can be put into a landing container or landing area. Then from there, uh, basically, uh, we can pro we can at least follow the standard approach, uh, of probably, uh, having the material architecture, uh, which may follow, uh, the curated, uh, gold, silver, bronze layer sort of. So we can have the data landed into the container containers and have their, uh, as a source of truth later. If we want, we can enrich and keep the data in the next layer. And, uh, at the final layer, we can have the, uh, data at the end. So, uh, based on this, what I intended to say as a design is, like, uh, we can utilize both the, uh, data lake, uh, blob storage, and, uh, we can have, uh, data factory and along with the SQL, uh, server. So see, for SQL the SQL server can be utilized for the end consuming process. So we can have the, uh, data available in the landing containers and keep it clean on the silver layer. Then on top of that, we can keep the, uh, gold layer as well. And from there, uh, the transformer, uh, sequel, uh, from the bronze layer sorry, from the gold layer, or or we can ignore the gold layer as as it is. And we can, uh, we can push the transformer data into the SQL server. Yeah. That's what I can think of.
Okay. So if we want to get a high throughput, uh, web scraping service, then, basically, we should be having a high availabilities service. May probably, we may have to think of, uh, think of, uh, queue related services from Azure side, or else we may have to think of, uh, a Kafka sort of components that will be providing us, uh, helping us with the high availability and real time sort of capability. So if we want the high throughput and concurrent application, then, uh, this sort of design is what, uh, uh, this sort of tools is what we can utilize. And based on this, we can have multiple servers, which will be, uh, running different, uh, sort of script test at the end, producing a producing a multiple scraping system.
So, uh, if we are having the web scraper data, which is coming from the scraping system, and, uh, we are expecting some sort of files that will be extracted as part of the, uh, web scraping. So, uh, depending on the natural files, how we are handling it. So if we are receiving the scraped file in CSV or in JSON or in XML, we can, uh, either have it into an SFTP location, uh, which can be triggered upon a batch sort of way, And then this can be taken into a landing layer, and we can do the cleaning on top of this. So, basically, we can utilize ADF along with the its, uh, copy capability. And, uh, we can have, uh, Databricks if we have that component in order to clean up the data, And we can utilize the lingered services and, uh, datasets to connect to move the data from our SFTP location into landing, then, uh, from later landing into, uh, an silver layer, uh, by utilizing the data bricks, uh, with some lingered with enabled lingered service. Then, uh, after cleaning and after doing the transformation, uh, we can have, uh, uh, we can have the data moved into, uh, SQL database by, uh, by just by having, uh, the Databricks script. Uh, that's the one of the approach. If we don't want the Databricks to be in between, uh, considering the cost or anything, then we can have the data claimed by utilizing the, uh, web, uh, data flow activities, which will help us to, uh, do basic level of transformations and cleanups So we can utilize the data flows. And at the sync level, we can have the SQL database, which will be most probably incrementally, uh, loading the data, uh, the scraper data. So the components will be mostly SFTP location, s f, uh, then at the time based trigger that will, uh, load into our landing. Uh, we can have a data lake or blob storage. Then from the blob storage, the, uh, ADF copy activity from the copy activity into SQL database, uh, by utilizing, uh, the webhook. So sorry. Yeah. By utilizing the, uh, webhook activity. And the webhook itself can have the source from our, uh, blob storage, and, uh, this can be moved into the sync can be into SQL database.
So in DevOps, if we want to deploy the, uh, if we want to deploy any, uh, uh, any repo configured item, then what we can, uh, do is, like, we can set up a bunch of PowerShell, uh, PowerShell code. Uh, so, basically, how does it work is, like, since we are having the CICD built, what we will do is, like, we will, uh, have the, uh, we will have the connection of states. So if we are having supposed 3 environments, uh, dev, QA, and test, uh, this sort of for example, dev let let's go with the dev test and, uh, prod. So, uh, if we are having these three environments, we can, uh, set these stages. And based on these stages, what we can do is, like, uh, this for these three stages, we can have 3 pipelines basically taking from the artifacts, uh, that we will be for the SQL schema, we should be keeping up a main branch and the, uh, we should be keeping, uh, in in the design, I will say, let's keep up uh, the main branch for prod, and let's have a, uh, acceptance, uh, branch, and let's have a test branch. And let's yeah. Oh, we were following 3. So, basically, we can have main for prod, then a test for, uh, uh, testing environment, and, uh, develop for, uh, dev environment. And from develop, we can, uh, utilize the feature branch. So from these 3, the artifacts can be, uh, taken, and these artifacts can be deployed with the, uh, container, uh, uh, Azure container deployment, you activity. So by utilizing this activity, we can, uh, containerize all these SQLs and deploy it into our Azure, uh, SQL infra. So, uh, the PowerShell activity in order to deploy this, and 3 batches will be there, uh, or 3 stages will be there. Uh, if we raise the PR from our feature branches into dev, then, uh, while merging, we can have this trigger as well if the PR is going into, uh, depth. Then from that, it will automatically trigger and deploy into depth. For testing and prod environment, we can, uh, we can, um, keep the security activity as well. Uh, we can put some approval criterias, uh, to be done by the leads or something. So, uh, we can configure those rules, and we can have, uh, the pipeline configured in that way.
Okay. So, uh, frequently, when we are executing the SQL query, it's going, uh, timed out. So timed out can be, uh, due to different, uh, different sort of reasons. So first of all, uh, time based analysis can be performed, and storage based analysis can be performed. So the volume of data that is getting processed by the query can be one reason, or else the complexity of the calculation can be another reason. So depending on this, the nature of the query that we have written should be, uh, the root cause, uh, the the first place where from where we can start the, uh, root cause analysis. So, uh, based on this, if we are having complex queries, complex nested queries or something, we may have to identify what we in which part of the query the, uh, time taking is the problem. So if we are having nested, uh, so we should be ideally following the, uh, basic principles, which which including selecting the only necessary columns, uh, not utilizing, sub queries, which so sub even if we utilize a sub query, the sub query should be returning a a limited set of data. We should be having, uh, proper join conditioning. We should be having proper indexing. Uh, if if not, we should be having, uh, implementing proper indexing, and, uh, we should be joining on proper column. If we have index, we should be utilizing index on the joining. This sort of basic evaluation should be there in order to see how much the query is taking on the time. And, also, we can, uh, utilize the by taking the same query and try it in on SQL editor, and, or we can, uh, check the SQL plan and see which part of the which part of the query is taking too much time, what, uh, can be doable in order to optimize it further. And if we are having large data, uh, getting scanned and that is causing some bottleneck in between and which is causing the times to be, um, time to be ran out, then, uh, we may have to identify or figure out a way, uh, in order to avoid the largest scan. Probably, we may have to check on the granularity. Uh, if we are, uh, checking some data on the sales order level, then, um, we may have to, uh, restrict it into a granular level where we will be scanning less data. What I intended to say is, like, uh, if we are having a sales order item level, then, uh, we may have to consider going back to the sales order level so that we will be having if we are having 10 items on a sales order, then let's not scan the entire data. That sort of way we may have to think of and, uh, go deeper into it.
Okay. So first of all, uh, the basic things we can say, the selecting star is not the best way. We can, uh, select the necessary columns. And second of all, I could see that the work clause have been placed at the end, and which is having the sales start date greater than these two filters, which got pushed at the end. So, uh, uh, the product number I'm expecting is coming from the product table, and I'm expecting the, say, uh, product okay. Product number and product model. So that means, basically, the, uh, product number filtering, uh, cell cell start date, both these filters coming from the, uh, product table. So I'm expecting that we can easily implement this filter, uh, before joining so that we have to take lesser data into the joining. So these 2 are the main findings instead of star, the necessary item, then, uh, pushing the filter to the top. And, uh, I can see that it's already on left join. So this will be having I I mean, to say left join will be having a fine performance. So it will be mostly fine. And and upon that, we can think of other other things to incorporate if we want. But these 2 definitely should be performed.
so in azure data factory copy activity so we are having a source on delimited text location is azure blob storage path container.dataset.csv format text format column delimiter is this thing treat empty as null okay we are keeping empty as null skip line count okay first line we are skipping in the sink azure sql dbo.import table translator table name string age int stat okay so in the column mapping it looks good overall there is an issue where the input fields due to incorrect data type mapping name is string age is int and start date is date time these three fields overall looks good to me even though what I can think of is like this start date can be no start date is fine it's date time only right so age most probably can have a problem if if people are keeping up what can I say the months or something along with that part from that I can't see based on this level of detail much insight from this but it's better if we can see the data as well that will give us the proper insight easily so the column mapping name is coming as string and age is integer and start date is timestamp overall this looks good but we may it can examine the data and identify what went wrong so there can be possible causes or causes like age is having floating point start date is having just date field only we may not be needed to utilize the date time when we don't have the timestamp details coming in so this sort of issues can be there but we have to see the data that can help us to figure out very easily
Multi regional data application inconsistency. So if we want to implement multiregional replication on SQL database. So this question needs a bit clarity for me. So are we expecting that we should be replicating the data in the SQL DB into multiple regions, or are we, uh, expecting data coming from multiple regions into the SQL database? So if we are expecting the data coming from multiple regions into the SQL database and we should be handling it on the database, in that case, the result uh, or the approach should be in this way, uh, that we should be having the time we should be having the flag column in order to identify, and, uh, we should be having a flag column to distinguish this data is coming from which region. So if we are having a sales order data coming from Northeast and, uh, East Asia and Middle East, these sort of regions, then, basically, we should be keeping, uh, keeping a data, uh, in a way that, uh, that will be having a flag column, which will be helping us to identify very easily whether it's coming from Middle East, whether it's coming from East Asia. So this sort of a single, uh, single column that can be, uh, easily identified based on some flag coming from the data, or that can be easily derived from the file names or, uh, from the time stamp details. That sort of way is what I can think of, how we should be handling the data consistently in the database. So this yeah. That makes sense, I think. Yeah.
So device. Okay. So if we are having a PHP system, uh, then I think it's it should be, first of all, an on premise system. If we are having the data available in the on premise, then, uh, we can utilize the, uh, batch activities in order to pull the data, uh, by setting up a lingered service and connection to the, uh, PHP system. Uh, and, uh, the dataset can pull the data into, uh, into directly into the SQL Server for a fresh time. So the copy option are lift and shift, uh, can help for that. Or else, uh, we make a new utilize Azure file service itself that will lift and shift very easily the whole data. We may have to figure out in between these 2, which may work, or we may have to explore and see if there is there are any other options available, uh, that may probably help us. So, yeah, these three approaches, either ADF, either, uh, file as your file store or else, uh, we may have to explore further and see on the Azure stack.
Okay, so if we want to implement CICD for ADF, it's pretty easy in the DevOps. So basically we can set up a branching strategy in the ADF, which can take into Azure repos. So once we have Azure repos available and set, then we can have probably two branchings. I mean to say by default, we will be having the ADF publish branch and we can have a main branch. And then we can create a pipeline that will basically take the ARM template and deploy. So that's the approach. And if we go deeper into the deployment, then we can take, we can create two artifacts. One is from the main branch and one is from our ADF publish branch. So from the ADF publish branch, we can have the ARM template, which is selected on the activity of ARM template deployer. And from the ARM template deployer, we can choose the ARM template and ARM template parameter file. So these two files can be taken, selected by utilizing the ADF publish, sorry, ADF sorry, by the artifact of ADF publish branch. And then we can connect this into our data factory, whichever data factory we are intending to be deployed, then we should be configuring the branch and trigger. And one more, why I said the main branch should be utilized, as Azure recommends to stop and restart the trigger before and after the deployment. So let's have the file or PowerShell script that will help us to stop the trigger prior to the deployment. I mean to say before the ARM template deploy activity, let's have a PowerShell script that will stop and start the triggers. So we should be, we can give the parameterization and the boilerplate code that Azure provides on the website that can be utilized. And post the ARM template deploy activity, we can have the, we can have the same PowerShell activity with the parameters to reconfigure or restart the triggers. So based on this, and on top of that, we can set the branch for dev state. If we are pushing it into dev, then we should be having it deployed. And if we are for a test environment, we should be having another batch set underneath that will be deploying it into based on the testing branch and the main branch of obviously for the prod. So this is what I can think of.