
Experienced Azure Data Engineer with a proven track record in the IT industry. Over two years of hands-on experience in designing, implementing, and managing data solutions on Azure. Specialized in data ingestion, transformation, storage, and analytics using Azure services like Data Factory, SQL Database, Databricks, and Synapse Analytics. Skilled at collaborating with cross-functional teams to gather requirements, architect data solutions, and deliver high-quality outcomes.
Azure Data Engineer
Slb
Azure Data Factory

Microsoft Power BI

Data Warehouse

SQL

Azure Databricks

Logic Apps

Azure Virtual Machines

MicroStrategy
Experienced Azure Data Engineer with a proven track record in the
IT industry. Over two years of hands-on experience in designing, implementing,
and managing data solutions on Azure. Specialized in data ingestion,
transformation, storage, and analytics using Azure services like Data Factory,
SQL Database, Databricks, and Synapse Analytics. Skilled at collaborating with
cross-functional teams to gather requirements, architect data solutions, and
deliver high-quality outcomes.
Data integration and data transformation (ETL):
Data Modelling and data visualization:
Financial report covering all the actual, plan, forecast and DSO data of the company across the globe focusing on automation & standardization.
Benefits- Single version of truth, Simplified & timely data availability
Tools: Azure Data Factory, Azure Data Lake Service, Azure Databricks, Azure SQL Warehouse, Azure SQL DB, Azure Virtual Machines, Azure Logic Apps, Power BI, MicroStrategy, SQL.,
Created an OLAP cube by employing the below processes:
Provides visibility into the global daily and month-end GL cash and bank balances across multiple banks accounts worldwide for the Treasury team.
Tools - Azure Data Factory, Azure Data Flows, Azure Data Lake Service, Azure SQL Warehouse, Azure SQL DB, Logic Apps, SQL, Automation runbook(PowerShell), AAS, Power BI
Created an OLAP cube by employing the below processes:
Worked on an automated framework for 20 projects with failure and data delay alerts for end to end ETL till OLAP cube refresh, along with a consolidated Power BI dashboard that tracks real-time and historical CIM progress, reducing manual intervention by 90%.
Tools: Azure Data Factory, Azure Logic Apps, Microsoft Power BI.
Hi, I'm Malvika. So I've been working as a data engineer in Schlumberger for the last two years and seven months. So during this time, I've, I've explored a lot of things in data engineering under Azure, and that has gained a lot of experience and exposure to a lot of things. And apart from this, I also take part in other activities in the organization and make sure that the culture of the company is always taken in upper hand. So I'm very enthusiastic and very excited to learn new things. So starting with my journey, I started after my studies, I started in Schlumberger, that's also called as SLB currently. So I joined the data engineering team that initially I was involved in making the automation, implementing automation and different things that we're working on. So one thing that we did is instead of, so the monitoring team, they usually send, they go to each and every data factory and the corporate information models, that's the Azure analysis service models and they, they try to do a lot of manual labor, manual task to come up with a checklist for to make sure that everything is running fine. So instead of that, we created a mechanism that automatically updates the status of different flows in Power BI, in a Power BI report, so that by using a lot of tables in the backend and by using a lot of web activities, and we also automated in such a way that as soon as the fact and the dimension tables are loaded, it moves on to loading the, refreshing the Azure analysis service that is OLAP cube. So we made sure that that automation was made. So after that, I was involved in different development projects. So which my first project was related to the bank account and the related information of the SLB. So we built a model for that, depending on the business requirement. So I used data flows there for the ETL transformation. For the next project I was involved in was if you know, financial reporting dashboard, which is also built on Azure analysis service cube. So here I was exposed to Databricks, where we used PySpark and SQL for the ETL. So this helped me a lot to learn more about the different technologies under Azure. And I'm also upskilling myself day by day to make sure that I am able to contribute better in the roles in the company. So right now I'm looking for other opportunities so that I'm able to explore more other than what I've done so far and contribute much better, much better with the knowledge that I've acquired so far and I'm still acquiring. So I'm pretty sure I'm, I'll be, I'll be able to adapt in different teams with different technologies, which involves the data engineering techniques like data modeling and ETL. So yeah, that's it from my end. Thank you so much for listening.
Uh, so what do we do in this, uh, what do I want to do in this case is that, uh, every time a pipeline is built, so, uh, the we should make sure that the, um, you know, whenever, uh, a pipeline runs, it's recorded somewhere. So, say, a configuration table where we, uh, where we record, uh, that before a pipeline starts, we keep an entry that checks if that pipeline for a particular run ID has run or not. If not, we'll, uh, we go and run it. If we go and run it and enter a new ID saying that for this particular pipeline, uh, for this particular data factory, the current has happened. If not the if that's not the case, then we, uh, again, uh, if it's already run, then we skip it. So in case of intermittent, uh, inter intermittent issues, uh, we can, uh, handle in such a way that there's no, uh, there's no repetitive run, uh, which ways and and and there's, uh, the resources being saved. So this way we can make sure that only one run happens for a particular run ID even if there is any failure. And also with, uh, respect to data availability, uh, we can before each and every run, uh, depending on the time stamp, we can check if, uh, what is the maximum time stamp that of the data that's available in the source. Uh, based based on that, we can check with the maximum time stamp available in our, uh, you know, say, for example, we are using, uh, Azure data warehouse. So in our ADW tables in the stage during the staging part, we can check, uh, if if what is the maximum time stamp with which do we have the data available and compare that. And if new data is available, only then we let the, uh, load happen. Else, we don't have to load an older data that's already present. Uh, so this makes sure that, uh, every time there is there is a failure, uh, the when we when when the process restarted, uh, the already ran pipeline skips. Plus, it also makes sure that, uh, only the new data has been taken and there's no, uh, there's no unnecessary runs happening.
So, uh, in the time travel mechanism, there's this version that version con version that we have, uh, where, uh, the different versions are being stored, where we can use, uh, the previous version to restore the data in case of, uh, in in case of, uh, a failure and, uh, a data loss. So we can recover the previous version, uh, depending on that. So we can also have multiple checks where we know the, uh, approximate data, uh, that's being pushed during each and every, uh, let's say, incremental load. So if if if that amount of data is not met, then we can, uh, have the previous version getting restored. Same for the dimension tables where we know that this is the amount of, uh, you know, dimension tables usually, uh, compared to fact, uh, fact tables, the dimension tables do not, uh, have the huge amount of data. The it varies. Right? So depending on that, we can make sure that, uh, we know the approximate count that's been sent to an incremental load, and we know the approximate count that's already existing in case of dimension tables. If that, uh, minimum availability is not met, then, uh, we can restore the previous version of the dimension table in the fact table and not let the current data be available for the users. And we can also send send, uh, we can also send an intimation mail, uh, indicating there is some data, uh, loss or, uh, we have, uh, we have, uh, we have seen some data, uh, mismatch. The quantity of data doesn't seem too much with the threshold, and the those mails, uh, can help us to go and check, uh, what has been happened. And such kind of highlighting such things will, uh, help us understand if there's a data loss, and we know that the data has been recovered by the process itself. So this will help us, uh, to make sure that the current data is available for the users without any data loss coming into picture.
Yeah. So we can do this by, uh, by, uh, comparing we can, uh, you know, comparing different versions. We can have a threshold of the versions that can be compared. And depending on the amount of data amount of data change, uh, during each and every run, uh, that can be dragged via a Power BI report, and we can have a threshold to it. Say, uh, all of a sudden, there is a huge the data has been doubled or all of a sudden, there is a data loss, such things can be highlighted in the report. So using that report, uh, we can, uh, track the historic changes. Uh, so this, uh, you know, we can have a a data factory that's designed in such a way that this, uh, maximum amount, uh, of rows that's being entered during different runs, uh, uh, can be calculated in a or a number of updates that's been done in different runs. Uh, the numb the number can be stored in a, uh, you know, a different table, and, uh, power report can be connected to that table. And depending on the change in the amount of data or the number of columns uh, or the depending on the updation that happened, we can, uh, detect these his changes in the history runs. And if there's if it goes beyond the threshold, uh, we can send an alert saying that this has happened, and we can, uh, that can also be done. Or, uh, we can use that to, uh, you know, to see the during auditing purpose, we can use that to see how the change in data has occurred. Yeah.
So, uh, in case of high, uh, volume data, uh, one thing that we have to consider is, uh, are we, uh, are what is the number of partitions that we are using to, uh, while the data is being processed? And if if the data volume can be, uh, reduced as by using incremental load instead of loading the whole data. And, uh, we have to make sure that the parallel processing is happening, uh, and it doesn't load sequentially. Say, if we have data based on your months, we can process all the your months in parallel instead of doing it in in a sequential way. Uh, another thing to, uh, to to consider is is the, uh, I think, Cairo serialization where the data's, you know, made into, uh, binary digits, uh, before being sent, uh, before being taken from the source into into our staging tables so that uh, the data is processed much sooner, uh, because it's it it small chunks of data is being compact data is being sent. So that way of processing will also, uh, help us kiro serialization. And, uh, so that's something that we can consider. And, yeah, I think for now, we I'm getting these are the options that I'm getting in my mind. 1 is, uh, by making sure that it's, uh, divided and it's parallelly processed in different partitions and, uh, you know, not implementing sequential pipelines rather doing parallel pipelines. Uh, and we can do this where we, uh, you know, have 2 different tables. 1, uh, where we copy the current data and do the table switch, uh, so that, uh, no one will we can make sure that no one is hitting the table that's being copied. Uh, after all the transformation, after it's loaded in the online table, we then switch the online table and the temporary table so that, uh, there's no table lock being occurred and there's that will slow down the process. Right? We can do that. We can have a temporary table and make sure the data is loaded into the temporary table, and online table is available for, uh, you know, other processes. And then the data's copy, just a copy of data, into the online table because the transformation obviously takes a lot of time. So that's something we can do. And the we serialize the data, make sure that it's converted into binary before we transfer the data. So that's the those are the few things I'm getting. It's in the top of my mind right now.
Uh, okay. So migrate an existing data model to Snowflake, ensuring minimal downtime. So, obviously, uh, when we are doing the migration, uh, we do in the lower environments. And while switching, uh, from a different data model to Snowflake, we, uh, make sure, uh, that has been properly tested in the lower environments. And, uh, we also send proper, uh, email alerts to the user. So one thing that we can do is, initially, we can start running both of these simultaneously, both the models. Uh, and we we have to make sure that the relationship between the different tables is fine, and the data load is happening properly. So all these checks can be done. So but I can learn more about this, uh, you know, after this because I don't have a in-depth understanding about the migration. So I I I would like to explore more on this.
So here, I can, uh, see, usually, in the CICD, uh, we made sure that it is first deployed to UAT, and then only the deploy to production If it moved to production, then that might cause, uh, a lot of issues. Uh, Yeah.
So, here we are just going to the order table and we are updating the status to Dispatched whenever the, for whichever status it's placed. So, for everything it just goes and here in this case, everything is taken into consideration, the whole data set where wherever order is placed, it's changed into Dispatched. And we are not committing these changes. That means these changes are not actually being, you know, implemented. And while the data is processing, if we are having the statement run, then without the table being locked, the incorrect statements might get copied, incorrect data might get copied. So, that is one impact that I can see in this.
So I guess here we can have different, uh, different pipeline. 1 takes the real time data sync Kafka or something, and the batch processing can also happen. Uh, but if we have to have an integration between these 2, uh, then we can have different pipelines that, uh, one just takes the data at all times and the other one that triggers only, uh, you know, depending on the time slot. So that's something that can be done. But yeah. And we can merge the data together. Uh, 1 column one extra column, a flag column that says if it has the if it's from batch processed data or the real time data. Okay. So it's a historic data. Okay. So the historic data, we can, uh, move it to the first, we can move it to the table that we are going to take the real time streaming data. Uh, so first, we can do that. And then on that data, on the existing data, we can, uh, load these, uh, on that table. We can load the real time streaming data.
Okay. Uh, implement version control mechanisms in your data pipeline deployments with Azure Data Factory to prevent data loss. Uh, Yeah. So, uh, first, initially, while the deployment happened, uh, we have the, uh, we have changes in our local branch. And then in our, uh, in our, uh, we push it to the integration branch and then to the master branch, and then those changes are then pushed to the UAT and then to production. So, uh, this can these different branches, we can restore, uh, or revert the changes if we do it this way. So if we maintain this, uh, CICD repository, uh, it make sure is that the previous versions of of the previous different versions are also saved, and we can restore these versions. Or one thing another thing that we can do is we can take a backup of the existing master branch before, uh, you know, making changes to our branch so that when some something goes wrong with using the previous master previous backup ticket to store, or we can use the previous whoever changed pre who whichever branch was was pushed to, uh, pushed to the upper environments before the particular change that I'm making. Those changes can be reverted back. So that mechanism can be used, uh, in this case.