
Software Engineer
Tata consultancy servicesSoftware Engineer
CognizantSoftware Engineer
Cognizant Technology Solutions Pvt. Ltd
SQL

AWS S3

Putty
Jira

MS-Excel
Could you help me understand more about your background by giving a brief introduction to yourself? Yeah. Hi. This is Harsha. I'm having total, uh, 5.2 years of, uh, total IT experience. Uh, in the initial days of my career, I worked as a, ETL developer. Then, uh, due to this migration project came into our assignment, so we are, uh, started trained in Snowflake, then, uh, I have a 3 years of Snowflake experience. Uh, I work for 2 migration projects, uh, which are one will be the American type distributor, another one will be the health care, and the one, uh, last one will be the, uh, Discover Financial Services. From past 1 year, I was working with, uh, DBT, uh, for, uh, new, uh, it's a new ATL or ELT 2, uh, which we are going to create the models, and, uh, we'll do the transformations in, um, DBT. And in Snowflake, normally, what, uh, we'll do, we we are performing we are performing some, uh, best, uh, things in, uh, Snowflake. Actually, uh, I have worked on Snowflake in features like, uh, time travel, data sharing, creation of tables, materialized views, data loading and unloading, and the Jira copy cloning. And I have created a I worked on this task creation and stream streams for change data capture and this task for scheduling. These are all the stuff I have done, uh, in Snowflake when coming to DBT. I was working with the I was working with the, uh, this creation of models and the seeds for loading the data. I mean, if you want to apply the I mean, uh, some portion of data, we can upload it through file in a through seed concept. And I have implemented a a CD type 2 by using the snapshot, and we have a different materializations. Like, I worked with the table view and incremental, and we have a we have a test unique test, not null test, and the column test and schema test also there. And, uh, I have created the DBT documentation, and, uh, we created the repository I mean, object I mean, whatever the objects are project DBT project versioning, we are maintaining through Jit. And, uh, we are doing the code deployment, uh, in, uh, in the DBT itself. These are all the part of my interaction.
What are the fundamental differences between, uh, DBT DBT and the ETL tools in, uh, Snowflake environment? I mean, what are the fundamental, uh, differences between using DBT and the traditional ETL tools in Snowflake environment. Snowflake environment. Like, normally, uh, we can what are the fundamental differences between using DBT and the traditional ATL tools in Snowflake environment? Using, uh, using DBT, it's a data build tool compared to traditional ETL tools in a snowflake environment, introduced the several fundamental differences in approach and the workflow and capabilities. Here are the key differences, actually. We have a SQL centric transformation versus GUI based ETL. DBT is a SQL centric, actually. DBT focus on SQL based transformation where analysis and the data engineers where the SQL carries to transform and model the data directly with the data warehouse. And the transformations are defined in SQL based models and using executing SQL commands. And, uh, normally, workflow management involves writing and managing a SQL code with a version controlled environments like Git, providing transparency and control over the data transformation logic. When coming to this traditional ETL tools, traditional ETL tools are often user graphical user interface to design data graphical user interface to design data transformation workflows, uh, visually. Uh, this, uh, workflows with the prebuilt workflows with prebuilt components, uh, and, uh, for data extraction, transformation, and loading. Workflow often in the, uh, includes designing the data flows, uh, scheduling the jobs, and managing the transformations within the ATL tools in the environment. Uh, when there is differences between in database processing and external processing, normally, dbt lever is the database processing capabilities of Snowflake, executing SQL transformations as directly within the data warehouse. This reduce the moment and leverage the Snowflake's scalability and performance of our transformation task. Transformations are performed too close to optimizing the performance. In traditional tools, uh, external crossing, like, uh, traditional tools often involve extracting data from source system performing transformations externally, uh, when, uh, and then loading transform the data into data lakehouse. Then transformations may, uh, require additional infrastructure and resources outside of the data warehouse, uh, leading to, uh, uh, leading to complex and potential performance, um, uh, overhead, actually. I mean, quality control, uh, code reusability and version control, DBT promotes the code reusability and the SQL code. But in traditional ETL tools, we have some, uh, versioning, uh, support code you reuse through the components and templates. The level of granularity and the control over the code, uh, versioning is may vary, change to transformations. And, uh, we have differences between DevOps and DataOps practices when compared with the DBT and, uh, transfer me I mean, traditional ETLs. DBT align with the motor and DevOps data ops practices integration with the motion control, uh, tool like Git and enabling the continuous integration and continuous deployment. When coming to this traditional ETL tools, uh, may require separator practices, tools for version control, CICD, and testing and integration, and the broader DevOps and DataOps workflows, and, uh, required the additional effect and customization.
What methods can you use to optimize, uh, Snowflake storage cost while ensuring data availability for query? What methods can you use to optimize, uh, snowflake cost while ensuring data availability for query? What methods? Normally, uh, we use the data optimizing, uh, Snowflake storage cost while ensuring data availability for querying involves implementing. Uh, we have many strategies like, uh, data compression. And in data compression, we have this automatic compression enabled to automatic the data compression in Snowflake to reduce the storage footprint by compressing data before storing it. Snowflake automatically applies compression techniques such as a run run length encoding and dictionary encoding to optimize the storage. And we have a columnar storage, uh, leverage the Snowflake's columnar storage format where data is stored columnarizer rather than the robot. So this format improves the compression ratios and reduce the storage requirements, especially for, uh, analytic workloads. And we have a data retention policies also there, and we are, uh, achieving through the data archiving and the data purging. And, uh, we have this, uh, storage optimization also. Their technique is there. We are going to apply in the customer keys on big tables for improving the query performance, and we are doing the data partitioning, uh, on large tables free, uh, for who are, uh, uh, whatever the queries we are, uh, frequently used. And we do the query optimization by doing this query pruning, optimize SQL queries to minimize the data scanned by the specific, uh, selective predictors and avoiding unnecessary joins and, uh, aggregations. Snowflake quality profiling tool can help identify and inefficient queries. And we use the materialized views, uh, to precompute and storage aggregations or frequently used datasets are rec reducing need to re recompute them for it. And, uh, last point will be the cost monitoring and governance. Uh, usage monitoring, we are going to monitor the billing reports in the Snowflake and when coming to this government government's policies and best practices for data storage, access control, usage monitoring, defend access control, duplication or access to storage usage. And we have a data life cycle management, uh, like, uh, data tiring, uh, data tiring automated workflows, and, uh, we'll do many these kind of things.
What are critical considerations, uh, when converting an ETL process to ELT within the Snowflake environment. What are the critical consideration? What's this 2 details in the Snowflake environment. I mean, uh, when converting ETL process to ELT process, uh, within a Snowflake environment, we have a critical considerations. And, uh, we have to address this successful migration, and then we have to answer this optimal performance. And, uh, we should consider this data value and complexity, And, uh, we should, uh, include the storage and compute resources. And, uh, when coming to the storage requirements, storage requirements evaluate storage requirements for loading raw data into Snowflake, considering factors such as data retention policies and compression and data archiving strategies and compute resources determines the, uh, compute resources needed for executing the transformations within the Snowflake, considers snowflake virtual warehouses options and concurrency and data loading and the ingestion strategy also there. For data loading strategy, we used the copy into as a bulk, uh, command for bulk loading for high performance data ingestion. And data ingestion pipeline, we used the snow pipe. Uh, we can connect there for real time. And, uh, a real time ingestion, we use the snow pipe. And we have a transformation strategies, like, SQL based transformation and the parallel processing. We we based the we create the CTEs and SQL based transformations and by utilizing and we use the temporary tables while, uh, using in a stored process or We'll we'll use the user different functions. And in parallel processing, utilize no flat parallel processing capabilities to execute transformations concurrently across the, uh, uh, design transformation workflows. And we have a data governance and security, uh, and performance monitoring and the optimization also there. And we'll do this, uh, resource scaling, and we'll do this chain management and testing. And, uh, we have these kind of things will be there. Uh, we should consider these things in the when we are going to convert the ETL process to
Describe an approach for, uh, schema evaluation, uh, management in a snowflake when dealing with frequently changing source data structures. Uh, schema evaluation management in s moflack. The frequently changing source of data structures. Uh, we we are going to manage the schema, um, evolution in Snowflake when dealing with the frequently changing source data structure involves implementing in a flexible and, uh, iterative approach to handle the schema changes seamlessly. We'll do the schema versioning, and we have a version control, implement version control for database schema using schema migration scripts. We can, uh, store the schema definition changes in the version control repositories. And we have a naming conventions, adopt a consistent naming conventions, and based on time stamp based or semantic versioning, clearly document the schema changes and their, uh, impact on the downstream downstream, uh, process. And we have automated schema management, automated scripting, and we'll do the continuous, uh, integration and the continuous deployment. And we'll do this, uh, schema evaluation, uh, schema evaluation patterns, like, uh, forward compatibility, schema evaluation strategies. And, uh, we have a data transformation and migration, uh, process, and we'll do the data mapping and the incremental data loading, When we have another point like data quality and the validation, we'll do the data profiling and schema testing, uh, there in this process. And we'll do the monitoring alerting and schema changing monitoring and the performance monitoring. Uh, we'll do the documentation, uh, and communicate properly, uh, to their downstream teams, which which they are going to deal.
When implementing a CACD pipeline for, uh, data process, how would you ensure that, uh, data integrity is not compromised during deployment? Security is not compromised during deployment. Uh, ensuring I mean, ensuring the data integrity is not compromised during deployment of a, uh, uh, CICD pipeline, uh, for, uh, data processing involving implementing righteous testing validation and rollback mechanism. Uh, here, how can you achieve this? I mean, uh, we we we are going to achieve this by using this automated testing, and we'll do the uh, unit testing. And they will do the integration test, and we'll do this, uh, uh, regression testing and, uh, regression testing. And we'll do the data validation, and we'll do this data profiling. And, uh, we we we we do the data quality checks, and, uh, we'll validate the schema. Um, when coming to this rollback mechanism, we'll do this automatic deployments and the rollback scripts. And we'll do this version control by using the git. And we'll do this enrollment isolation, uh, by segregating the enrollment, and I will do the masking of data. And, uh, we are going to monitoring and alerting, uh, like, uh, we'll do the deployment, uh, monitoring, and I will do the data quality monitoring and, uh, while up while applying the test. These will be the part of the CICD, uh, pipeline data processes.
In the following DBT tool, uh, model code snippet, we expect to create transformed table with calculated revenue. However, the transformation is, uh, producing incorrect results. What is wrong with the logic, and how can it to be corrected? Select the order. It is among the yeah. We'll do the other way new order program So tell this question. In the following dbt model codes, we expect to create transform the we expect to create a transform the table with calculated, uh, revenue. However, the transformation is producing incorrect results. What is wrong with the logic and how it can be corrected? We have a syntax error, uh, with the order ID and the total amount having spaces in them, which will can cause the syntax errors. And column names with the spaces should be enclosed on double quote or, uh, brackets depending upon the SQL thing. And they will do the cal there is a calculation error for revenue is incorrect. It, uh, multiplies the total amount by 0.1 to calculate the revenue, but it should multiply by total amount total amount by 0.1 instead. So, uh, we this is the I mean, uh, we have to make the changes, like enclosed column name with the spaces, order ID, and the 2, uh, total amount, uh, in the double quotes. And, uh, we have to correct the calculation for review by multiplying some of amount by 0.1.
Review this SQL code snippet, uh, in in SQL, uh, slow flight, which is intended to return the number of unique customer ID cards from the sales table. However, it's not, uh, executing correctly. What is wrong with the query and what how do you fix it? Which is instead intended to return the number of unique intended to return number of unique customer IDs. Okay. No. Uh, in it, uh, like, name wanted to distinct. We have to use distinct here, uh, for getting the customer ID records, uh, uh, from the sales table. However, it's not executing correctly. Already so. Okay. Uh, we are we are going to, uh, we are going to apply the, um, we are going to apply the distinct keyword instead of unique, and, uh, the query will be select distinct of customer ID from sales. It will fix the issue.
Uh, when designing Snowflake, uh, data warehouses, how do you approach balancing cost and performance across different warehouse sizes. How do you approach balancing cost with performance across warehouse sizes? Yeah. When I mean, Snowflake warehouses, uh, with the balancing cost to compare across I mean, uh, we use the different warehouse sizes involving, uh, considering factors such as workload requirement and the concurrency, and we'll do the query complexity and budget country, uh, constraints. We should understand the workload cast characteristics, and we should, uh, uh, analysis the workload. And we'll do the query profiling, And we'll do rightsizing warehouse by selecting performance requirements, and we use this scaling strategy. And we'll do this cost optimization, and this is a cost modeling and, uh, resource utilization. We'll do this performance testing like benchmarking and the load testing. We'll monitor and optimizing the performance monitoring and the cost monitoring. And, uh, we'll do this iterative optimization, like continuous improvement and feedback loop. These are the steps, uh, we can take care while designing the Snowflake data warehouse.
Would you automate the deployment of, uh, data models in Snowflake using DBT in conjunct with a CACD framework. How would you automate deployment of data models in Snowflake using DBT in con in conjunction with the CACD framework? I mean, uh, for automating this deployment of data models in Snowflake, uh, using DBT tool in the conjunction with the CACD pipeline integration. We have to set up the DBT project, and then we should initialize the DBT project. Then we have to define the, uh, I mean, uh, Snowflake connection in the profile dot yaml file. Uh, then we can implement the CICD work flows, uh, workflow by using this, uh, version control, uh, by using the git. And we'll do the CICD configuration using, uh, Jenkins by providing this YAML file configuration. And, uh, we'll do the CICD pipeline, um, I mean, step by step, like check out the code and install the dependencies and, uh, compile dbt model and run the, uh, proper testing dbt and deploy to then deploy to Snowflake. Then we have to handle, uh, environmental specific configuration, like environmental variables. And, uh, we have to parametrize the deployment, and we should handle the errors. And, uh, we have to roll back it, by using this rollback strategy and monitoring and we have to monitor and the reporting by pipeline monitoring, and we have a deployment reports. These are all the, uh, conjunct I mean, uh, con conjunction with CICD.
Uh, in in what ways, uh, can DBT be integrated, uh, with the Snowflake to streamline and enhance enhance the data transformation processes. In what ways can we integrate data? Snowflake to streamline and then enhance the data transformation process. Uh, normally, uh, d d b d can be integrated with the Snowflake, uh, to streamline and enhance the data, uh, uh, by using we we have many several ways, actually. Uh, one will be the native, uh, Snowflake integration. Uh, we have a Snowflake adapter available. We can connect through it. And, uh, we have a, uh, Snowflake specific functionality also there. And DVD leverage the Snowflake specific features such as time travel, fail safe, and automatic clustering to optimize the queries and to optimize the, uh, queries and, uh, workflows and ensure the data quality, uh, incremental data loading. And we'll do the incremental models. DBT supports these incremental models by defining the enabling, uh, the incremental model loading strategies in Snowflake. And we have a incremental materialization also available in, uh, DBT, uh, which is a, uh, precomputed things. We can upgrade the datasets in Snowflake for a incremental purpose. And we have a data warehouse automation, version control and collaboration. And, uh, we we can, uh, we can do it. And, uh, we have a version version control and collaboration. And, uh, we we can through this, uh, Git integration, we can perform this activity, and we have a code reusability in DBT. And we'll do the testing and documentation. We have a automated, uh, testing cycle, and we have we we we can, uh, generate the documentation, DVD documentation by for fixing the errors and issues. And we have a performance optimization. We'll do this query profiling and materialize it beyond caching. And we'll do this monitoring and alerting, and we can log in and monitor the errors, uh, and we can, uh, alert and notification also channels also there. We can do it through dbt. These things we can perform, and we can integrate the DBT with Smofit for a streamline and enhance the transformation process.