
Senior Associate II, Data Science
Kyndryl SolutionsSoftware Engineer II
Terralogic Software SolutionsSystem Engineer
Huawei TechnologiesProject Engineer Intern
MD ElectricalsAssociate Engineer
Tata CommunicationsCustomer Service Executive
Tata Communications Transformations Services
MySQL

Oracle Cloud

Azure Machine Learning Studio

GitHub
Hi. Good morning. I'm, and I have a three 3 plus years of experience in the data scientist fields and the AI field machine learning, deep learning, and the other end deployment field. So I have a 3 plus years of experience, and I did my graduation from University, ambassador of engineering. And I'm passing that 2,000 90 batch. Then I started my career in data communication as a associate engineer. There, I worked on the chatbot development team and looked as an NLP data analyst. There, I analyze the data on the basis of NLP techniques, like text summarization, machine translation, NER. Uh, Yeah. Like that, the all, uh, monthly related task. And then I shift into the PCPS that is the sub ordinary of the data They have up to the customer service executive coordinating with the project delivery department. Uh, make sure that our, Data architect team should be done good and no need. Uh, I have worked I have to make sure that box fix project delivery. I have to fix that, uh, deduct all those bugs that comes in to the project delivery. Then I, uh, switch into the technology. Technology. I worked on the, uh, and then the system. I worked on the prediction. So there I, Uh, what the main point system that can predict the alarms. The product is the network management system. So when the Network fiber got down. There is alarm should come. So based on that, we make 1 model, the prediction system, and make sure that we get some information based on our video streaming data and the past data. And my last company was Theralogic Solutions Theralogic Software Solutions There I have a there's a software engineer too. And there I handle the ASM AI related task. They have their own HRM assistant. So I analysis their data and, uh, I require some machine learning models to improve their HRM assisting and worked on the chatbot customer's voice system. So based on the small data set I worked there, uh, we deploy the chatbot using NLP techniques and use some Like, g53 and what? G53.5, g54. So based on that, we're trying to improve Chatbots, trying to grow our model. Yeah. That is my
Yeah. So, basically, implementing regularization technique in neural network. It's a to combat all fitting and improve model generalizability. So there are some techniques. There is a sum l one and l two regularization. So if I talk about l one regularization, it's a l one regularization is the lasso. So It adds the sum of the absolute values of bits to the loss function. So this encourages sparsity driving some bits to 0, effectively reduce the model Let's see. And l three's regularization one of the taking that it's a rich technique we call that. So it adds to some of the squared risk to the loss function. It's been the line this large which pushing them towards 0, but not eliminating them entirely. So that is the main l one to legalization techniques. And about the implementation of that, we here at the lab, there is, like, uh, data types of probability, the regularization. So there is some process of, uh, if you see that data, there is a particular regularization. It's equal to regularize regular regularizer. There is some techniques like that, and it chooses the right hyperparameter. Uh, it means lambda for the regularization strength through the Next, like, read search, then dropout. Randomly drops neurons during training, forcing the network to rely less on individual features and learn more robust Presentations implemented by setting the dropout parameter in there. For example, dropout in have a parenthesis 0 comma 2. Like, that it that way it works. So that chooses the dropout rate depending on the problem and network complexity. Uh, also, That is a early stopping problem. For that, we monitor the performance on validation set during the training. Stops training if validation performance doesn't improve for a defined period. Prevents the model from memorizing training data and generalizes the beta. So using early stopping, call back in Kiran's well appropriate patients value for validation metrics like accuracy or loss. So artificially increases The diversity of training data through techniques like image, flipping, cropping, or adding noise. So makes the model more robust to variations in unsaved data. Also, we work on weight decay. So gradually reduces the magnitude of weights during training similar to the l two. It reorganization and 2 regularization, but often implemented separately. So implemented by setting the weight b k parameter in the optimizer. Uses the item optimizer. If, for example, if you use the item optimizer, so item but in this is made and this for the Point not 1. So that choose the rate decay rate based on the data science and network complexity. So yeah. So And so so we have to the optimal combination and hyperparameter for this
So you reduce the order fitting machine learning model by using the Python. So when we use the Python, Python offer toolkit for addressing or fitting machine learning models. So there is some common approaches in data centric techniques like data augmentation. It's artificially expand your dataset by like random transform options like rotation, fleets, cropping, feature engineering and selections. Analyze your feature to identify Then early stopping. So when we, uh, there is early stopping that, uh, technique that train your model with separate validation state Sit. And stop training when validation performance starts to decline. So when we're telling about the model, it's a technical, there's a regularization to avoid the overfitting. So penalize last parameter values in your model to prevent over fitting on specific aspects of the data. And then to regularization are Common approaches implemented easily with libraries like Skykit Learn or. Another technique is the dropout. Dropout Specifically, randomly deactivate neurons during training, forcing the model to learn robust representations that don't rely on specific features. And care of the dropoutness for this purpose. Model complexity and reduction. You see your models In that we use the simpler models with fewer layers of neurons that start with basic models and gradually increase complexity on the if, uh, it is necessary. Then another is the end sampling technique. Internal sampling technique that we combine the predictions from multiple models that train differently. For sample different hyper parameters or training splits. Uh, it used to ensure better generalization level is next guide which not provide tools for mapping and boosting techniques. So that is the best approach, uh, to avoid lower fitting and machine learning model. But, uh, it is very important that To best approach, it depends on the specific dataset and model. So, uh, we can experiment with different techniques evaluate their impact in our model performance to find the optimal
So as of my knowledge, Normalizing data directly within the SQL queries for machine learning purpose. It's not the good app first. So it can lead us to have an issue. Yes. Why? Because the data leakage inefficiency problem, limited techniques as get functionality and data integrity for this one happens. But there is some approaches that we, uh, use. For example, pre process data outside is quality use. The Python level is like. The efficiency normalize data separate and testing sets before feeding them into your machine learning pipeline. So this ensures data integrity and allows for flexible and optimized calculations. It's to store the normalized data in performance is critical. Consider pre calculating and storing normalized data in separate tables for faster access during model and evaluation. However, remember to keep track of normalization parameter database extension. There is one of some databases system offer extension or store process. Those are just specifically designed for data because it's in task, but there are options and capabilities vary. So machine learning emphasizes the careful data handling and the words leakage. So normalizing within SQL queries might appear convenient, but it can compromise this principle. So ultimately lead to suboptimal model performance.
So as per my knowledge, SQL window function are indeed powerful tools for performing calculation. So, uh, we can do this by using the defining the window. The work clause specifies the window If the function operates, so that involves the, uh, partitioning and define how to group those logically using partition line. For example, calculating moving averages within each customer ordering, specify the order of flows with each partition using Order by this determines the direction of the window like chronological order or time series data. It shows the function also. So there is a some function like some average mean max and calculates aggregate values within the window for metrics like total sales or average price. Then the row number, dense rank Ranks. So that assigned sequential ranks or position to roll based on order or values within the window. Uh, another one is is the Lead or lag. So that access value from specific positions ahead or behind the current row useful for calculating differences or lags. Come sum or come average that calculate cumulative sums or averages from the beginning of the window up to the current row. To specify the range, the rows or range define clause defines the extent of window relative to the current row. So rows And preceding include the end row before the current row. Rows and following. That includes the rows after the the current row. Rows between m Proceeding and and following through the rules from m positioning before we have position after. The range between m preceding and n following similar to those, but specifies setting dumps of the order. For example, 2 day proceeding. Uh, example, if we take one example, select customer ID, order data, product bytes. I'll be selecting those parameters from table. So have this product place or partition by customer ID order by order date. Rows between 2 preceding and current role as moving average price from orders. So like that, this is the example. So this very calculate moving average price products for each customer transferring the current price and the 2 preceding orders within the same customer group. So, On the window function, they have a complex city where it's so ensure you, uh, use their logic and impact on So different functions in window that are that's our specific purpose. So the common Also, uh, when we use the format table expression to precalculate window results of using pro credibility, it's important.
So Handling variable correlations when we're developing a multivariate linear regression model. So it's crucial when developing This multivariate lead generation model, it can significantly impact the validity and interpretability of our user. So We have to identify the correlated variable first. So correlation matrix. It calculate correlation matrix to visualize the pairwise correlation between all your We have to look for high correlations. We can label 0.8 or 0.9. So that might indicate redundancy. Scatter plot is important. It's scatter plot. We create scatter plot between pair of variables to graphically assess the nature of the correlation linear, etcetera. Then we have to understand the impact of correlation. So multicollinearity In multicollinearity, high correlation can lead to unstable estimates, inflated standard errors, and difficulty interpreting individual variable coefficient. Separation effect also we get one off variable might mask the true relationship of another with the dependent variable. Then we have to apply some strategies, Remove the highly correlated variables. If 2 variable are highly correlated and offer a similar information, we have to consider removing 1 based on knowledge for future importance analysis. We have to combine the variables. We have to create new variable by combining highly correlated ones if it makes logical sense based on the context. Then they have to use the damage related reduction also. The techniques like principal component and then component analysis and expect uncorrelated components from our variables to reduce the. We can use the regularization technique that is l one and l two in regression model that can penalize large coefficient reducing the effect of multi polarity. Another is the interpretation with caution. 1 with address correlation, deep cautions when interpreting individual coefficient. So in highly correlated settings. So, Uh, we have to monitor the various inflation and factor also to quantitatively access multiple identity severity. Also, we have to evaluate the impact of different strategies on our model performance. Also, we can use some visualization techniques like partial dependent blocks to understand the Combined effect of multiple variables on the dependent variables. So as for that handling the variable correlation requires understanding this potential issues, we can apply appropriate strategy based on our specific task, uh, data and research. So, yeah, that's where we can
So So the late gen clause ensures that all customer included in the result even if they have no sales in 2023 1 1 and b sales date 1 1 2023 uses a proper to compare the sales date. The a dot sign update less than equal to b sales to less for date condition filters for customer who signed up before or on So the output of Python course shows that the customer with ID one has sales of 100 on 2023 1 1, and their sign up date is 2020. Sign up date. This is 2021 1 which is equal to the sales date. Yeah.
So when we are deploying the state management approach in our predictive model, Specifically, in production environment, we have to choose the, uh, uh, right state management approach for predictive model In production, it depends on the several factors. So there is a model complexity. So simple models require a few parameters might not get complex state management. And another is the data, the update frequency, frequently updated models. Uh, necessities relay, reactive state Another is the scalability and high availability. Consider distributed solutions for large scale deployment cycle and integrations with existing infrastructure. So it leverage existing tools and frameworks whenever possible. So there is some common state management approach and their potential application. So key value stores that is a and mem cache. So there is a fast level, high availability, simple for static model parameters. It's storing the model coefficient in intermediate calculation user specific states. Then distributed file system, PFS plus taskFS. That's scalable, durable, handle large data set suitable for modern checkpoints, but it may not be optimized for shipment updates. So it can introduce the latency, but, uh, it's storing large model, train based historical predictions. Also, database management system. There is a SQL. Uh, there is a Maya still post grade school. So it is structured storage data creating capabilities as your direct is, But it is less performed for frequent updates than key value stores. So it used For storing model, metadata user specific context model training loss. So we use the some model serving frameworks in that tensor flow serving. So design for model in the for a model design for model in the for a optimized for performance handle different model formats. So it can be complex to manage, but because it requires some specific expertise. But when we're deploying the model itself, managing model versions, handling the inference request. So there is the some state management libraries and so DBC. It used to for virtual experiment tackling, Tracking model, debugging, deployment pipelines. So tracking model, tracking and deployment history, keep managing the different model version. So, uh, so it's a combination of these approaches we can use. So key value stores for the That's the file system for storing models and large dataset. Databases for structured data and logic models, setting frameworks for deploying and serving the model. State management tab is for tracking and yeah. Uh, apart from that, we needed some security additional security that implement robust mitigation authorization mechanism model state. Then monitoring, we needed monitors data, updates, model performance, and resource usage. Disaster recovery also did. I was trying to record from system
There are some several approaches. Uh, we can take to dynamically adapt the data visualization based on user data selection in Python, each with its own strengths and weakness. So there is some way Hello? Using the trademarks, so libraries like dash plotly bouquet. So And so there's some server side rendering and streaming flash dejango fast API. It can be used more complex understanding of server side programming. Another is the, Uh, we can use some prebuilt dashboards also. Tableau, Power BI, Google Data Studio. It is easy to use. Wide range of features good for When we use the Python, we get some libraries like Mac. Leap, Seaborn, or Broadly. We can use that one also. Uh, for simple charts, If data updates frequently consider server side rendering or event driven approaches of, uh, faster bits. So web frameworks and, uh, we have to consider some complexity of visualization. For simple charts, we can use the macro. Uh, some of the more difficult views, the plot. And so There is some finders. You can use that. Python. So, uh, it can use us. So there are some some, uh, local JavaScript level is also if, uh, there is some charges, Vega site, but it can't come under the Python. So So broadly, Python also powerful interactive visualizing extensive chart documentation and community some of the ways integration with data sources. It allows custom layouts and interaction. I'll dare also declare it using test built on Vega Lite for interaction, new visualization, and concise code. So we can use pandas for data manipulation and cleaning use number and numerical computation is meant for the for basic static equalization. Our user choose an interactive
So to leveraging Azure's AML services, we enhance our machine learning model. So some ways we can leverage this. There is a data management and storage. So there is a, I use the data, uh, bridge to collaborate on data exploration, analysis, and model training in the unified platform. Then we perform the training and experimentation. In training and experimentation, we're using the machine learning, Uh, Azure machine learning service. So it will train and manage it with machine learning workflows in the cloud. It to find the best performing model on our dataset. Then, In experimentation, uh, as our machine learning experimentation, we track, compare, and reproduce machine learning experiments to efficient model deep development. Also, we use the, uh, some for deployment and serving, we use the Azure function that deploy lightweight models to respond to events or triggers in real time. Sometimes we use the Ubernet service also that deploy our train model as containerized microservices or scalable and secure inference. So Kubernetes, we are not we're using very rare. Mostly, we use the Azure functions. So then we wanted our performance in machine learning studio. So in Azure machine learning studio, we the model performance and health metrics in our usual inference interference. Then understand how our model makes predictions with expedition and insights. So that's basically services we used, but it depend on, uh, our machine learning model that we will take for our specific requirement. Yeah. So
So integrating the power of UI visualization into Python based data analytics workflow, it Can we add tool through several methods? Uh, we can use several method. There is a Power BI REST API. So Power BI provides a REST API that allows will oppose to embed Power BI reports and dashboards into custom application including Python list ones so we can use the API to direct Power BI including access data sets and dashboards programmatically. We can use power embedded also. It allows us to embed power the reports and dashboard directly to our desktop application. We can use the Power BI embedded Python SDK to integrate this visualization into our Python based Applications. And this Power BI Desktop model and so while Power BI Desktop itself is not directly available with 5. But you can export data from Power BI Desktop to various file formats, for example, .CSV Excel. And then we import and analyze that data using Python libraries such as, NumPy, my. So there is some Python visualization. Now there is also Python has powerful data visualization. Now there is like and Plotly. We can extract data from Power BI either through Exporting or using APIs and then visualize it using this Python libraries within our Python based data analytics workflow. So we can use Azure database also. So Power BI data flows allow you to prepare and transform data within And then you can use Azure database which supports Python for advanced analytics and machine learning task. So we can then visualize the result back in Power BI. Also, we can use the Jupyter notebook for embedded analytics. So, uh, we can embed power to be directly into Jupyter notebook and with them using Python code. So it depend all these methods. So it depends on our A specific requirement considering the factors such as data only real time updates, licensing, and integration complexity. So based on those parameters, we can, uh, choose the method that we want.