
I am currently working as a Data Scientist at Grainger Canada, deploying machine learning models for customer-facing applications and working towards productionalizing LLMs. I have a Masters in Data Science, two degrees in Mathematics and have worked as an Analytics Manager in the FinTech domain. I'm a Databricks Certified Machine Learning Associate.
I've developed machine learning models for recommending mutual fund schemes and used clustering methods to segment customers. Presently, I'm interested in MLOps and LLMs. I actively engage in Kaggle Competitions to discover novel methods to tackle varied data problems. My philosophy is to practice what I have learned and learn what I have not read before.
Data Scientist
Grainger CanadaData Scientist
GraingerData Analytics Manager
SBI Funds Management Pvt. Ltd.Data Analytics Manager
SBI Mutual Funds
SOLR

MLflow
.png)
Databricks

Streamlit
.png)
GitHub Copilot

Locust

Splunk
.jpg)
Grafana

CRM

Power BI

Google Colab

SQL Server

Google Analytics

PowerBI

SQL Server

ETL
.png)
Flask

MLflow

Superset
Led the development and integration of end-to-end ML microservices, deploying production-ready models that drove Grainger Canada's web search using name entity recognition and text classification algorithms
Designed an LLM-based system to generate descriptions of website products, fine-tuning on existing high-quality descriptions, incorporating manual review for quality and improvement
Refactored the category prediction fastText model by migrating it from an on-premises server to the cloud through MLflow and Databricks, integrating version control and ensuring reproducibility, leading to a 37% reduction in model training time and an 8% enhancement in recall
Deployed an ML API that utilizes a GPT-3 powered model to assess parsed search queries from an established NER model to identify incorrect query labels, resulting in a notable reduction in the time taken
Mentored a team of two junior data scientists and successfully reduced tech-debt of existing ML pipelines by 80%
Executed health check scripts and stress testing for REST APIs using Locust, implemented efficient logging with Splunk, and utilized Grafana dashboards to monitor model drift and performance
Fine-tuned BERT for classifying sentiment of 20,000 user reviews, integrating the result in the internal CRM tool, thereby improving the support teams response time to negative feedback
Implemented clustering and segmentation for over 10 million investors and 37,000 brokers using K-Means and DBSCAN, for sending targeted notifications on app, increasing app usage by 25%
Managed a team of two data analysts in writing efficient SQL queries and reducing time spent on building reports
Created an item-item collaborative filtering model to recommend products to customers in the broker platform for upselling and cross-selling schemes, leading to an 11% increase in digital sales
Built Power BI visualization reports with data from SQL Server & Google Analytics within an ETL framework
Generated a reproducible machine learning pipeline in Google Colab using computer vision algorithms to detect the occurrence of head impact event using a combination of Object-detection and Activity-recognition models
Using YOLOv5 & Resnet 3D, the manual analysis time for videos decreased by 60%, achieving a recall score of 0.91
Hey. Uh, Hi. I'm Rohit. I am currently working as a data scientist for Granger for the past 2 years. Uh, my total work experience is 5 years. I've been in the data science domain for, like, 5 years now. My academic background is in mathematics. Uh, I have a master's in mathematics, and I have a master's in data science as well. So my my interest lies basically both on natural language processing and also on computer vision. In my current job, I'm working on sort of improving the research experience of our customers. And my previous job, I was working on a plethora of, uh, machine learning models ranging from customer sentiment analysis to recommendations and also customer segmentation.
Okay. Um, so in my, um, academic project, in my previous degree when I was doing masters of data science, so our, um, problem was that we had to detect head impacts of, uh, players within a sports video. So it's it was a twofold problem. First, we had to detect, uh, helmets, uh, in the images of the video 1 by 1, frame by frame, And the second part was detecting whether 2 helmets interact in the video and if that interaction is a head impact or not. So it was a combination of 2 computer vision problem, uh, computer vision solutions. 1st was Yolo v 5, which was detecting the helmet, and the second was a ResNet 3 d model, which was detecting whether we made a impact or not. Uh, we primarily use libraries out of PyTorch and a vision vision vision vision torch, basically.
Okay. In terms of classes, uh, when I'm trying to detect, uh, anomaly detection with the PyTorch, I think you'll basically use the Torch library. And over and above for using Torch, you can use metrics from SKAlert, uh, since you are right now detecting anomaly. So it's a prediction problem. Uh, so you have to assess whether recall is important for you or precision is important for you. And based on that, you can make, uh, instead of using just accuracy, you could use precision recall, um, accuracy, and even f n score to some extent. In terms of validation, I would say, uh, we'll have to assess what kind of dataset that we have. Uh, we can use a cross validation strategy where we are trying to make let's say, in a cross validation strategy, we sort of make folds in the dataset that we have. Let's say we make 5 folds, so you'll train on 4 sets and you'll validate on the 5th set. And then, again, you will move on to another set of 4, uh, parts of the trading, uh, parts of that set and then treat the other one as a 4. Likewise, you'll make, I think, 4 5 5 folds and you'll train the set, uh, train the model on PyTorch 5 times, and the validation score could be the average that we take on this 5 fold problem.
Okay. So, um, yeah. So by the vector databases can be really helpful, and we want to derive context better context for our machine learning problem. Let's say our AI, uh, is detecting or not detecting. Let's say it's trying to generate product descriptions. So instead of just using a plain but in a large language model, it would be a better idea that you use a training set. Training set could be let's say, you are generating product descriptions of your product catalog that you might have. Uh, what you can do is, um, let's say, every product discussion that you have is a dictionary, and you can store the dictionary each individual dictionary into a vector database using some chunking strategy. And when you, let's say, uh, pass in a prompt, you try to find out what are the, let's say, 10 best chunks or 10 best, uh, dictionaries that you can retrieve from the vector database and which sort of augments our prompt and could give you better item retrieval or better item description.
Python tools with new useful. Okay. Um, I think it depends on the use case. Um, let's say you have a a problem where you want it to be really fast and, uh, the inferences happen in batches, not in real in real time. Uh, let's say the latency that you want in the, let's say, the inference time that you want in the machine learning pipeline is very small. So it makes sense that you use a very lighter version of tokenization and, uh, what do you say, text sentiment classification. So for tokenization and for, um, what do you say, What do you say? Tokenization and also for text cleaning and processing, you might use spaCy, which is lighter. And for sentiment analysis, you could, uh, maybe go with bag of words or even, uh, any lighter word to back, uh, strategy that would be a little lighter. Uh, but let's say you have, uh, what do you say, a problem where it's not real time. Uh, it happened in batches and you are not really worried about the inference time. So it makes sense that you will use a bigger model here. A a popular library or, uh, what do you say, assortment of models that are present are present in Hugging Face. So you can use the transformers library from Hugging Face. And, uh, from there, you can use auto tokenizer and, um, an auto text, uh, what is it, text and classification model. Uh, so both of them, you can download from using Hugging Face CLI and get the token and use both of them to get pretrained models, and then you can, um, use these two tools for text organization and also for sentiment analysis.
What is your approach for training? Okay. I think, uh, for imbalance dataset, I think it makes sense that you sort of, uh, upsample, uh, the the let's say, the imbalanced dataset has 2 labels. 1 is positive and negative. Let's assume that the negative one has lower count. Let's say the positive ones are 90% and the negative ones are 10%. So there there's a higher propensity that the model will be a little more skewed towards the, uh, positive ones. So it makes sense either you can sort of downsample the 90% positive, uh, positive cases that you have or you can up sample the 10% negative cases that you have. So that sort of helps us in reducing the gap between the positive and the negative reviews that you have in the training set and sort of reduces the imbalance. Uh, in terms of ensuring the model performance remains robust, you ensure that the distribution of positive and negative, uh, and the features that you, uh, sort of feed into the network remains cons remains, uh, what you say, similar. And, uh, if you're using any sample seed or you're using any randomization to pick samples, so you ensure that let's say, for example, you are using sklearn anywhere. So it's better that you select a seed. Let's say, for example, select seed. You can put the random status 24 or 25, but make make sure that you fix it so that anytime you down sample or up sample it, you get the same, um, what do you say, training training, uh, examples for your whole model performance. And, uh, the this this way, we'll sort of ensure that the model for one sort of remains robust, and you can use PyTorch or TensorFlow to achieve this.
Given the Python code, what is the issue that would prevent? SVC. I think one of the, uh, problems here might be that since you're dealing with PCAs, it it's important that all the values that you have in your extreme dataset are in numeric. So it maybe makes sense that you filter out, uh, in the pipeline only use PCA on the, um, what do you say, on the numeric sets. So that way you sort of avoid that problem because PCA sort of cannot deal with categorical, uh, values. But if you intend to use categorical values, then it makes sense that you use a label encoder so as to convert that into a numeric field. And then you can use a PCA on that whole entire set. So it all depends on how, uh, the extreme set that you have is structured.
A section of call. Open. And then this call. I think a basic, uh, error that it's not it it it won't be able to capture is is when, let's say, the maximum and the minimum value is the same. Then you can see that maximum minus minimum would be 0, and that has been used as a denominator here. So that would return an error. So there might be cases that, uh, the maximum and the minimum value of the dataset is the same, so we should try and catch the or catch that error here, uh, then then we could sort of, uh, avoid that error.
A new device, a Python works, should it apply? Oh, wow. So, uh, yeah. So, uh, you'll have to you'll have to combine 2 pipelines here, uh, side by side. Um, if you are trying to append this, so what what you can do is you're trying to extract, uh, insights from a visual thing. So there are 2 ways to do go about it. Either you can convert that visual data, let's say it's an image to, uh, using a a CNN, you can, uh, convert that 2 d or 3 d image into a flat vector, and then you can append that into, um, the NLP embedding that you might have for the dataset. And at the other way is that that you're trying to extract inside site. So what you can do is feed use a vision l m. So that vision transformer would sort of give descriptions of images, and what you can do is, uh, use first pipeline to generate descriptions out of the images and then append that to the textual data that you have, and then you have a string of textual data plus textual data which is derived from the visual data. And then it becomes a single NLP problem. And that NLP problem can be used to either sort of, uh, generate a summary or if you want to assess what's the mood or anything else that you want to do with it. So there are 2 approaches to it. 1 is appending, uh, converting the, uh, the image or any other visual data that you have to a flat vector and then appending it to the NLP prop NLP amending that you have. Or what you can do is, I would say, the better idea is to generate a description of that image. It it could be how however long that you want it to be, and then upload it to the textual data. So you have a single piece of textual data, uh, with you, and then it becomes an easier to easier way to just make a NLP pipeline for this.
Can you illustrate? Oh, so, um, yeah. So version control is extremely important if you want to fall back to, uh, fall back to, uh, let's say, um, a working let's say you anytime you make a release, you make a major release, you make a minor release, and version control sort of helps you in reverting to a version which was actually working. So there are cases when, uh, you are working in a remote environment and some changes might break the pipeline. So it is better that you have a fallback mechanism. A fallback mechanism could be, uh, a previous version of the code that you might have in GitLab, and then you can just pick that up and build the entire pipeline from that because it was already working. And, uh, yeah, it it is very important that you ensure that you have, what do you say, a lot of integration integration and unit tested attached to it so as to ensure that the previous pipeline was already working, and then only you sort of move on to the next pipeline.
How would you ensure your live moments by the end? Okay. So, uh, to ensure that, uh, the performance is real time when you're using stream data, so it's very important how do you store that data. So then it becomes a little easier to sort of, uh, what do you say, retrieve the data as well. So if you're using a vanilla, um, or do you say vanilla search engine, that then there might be cases that you might have to saw search for the problem in a law in a larger space. So, uh, a better approach would be to sort of use an NLP related, uh, algorithm. It could be entity recognition. It could be multilevel classification so that so that you can recognize, uh, some valuable things that might help the search engine to sort of get the, uh, what do you say, the algorithm running in the real time, and then you can sort of use