profile-pic
Vetted Talent

Bharath Shroff

Vetted Talent
Results-driven professional with 5+ years of experience in AI, data science, and software engineering, consistently leveraging cutting-edge technologies to drive innovation. Proven expertise in automating financial and data processes, building scalable solutions, and delivering actionable insights for global stakeholders. Skilled in AI/ML, Python, RAG, Next.js, and cloud platforms like Databricks and Azure. Adept at enhancing decision-making through advanced analytics, end-to-end application development, and agile methodologies, with a strong foundation in project management and client-focused solutions.
  • Role

    Data Scientist

  • Years of Experience

    6 years

  • Professional Portfolio

    View here

Skillsets

  • REST API - 2 Years
  • React Js - 2 Years
  • react - 2 Years
  • Scala - 1 Years
  • React Js - 2 Years
  • Next Js - 1 Years
  • Next Js
  • Selenium - 2 Years
  • MLOps - 1 Years
  • LLMs - 1 Years
  • K-Means - 1 Years
  • Backend - 2 Years
  • Financial reports - 1 Years
  • Node Js - 1 Years
  • PowerBI - 2 Years
  • MySQL - 5 Years
  • Git - 4 Years
  • PowerBI - 2 Years
  • rag
  • Data engineering and manipulation
  • Tableau - 1 Years
  • Reporting - 3 Years
  • Relational Database - 5 Years
  • PyTorch - 1 Years
  • Python - 6 Years
  • SQL - 5 Years
  • PySpark - 5 Years
  • Cloud - 1 Years
  • Next Js - 1 Years
  • Databricks - 5 Years
  • Odoo
  • Big Data - 5 Years
  • Data Engineering - 5 Years
  • MLFlow - 1 Years
  • JavaScript - 4 Years
  • React Native - 1 Years
  • Databricks cloud
  • Finance - 1 Years
  • Restful APIs - 5 Years
  • LLM - 1 Years
  • AI - 3 Years
  • Data Engineer - 5 Years
  • Data warehouse - 5 Years
  • Azure - 2 Years
  • API - 3 Years

Vetted For

10Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Developer (AI/ML & Cloud Services) - RemoteAI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :GCP/Azure, Micro services, Django /Flask, Neo4j, Restful APIs, AWS, Docker, Kubernetes, machine_learning, Python
  • Score: 59/90

Professional Summary

6Years
  • Aug, 2024 - Present1 yr 2 months

    Contract Data Scientist

    MCSquared AI
  • Aug, 2024 - Oct, 2024 2 months

    AI Innovation Specialist - Finance

    Trilogy
  • May, 2022 - Jul, 20242 yr 2 months

    Full Time Data Scientist

    MCSquared AI
  • May, 2018 - Jul, 2018 2 months

    RnD Intern

    DELL EMC
  • Jun, 2019 - Jul, 20212 yr 1 month

    Associate IT Consultant

    ITC Infotech
  • Aug, 2021 - Apr, 2022 8 months

    Full Stack Developer Volunteer

    Isha Foundation
  • May, 2016 - Jul, 2016 2 months

    RnD Intern

    Computer Institute of Japan

Applications & Tools Known

  • icon-tool

    Odoo

  • icon-tool

    Apache

  • icon-tool

    NumPy

  • icon-tool

    WordPress

  • icon-tool

    Palantir Foundry

  • icon-tool

    Databricks

  • icon-tool

    Azure Data Factory

  • icon-tool

    Power BI

  • icon-tool

    Next JS

  • icon-tool

    LangChain

  • icon-tool

    React Native

  • icon-tool

    Git

  • icon-tool

    DevOps

  • icon-tool

    Selenium

  • icon-tool

    PowerShell

  • icon-tool

    Scala

  • icon-tool

    Kaggle

  • icon-tool

    Scrapy

  • icon-tool

    SVM

  • icon-tool

    Naive Bayes

  • icon-tool

    Tkinter

Work History

6Years

Contract Data Scientist

MCSquared AI
Aug, 2024 - Present1 yr 2 months
    Led the team to build a pipeline in Databricks feeding into a map view dashboard containing proximity hotspots of leads around business provided site locations leveraging Bing Maps API and 3rd party Real world data sources like Citeline, Health Verity, IQVIA.

AI Innovation Specialist - Finance

Trilogy
Aug, 2024 - Oct, 2024 2 months
    Deriving Financial Insights using LLM chatbot built on React for the frontend and Express JS for the backend, which updated the RAG Vector DB upon new file uploads, reducing manual analysis time by an hour.

Full Time Data Scientist

MCSquared AI
May, 2022 - Jul, 20242 yr 2 months
    Deployed Machine Learning Survival model to production replacing the previous XGBoost model on Databricks using the medallion architecture capable of self re-training every month with new data and auto archive or promote to production based on the champion model using MLFlow for model versioning and evaluating the model performance based on C-score.

Full Stack Developer Volunteer

Isha Foundation
Aug, 2021 - Apr, 2022 8 months
    Developed a web application using the open-source Odoo Framework built on Python, streamlining processes and digitizing multiple forms required to be filled by hand by 100s of visitors saving hours of work both for the visitors and the staff.

Associate IT Consultant

ITC Infotech
Jun, 2019 - Jul, 20212 yr 1 month
    Deployed end-to-end modules using Git DevOps for Continuous Deployment across the 4 stages (DEV->QA->UAT->PROD), ensuring seamless transitions and operational efficiency for MLOps.

RnD Intern

DELL EMC
May, 2018 - Jul, 2018 2 months
    Developed Python scripts for automated reporting, flagging approximately 100 high-priority reports daily, enhancing efficiency in report management.

RnD Intern

Computer Institute of Japan
May, 2016 - Jul, 2016 2 months
    Helped in improving the accuracy of multi-class Classification of emails and Achieved 70%+ accuracy.

Achievements

  • Football Secretary (IIT Hyderabad)
  • Inter IIT Football Captain
  • Participated in Table Tennis Inter-Departmental / Inter-Year Tournaments

Major Projects

7Projects

Melanoma Classification

    Achieved 85% AUC score in Identifying Melanoma using Convolutional Neural Network (CNN) models.

Network traffic analysis ITC Infotech

Oct, 2020 - Oct, 2020
    Extracting insights by transforming Apache access logs and visualizing through plots showing traffic originates from 10 different countries. Processed 6 million+ rows of server logs fetched from Open Source Apache Server Logs. Done as part of a training for PySpark.

Network traffic analysis

Oct, 2020 - Oct, 2020
    Extracting insights by transforming 6 million + Apache server logs and visualizing through plots showing traffic originates from 10 different countries.

Machine Learning Library from scratch

Aug, 2020 - Aug, 2020
    Implemented a few ML algorithms only using NumPy with the intention of developing a deep understanding of the Machine Learning algorithms. Regression 3 models, Classification 3 models, No use of any existing modules libraries apart from NumPy (math library). Also 9 Normalization algorithms for Data Standardization in an effort to understand them.

Image classification of fruits

May, 2020 - Jul, 2020 2 months
    Multi Class Classification of Fruits using images, dataset used from Kaggle with 90380 annotated images. Leveraging Pretrained models like VGG, ResNet, AlexNet, Mobile Net for mobile deployable model.

Tic-tac-toe Extended 2player

Apr, 2019 - Apr, 2019
    Implementation of an advanced version of the Tic-Tac-Toe game in python. 2 player as of now. Learnt about this game of 2 layered Tic-Tac-Toe from a friend where we used to play on the behind of our notebooks. Implemented as a side project during college, to be played manually by 2 people as of now, ambitious objective of using ML as a future scope.

IITH Main Website

Jan, 2019 - Mar, 2019 2 months
    Built our college website from scratch using WordPress Templating which included integrating from over 10 departments.

Education

  • Bachelor of Technology in Mechanical Engineering

    Indian Institute of Technology (2019)
  • Bachelor of Technology in Mechanical Engineering

    Indian Institute of Technology (IIT) (2019)
  • Bachelor of Technology, Mechanical Engineering

    Indian Institute of Technology (IIT) Hyderabad (2019)

Certifications

  • Certified azure data engineer associate (dp-200, 201) microsoft 2021

  • Certified azure data engineer associate (dp-200, 201) | microsoft | 2021

  • Microsoft certified azure data engineer associate (dp-200, 201)

AI-interview Questions & Answers

Hi, my name is Bharat Shroff and I'm from Bangalore, Karnataka. Starting my career as an associate IT consultant where my responsibilities basically included those of a data engineering role, I worked with two clients. In the first client, I helped them build an Azure data factory in which we orchestrated a pipeline, event-driven pipeline, which every day there would be a file that would be uploaded and that would trigger a pipeline of notebooks which would take the data from the raw, put some transformations to generating some analytics on it and pushing that to a Power BI and Synapse Analytics which would then be consumed by further stakeholders. In the second one, it was majorly on Databricks, Azure Databricks, again, creating a similar data pipeline. And then after that, I worked in Isha Foundation for a considerable amount of time and there I basically helped them build or I built the website which helped digitize because it was a very manual process of every time a person comes to the Isha Yoga Center, they have to fill a form, a handwritten form and that used to take like hours of work from the team and from the participants as well. So we created a digital profile, storing all that information and integrating different aspects of the different activities like the accommodation or any other programs they would do by integrating those APIs and build a common website where the user or the visitor can come and just book through that. For this, I used Python and Udoo. So Udoo is an open source framework. So there I got exposed to a lot of full stack where I developed both the backend and the frontend. Then coming back to MC squared, I switched to MC squared where I worked as a data scientist. There also I worked with two clients. The first client, they had their own, they had a different platform, data platform, which was called Palantir and there I basically worked on preparing contours, which is essentially visualizations. So it's like a POC on visualizations on which stakeholders would be interested. And that did involve some health checks on the data, data monitoring, data drift monitoring, all this kind of KPIs. In the second client I worked with, it was basically again on Databricks, but this basically had had the process of identifying data vendors from which we can buy data and using the client's proprietary data, do analysis, competent analysis and other analysis which would help grow their business essentially. And in my latest, the current project that I am working on is basically it's on an LLM where we are trying to, we have built an agent which you can ask questions and which will create SQL queries and that will go and fetch it from the database required. So yeah, it's been a good journey with very varied experiences and tech stacks. Thank you.

How do you instrument and improve the reliability of a distributed task? So on AWS, I'm not sure which is the equivalent of Azure Data Factory in AWS, but I'm assuming AWS SageMaker would be a close match to that, which would help orchestrate pipelines as notebooks written in the AWS Glue that would contain the actual Python machine learning and data processing logic, Python code, and that would help orchestrating and automating the whole pipeline. Yeah.

So Redis cache is one of the industry leading standards here and that would help us drastically optimize the performance of any cloud platform by storing or even edge caching which would store certain relevant data on the edge devices which should be near real time retrieval speed. And if the AI model itself is small enough to be able to be hosted on the edge device then the latency between the server load and the latency between each query which comes back to the server and server uses the AI model to generate the response and serves it back that would greatly reduce the latency between that by hosting and minimizing the AI model size so that it can be hosted on an edge device.

When designing a low latency API, which is serving machine learning predictions, or at least from a user interface user experience perspective, it is important. And it is very, the perceived time, time delay or the perceived latency using a streaming is definitely shown to improve user experience. So as the as we start getting the responses, just start showing each of the words. And then ultimately, once the whole response is generated, then format, I think that's what the major or major in interactions or the UIs do and what other low latency. Using vector databases definitely helps speeds up the process.

Now, we'll destructure a Python code base, keeping solid principles in mind. So an ML project, it is important to accommodate the flexibility in data and the flexibility in training of a model and the retraining with the update of data. So it is very important to accommodate for that. And based on what I have used is the database architecture of bronze, silver and gold layers, where the bronze layer contains the raw data, silver layer contains the feature engineering or feature extraction and basically all the features that we want to feed into a machine learning model. Then the gold layer has the data which is filtered and just before it goes into the machine learning model. And in the gold layer is where the predictions are created. And then beyond that, we obviously would want a retraining process and which would utilize a sense of what with MLflow, I would be maybe a bit biased about that. But any other Apache Airflow or similar strategies would work on them, where we retrain the model on a new data using then using a champion model comparison, whether based on certain metrics, which is relevant to the particular use case, we would either archive the previous model or based on which model is performing better or just continue with the champion model. So all these would help build a self-sustaining pipeline, which would maintain the data as well as the quality of predictions and the accuracy would improve because the more data an ML model has, the better the accuracy.

What strategy would you employ to optimise a Python application's interaction with S3? Parallel processing is one of the major computationally or which can handle the computation and not block or cause any blockages which is essential for user experience so that all these S3 buckets by default they have parallel access so use multiprocessing or multithreading also would work so that in the Python itself so that the Python app is leveraging multithreading and accessing for each user or even not even for each user for each prediction it uses a different thread so that and that thread can independently and in parallel access the S3 buckets so that there because by default Python application is a sequential application and by helping parallelise that would significantly improve or optimise how S3 native AWS S3 natively supports parallel accesses reads and writes so yeah

Find any incorrect results of one more lyrics. Querying a knowledge graph, select property value. I have worked with SQL majorly. So I don't know about graph. But this question mark property question mark value and it's not a valid SQL query at least and this hyphen sorry not hyphens backslash quote would doesn't make sense. It's not correct Python syntax. So we don't we do not need that backslash just three quotes would do and yeah the query itself. I don't know if we should be using commas and the where condition it doesn't have and what should be the condition exactly. So these are the query doesn't look right to me.

Neo4j is basically a graph-based database framework, so based on, so any use case which involves maintaining relationships, these kind of node or graph kind of representation like a social media network where you have friends who are friends of friends and so on, that's how a graph, a node is connected to another node, so your friend is connected to another friend, so this setup is ideal for these kind of scenarios and the machine learning in this case inherently knows about these relationships and would try to leverage similar nodes not only by the individual nodes attributes but using the relationships as well which would help the machine learning model learn about these things instead of usual table structure which would require additional training to integrate the relationship aspect, so explaining how one row is related to another row, that wouldn't be something straightforward to teach an ML model using a tabular or a columnar structure.

that can enhance ML prediction capabilities for a system designed in this strategy. Neo4j, like I said before, it is a graph-based database. So building a knowledge or implementing a knowledge graph would be very straightforward and leveraging this for machine learning predictions, I mean, assuming it is a use case which is very suitable for a graph, Neo4j natively supports the nodes, relationships and this would be easily captured by the machine learning model which would help train or implement a knowledge graph and the machine learning model can immediately learn about how the knowledge graph is structured.

So Skykit, so the project that I had worked on which involved initially we used XGBoost on Skykit Lore, but based on the use case a survival model was a much better fit. So there is another library by Skykit called Skykit Lore, Skykit Survival which we implemented to tailor fit our use case which just made sense instead of using the traditional ML which are majorly good for classification kind of problems or I mean regression of course.

FastAPI, since I've worked with FastAPI, it natively supports asynchronous, although there is a little tricky part there where if you specify or manually specify an async function then it actually becomes a sequential function, which I think was a major topic of debate – not debate, major topic of confusion, which was clarified in a PyCon in Ireland or something, which in a talk the person clarified on how to exactly use this for asynchronous. So basically you just define the function as they are, you do not manually specify async and because FastAPI natively supports async, it would just automatically run the functions in an asynchronous method and it is important to keep any API asynchronous so that one user's query is not blocking another user's query and optimize the server load, the compute of the server so that there is no idle time for the CPU.