profile-pic
Vetted Talent

Divyansh Srivastav

Vetted Talent
Passionate DevOps Engineer with years of experience in providing and maintaining cloud infrastructure, formulating and implementing cloud solutions, and setting up CI/CD pipelines
  • Role

    DevOps Manager

  • Years of Experience

    10.4 years

Skillsets

  • Cloud Infrastructure - 9 Years
  • Shell - 10 Years
  • Bash Shell Scripting
  • CI/CD Pipeline
  • Amazon Web Services
  • Serverless Framework
  • Linux system administration & performance tuning
  • Apache/nginx/caddy web server
  • Jenkins - 8 Years
  • Shell Scripting - 10 Years
  • SQL - 10 Years
  • infrastructure as code - 8 Years
  • Security - 5 Years
  • GCP - 4 Years
  • Git
  • Architecting & implementing ci/cd pipeline
  • System Administration - 10 Years
  • Ci/Cd Pipelines - 10 Years
  • Amazon Web Services
  • Git
  • Kubernetes
  • Serverless Framework
  • Ansible - 05 Years
  • DevOps - 10 Years
  • Cloud
  • Terraform - 6 Years
  • Terraform - 6 Years
  • AWS - 10 Years
  • Apache
  • nginx
  • Azure - 8 Years
  • Git - 10 Years
  • Kubernetes - 7 Years
  • DevOps - 9 Years
  • Bash - 10 Years
  • Docker - 8 Years
  • Grafana
  • Nagios
  • Prometheus
  • CI/CD - 8 Years
  • ELK Stack
  • IAC - 7 Years
  • Azure DevOps - 7 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Software Engineer - MLAI Screening
  • 79%
    icon-arrow-down
  • Skills assessed :Kubeflow, seldon, Spark, AWS, Docker, Kubernetes, machine_learning, Problem Solving Attitude, Python
  • Score: 71/90

Professional Summary

10.4Years
  • Apr, 2024 - Present1 yr 6 months

    Senior Technical Leader - DevOps

    Espire Infolabs Private Limited
  • Aug, 2023 - Mar, 2024 7 months

    DevOps Lead Engineer

    Accuknox India Pvt Ltd.
  • Dec, 2018 - Jul, 20234 yr 7 months

    DevOps Technical Lead

    Celestial Systems Pvt Ltd.
  • Nov, 2016 - Aug, 2017 9 months

    Linux System Administrator

    CHI Networks
  • Sep, 2017 - Jan, 2018 4 months

    System Engineer

    Value First Digital Media Private Limited
  • Jan, 2018 - Nov, 2018 10 months

    DevOps Engineer

    One.com India Private Limited
  • Aug, 2014 - Nov, 20162 yr 3 months

    Senior Analyst

    HCL Technologies Ltd.

Applications & Tools Known

  • icon-tool

    Amazon Web Services

  • icon-tool

    Azure

  • icon-tool

    Kubernetes

  • icon-tool

    Docker

  • icon-tool

    CI/CD Tools

  • icon-tool

    Terraform

  • icon-tool

    Ansible

  • icon-tool

    Bash

  • icon-tool

    Prometheus

  • icon-tool

    Grafana

  • icon-tool

    ELK Stack

  • icon-tool

    Linux

  • icon-tool

    Apache

  • icon-tool

    Nginx

  • icon-tool

    Git

  • icon-tool

    Jenkins

  • icon-tool

    Github Actions

  • icon-tool

    ArgoCD

  • icon-tool

    GitHub Actions

  • icon-tool

    Travis CI

  • icon-tool

    Terraform

  • icon-tool

    Serverless Framework

  • icon-tool

    Linux

  • icon-tool

    Nginx

  • icon-tool

    Prometheus

Work History

10.4Years

Senior Technical Leader - DevOps

Espire Infolabs Private Limited
Apr, 2024 - Present1 yr 6 months
    Architecting and implementing Azure Cloud Infrastructure to enforce standards, manage compliance, and ensure Azure Well-Architected Framework adherence. Developing, optimizing, and maintaining Terraform code to provision and manage scalable Azure infrastructure using AKS, Azure Database, and other Azure services.

DevOps Lead Engineer

Accuknox India Pvt Ltd.
Aug, 2023 - Mar, 2024 7 months
    Led and managed the DevOps Team, actively participating in sprint planning sessions. Translated product requirements into DevOps solutions, ensuring alignment with business objectives.

DevOps Technical Lead

Celestial Systems Pvt Ltd.
Dec, 2018 - Jul, 20234 yr 7 months
    Understanding Customer's DevOps requirements and designing complete DevOps workflow from the first commit to production.

DevOps Engineer

One.com India Private Limited
Jan, 2018 - Nov, 2018 10 months
    Maintenance of IT infrastructure of web hosting. Configuration management. Planning and setup of CI/CD pipeline.

System Engineer

Value First Digital Media Private Limited
Sep, 2017 - Jan, 2018 4 months
    Linux server Administration and maintenance - Production Managed Amazon Web Services cloud platform.

Linux System Administrator

CHI Networks
Nov, 2016 - Aug, 2017 9 months
    Linux servers Administration and Performance Tuning. Deployment of LAMP and web hosting platform.

Senior Analyst

HCL Technologies Ltd.
Aug, 2014 - Nov, 20162 yr 3 months
    Learned about the ITIL process - Incident and change management. Resolved tickets and performed changes within the stipulated time.

Achievements

  • Implemented GitOps in multiple projects
  • Provisioning and managing Kubernetes clusters
  • Automation of Infrastructure provisioning via Terraform
  • Configuration management through Ansible
  • Linux System Administration & performance tuning
  • Translated product requirements into DevOps solutions, ensuring alignment with business objectives.
  • Led and managed the DevOps Team, actively participating in sprint planning sessions.
  • Conducted interviews to hire multiple DevOps positions, contributing to team expansion and talent acquisition initiatives.

Major Projects

1Projects

Multi-tenant SAAS application architecture

    Architected multi-tenant design for SAAS application in conformance with the Azure Well-Architected Frameworks using Azure Kubernetes Service

Education

  • Bachelors of Technology in Instrumentation and Control Engineering

    SRM University (2014)
  • Bachelor of Technology in Instrumentation and Control Engineering

    SRM University (2014)

Certifications

  • Hashicorp certified: terraform associate

  • Certified kubernetes administrator

  • Red hat certified engineer (rhce)

  • Certified kubernetes administrator certificate number: lf-q22q5yjxxh

  • Hashicorp certified: terraform associate (003) certificate number: b9850132-3657-45f0-945c-415279080f65

  • Hashicorp certified: terraform associate (003)

  • Red hat certified engineer (rhce) certificate number: 160-039-657

AI-interview Questions & Answers

Okay. Uh, could you help me understand more about your background and completion? Sure. My name is Divyans Shrivastava. Uh, I have a decade of experience in the DevOps and the cloud, uh, space. And in these 10 years, like, I have worked with a variety of companies. Like, I have worked with product based companies, uh, like, ValueFirst Digital Media and my current organization like Equinox. I have worked with, uh, web hosting one of the leading web hosting companies like 1.com. I've also worked with classical service based companies like, you know, SCL. And, uh, by working in these companies, like, I've got a good exposure, uh, understanding, and experience, you know, working with DevOps tools, frameworks, and and technology, you know, as as a whole. Uh, I have also, uh, been into the management positions in my last 2 organizations. Like, I have led the DevOps team in my last organization. I'm also leading the team in my current organization. And I've built the DevOps team from the scratch from 2 to 22 in my last organization. And I'm also, you know, working hard in my current organizations to, you know, build the DevOps framework and the processes. Uh, I'm proficient, you know, in designing the entire DevOps infrastructure from the 1st commit to the production. Um, I am proficient in managing the, um, uh, Kubernetes clusters, uh, creating the infrastructure architecture, uh, on cloud, you know, for various business use cases. Uh, and I can also set up the observability for microservices and infrastructure if needed. Uh, so in an all, you know, my profile, or or you would say my experience, you know, spans across the technical as well as the managerial, uh, you know, uh, aspects, uh, you could say. Uh, I can handle both things. Right? I am proficient in handling the technical aspect. You know, I am I'm also, you know, proficient in managing the, uh, the manager, you know, aspect. You know? Uh, in case, uh, there is a need to lead the team or, you know, uh, to mentor, uh, the junior, uh, DevOps engineers. Yeah. Yeah. So that's all about me. My detail, you know, skill set, you know, is mentioned in my resume that you can go through, uh, and, uh, you know, have a better understanding of it. Yeah. Thanks.

Uh, propose a logging strategy for a Python machine learning application on Kubernetes time balances details with storage, uh, consideration. Yeah. So I think logging strategy for any application in general is very crucial. Right? And, uh, for a Python based application, I think we should go for the centralized logging system. So in which, what we should have, like, we should have that we should have 1, you know, server or a logging aggregator, where the logs from all, uh, the application instances where this Python application is, uh, deployed can come to. So, uh, we can use your case tag. I think it's a very efficient and capable, or and a powerful, you know, logging aggregator. So we can set up an ELK, uh, stack system. Right? We can have a real case server. Right? And, um, we can have we we can then have, uh, the agent that will be installed, you know, on this, uh, you know, Python, uh, instances. Right? Where where the, uh, Python application would be installed. And, uh, from there, you know, I think the logs would be exported to the centralized server, where it will be processed through, uh, Logstash. Right? And it can then and then the logs could be, you know, carried, uh, through the through Elasticsearch and can be viewed on Kibana. Right? And then, like, uh, for exporting the, uh, logs, we have different, uh, you know, agents. Like, we can make use of Beats, right, or we can make use of, uh, Fluentd. So it depends on the use case, and, uh, it also depends at which one is efficient for our use case. Right? So we can evaluate it, and this is on the POC. We can evaluate which one is working well. In most of the cases that I have seen with text log files, I think Beats works well, so I think we can go with that. For storage considerations so that is why I select for storage consideration, you know, it's better to offload all the logs to one place. That is your, uh, centralized server, and we can, um, rotate the logs, you know, after, you know, after a month or after 45 days. Right? So depending on the policy, uh, storage policy of the logs that, uh, that has been in place, we can rotate the logs, and we can offload the logs to to s three bucket or something of that slot to reduce the overall, you know, EBS storage cost in case we are on AWS or, uh, the disk storage cost in case we are on, you know, uh, Azure, you know, for that matter. So this so, uh, the strategy will be very simple, that, uh, offload the logs from the server to some centralized server and then rotate the logs after after a certain interval to some, um, archive solution, right, that that costs less, you know, uh, from the storage perspective. Yeah. So that's my answer for this question.

Developer's strategy to implement AB testing of new Python machine learning models in a Kubernetes, uh, environment and showing minimal impact on, uh, production. I think it's it's a very good question, I would say. Uh, I think we can make use of the advanced, you know, um, deployment methodologies, like, uh, canary deployment. Right? And for canary deployment, we can do it in combination with GitOps. So So so if I have to if I were to, you know, implement this solution, I would do it using using GitOps. Uh, methodology involving, you know, uh, implementing Argo CD as a tool, and then implementing the canary based deployment. Right? So in canary based deployment, as you might be knowing that only a certain, you know, only a certain percentage of the traffic is exposed to the new version. While the older version, you know, is, uh, what is strictly running on the previous version or on the existing version. And then if you find that, okay, that the new version is working well, then gradually we increase the percentage of the traffic that would be exposed, you know, to the new version. Right? And I think this can be very well be done, you know, with the use of, you know, ArgoCD. ArgoCD has inbuilt support for, you know, candidate deployments, wherein we can mention that what percentage of traffic we would like to expose to the new version. Right? And what percentage of the traffic would be, uh, uh, you know, exposed to the previous version. Right? And with that, we could do it. Right? Yeah. So for a to for a and b testing, I think, uh, uh, uh, going ahead with the candidate deployment, you know, makes sense, you know, as per

Can you optimize resource allocation in a Kubernetes cluster running heavy Python based machine work learning workloads without overprovisioning? See, the very first step that would be required, uh, for learning the modules, you know, would be to analyze the requirements. Right? Uh, would be to benchmark the application and see that how much CPU or the GPU that particular, uh, you know, application needs. And basis on the benchmarking, once we have the data, once we know that, okay, uh, suppose if we are working on TensorFlow. Right? And suppose if needs 1 GPU to process, you know, 1 scan. Right? Uh, so we know that at least if suppose if you have to go with multi processing, so it would need at least, uh, suppose 2 GPU or 3 GPUs at a time. So based on that type of benchmarking, if you have that data, what we can do, we can put the resource request. Right? Um, uh, into the, uh, source source request into the configuration for those, you know, LLM model ports. Right? And, uh, that will that will actually, you know, help us to, uh, you know, um, resource, uh, to to allocate the resource without over provisioning it. And we can also put some limits. Right? That okay. Suppose if, uh, it goes beyond, you know, for for 3 GPUs or 4 GPUs. Right? Then it should not be allowed to. So we can we can put the limits, you know, for for that matter. So in my opinion, I think benchmarking the application is very important, knowing that how much it's, uh, and then how much is the memory and the CPU and the GPU requirements for the LLM models are. And then, based on that, we can actually, you know, configure, uh, the resource requests, you know, and the limits. Uh, Yeah. So this is one thing, you know, that, uh, that we can do. And for benchmarking, I think we can make use of tools like Locust. And there are other tools that can help us to determine, uh, you know, that particular thing. Yeah. So, yeah, this is how, you know, I can I think we can do?

Describe the process of monitoring the health of a Python based machine learning application on Kubernetes and triggering alerts based on custom metrics. Yeah. So I think this, we can do it with Prometheus. Right? And, uh, so we can set up, uh, the monitoring with an open source tool like Prometheus. It will help us to, you know, view the metric data. Right? Like CPU, memory, uh, and other, you know, performance metrics. Right? And then we can have, uh, this thing, uh, what we say. Uh, we we can we can send the alerts, right, based on, uh, uh, the Prometheus metrics. Right? Uh, so, yeah, so so this can be done. Uh, so suppose, like, if you want to do, uh, in fact, the auto scaling can also be configured with respect to the, uh, you know, Prometheus metric data. But that's a separate, you know, topic altogether, and how do we configure HPE. Right? So we obviously, we cannot configure HPE with respect to, uh, the Prometheus metric data. But, yes, the question over here is very simple, that how do we monitor it? So for the sake of this answer, I would say that we will configure Prometheus. We'll get, uh, the metric data, you know, of, uh, of all the, you know, Python instances that are running, uh, and the antipods, right, that can be configured, right, with the use of optimal, uh, from a test exporter. And based on that, we can actually set the alerts. So this is how we can do it. Uh, yes. And let me think if we can yeah. We can also integrate Krapana to have the visualization of those metrics if you want to. Right? We can also, uh, set, uh, I would say, uh, the the the alerting at the final level also to so that we can see we can actually visualize, um, the bench the the thresholds. Right? And the panels and the charts. Right? So that can also be done. Yeah. So this so so yeah. So this can be done, obviously. Right? Um, and there are some custom, uh, I would say, you know, exporter did help us to, you know, configure the custom metrics. Right? Suppose if you want to have, uh, for example, like, suppose Python based machine learning application. Right? Suppose if you want to have data, how many, uh, how many other models are in process right now or in the queue right now, right, or or how much, uh, CPU memory 1, um, 1 11 ML models, uh, you know, run takes. So so there are some custom, you know, exporters that helps you to get this information. So we can configure that. Then we can export the data to Prometheus. And with that, we can also configure upon it to have the visualization on the charts. Yeah. So I think so I think this, this is going to work, you know, uh, for sure without any issues.

How would you leverage Kubernetes features to scale up Python based learning inference workload efficiently? With these features to scale up Python based machine learning inference. Yeah. So, as I said, see, there are 2 levels or there are 2 types of, uh, you know, auto scaling in, um, Kubernetes. 1 is your horizontal border scaling or vertical border scaling, and then we have the cluster auto scaling. Right? But the parameters to decide on when to, you know, auto scale matters. Right? So that is why, like, once we have implemented the Prometheus and once we have the custom metric data, we can actually integrate it with HPA. And, uh, we can, uh, you know, we can we can make it to, you know, auto scale, you know, on some standard metric or some custom metric. Right? So once, uh, this will be configured, then the port will scale on its own. Right? Uh, based on the scaling configuration, um, and the the con the condition that we'll be putting it in the, uh, in the scaling configuration. And, uh, it will scale up to a limit because, you know, every node can have only certain point number of ports. Once it reaches the node, you know, threshold, then, obviously, we need another level of auto scaling, and that is your, uh, node auto scaler or the cluster auto scaler. Now with cluster auto scaler, we have 2 options. Either we can go with the open source cluster auto scaler, or we can go ahead with something like more sophisticated, like Carpenter. So Carpenter is a tool that is developed by AWS team itself. Right? Uh, it manages the auto scaling in a very different way. Right? Instead of, uh, you know, playing uh, with the with the, uh, uh, node groups. It actually directly interacts with the easy two APIs. And, uh, uh, and auto scales directly, you know, by interacting with the EC two APIs. Right? So and we and along with the auto scaling, it also helps me with the cost optimization. Right? Because it, uh, auto consolidates, uh, you know, after a certain period of time, which is, uh, like, as defined in the configuration file. Right? So, yeah. So two levels of auto scaling have to be configured, horizontal border scaling, vertical border scaling, and then, uh, the cluster auto scaling. And with that, I think we would be able to, you know, achieve, uh, the decent level of auto scaling with this. Right?

Find the problem in this Docker Messenger that could potentially break the build by leveraging caching layers, okay, from Python. See in this, first of all, the requirement dot dot txt is not copied. Right? To the image. Right? So when it is not copied, uh, so without copying it, we are running the command pick install hyphen r dot TXT. And so this itself will break, uh, the docker build, because the requirement dot txt will not be there, you know, uh, from before, you know, inside this particular image. Right? Uh, so this is this is first problem that I see, you know, with this Dockerfile. Yeah. So I think, uh, the best solution would be that first, we should copy the requirement dot TXT file to the image. Once it is copied, right, like, once we do, you know, copy, uh, requirement dot TXT to, uh, the working directory, after that, we should do, you know, run this command, run pip install hyphenrconvert.txt. Once the pip install run it and install all the, you know, Python modules, after that, we should, uh, you know, copy rest of the code, and we should build the application. So this is how, uh, the, uh, the Dockerfile should be built for it for Python application. Yeah. And that's it. I think, uh, everything else seems to be fine.

The process for converting the stateless machine learning APIs for stateless machine learning APIs in Python to stateful services according to complex processing needs. Outline the process of converting stateless machine learning stateless machine learning APIs in Python, 2 stateful services in Kubernetes for complex processing. See, for, uh, for converting any stateless service to stateful, uh, that have to be deployed inside the Kubernetes cluster. Obviously, we would have to make use of stateful sets. Actually, stateful sets, you know, helps you to have a unique identifier order, you know, pods, order in order numbering of the pods. And, uh, it attaches, you know, you know, a physical, you know, volume, right, to every port. Right? So you will get that, uh, particular you know, you will get those features, you know, with with the stateful, uh, with the stateful setting that helps you to, uh, retain the, you know, state of the, you know, pod even if it crashes right now. So, obviously, the stateful sets have to be, uh, you know, considered. Right? It have to it has to be, you know, configured, you know, for deploying these stateless services as a stateful inside the Kubernetes, uh, environment. And I think, uh, apart from that, if there is anything, I think that that depends on the overall use case. So as the description is very limited, so I am unable to, you know, think of any other, I would say, solution at the moment, to be very honest, Clyde. But, yes, I think statefuls that have to be, you know, considered, and they have and I think I think they should be utilized, you know, for this purpose. And they can be extended with the use of, uh, you know, services or maybe, you know, ingress to expose those, you know, um, uh, Python services. Uh, but, yes, under the hood, I think, uh, Stetful said, you know, would have to be, you know, uh, would have to be utilized, you know, for this purpose.

For 12 and a half literate solution for a committed solution that serves Python based machine learning models for critical, real time applications. Okay. So for do you see for, uh, for the for creating a fault to run a high availability solution for Kubernetes cluster that sells Python based machine learning models for critical real time applications. I think we would have to use, first of all, uh, the, uh, GPU based instances if data learning modules requires it. Right? And, also and the thing is that the nodes should be, you know, span across multiple availability zone. So, uh, after selecting the right kind of, uh, you know, uh, nodes, uh, we should be creating, uh, Kubernetes cluster in such a way that the nodes are scattered across, uh, different availability zones. Right? So that even if, uh, the node, uh, of if, you know, a node, you know, residing in one availability zone, you know, uh, is down, right, or is is experiencing some issue, the traffic could be served from other availability zone. Likewise, when we are designing the pod, you know, uh, you know, when we are deploying the pod, we should also, you know, consider the, uh, pod spread constraint. Right? So, uh, we should, like, uh, we should mention this, uh, you know, spread constraint topology, right, where we should have at least, you know, 1 part, you know, of, uh, the model in each node. So that even if, uh, the even if 1 port crashes, you know, the traffic could be served from, uh, other, you know, pods from other availability zones. Right? And then we should adopt the decent level of, you know, auto scaling mechanism to ensure that if, uh, if if particular, you know, uh, you know, a port is, you know, uh, is bombarded with a lot of, you know, requests, right, and is unable to sub the traffic. The the traffic could be, you know, subbed from other ports. Right? Uh, Yeah. So so so so I think, uh, yes, this can be done. Apart from this, I think, uh, we can also implement the probes and, uh, a pro like, we have different kinds of props, like, you know, a a startup probe and your, uh, uh, live liveness probe. Right? And and readiness probe. So these are the props that could be utilized to ensure that, uh, antenna unless the port is completely up and is completely to receive the traffic, the traffic is not, uh, the the ports are not exposed to the traffic. Right? So liveness and the readiness prop can actually, you know, help us to, uh, effectively, uh, you know, divert the traffic to the to the ports or only when, uh, they are ready to accept the traffic or or when they're healthy. Also, we'll be able to know when the ports are unhealthy. And based on that, the the traffic can be routed to some other, you know, uh, healthy ports, you know, that are part of the cluster. So I think this is how we can, uh, you know, design default tolerance and high availability. And, yes, there are a lot of other things that comes into, you know, that comes into play that depends at how the entire, uh, you know, application, you know, is being designed, uh, and what are the other components, right, that are that are involved, right, uh, in the overall setup. But, yes, this is this is from where we can start with, and then we can brainstorm further to create an extensive solution. Uh, yes. This this is my answer for this question.

The role of MLflow in simplifying the management of Python learning model life cycles within a Kubernetes based platform. Uh

Evaluate how Spark can boost data processing for, uh, for training large scale battle learning, what is in a cloud environment like AWS. I think Spark is capable of processing the large amount of data. Right? And as Python learning modules, uh, right, they they actually, you know, they they have to process a large amount of data, but the class so we can, you know, uh, integrate Spark, so that we can utilize, uh, the high processing power of Spark to process the, uh, large amounts of data. And then can it and then it can be routed to, uh, the models. Maybe, like, I'm I'm not very sure about this, but this is what is coming to my mind as of now.