Rakesh Katkam

Data Scientist
Full Stack Developer

Education

MS in Computer Science

University of Central Missouri

Jan 2019 - May 2020

BE in Computer Science

University of Mumbai

2006 - 2010

Languages

English (Fluent)
Telugu (Native)
Hindi (Fluent)

Interests

Reading
Writing
Photography
Cooking

Work Projects

Viewership Prediction Engine - NIIT Technologies Ltd

The media organisation manages 22 television channels, in terms of broadcasting, new shows/series productions, logistics and financial aspects. The broadcasting division consist advertisement sales department. The primary responsibility of this department is to sell the advertisement slots of each and every show/series from all the channels. Traditionally, the sales prices were pitched primarily on the TRP (Television Rating Points)**.

Business Problem:
As the sales prices of the advertisements are based on the TRP figures, and the TRP which are obtained from devices attached to only few thousands television sets in the cities where population exceeds in millions households, presents unreliable viewership counts. as they does not represent the actual viewerships numbers at ground level. Additionally, the TRP gives the past number, not the future forecast. So, to make better decision on the ad sales prices, and to consider additional metrics like holidays, the festival seasons and the regional population count, better and more accurate viewership model was required.

Data Sources:
1. Broadcasting devices:
The broadcasting of the channels happens from organisation to operators (local agents), and from operators to end users(customers). The operators manages devices, which receive the signals from satellites, and reroute them to the end users. The devices are channel specific and logs sensor data every 30 seconds about number of connections established with it. The devices generates around 6.2 GB average data per day.
2. DTH partners:
Additional data is sourced from the DTH partners, which provides services directly to around 300 millions customers and provide around 1 GB processed data per day.
3. Online platforms and mobile apps:
The online and mobile app accounts to around 25 million customer base. And the daily average data generated by the apps are around 1 GB in the form of logs and databases.

Responsibilities:

Constructed multi-layered data funnels using Python, HDFS, Spark, Hive and Sqoop to collect and store data from sensors devices, applications servers and DTH partners.

Extracted and transformed the sensor data from raw text lines to the required format with Hive regular expression.

Merged and joined financial, logistics, sensors, and DTH datasets to create single dataset with 1500 features using Hive.

Applied statistical and exploratory methods like missing value treatments, imputation, outlier detections, scaling and feature engineering with Python.

Processed and prepared data with 700 features and 20 million records for modeling.

Initially trained random forest regressor as base model to extract important features and cross-validate with other models.

Trained regressors using LightGBM (rmse < 0.17)and XGBoosters (rmse < 0.07) on the processed data for viewership prediction.

Built and deployed the final model through internal web applications to be used by the advertisement division for viewership forecast.

Academic Projects

House Prices Prediction - Machine Learning Course, Summer 2019. Objective of the project is to predict the House prices on the given data set.

Performed EDA steps such as data cleaning, missing value treatment, scaling, feature engineering.

Written custom functions for KNN, Naive Bayes and K-Means clustering.

Trained Decision Tree, RandomForest, Linear Regression, GradientBoost, LightGBM, and XGBoost.

Submitted on Kaggle competiton, with RMSE < 0.1350 and leadership position under 2000.

Rock Paper Scissor Programming Competition - Machine Learning Course, Summer 2019.

A machine learning algorithm to predict the next move of the opponent and output the counter move to win the game.

Developed and trained the prediction model using Bayes's probability,

The model outputs outcome based on the likelihood of new event occurrence with respect to past events.

Secured second spot among 15 teams in the competition, winning 250 rounds of 300 played.

Competed in the RPS competition, and ranked under 1000 on the leaderboard.

Cloud-based web application on AWS - A combined project for Big Data Analytics & Cloud computing, Spring 2019.

Deployed Elastic MapReduce (EMR) Clusters on AWS cloud for data analysis.

Designed and developed a web application on AWS with services like EC2, S3, Load Balancing, and EMR.

Scheduled various scripts in Hive, Spark and SparkSQL for data processing and report generation.

Build a dashboard displaying reports such as AWS resources status, scheduler queue, and completed activities.

Sentiment Prediction with Multiprocessing & Multithreading - Advanced Operating Systems, Fall 2019.

Performed exploratory and sentimental analysis on twitter data in serial and multiprocessing setup

Compared and documented serial and multiprocessing parameters deployed on systems with varying configurations and operating systems.

Developed custom round robin algorithm for fetching tweets in serial and parallel processing.

Performed NLP methods like sentence segmentation, word tokenization, text lemmatization, and stop words identification

Visualization of sentiments and word cloud

Research Projects

Instalytics - Instagram analysis of posts and comments

Instalytics is an instagram analytics tools to perform analysis and derive insights from instagram posts.

Developed a python bot to crawl through instagram posts and extract information on likes, comments and followers.

Cleaned and processed raw data to convert it into structured format.

Employed various analytical methods to gain insights on posts frequency, trends, most liked and commented post.

Displayed the insights through interactive visualization using Plotly library.

Future Work: Media labelling of instagram posts, sentiment prediction of user posts and comments using Deep Neural Network algorithms

Resume/CV Parser

The script parse a given resume/cv in pdf format and outputs the data into classified labels for easier read.

Data has been collected and gathered from multiple sources

Trained spacy's NER model for label classification

Top classified labels:Name and Contact Information, Designation, Location, Experience, Skiils

Accumulating more data for improving the accuracy of the model.

Natural Language Processing - Analysis and labelling on Human text,speech patterns

A research project, analyzing humans speech, talks, and writing, to find patterns and gain insight into their thought process and classify them into different categories of mindsets like optimistic, realistic, positive and criticizing to name a few. Collaborating with linguistic experts, psychological expert and other enthusiastic data scientists for the work.

Grouped and segmented the text and speech data based on linguistic and geo spatial points.

Used NLTK for region specific population to expand domain specific Knowledge Base for semantic purpose.

Identified and documented the common traits based variety of the behavioural aspects.

Career Profile

Strong analytical and problem-solving skills and the ability to digest, interpret complex concepts.

Attention to detail and the ability to learn quickly

Strong credentials and highly proficient in Python, R, Sklearn, TensorFlow, Keras, Spacy, NLTK, NumPy, Pandas, Matplotlib, Seaborn, Plotly, REST API, and SOAP services.

Deep expertise in statistical methods like data normalization, random/up/down-sampling, feature engineering.

Experienced in training classification & regression machine algorithms like XGBooster, LGM, Random Forest, CNN(1D,2D), RNN, LSTM, VGG16, SVM, DBSCAN.

Strong credentials with variety of Amazon Web Services product offerings (EC2, S3, Load Balancing, CloudSearch, ElastiCache).

Demonstrated experience in bringing critical applications from design through production and support

Excellent interpersonal, written and verbal communication skills. Proven ability to collaborate with internal and external teams globally including ensuring appropriate communications with management.