Work Projects
Viewership Prediction Engine -
NIIT
Technologies Ltd
The media organisation manages 22 television channels, in terms of broadcasting, new
shows/series productions, logistics and
financial aspects. The broadcasting division consist advertisement sales department. The primary
responsibility of this department is to sell the
advertisement slots of each and every show/series from all the channels. Traditionally, the sales
prices were pitched primarily on the TRP (Television Rating Points)**.
Business Problem:
As the sales prices of the advertisements are based on the TRP figures, and the TRP which are
obtained from devices attached to only few thousands television sets in the cities where
population exceeds in millions households, presents unreliable viewership counts. as they does
not represent the actual viewerships numbers at ground level. Additionally, the TRP gives the
past number, not the future forecast. So, to make better decision on the ad sales prices, and to
consider additional metrics like holidays, the festival seasons and the regional population
count, better and more accurate viewership model was required.
Data Sources:
1. Broadcasting devices:
The broadcasting of the channels happens from organisation to operators (local agents), and
from operators to end users(customers). The operators manages devices, which receive the signals
from satellites, and reroute them to the end users. The devices are channel specific and logs
sensor data every 30 seconds about number of connections established with it. The devices
generates around 6.2 GB average data per day.
2. DTH partners:
Additional data is sourced from the DTH partners, which provides services directly to around 300
millions customers and provide around 1 GB processed data per day.
3. Online platforms and mobile apps:
The online and mobile app accounts to around 25 million customer base. And the daily average data
generated by the apps are around 1 GB in the form of logs and databases.
Responsibilities:
Constructed multi-layered data funnels using Python, HDFS, Spark, Hive and Sqoop to collect and store data from sensors devices, applications servers and DTH partners.
Extracted and transformed the sensor data from raw text lines to the required format with Hive regular expression.
Merged and joined financial, logistics, sensors, and DTH datasets to create single dataset with 1500 features using Hive.
Applied statistical and exploratory methods like missing value treatments, imputation, outlier detections, scaling and feature engineering with Python.
Processed and prepared data with 700 features and 20 million records for modeling.
Initially trained random forest regressor as base model to extract important features and cross-validate with other models.
Trained regressors using LightGBM (rmse < 0.17)and XGBoosters (rmse < 0.07) on the processed data for viewership prediction.
Built and deployed the final model through internal web applications to be used by the advertisement division for viewership forecast.
Academic Projects
House Prices
Prediction -
Machine Learning
Course, Summer 2019. Objective of the project is to predict the House
prices on the given data set.
Performed EDA steps such as data cleaning,
missing value treatment, scaling, feature engineering.
Written custom functions for KNN, Naive Bayes and
K-Means clustering.
Trained Decision Tree, RandomForest, Linear Regression,
GradientBoost, LightGBM, and XGBoost.
Submitted on Kaggle
competiton, with RMSE < 0.1350 and leadership position under
2000.
Rock Paper Scissor Programming Competition -
Machine Learning Course, Summer
2019.
A machine learning algorithm to predict the next move of
the opponent and output the counter move to win the game.
Developed and trained the prediction model using
Bayes's probability,
The model outputs outcome based on the likelihood of
new event occurrence with respect to past events.
Secured second spot among 15 teams in the competition,
winning 250 rounds of 300 played.
Competed in the RPS competition, and ranked under 1000
on the leaderboard.
Cloud-based web application on AWS -
A combined project for
Big Data Analytics & Cloud computing, Spring 2019.
Deployed Elastic MapReduce (EMR) Clusters on AWS cloud
for data analysis.
Designed and developed a web application on AWS with
services like EC2, S3, Load Balancing, and EMR.
Scheduled various scripts in Hive, Spark and SparkSQL
for data processing and report generation.
Build a dashboard displaying reports such as AWS
resources status, scheduler queue, and completed activities.
Sentiment Prediction with Multiprocessing &
Multithreading -
Advanced Operating
Systems, Fall 2019.
Performed exploratory and sentimental analysis on
twitter data in serial and multiprocessing setup
Compared and documented serial and multiprocessing
parameters deployed on systems with varying configurations and operating
systems.
Developed custom round robin algorithm for fetching
tweets in serial and parallel processing.
Performed NLP methods like sentence segmentation, word
tokenization, text lemmatization, and stop words identification
Visualization of sentiments and word
cloud
Research Projects
Instalytics -
Instagram analysis of posts and
comments
Instalytics is an instagram analytics tools to perform
analysis and derive insights from instagram posts.
Developed a python bot to crawl through instagram posts
and extract information on likes, comments and followers.
Cleaned and processed raw data to convert it into
structured format.
Employed various analytical methods to gain insights on
posts frequency, trends, most liked and commented post.
Displayed the insights through interactive visualization
using Plotly library.
Future Work: Media labelling of instagram posts,
sentiment prediction of user posts and comments using Deep Neural
Network algorithms
Resume/CV
Parser
The script parse a given resume/cv in pdf format and
outputs the data into classified labels for easier read.
Data has been collected and gathered from multiple
sources
Trained spacy's NER model for label classification
Top classified labels:Name and Contact Information,
Designation, Location, Experience, Skiils
Accumulating more data for improving the accuracy of the
model.
Natural
Language Processing -
Analysis
and labelling on Human text,speech patterns
A research project, analyzing humans speech, talks, and
writing, to find patterns and gain insight into their thought process
and classify them into different categories of mindsets like optimistic,
realistic, positive and criticizing to name a few. Collaborating with
linguistic experts, psychological expert and other enthusiastic data
scientists for the work.
Grouped and segmented the text and speech data based on linguistic and geo spatial points.
Used NLTK for region specific population to expand domain specific Knowledge Base for semantic purpose.
Identified and documented the common traits based variety of the behavioural aspects.
Career Profile
Strong analytical and problem-solving skills and
the ability to digest, interpret complex concepts.
Attention to detail and the ability to learn
quickly
Strong credentials and highly proficient in Python, R, Sklearn, TensorFlow, Keras, Spacy, NLTK, NumPy, Pandas, Matplotlib, Seaborn, Plotly, REST API, and SOAP services.
Deep expertise in statistical methods like data
normalization, random/up/down-sampling, feature engineering.
Experienced in training classification &
regression machine algorithms like XGBooster, LGM, Random Forest, CNN(1D,2D), RNN, LSTM, VGG16, SVM, DBSCAN.
Strong credentials with variety of Amazon Web
Services product offerings (EC2, S3, Load Balancing, CloudSearch,
ElastiCache).
Demonstrated experience in bringing critical
applications from design through production and support
Excellent interpersonal, written and verbal
communication skills. Proven ability to collaborate with internal and
external teams globally including ensuring appropriate communications
with management.