Abhijit Annaldas | Machine Learning Blog

Dimensionality Reduction and Principal Component Analysis (PCA) Explained

Mon, 13 Nov 2017 00:00:00 +0000

Data is seldom clean and ready for machine learning or predictive modelling. Data preprocessing is time consuming and non-trivial effort in any predictive modelling task. A recent kaggle survey says that dirty data is a biggest barrier!

Once the data is cleaned and pre-processed, the next challenge is finding the important features of the data, engineering new features and ignoring less important or irrelevant features for the predictive modelling task at hand. Not every piece of data is equally important and may not be relevant in context of the problem being solved. Some attributes may be noise or duplication in some form, and sometime a particular transformation of an attribute (or engineered feature) could be much more relevant than an original attribute.

Popular ways to find these important features are…

Hand crafted feature engineering - deriving features based on existing features, domain expertise and sometimes data augmentation.
Feature Selection - selecting a subset of important features, ignoring features that don’t contribute towards predictions.
Dimensionality Reduction (we’ll discuss PCA in this post) - a type of linear transformations of attributes.

Of the above three, sometimes any one strategy might suffice or a combination of all three could be used. If all three strategies are used, I believe the above sequence is an appropriate approach, please share in comments what do you think.

PCA has several advantages. It helps find the most effective transformation of existing attributes through a linear transformation technique. It helps with dimensionality reduction, which makes things faster by reducing the size of dataset to be stored and processed. PCA, with dimensionality reduction, also makes visualization easy to some extent as it’s hard to visualize and understand data anything beyond 3 dimensions. Intuitively we can say – It is a change in viewpoint (linear transformation) of a scene (vector space), which gives much clear shot of the objects (data) in the scene!

Let’s consider an analogy for some intuition, a business which sells flooring products for homes needs some insights. Let’s consider two pieces (as it makes it easy to draw and visualize) of information about the real estate data viz., plot length and plot width of the house.

Before doing PCA, the data is normalized by substracting mean and usually scaled, data looks something like this after normalizing

PCA identifies two principal components, represented by red and orange lines below.

In this case, PCA identified an attribute what seems to be analogous to plot area, without we engineering the feature ourselves. It seems little obvious in this case due to the nature of chosen example, but PCA does find dimensions (attributes) that aren’t always obvious to us.

Now, let’s understand how does PCA work.

PCA finds a linear transformation which maximizes variance. This is done by minimizing the projection error along the data. This is known as principal component, and the subsequent components are mutually orthogonal. Each principal component is an eigenvector. In the below diagram, green lines are projections from blue data points onto red line, the first principal component. I’ve reduced the number of data points from previous diagram to make it easy to show, as well as draw :), projections.

PCA finds the most important to least important dimensions. So it’s possible that the last/later dimensions have eigen values close to zero. Hence, when we choose k of the n dimensions, we get the k most important dimensions.

Computing PCA has three main steps…

Compute covariance matrix of all the attributes
Perform SVD on the covariance matrix computed in step 1
Step two returns three matrices viz., U, S and V. Matrix U is a n x n matrix, where n is the number of total dimensions/principal components. We choose first k columns of the matrix, which represent k dimensions we need.

Dimensionality Reduction vs Feature Selection

With PCA we pick k (most important) dimensions of n dimensions. k is a parameter determined based on problem and experience. These k attributes are optimal to represent the insights, derive correlations and predictions.

Feature Selection on other hand is picking a subset of all features (and engineered features) of higher importances. Importance can be calculated using different techniques. Some of the popular techniques are available in sklearn and sometimes, the learning algorithm provides the importances of different features.

These two techniques are very different from each other. While dimensionality reduction finds the best features, they aren’t actual features (attributes) we could relate to, they are all linear transformations of existing features. Hence it’s difficult to understand what drives those predictions and outcomes. On the contrary, with feature selection selection, we have much better understanding of the attributes/features and their significance w.r.t. outcome.

Thanks for reading, please share your thoughts in comments.

DBSCAN Clustering

Fri, 20 Oct 2017 00:00:00 +0000

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters. This is very different from KMeans, where an observation becomes a part of cluster represented by nearest centroid. DBSCAN clustering can identify outliers, observations which won’t belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we don’t know how many clusters could be there in the data.

K-Means clustering may cluster loosely related observations together. Every observation becomes a part of some cluster eventually, even if the observations are scattered far away in the vector space. Since clusters depend on the mean value of cluster elements, each data point plays a role in forming the clusters. Slight change in data points might affect the clustering outcome. This problem is greatly reduced in DBSCAN due to the way clusters are formed.

In DBSCAN, clustering happens based on two important parameters viz.,

neighbourhood (n) - cutoff distance of a point from (core point – discussed below) for it to be considered a part of a cluster. Commonly referred to as epsilon (abbreviated as eps).
minimum points (m) - minimum number of points required to form a cluster. Commonly referred to as minPts.

There are three types of points after the DBSCAN clustering is complete viz.,

Core - This is a point which has at least m points within distance n from itself.
Border - This is a point which has at least one Core point at a distance n.
Noise - This is a point which is neither a Core nor a Border. And it has less than m points within distance n from itself.

DBSCAN clustering can be summarized in following steps…

For each point P in dataset, identify points pts within distance n.
- if pts >= m, label P as a Core point
- if pts < m and a core point is at distance n, label P a Border point
- if pts < m, label P a Noise point
For the sake of explainability, lets refer to a Core point and all the points within distance n as a Core-Set. All the overlapping Core-Sets are grouped together into one cluster. Like multiple individual graphs being connected to form a set of connected graphs.

Since clustering entirely depends on the parameters n and m (above), choosing these values correctly is very important. While good domain knowledge of the subject helps choosing good values for these parameters, there are also some approaches where these parameters can be fairly approximated without deep expertise in the domain.

See DBSCAN demo in sklearn examples and try it with sklearn.cluster.DBSCAN in sklearn.

K-Means vs KNN

Sat, 23 Sep 2017 00:00:00 +0000

K-Means vs KNN

K-Means (K-Means Clustering) and KNN (K-Nearest Neighbour) are often confused with each other in Machine Learning. In this post, I’ll explain some attributes and some differences between both of these popular Machine Learning techniques.

K-Means	KNN
It is an Unsupervised learning technique	It is a Supervised learning technique
It is used for Clustering	It is used mostly for Classification, and sometimes even for Regression
‘K’ in K-Means is the number of clusters the algorithm is trying to identify/learn from the data. The clusters are often unknown since this is used with Unsupervised learning.	‘K’ in KNN is the number of nearest neighbours used to classify or (predict in case of continuous variable/regression) a test sample
It is typically used for scenarios like understanding the population demomgraphics, market segmentation, social media trends, anomaly detection, etc. where the clusters are unknown to begin with.	It is used for classification and regression of known data where usually the target attribute/variable is known before hand.
In training phase of K-Means, K observations are arbitrarily selected (known as centroids). Each point in the vector space is assigned to a cluster represented by nearest (euclidean distance) centroid. Once the clusters are formed, for each cluster the centroid is updated to the mean of all cluster members. And the cluster formation restarts with new centroids. This repeats until the centroids themselves become mean of clusters, i.e., when updating centroids to mean doesn’t change them. The prediction of a test observation is done based on nearest centroid.	K-NN doesn’t have a training phase as such. But the prediction of a test observation is done based on the K-Nearest (often euclidean distance) Neighbours (observations) based on weighted averages/votes.

You can find a bare minimum KMeans algorithm implementation from scratch here.

Learn more about K-Means Clustering and K-Nearest Neighbors

Data preprocessing in Machine Learning

Fri, 08 Sep 2017 00:00:00 +0000

Data preprocessing is an important step of solving every machine learning problem. Most of the datasets used with Machine Learning problems need to be processed / cleaned / transformed so that a Machine Learning algorithm can be trained on it. Most commonly used preprocessing techniques are very few like - missing value imputation, encoding categorical variables, scaling, etc. These techniques are easy to understand. But when we actually deal with the data, things often get clunky. Every dataset is different and poses unique challenges.

In this post, I’m not explaining preprocessing techniques, but sharing a few tips based on my experiences.

Too many nulls - When most (over 60% to 70%) of the values in a column are null, it’s better to drop the column. This percentage/threshold can be decided based on problem and experience.
Same values/skew - Sometimes, a majority of values in a column might be same values with very few different values. We need to check if the occurrence of such values is due to a skew in dataset or is it natural for that dataset. If it’s skewed, dataset should be resampled (sub-sample or over-sample, as appropriate). If it’s not a skew and the values occur naturally in that way, it’s better to drop the column.
Data types - Check the datatypes of the columns, particularly date columns and type cast appropriately.
Missing value imputation - Usually median is used with numeric columns and mode with non-numeric columns.
When column doesn’t have missing values - It’s possible that a column doesn’t have any null values in the train dataset, but it’s very possible that it might have null values in test dataset. Hence, it’s important to review the columns/data and perform missing value imputation of all columns that can possibly have missing values, even if the train dataset doesn’t have any missing values.
Categorical Attributes -
- When the number unique values in a categorical column are too high, check the value counts of each of those values. Replace rarely occurring values together into a single value like ‘Other’ before encoding.
- When number of unique values is huge and even the values are equally distributed, try to find some related values and see if the multiple categorical values can be clubbed into single (grouping), thereby reducing the count of categorical values.
Related Attributes - If there multiple attributes with same information with different granularity, like city and state, it’s better to keep columns like state and delete city column. Additionally, keeping both columns and assessing feature importance might help in eliminating one column.

Machine Learning Practitioners/Data Scientists - Please share your thoughts, anything you wanna add or do differently/better.

Machine Learning - Beyond the buzz!

Sun, 03 Sep 2017 00:00:00 +0000

Machine Learning and Data Science is one of the hottest topics of all the disciplines these days. It has created a lot of interest among the people. Machine Learning has immense potential and we keep seeing a lot of jaw dropping accomplishments day by day. Here I’ll share my understanding of the field and what would it take to make a career in Machine Learning/Data Science. I am hoping that this can be of little help in taking the important decision about the career.

The hype of the topic to some extent is true. The hype and all the buzz shouldn’t be a trigger to choose Machine Learning as a career option. The buzz and the glamour that pulls one into the field soon fades away as soon as he/she gets the feet wet in Machine Learning unless there is a strong objective/goal that Machine Learning will help achieve. Choosing/making a career in Machine Learning shouldn’t be the objective in itself.

I first heard about Machine Learning 3 years ago and I didn’t see any compelling reason at that point of time to seriously consider the field. In early 2016, I came to know about something which forever changed what Machine Learning means to me. I read about diagnosing diabetic retinopathy from retina images using Machine Learning. That was a wow moment for me. I saw immense potential of doing great work, positively impacting people’s lives which is visible first hand, so direct and close from the work I can do. The mere thought that Machine Learning cuts across multiple fields, disciplines and different walks of life is fascinating to me. That’s when I became serious about Machine Learning.

Once you have a solid reason and a goal, below is a short one liner depiction of the path to Machine Learning mastery. Though shortcuts may be taken from time to time, every shortcut will have it’s own cost which will need to be paid by taking a break and coming back to learning. Honestly, I’m yet to complete my first iteration of the loop even after almost 18 months.

Linear Algebra -> Calculus -> Probability -> Statistics -> Statistical Learning Theory -> Optimization Techniques -> Machine Learning/Deep Learning -> Programming Language -> Data Preprocessing, Analysis and Exploratory Data Analysis -> Mastering Machine Learning/Deep Learning libraries/frameworks -> relentless practice -> keeping up with the advancements -> loop back to the topic that you realize you need to revisit.

If you are at a juncture where you are thinking of career options, here’s my advice. Assess yourself, you are the best person to evaluate for yourself what you can do. You don’t have to go by the hype and ride the same wave everyone is. You can achieve mastery, do great work, make a lasting impact on the world, be proud of yourself and make your loved ones proud of you in any field. It need not be Machine Learning (or any hottest happening). Some of the fields that I think are going to be big…

Clean Energy - Global warming, depleting fossil fuels, increasing energy needs and a lot of other problems have one common solution - clean energy.
Quantum Computing - Once this is a reality, everything is gonna change. All the equations of the (tech) life we take for granted today will be turned upside down.
Health, Fitness, Diet and Nutrition - This will only become more and more important day by day.
Agricultural Advancements - with changing weather, water and ecological landscape, it’s important to evolve techniques of agriculture.

Installing Tensorflow-GPU on Windows

Thu, 17 Aug 2017 00:00:00 +0000

To use tensorflow library on GPU, NVIDIA CUDA Toolkit and cuDNN libraries need to be first installed. Installing tensorflow-gpu is straight forward. Installing the NVIDIA CUDA Toolkit and cuDNN is slightly tricky, we need to ensure a couple of things so that CUDA toolkit works that tensorflow-gpu uses under the hood.

Let’s begin…

Download CUDA Toolkit 8.0 and cuDNN v5.1 for CUDA Toolkit 8.0 from NVIDIA Developer portal
Install CUDA Toolkit 8.0
- Sometimes if the GPU is a latest one, CUDA Toolkit may not have support for it. This shouldn’t stop you from using CUDA Tookit, but the CUDA Toolkit installation warns that it couldn’t find a compatible GPU Hardware. In that case, go for a custom installation and choose not to install the drivers. Should be easy to identify in the custom install screen.
Copy the contents of the cuDNN downloaded archive into CUDA Toolkit installed folder.
Add below two paths to the system ‘PATH’ environment variable (if CUDA Toolkit installation didn’t add it):
- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64
- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\CUPTI\libx64
Ensure below three environment variables are available with value C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0 or wherever the CUDA toolkit is installed. Create the missing variables.
- CUDA_HOME - this needs to be added manually
- CUDA_PATH - this is usually created after installing CUDA Toolkit 8.0
- CUDA_PATH_V8_0 - this is usually created after installing CUDA Toolkit 8.0
Installing tensorflow-gpu
- If tensorflow package without gpu support is installed, uninstall it.
- Install tensorflow-gpu using pip install tensorflow-gpu OR conda install tensorflow-gpu (if using anaconda). If already installed, uninstall it and then reinstall. It didn’t work for me until I reinstalled.
If tflearn is required, install it with one of the below commands
- pip install tflearn
- conda install -c derickl tflearn

You can find the official tensorflow documentation guide for tensorflow-gpu setup here.

Learning from my first kaggle competition

Tue, 15 Aug 2017 00:00:00 +0000

My first Kaggle competition Instacart Market Basket Analysis has concluded. Though my ranking wasn’t impressive, I’ve learned a lot in this competition, from everyone. I’m happy that I know quite a few things after this competition.

Here are few things I learned in this competition. I’ve missed some of these this time, hoping to do better on next time.

Begin with understanding data and then spend maximum time on feature engineering initially. Start with a notebook and pen. Hoping to leave no stone unturned. Feature engineering is something that should never need more time at later point of time. Though feature engineering ideas can come at any point of time, in my opinion separating this crucial step as first milestone helps in concentrating on building final solution. Validate the merges/joins are happening the way they are intended to happen.
It’s important to use the correct evaluation metric during the model training.
Use the cross validation strategy (when) available natively with the library rather than doing it separately.
Don’t use small early stopping for XGBoost, LightGBM, etc.
Always keep one more held out set apart from the cross validation set used with model training, cross validation scoring. Score on different metrics with held out set.
Keep training iterations (n_estimators, epochs) small, a little bigger learning rate when just evaluating results/improvement locally, saves a lot of time. Can be fine tuned when required.
Create a custom data loader for large datasets which can load appropriately sampled train and multiple evaluation datasets specified in percentage, can be very useful when playing around with different models and approaches with such large datasets. This was my biggest roadblock as I was working towards the end of competition with such huge dataset. Sampling smaller dataset would have saved a lot of time. When working on such huge datasets its easy to exhaust even 32GB RAM.
Work on two different approaches simultaneously if it’s gonna take a large amount of time to train.
When writing a submission file, always write a text file with the details like model, parameters and details that help you reproduce it (I once hit the logloss of 0.20+ but wasn’t able to reproduce it again).
Feature importance and feature selection are different for algorithm (like what LightGBM thinks is important may not be true with XGBoost) hence Feature selection (elimination) should be based on the model/algorithm.
To gain the most out of the competition, it helps to focus on learning one thing. In this competition, the solution could have been a time-series based, based on Deep Neural Network, based on ensembling techniques, based on feature engineering, etc. Focusing on learning one thing may not win the competition but eventually leads to invaluable learning of something new. In this competition, I focused on stacking multiple models, feature selection and hyper-parameter tuning. These things didn’t give me a good rank in the competition but I learned those things.

Significance of Interpretable Predictive Models

Fri, 21 Jul 2017 00:00:00 +0000

Humans are have curious minds. We often tend to ask why something is the way it is. It’s an innate quality that helps us understand things around, learn and grow.

In Machine Learning, when a predictive model (or a machine learning model) predicts something, we usually only get a confidence level associated with prediction in terms of probability. And beyond this, most of the predictive models aren’t very explaining in nature. Every predictive algorithm answers What? but most of them miss on answering How? and Why?

Let us consider a problem of predicting a returning customer in a context of a business which sells personal computers, peripherals and accessories. The machine learning model is built on the past data comprising of details about customer transactions, demographics, etc. The model is built to predict if a customer will return to buy certain products. To come to this conclusion, the model first finds patterns (training phase) within the sample data (training dataset). When the model is used to predict for an unseen customer, the model finds which learned patterns does the unseen customer match with. Lets consider one unseen customer C’s case…

[Input] C has purchased an ultra-notebook that is typically bought by frequent travelling professionals {consider, we have all details like, order date, configuration, etc.}
[Prediction] Customer C will be a returning customer, and he will buy product P.

From the above case, even if we know the complete details of customer C (which aren’t included here for simplicity), like PC configuration, order details, date, cost, etc. we won’t be able to make it out why would customer C return and buy product P.

If the business knows what does it take (patterns learned by model) for a customer to return and buy product P, it could be of an incredible value. Business can take decisions which will turn more customers into returning customers. Without that, all that the business can do is just maintain inventory for predicted demand or carry out targeted marketing. If we see the magnitude of impact here, predictions help en-cash on the returning customers, and on the other hand learned insights help business to retain more customers.

In this case, the underlying patterns and what model learned can be visualized in the form of a decision tree explaining a series of events/criteria that led to this outcome, or at a much higher level, giving out reasoning in natural language.

Predictive models today are interpretable to a varied extent, based on the nature of underlying model. Some of the models aren’t interpretable at all, like deep neural networks.

Pondering about Artificial General Intelligence

Wed, 19 Jul 2017 00:00:00 +0000

Researchers in the field of Machine Learning and Artificial Intelligence are relentlessly working in a quest to solve intelligence, build Artificial General Intelligence aka Strong Artificial Intelligence. And we see progress every once in a while as we witness jaw dropping feats accomplished by intelligent systems built at various industrial and academic research labs. Two such feats are, an AI system which mastered PAC MAN and the Alpha Go beating a human expert at the game of Go.

Some recent feats in AI systems are making great progress towards solving intelligence. If we would have thought about such breakthroughs a few years ago, probably we would have considered such AI systems to be an Artificial General Intelligence back then. Every breakthrough brings a lot of learnings and opens new possibilities. When researchers find answers, they also find new questions. The solution to such great problems is usually built on the mathematical and scientific principles/concepts that have been around for quite some time. Once such breakthroughs/feats are achieved, new knowledge/expertise is within reach and it isn’t seen as Artificial General Intelligence because we know we aren’t there yet. But, we don’t know how far we are from solving Intelligence. Hence solving intelligence appears to be a target that is moving farther.

I’ve been thinking of what would qualify as an Artificial General Intelligence and will it come from the fundamental sciences that are yet to be invented or discovered? There must be field of study that we are yet to discover which spans across mathematics, computer science, neurosciences, cognitive science, statistics, probability, linguistics, etc. It has been proven that humans, even in toddlers stage, tend to demonstrate qualities that are studied in depth across these subjects. There has been a lot of work put into understanding how a babies’ brain works, how they learn and grow. Children are the best manifestation of acquiring intelligence, which is exactly what Artificial General Intelligence is supposed to do. Children demonstrate the best way of learning. Period. Nothing else beats the evolutionary growth. This brings me to a thought that probably intelligence can be solved by understanding about babies brain development without constraining our self to one field of research. We should see understanding human intelligence and reproducing it as an interdisciplinary research, not as research area in AI, neurosciences, statistics, probability, etc. individually.

Here are few references/talks about how children think. Be ready to be awe inspired if you haven’t explored this area before…

Intuition Behind Encoded Vector Representation

Wed, 05 Jul 2017 00:00:00 +0000

In this post I will share the intuition behind the encoded vectors, like One Hot Encoding used for categorical variables and word embeddings used to represent words in Natural Language Processing. We always represent non-numeric data in terms of vectors for very obvious reason. Non-numeric values cannot be used in the computations required to learn/predict the patterns with Machine Learning.

These computations are almost always about operations on vectors in multi-dimensional space (hyperspace). All the attributes of the data need to be represented using numeric values in the form of a vector. Sometimes when the attribute is not numeric in nature, doesn’t have a magnitude/scale or anything such sort of, we encode the values in terms of vectors.

Categorical values would be encoded as one hot vector, in other words it is a unit-vector pointing along only one axis/dimension in the hyperspace. On the other hand, a single encoded word embedding is a vector (may nor may not be a unit-vector) which usually points into hyperspace based on the dimensions it represents. For example, if dimensions are animal, pet, fur, etc. dog and cat could be represented using similar or same vectors based on the word embedding strategy. If it is representing a single word in vocabulary, it can be a unit-vector pointing along a single axis the word is represented by.

Two things happen when non-numeric values are encoded. First, the encoded attribute adds new dimensions to the hyperspace equal to the number of all possible distict values the attribute it is encoding. Secondly, these vectors help in clustering together similar observations (records) as they affect the representation of an observation in hyperspace.

Every encoding adds multiple dimensions to the hyperspace the problem is being solved in. Every encoding adds to the level of complexity. To make it explicit, imagine a number-line. Add a dimension, it’s an X-Y plane. Add one more dimension, it’s an X-Y-Z space. Add one more dimension, and it’s a hyperspace. Beyond this it’s hard to intuitively imagine the representation of an hyperspace. And it adds to the complexity of solving a problem. It goes without saying that we should exercise caution while adding encoded representations.

This post is a result of sudden realization of the intuition behind encoded vector representation. These thoughts are all about how I understand this. I could be incorrect. Please let me know what do you think in comments. Your comments will help anyone who lands here, including me. It’s all about learning from each other, isn’t it?

Applied vs Theoretical Machine Learning

Wed, 21 Jun 2017 00:00:00 +0000

One can approach learning Machine Learning in one of the two ways viz., Applied Machine Learning and Theoretical Machine Learning. Both the paths are very different and empower an individual in different ways to make a difference/solve problems.

Applied Machine Learning is about understanding the Machine Learning concepts at an abstract level sufficient enough to solve problems using machine learning (applying machine learning). This involves gaining expertise in using the tools and libraries which implement the Machine Learning Algorithms at their core.
Theoretical Machine Learning on the other hand is about understanding the underlying algorithms, mathematics, probability theory, statistics and definitely a lot of other subjects/concepts at the fundamental level.

Applied Machine Learning

Applied machine learning is about solving real world problems. This is where the potential and impact of new inventions/discoveries made through advancements in theoretical machine learning are realized. It’s all about data, seeing a difference in the lives of people first hand. Once a person understands basics of machine learning and the big main concepts, he/she can get started with applied machine learning. Expertise in applied machine comes with practice and solving problem after problem. It takes understanding of the data, the challenge a person/institution/society is facing. It is motivating to see the results relatively quicker as one solves a problem after problem.

Theoretical Machine Learning

While Theoretical Machine Learning is exciting to learn, it is much more vast than applied machine learning at an uber level when one begins learning. As one studies, the subjects/concepts will touch base with many other concepts and subjects. Every new piece of theoretical study might come with things that might need double clicking to understand further. It ellicits curiosity to learn more and intimidates at the same time. In the initial stages, it’ll leave the learner with a feeling of not knowing a lot of things, which is true. In case of theoretical machine learning, the path (well, it’s a graph :)) is long and needs a lot of learning, before picking a particular subject/area to dive deeper to gain further expertise/specialization.

With Theoretical Machine Learning, the rewards (read as satisfaction/results/problem solving) might be slow but are very satisfying when achieved. One might invent an new algorithm or improve ways of doing things. The new discoveries/inventions may be realized as speeding up existing solutions, solutions to unsolved problems, etc. In other words, it opens up door to possibilities to be realized with applying the new inventions (applied machine learning).

I’m trying to strike a balance between learning theoretical and applied machine learning. Both the approaches excite me equally, albeit in different ways. Applied Machine Learning excites because I am able to see the results of my work quicker/first hand. And Theoretical Machine Learning gives me a sense of satisfaction of knowing something new at the end of the day (though I may or may not pursue the research path in long run). If you are aware of reinforcement learning, the problem I’m facing can be aptly put as exploration (theoretical machine learning) versus exploitation (applied machine learning) dilemma :)

If you are a self taught (or just beginning) data scientist, you will understand that there are times when you would be on full throttle (read as doing/seeing results/learning a lot) and there would be times when things aren’t moving as fast. You may be left with a feeling of wasting a lot of time, you aren’t making any progress, you might be reading/learning something which might feel irrelevant to you as you learn. I’ve been through such times and I guess that might happen again. There would be ups and downs, it’s not easy to build an expertise in such a field keeping the day job. And never, ever think if it’s all worth it! When things aren’t moving, just give it some time and hang on during difficult times. Get back on track, every time you feel derailed!

Last but not least, these are just my views coming directly from my heart. I don’t mean one is superior to other, or one is lot of hard work, rewarding, satisfying over other. Both are equally important and both need each other.

Machine Learning Practitioners and Researchers, please share your valuable advice/inputs in comments…

Introduction to Ensemble Learning Methods in ML

Sat, 03 Jun 2017 00:00:00 +0000

Machine Learning is advancing at a rapid pace day by day. It never ceases to surprise with newer breakthroughs from time to time. Be it IBM Watson’s jeopardy win or a recent win of DeepMind’s Alpha Go over a human expert in the game of Go. It is certainly the next revolution in the history of mankind after industrial revolution. Andrew Ng, best known for his introductory Machine Learning course, has rightly said - AI is the new electricity! It comes with unimaginable opportunities, only to be discovered by time.

Many data science competition winning solutions nowadays are built using techniques known as Ensemble Learning. Dictionary meaning of Ensemble is a group. Likewise in Machine Learning, Ensemble refers to multiple (different) Machine Learning models working together as a group. In real life too, we know that a group is always intelligent than an individual. It’s a quite natural to believe a piece of information that comes from multiple sources. We do look for multiple opinions at times in real life when one expert opinion isn’t convincing enough for us to act upon. It’s somewhat similar idea in Machine Learning.

While any type of Machine Learning algorithms can be used with ensembling techniques, most often, some form of Decision Trees and/or random forests are used with ensembling techniques. In this post, we’ll go through a quick introduction of Information Entropy, Decision Trees and popular ensemble techniques.

Decision Tree and Information Entropy

Entropy is a measure of impurity/noise in data. When we navigate through the data reducing noise to find a piece of information/truth, we say that we have reached a pure form of data, meaning zero entropy.

Say for example we have data of all flight departures from a particular airport for last one year. Data like - airline carrier, departure gate, departure time, aircraft model and delayed. There could be many more, but let’s consider these for now. Now, given this information for future flights, we would want to predict if the aircraft was delayed. The existing data in our case has some entropy. And when we segregate the data into two buckets based on whether the flight was delayed, we can say we have data with zero impurity in this particular case.

It might seem straight foward to identify items in these two buckets, just split based on whether the flight was delayed. But the point here is, for future flights that piece of information would be missing we cannot do this split easily due to the information impurity (entropy).

We use decision trees to uncover the circumstances in which the flight was delayed, though we don’t have delayed information, we do have information (other fields) which gives some idea of delayed. To give a quick idea of how decision trees do it, consider this, we notice that 8 in every 10 flights departed from a particular gate were delayed (may be because of the airport plan and gate’s location). Now, if we make two buckets based on whether the flight departure was from that particular gate or not, we have a bucket where 80% of flights will possibly be delayed and while we don’t know much about the other bucket. This additional information that we uncovered is known as information gain (reduction in entropy). And when for future flight departures from that particular gate, we know that there is an 80% chance that the flight would be delayed. Decision trees learn to ask the right questions in the right order to identify items in these buckets as fast as possible, i.e., maximizing information gain (entropy reduction) in least possible steps (questions/splits). With every split, decision tree knows the probabiliy of each outcome (delayed or not delayed) for both the branches of the split. Usually these set (or more correctly, sequence) of questions (conditions identified) generalize so as to predict fairly accurate for future flights.

Ensemble Methods

Some of the common Ensemble Learning techniques are…

Bagging: Trains on a subset of samples iteratively and results are combined. The results can be combined in many ways, for instance a majority vote for classification and average for regression. Learn more here.

Boosting: Trains iteratively, each time improving prediction on previous misclassified observations. Learn more here.

Stacking: Multiple algorithms are trained and the outputs (predictions) become features for the final model, sometimes referred to as meta learner. Learn more here.

Further Learning

Mentioned in post

IBM Watson winning jeopardy
AlphaGo
Andrew Ng
Andrew Ng’s Machine Learning Course
AI is the new electricity! tweet by Andrew Ng

Building a Data Science Portfolio

Fri, 26 May 2017 00:00:00 +0000

Having a good portfolio is very important to an individual’s success. It brings opportunities, helps get in touch with great people. Networking/new connections can bring in lot of mutual learning. People with like mindset, those who have worked on similar problems will get in touch. It’s a win-win for everyone.

I would recommend doing lots-n-lots of hands on projects. If it’s beginner level, having different kinds of projects/datasets/problems helps in maximizing learning. If it’s in the intermediate/expert level or about specialization, doing a lot of different kind of projects related to the specialization under consideration helps strengthen skills.

Secondly, as you work and gain expertise, you will build your own arsenal of code snippets that you might see yourself reusing often. Consider spinning them out into tools/libraries to give back to the community.

Once you feel comfortable, start competing in hackathons. There are several opportunities online for all levels of expertise. One of the most notable is Kaggle.com. Start working on the challenges over there.

GitHub is no doubt a nerds portfolio! Consider pushing most of your work to GitHub. If you don’t want to push your work to GitHub public repo, consider writing about that in your blog.

Some tips for good presentation of repositories (portfolio)…

Each project should have a different repository (needless to say, but I’ve seen people stuffing code into same repo with a blanket name)
A neat and short ‘read me’ for each repository explaining a problem statement and the solution in short, preferably a single page at max.
Apart from code and introductory read me, document the solution approach in detail. The purpose of this is to show how the solution was built. It should include…
- Problem statement
- Info about dataset
- Visualizations of data
- train, cross validation, test and predict performance charts
- accuracy, metrics and results
- closing notes - challenges faced, possible enhancements, etc.
If code uses a jupyter notebook, code and the solution approach detail can be neatly presented together.
Last but not the least, keep sharing your knowledge through a blog (as I’m doing!)

All the best!

My first Machine Learning Hackathon

Tue, 28 Mar 2017 00:00:00 +0000

tl;dr

Sharing my Machine Learning hackathon participation experience. Hackathons are the best way to practice and get hands on experience. They bring out the best in us everytime, no exceptions. Look for hackathons that work for you, it’s better to work along the people rather than solving in silos (for learning at least)

Hackathons magically raise the enthusiasm and excitement of solving a problem. It takes the game altogether to a different level. Last week I solved my first machine learning problem for an online hackathon. I think hackathons bring the best out of us.

HackerEarth hosted a Machine Learning Challenge where the challenge was to predict the probability of a loan being defaulted based on a dataset of over 5L records with 45 attributes/columns.

Though I do solve some machine learning problems now and then, I was still mostly in a learning mode. But not anymore, this was the first decent and moderately difficult problem I solved. The learnings have been immense. I solved the challenge in Python achieving 97.6% accuracy. It’s posted on GitHub. I got a sense of what it takes to improve the accuracy point by point pushing the limits and getting the most insights out of data. And it all happens in hackathons when there is a leaderboard to compare the numbers, no matter where you stand on the leaderboard. It’s encouraging to see the accuracy figures in comparision ranked with other solutions as compared to solving the problem in silos. One might get content with 95% accuracy, but when we see it’s possible to do more with the same dataset, we push the limits of what we think we can do. Througout the 10 day Hackathon I gravitated on the leaderboard starting with 8th in the beginning rose to 4th at one point of time and then finally finished at 19th.

Mathematical Thinking

Wed, 10 Aug 2016 00:00:00 +0000

Mathematics is a fascinating subject. That was not true for me just two days ago. I started learning mathematics two days ago. Not because I loved it, but because I realized it’s a very important subject. I need to have very strong fundamentals about Mathematics if I wanted to learn Machine Learning, which I picked up 4 days ago! Isn’t that interesting, yes you need to pivot and change course to learn whatever it takes. I prefer understanding the basics.

So when I started learning Mathematics 2 days ago, I realized and learnt a few things from the Math gurus who share their knowledge with the world through internet (some useful links at the end of page). I started loving Mathematics and now learning it in a way I have never imagined. The way mathematics is learned is the reason people hate it. Mathematics is abstract in nature, is a wrong belief people hold in general. On the flip side it is true that the way people learn/have been taught is abstract in nature.

Learning mathematics can be very fun. Mathematics is all around us. Here are a few more important dis-beliefs about Mathematics:

Some people have inherent capabilities to do well in math, a math person
Mathematics can never be related to real life and learned with analogies
Mathematics is just abstract
We need to memorize all the formulae
We need to memorize all the rules
Mathematics is all about numbers, rules, methods

There are tons of such dis-beliefs about mathematics which makes people treat mathematics differently and keep it at bay. To learn mathematics its important to note a few points:

Anyone can learn mathematics and start loving it, provided the learning approach is changed
Mathematics is a study of patterns (as Mathematician Keith Devlin says)
It’s very important to understand and internalize the concepts rather than memorizing the formulae, procedures and methods of solving something
Solving a mathematical problem is not always fast, it takes time. And that’s where most of the people give up.
Imagine reading an article (could be about anything) and not understanding it even after trying very hard for a reasonable time. Difficult to imagine, right? We generally understand what we read very quickly. Usually within few minutes in rare cases where we cannot comprehend the text easily or the concept is a bit tough, it would typically take a little longer. But we would understand it. This is not the case with Mathematics, it usually takes longer. And that’s where the mathematical understanding deepens, when we persevere.
It’s important to keep in mind that it’s perfectly fine to struggle at a problem. Struggling is where the search for different pathways and patterns begins. This struggle also helps deepen the understanding of concept and improve the relationship with numbers!
If you have ever solved a mathematical problem in a totally different way, even if it was accidentally that you realized oh, it can be solved this way. You can imagine the satisfaction and happiness it gives to find a pattern/pathway to solution that wasn’t taught in class or described in the textbook you were referring to. This is how the learning should be, naturally and intuitively. And there would never be another boring math problem!

Below are some quick references where you can start learning mathematics and change your perception about it:

Mathematics for Machine Learning

Mon, 08 Aug 2016 00:00:00 +0000

I’ve just started learning Machine Learning. I stumbled upon mathematical expressions that I’ve never seen before! And that’s where I took a break on that and turned to first learn the required Mathematics before I get into Machine Learning.

Good fundamentals with the maths subjects like Calculus, Linear Algebra help immensely help learn Machine Learning. As I started looking for what all I need to learn, questions like where to start? what to learn? in what sequence? started popping up. It took me a few days to figure out what all Mathematics needs to be studied for Machine Learning.

Well, we are lucky we have numerous structured online learning resources, open-sourced learning content today. In case you’d like to understand the math behind the machine learning. You can get quickly started with Linear Algebra and Calculus basics. Links at the end of page. First two links would be sufficient to get started.

Since it took me a few days to understand all about this when I started looking into Machine Learning, I hope this helps you have a good head start.

Useful links…

Linear Algebra - Khan Academy
Calculus One - Coursera.org
How to learn math - Stanford - A short course which introduces to a different approach of learning math
Math for machine learning - An interesting blog post
Useful Stuff - Page of this blog