Abhijit AnnaldasFor the love of data and machines that can learn
http://abhijitannaldas.com/
Fri, 23 Jun 2017 11:29:14 +0000Fri, 23 Jun 2017 11:29:14 +0000Jekyll v3.4.3Applied vs Theoretical Machine Learning<p>One can approach learning Machine Learning in one of the two ways viz., Applied Machine Learning and Theoretical Machine Learning. Both the paths are very different and empower an individual in different ways to make a difference/solve problems.</p>
<ul>
<li>Applied Machine Learning is about understanding the Machine Learning concepts at an abstract level sufficient enough to solve problems using machine learning (applying machine learning). This involves gaining expertise in using the tools and libraries which implement the Machine Learning Algorithms at their core.</li>
<li>Theoretical Machine Learning on the other hand is about understanding the underlying algorithms, mathematics, probability theory, statistics and definitely a lot of other subjects/concepts at the fundamental level.</li>
</ul>
<h4 id="applied-machine-learning">Applied Machine Learning</h4>
<p>Applied machine learning is about solving real world problems. This is where the potential and impact of new inventions/discoveries made through advancements in theoretical machine learning are realized. It’s all about data, seeing a difference in the lives of people first hand. Once a person understands basics of machine learning and the big main concepts, he/she can get started with applied machine learning. Expertise in applied machine comes with practice and solving problem after problem. It takes understanding of the data, the challenge a person/institution/society is facing. It is motivating to see the results relatively quicker as one solves a problem after problem.</p>
<h4 id="theoretical-machine-learning">Theoretical Machine Learning</h4>
<p>While Theoretical Machine Learning is exciting to learn, it is much more vast than applied machine learning at an uber level when one begins learning. As one studies, the subjects/concepts will touch base with many other concepts and subjects. Every new piece of theoretical study might come with things that might need double clicking to understand further. It ellicits curiosity to learn more and intimidates at the same time. In the initial stages, it’ll leave the learner with a feeling of not knowing a lot of things, which is true. In case of theoretical machine learning, the path (well, it’s a graph :)) is long and needs a lot of learning, before picking a particular subject/area to dive deeper to gain further expertise/specialization.</p>
<p>With Theoretical Machine Learning, the rewards (read as satisfaction/results/problem solving) might be slow but are very satisfying when achieved. One might invent an new algorithm or improve ways of doing things. The new discoveries/inventions may be realized as speeding up existing solutions, solutions to unsolved problems, etc. In other words, it opens up door to possibilities to be realized with applying the new inventions (applied machine learning).</p>
<p>I’m trying to strike a balance between learning theoretical and applied machine learning. Both the approaches excite me equally, albeit in different ways. Applied Machine Learning excites because I am able to see the results of my work quicker/first hand. And Theoretical Machine Learning gives me a sense of satisfaction of knowing something new at the end of the day (though I may or may not pursue the research path in long run). If you are aware of reinforcement learning, the problem I’m facing can be aptly put as <em>exploration (theoretical machine learning) versus exploitation (applied machine learning) dilemma</em> :)</p>
<p>If you are a self taught (or just beginning) data scientist, you will understand that there are times when you would be on full throttle (read as doing/seeing results/learning a lot) and there would be times when things aren’t moving as fast. You may be left with a feeling of wasting a lot of time, you aren’t making any progress, you might be reading/learning something which might feel irrelevant to you as you learn. I’ve been through such times and I guess that might happen again. There would be ups and downs, it’s not easy to build an expertise in such a field keeping the day job. And never, ever think if it’s all worth it! When things aren’t moving, just give it some time and hang on during difficult times. Get back on track, every time you feel derailed!</p>
<p>Last but not least, these are just my views coming directly from my heart. I don’t mean one is superior to other, or one is lot of hard work, rewarding, satisfying over other. Both are equally important and both need each other.</p>
<p>Machine Learning Practitioners and Researchers, please share your valuable advice/inputs in comments…</p>
Wed, 21 Jun 2017 00:00:00 +0000
http://abhijitannaldas.com/applied-vs-theoretical-machine-learning.html
http://abhijitannaldas.com/applied-vs-theoretical-machine-learning.htmlIntroduction to Ensemble Learning Methods in ML<p>Machine Learning is advancing at a rapid pace day by day. It never ceases to surprise with newer breakthroughs from time to time. Be it IBM Watson’s jeopardy win or a recent win of DeepMind’s Alpha Go over a human expert in the game of Go. It is certainly the next revolution in the history of mankind after industrial revolution. Andrew Ng, best known for his introductory Machine Learning course, has rightly said - AI is the new electricity! It comes with unimaginable opportunities, only to be discovered by time.</p>
<p>A significant number of data science competition winning solutions today are built using various advanced forms of Decision Trees (or Ensembles of Random Forest techniques to be more correct). Ensembles in ML context can be understood in one liner as - combination of different decision trees/random forest predictors.</p>
<p>In this post, I’ll briefly introduce about some basics about Decision Trees, Random Forest and Ensemble Methods.</p>
<h4 id="information-entropy">Information Entropy</h4>
<p>With decision trees, we are trying to find some information/knowledge with the data we have and would want to apply it to our present/future conditions/scenarios. We ask questions based on existing data to reduce information entropy (noise) and reveal some information in pure form (meaning zero entropy). The number of questions that we need to ask to arrive at a conclusion/truth, is directly proportional to the Entropy.</p>
<p>Entropy is a measure of impurity/noise in data. When we navigate through the data reducing noise to find a piece of information/truth, we say that we have reached a pure form of data, meaning zero entropy.</p>
<h4 id="decision-trees">Decision Trees</h4>
<p>At the most fundamental level, decision trees are a series of very efficient if-then-else decisions (and much more infact!), similar to a programming construct. If you are new to Machine Learning, think of an analogy. How would you code when there are 10 variables you need to consider and code based on their values for different cases? Most likely, you’d code so as to keep the nesting/complexity to the minimum. Decision Trees too ensure that the tree depth remains minimum. The actual decisions within the tree, aren’t just Yes/No decisions. Every decision has associated weights/probabilities to it which has the information about both left and right subtrees and their entropy level. Probability at a particular tree node indicates a chance of particular outcome at that level and below in a tree. At every decision in the tree, the reduction in entropy is known as Information Gain. Information Gain can be intuitively understood as the increasing level of understanding about of piece of information. In ML the decision tree is built based on historical data and quest for a particular piece of information about new data is met using the decision tree, to uncover some truth or a fair prediction of an event or value of something. This piece of information (event or value) is known as target variable. A decision tree can help find value/probability/fair guess of a target variable it was built for, based on other data points involved.</p>
<h4 id="ensemble-methods-decision-forest">Ensemble Methods (Decision Forest)</h4>
<p>Ensembles are another way of using Decision Trees. Ensemble, literally means a group. And hence Ensemble of Decision Trees is known as Decision Forest. So why do we want to group together multiple trees? It’s a quite natural to believe a piece of information that comes from multiple sources. We do look for multiple opinions at times in our daily life when one expert opinion isn’t convincing enough for us to act upon. Ensembles do the same thing. It takes into consideration what multiple trees predict on a particular decision. And the majority decision by all those trees combined is considered final.</p>
<p>There are many ways decision trees ensembles work, based on how individual decision tree results are considered, weighted, evaluated and combined.</p>
<p>Some of the popular ensemble methods are <a href="https://en.wikipedia.org/wiki/AdaBoost" target="_blank">AdaBoost</a>, <a href="https://en.wikipedia.org/wiki/Bootstrap_aggregating" target="_blank">Bagging</a>, <a href="https://en.wikipedia.org/wiki/Random_forest" target="_blank">Random Forest</a>, <a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)" target="_blank">Boosting</a> and <a href="https://en.wikipedia.org/wiki/Gradient_boosting" target="_blank">Gradient Boosting</a>. Of all these, Gradient Boosting technique (Gradient Boosted Decision Trees) is the latest and usually the most efficient. Some of the popular implementations of Gradient Boosted Decision Trees are available in libraries XGBoost, LightGBM and H2O to name a few.</p>
<h4 id="further-learning">Further Learning</h4>
<ul>
<li><a href="https://www.khanacademy.org/computing/computer-science/informationtheory" target="_blank">Information Theory</a></li>
<li><a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)" target="_blank">Information Entropy</a></li>
<li><a href="https://en.wikipedia.org/wiki/Information_gain_in_decision_trees" target="_blank">Information Gain</a></li>
<li><a href="https://en.wikipedia.org/wiki/Ensemble_learning" target="_blank">Ensemble Learning</a></li>
<li><a href="https://en.wikipedia.org/wiki/Gradient_boosting" target="_blank">Gradient Boosting</a></li>
</ul>
<h4 id="mentioned-in-post">Mentioned in post</h4>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Andrew_Ng" target="_blank">Andrew Ng</a></li>
<li><a href="https://www.coursera.org/learn/machine-learning" target="_blank">Andrew Ng’s Machine Learning Course</a></li>
<li><a href="https://xgboost.readthedocs.io/en/latest/" target="_blank">XGBoost</a></li>
<li><a href="https://github.com/Microsoft/LightGBM" target="_blank">LightGBM</a></li>
<li><a href="http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html" target="_blank">H2O</a></li>
<li><a href="https://twitter.com/andrewyng/status/735874952008589312?lang=en" target="_blank">AI is the new electricity!</a> tweet by Andrew Ng</li>
<li><a href="https://www.import.io/post/how-to-win-a-kaggle-competition/" target="_blank">Blog post about winning Kaggle competitions</a></li>
<li><a href="https://en.wikipedia.org/wiki/AlphaGo" target="_blank">AlphaGo</a></li>
<li><a href="https://en.wikipedia.org/wiki/Watson_(computer)#Jeopardy.21" target="_blank">IBM Watson winning jeopardy</a></li>
</ul>
Sat, 03 Jun 2017 00:00:00 +0000
http://abhijitannaldas.com/introduction-to-ensemble-learning-methods-in-ml.html
http://abhijitannaldas.com/introduction-to-ensemble-learning-methods-in-ml.htmlBuilding a Data Science Portfolio<p>Having a good portfolio is very important to an individual’s success. It brings opportunities, helps get in touch with great people. Networking/new connections can bring in lot of mutual learning. People with like mindset, those who have worked on similar problems will get in touch. It’s a win-win for everyone.</p>
<p>I would recommend doing lots-n-lots of hands on projects. If it’s beginner level, having different kinds of projects/datasets/problems helps in maximizing learning. If it’s in the intermediate/expert level or about specialization, doing a lot of different kind of projects related to the specialization under consideration helps strengthen skills.</p>
<p>Secondly, as you work and gain expertise, you will build your own arsenal of code snippets that you might see yourself reusing often. Consider spinning them out into tools/libraries to give back to the community.</p>
<p>Once you feel comfortable, start competing in hackathons. There are several opportunities online for all levels of expertise. One of the most notable is Kaggle.com. Start working on the challenges over there.</p>
<p>GitHub is no doubt a nerds portfolio! Consider pushing most of your work to GitHub. If you don’t want to push your work to GitHub public repo, consider writing about that in your blog.</p>
<p>Some tips for good presentation of repositories (portfolio)…</p>
<ul>
<li>Each project should have a different repository (needless to say, but I’ve seen people stuffing code into same repo with a blanket name)</li>
<li>A neat and short ‘read me’ for each repository explaining a problem statement and the solution in short, preferably a single page at max.</li>
<li>Apart from code and introductory read me, document the solution approach in detail. The purpose of this is to show how the solution was built. It should include…
<ul>
<li>Problem statement</li>
<li>Info about dataset</li>
<li>Visualizations of data</li>
<li>train, cross validation, test and predict performance charts</li>
<li>accuracy, metrics and results</li>
<li>closing notes - challenges faced, possible enhancements, etc.</li>
</ul>
</li>
<li>If code uses a jupyter notebook, code and the solution approach detail can be neatly presented together.</li>
<li>Last but not the least, keep sharing your knowledge through a blog (as I’m doing!)</li>
</ul>
<p>All the best!</p>
Fri, 26 May 2017 00:00:00 +0000
http://abhijitannaldas.com/building-a-data-science-portfolio.html
http://abhijitannaldas.com/building-a-data-science-portfolio.htmlMy first Machine Learning Hackathon<p><br /></p>
<h4 id="tldr"><em>tl;dr</em></h4>
<blockquote>
<p>Sharing my Machine Learning hackathon participation experience. Hackathons are the best way to practice and get hands on experience. They bring out the best in us everytime, no exceptions. Look for hackathons that work for you, it’s better to work along the people rather than solving in silos (for learning at least)</p>
</blockquote>
<p>Hackathons magically raise the enthusiasm and excitement of solving a problem. It takes the game altogether to a different level. Last week I solved my first machine learning problem for an online hackathon. I think hackathons bring the best out of us.</p>
<p>HackerEarth hosted a <a href="https://www.hackerearth.com/problem/machine-learning/bank-fears-loanliness/" target="_blank">Machine Learning Challenge</a> where the challenge was to predict the probability of a loan being defaulted based on a dataset of over 5L records with 45 attributes/columns.</p>
<p>Though I do solve some machine learning problems now and then, I was still mostly in a learning mode. But not anymore, this was the first decent and moderately difficult problem I solved. The learnings have been immense. I solved the challenge in Python achieving 97.6% accuracy. It’s posted on <a href="https://github.com/avannaldas/Loan-Defaulter-Prediction-Machine-Learning" target="_blank">GitHub</a>. I got a sense of what it takes to improve the accuracy point by point pushing the limits and getting the most insights out of data. And it all happens in hackathons when there is a leaderboard to compare the numbers, no matter where you stand on the leaderboard. It’s encouraging to see the accuracy figures in comparision ranked with other solutions as compared to solving the problem in silos. One might get content with 95% accuracy, but when we see it’s possible to do more with the same dataset, we push the limits of what we think we can do. Througout the 10 day Hackathon I gravitated on the leaderboard starting with 8th in the beginning rose to 4th at one point of time and then finally finished at 19th.</p>
Tue, 28 Mar 2017 00:00:00 +0000
http://abhijitannaldas.com/my-first-machine-learning-hackathon.html
http://abhijitannaldas.com/my-first-machine-learning-hackathon.htmlMy Data Science Journey<p>I am Abhijit Annaldas, a Software Engineer who recently fell in serious love with Data Science. I’ve been learning about Mathematics, Machine Learning and Deep Learning a lot lately. And sometimes for a change I read/watch about random physics/science topics ranging from astronomical concepts to some basic physics. I recently came to know about Feynman Technique, so I thought I’ll share what I learn through this blog. I’m planning to get my hands dirty with things I read/learn about and use github as my playground. Both, for doing and talking.</p>
<p>Apart from my recently found love, I’m a Software Engineer with Microsoft India. I have been blogging about various/random topics about technology on my <a target="_blank" href="http://abhijitannaldas.com/">other blog</a>. Since I wanted to start with a clean slate, all about data science. I’m starting a new blog here.</p>
Tue, 20 Dec 2016 00:00:00 +0000
http://abhijitannaldas.com/my-data-science-journey.html
http://abhijitannaldas.com/my-data-science-journey.htmlMathematical Thinking<p>Mathematics is a fascinating subject. That was not true for me just two days ago. I started learning mathematics two days ago. Not because I loved it, but because I realized it’s a very important subject. I need to have very strong fundamentals about Mathematics if I wanted to learn Machine Learning, which I picked up 4 days ago! Isn’t that interesting, yes you need to pivot and change course to learn whatever it takes. I prefer understanding the basics.</p>
<p>So when I started learning Mathematics 2 days ago, I realized and learnt a few things from the Math gurus who share their knowledge with the world through internet (some useful links at the end of page). I started loving Mathematics and now learning it in a way I have never imagined. The way mathematics is learned is the reason people hate it. Mathematics is abstract in nature, is a wrong belief people hold in general. On the flip side it is true that the way people learn/have been taught is abstract in nature.</p>
<p>Learning mathematics can be very fun. Mathematics is all around us. Here are a few more important dis-beliefs about Mathematics:</p>
<ol>
<li>Some people have inherent capabilities to do well in math, a math person</li>
<li>Mathematics can never be related to real life and learned with analogies</li>
<li>Mathematics is just abstract</li>
<li>We need to memorize all the formulae</li>
<li>We need to memorize all the rules</li>
<li>Mathematics is all about numbers, rules, methods</li>
</ol>
<p>There are tons of such dis-beliefs about mathematics which makes people treat mathematics differently and keep it at bay. To learn mathematics its important to note a few points:</p>
<ol>
<li>Anyone can learn mathematics and start loving it, provided the learning approach is changed</li>
<li>Mathematics is a study of patterns (as Mathematician Keith Devlin says)</li>
<li>It’s very important to understand and internalize the concepts rather than memorizing the formulae, procedures and methods of solving something</li>
<li>Solving a mathematical problem is not always fast, it takes time. And that’s where most of the people give up.</li>
<li>Imagine reading an article (could be about anything) and not understanding it even after trying very hard for a reasonable time. Difficult to imagine, right? We generally understand what we read very quickly. Usually within few minutes in rare cases where we cannot comprehend the text easily or the concept is a bit tough, it would typically take a little longer. But we would understand it. This is not the case with Mathematics, it usually takes longer. And that’s where the mathematical understanding deepens, when we persevere.</li>
<li>It’s important to keep in mind that it’s perfectly fine to struggle at a problem. Struggling is where the search for different pathways and patterns begins. This struggle also helps deepen the understanding of concept and improve the relationship with numbers!</li>
<li>If you have ever solved a mathematical problem in a totally different way, even if it was accidentally that you realized oh, it can be solved this way. You can imagine the satisfaction and happiness it gives to find a pattern/pathway to solution that wasn’t taught in class or described in the textbook you were referring to. This is how the learning should be, naturally and intuitively. And there would never be another boring math problem!</li>
</ol>
<p>Below are some quick references where you can start learning mathematics and change your perception about it:</p>
<ol>
<li><a href="https://www.youtube.com/watch?v=3icoSeGqQtY">How you can be good at math, and other surprising facts about learning - Jo Boaler - TEDxStanford</a></li>
<li><a href="https://www.youtube.com/watch?v=ytVneQUA5-c">Five Principles of Extraordinary Math Teaching - Dan Finkel - TEDxRainier</a></li>
<li><a href="https://www.coursera.org/learn/mathematical-thinking">Introduction to Mathematical Thinking – Coursera.org</a></li>
<li><a href="http://www.ted.com/topics/math">Ted Talks about Mathematics</a></li>
</ol>
Wed, 10 Aug 2016 00:00:00 +0000
http://abhijitannaldas.com/mathematical-thinking.html
http://abhijitannaldas.com/mathematical-thinking.htmlMathematics for Machine Learning<p>I’ve just started learning Machine Learning. I stumbled upon mathematical expressions that I’ve never seen before! And that’s where I took a break on that and turned to first learn the required Mathematics before I get into Machine Learning.</p>
<p>Good fundamentals with the maths subjects like Calculus, Linear Algebra help immensely help learn Machine Learning. As I started looking for what all I need to learn, questions like where to start? what to learn? in what sequence? started popping up. It took me a few days to figure out what all Mathematics needs to be studied for Machine Learning.</p>
<p>Well, we are lucky we have numerous structured online learning resources, open-sourced learning content today. In case you’d like to understand the math behind the machine learning. You can get quickly started with Linear Algebra and Calculus basics. Links at the end of page. First two links would be sufficient to get started.</p>
<p>Since it took me a few days to understand all about this when I started looking into Machine Learning, I hope this helps you have a good head start.</p>
<p>Useful links…</p>
<ul>
<li><a href="https://www.khanacademy.org/math/linear-algebra" target="_blank">Linear Algebra - Khan Academy</a></li>
<li><a href="https://www.coursera.org/learn/calculus1" target="_blank">Calculus One - Coursera.org</a></li>
<li><a href="https://lagunita.stanford.edu/courses/Education/EDUC115-S/Spring2014/about" target="_blank">How to learn math - Stanford</a> - A short course which introduces to a different approach of learning math</li>
<li><a href="http://fastml.com/math-for-machine-learning/" target="_blank">Math for machine learning</a> - An interesting blog post</li>
<li><a href="https://avannaldas.github.io/useful-stuff/">Useful Stuff</a> - Page of this blog</li>
</ul>
Mon, 08 Aug 2016 00:00:00 +0000
http://abhijitannaldas.com/mathematics-for-machine-learning.html
http://abhijitannaldas.com/mathematics-for-machine-learning.html