What is Python? Python is one of the preferred programming languages for working in Test Automation, Web Scraping, Data Analytics, and Machine Learning domains. Python is easy to learn, highly readable, and simple to use. It has a clean and English-like syntax which requires less coding and let the programmer focus on the business logic. The below sections cover Python history, features, domains, why to learn Python, how to install and run Py...
Essential Libraries and Tools Understanding what scikit-learn is and how to use it is important, but there are a few other libraries that will enhance your experience. Scikit-learn is built on top of the NumPy and SciPy scientific Python libraries. In addition to knowing about NumPy and SciPy, we will be using Pandas and matplotlib. We will also introduce the Jupyter Notebook, which is an browser-based interactive programming environmen...
Motivation Cross-validation is a way to validate your model against new data. The most effective forms of cross-validation involve repeatedly testing a model against a dataset until every point or combination of points have been used to validate a model, though this comes with performance trade-offs. We discussed several methods of splitting a dataset for cross-validation: Holdout Method: Splitting a percent of data off as test data K-Fo...
Motivation When we are presented with a data set, we try and figure out what it means. We look for connections between the data points and see if we can find any patterns. Sometimes those patterns are hard to see so we use code to help us find them. There are lots of different patterns data can follow so it helps if we can narrow down those options and write less code to analyze them. One of those patterns is a linear relationship. If we can find this pattern ...
Overview When using machine learning, there are many ways to go wrong. Some of the most common issues in machine learning are overfitting and underfitting. To understand these concepts, let’s imagine a machine learning model that is trying to learn to classify numbers, and has access to a training set of data and a testing set of data. Overfitting A model suffers from Overfitting when it has learned too much from the training...
Motivation Consider the following scenario. You are making a peanut butter sandwich and are trying to adjust ingredients so that it has the best taste. You might consider the type of bread, type of peanut butter, or peanut butter to bread ratio in your decision making process. But would you consider other factors like how warm it is in the room, what you had for breakfast, or what color socks you’re wearing? You probably wouldn’t as these things don...
Apriori Algorithm for Association Rule Mining Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this lab we will study the theory behind the Apriori Algorithm and will later implement Apriori algorithm in Python. Theory of Apriori Algorithm There are three major components of Apriori algorithm: 1. Support 2. Confidence 3. Lift We ...
FP-growth algorithm Have you ever gone to a search engine, typed in a word or part of a word, and the search engine automatically completed the search term for you? Perhaps it recommended something you didn’t even know existed, and you searched for that instead. This requires a way to find frequent itemsets efficiently. FP-growth algorithm find frequent itemsets or pairs, sets of things that commonly occur together, by storing the dataset in a special structure calle...
Introduction Logistic regression is a method for binary classification. It works to divide points in a dataset into two distinct classes, or categories. For simplicity, let’s call them class A and class B. The model will give us the probability that a given point belongs in category B. If it is low (lower than 50%), then we classify it in category A. Otherwise, it falls in class B. It’s also important to note that logistic regression is better for this purpose th...
Motivation Decision trees are easily created, visualized, and interpreted. Because of this, they are typically the first method used to model a dataset. The hierarchical structure and categorical nature of a decision tree makes it highly intuitive to implement. Decision trees expand logarithmically based on the number of data points you have, meaning larger datasets will impact the tree creation process less than other classifiers. Because of the tree structure...
Introduction K-Nearest Neighbors (KNN) is a basic classifier for machine learning. A classifier takes an already labeled data set, and then it trys to label new data points into one of the catagories. So, we are trying to identify what class an object is in. To do this we look at the closest points (neighbors) to the object and the class with the majority of neighbors will be the class that we identify the object to be in. The k is the number of nearest neighbors to ...
What is it? Naive Bayes is a classification technique that uses probabilities we already know to determine how to classify input. These probabilities are related to existing classes and what features they have. In the example above, we choose the class that most resembles our input as its classification. This technique is based around using Bayes’ Theorem. If you’re unfamiliar with what Bayes’ Theorem is, don’t worry! We will explain it in the next sect...