Beginner’s Guide to Data Science: 10 Basic Concepts to Learn
10 basic concepts of data science a beginner should know about.
Data Science is a blend of various tools, algorithms, and machine learning principles to discover hidden patterns from the raw data. What makes it different from statistics is that data scientists use various advanced machine learning algorithms to identify the occurrence of a particular event in the future. A Data Scientist will look at the data from many angles, sometimes angles not known earlier.
Data Visualization
Data Visualization is one of the most important branches of data science. It is one of the main tools used to analyze and study relationships between different variables. Data visualization tools like scatter plots, line graphs, bar plots, histograms, qq plots, smooth densities, box plots, pair plots, heat maps, etc. can be used for descriptive analytics. Data visualization is also used in machine learning for data preprocessing and analysis, feature selection, model building, model testing, and model evaluation.
Outliers
An outlier is a data point, that is very different from the rest of the dataset. Outliers are often just bad data, created due to a malfunctioned sensor, contaminated experiments, or human error in recording data. Sometimes, outliers could indicate something real such as a malfunction in a system. Outliers are very common and are expected in large datasets. One common way to detect outliers in a dataset is by using a box plot.
Data Imputation
Most datasets contain missing values. The easiest way to deal with missing data is simply to throw away the data point. Different interpolation techniques can be used for this purpose to estimate the missing values from the other training samples in the dataset. One of the most common interpolation techniques is mean imputation where the missing value is replaced with the mean value of the entire feature column.
Data Scaling
Data scaling helps improve the quality and predictive power of the data model. Data scaling can be achieved by normalizing or standardizing real-valued input and output variables. There are two types of data scaling available such as normalization and standardization.
Principal Component Analysis
Large datasets with hundreds or thousands of features often lead to redundancy especially when features are correlated with each other. Training a model on a high-dimensional dataset having too many features can sometimes lead to overfitting. Principal Component Analysis (PCA) is a statistical method that is used for feature extraction. PCA is used for high-dimensional and correlated data. The basic idea of PCA is to transform the original space of features into the space of the principal component.
Linear Discriminant Analysis
The goal of linear discriminant analysis is to find the feature subspace that optimizes class separability and reduces dimensionality. Hence, LDA is a supervised algorithm.
Data Partitioning
In machine learning, the dataset is often partitioned into training and testing sets. The model is trained on the training dataset and then tested on the testing dataset. The testing dataset thus acts as the unseen dataset, which can be used to estimate a generalization error (the error expected when the model is applied to a real-world dataset after the model has been deployed).
Supervised Learning
These are machine learning algorithms that perform learning by studying the relationship between the feature variables and the known target variable. Supervised learning has two subcategories such as continuous target variables and discrete target variables.
Unsupervised Learning
In unsupervised learning, unlabeled data or data of unknown structure are dealt with. Using unsupervised learning techniques, one can explore the structure of the data to extract meaningful information without the guidance of a known outcome variable or reward function. K-means clustering is an example of an unsupervised learning algorithm.
Reinforcement Learning
In reinforcement learning, the goal is to develop a system (agent) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a so-called reward signal, reinforcement learning can be defined as a field related to supervised learning.