[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9790":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":15,"starSnapshotCount":15,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},9790,"datascience","sreeharierk\u002Fdatascience","sreeharierk","This repository is a compilation of free resources for learning Data Science.","https:\u002F\u002Ftwitter.com\u002Fsreeharierk",null,5155,528,378,5,0,3,39.17,"GNU General Public License v3.0",false,"main",true,[23,24,25,26,27,28,29,30,31],"artificial-intelligence","computer-vision","data-science","datascienceproject","deeplearning","machine-learning","machine-learning-algorithms","natural-language-processing","neural-networks","2026-06-12 02:02:12","# Data-Scientist-Roadmap (2021)\n\n![roadmap-picture](http:\u002F\u002Fnirvacana.com\u002Fthoughts\u002Fwp-content\u002Fuploads\u002F2013\u002F07\u002FRoadToDataScientist1.png)\n\n****\n\n# 1_ Fundamentals\n\n\n## 1_ Matrices & Algebra fundamentals\n\n### About\n\nIn mathematics, a matrix is a __rectangular array of numbers, symbols, or expressions, arranged in rows and columns__. A matrix could be reduced as a submatrix of a matrix by deleting any collection of rows and\u002For columns.\n\n![matrix-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fb\u002Fbb\u002FMatrix.svg)\n\n### Operations\n\nThere are a number of basic operations that can be applied to modify matrices:\n\n* [Addition](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatrix_addition)\n* [Scalar Multiplication](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FScalar_multiplication)\n* [Transposition](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTranspose)\n* [Multiplication](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatrix_multiplication)\n\n\n## 2_ Hash function, binary tree, O(n)\n\n### Hash function\n\n#### Definition\n\nA hash function is __any function that can be used to map data of arbitrary size to data of fixed size__. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file.\n\n![hash-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F5\u002F58\u002FHash_table_4_1_1_0_0_1_0_LL.svg)\n\n### Binary tree\n\n#### Definition\n\nIn computer science, a binary tree is __a tree data structure in which each node has at most two children__, which are referred to as the left child and the right child.\n\n![binary-tree-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Ff\u002Ff7\u002FBinary_tree.svg)\n\n### O(n)\n\n#### Definition\n\nIn computer science, big O notation is used to __classify algorithms according to how their running time or space requirements grow as the input size grows__. In analytic number theory, big O notation is often used to __express a bound on the difference between an arithmetical function and a better understood approximation__.\n\n## 3_ Relational algebra, DB basics\n\n### Definition\n\nRelational algebra is a family of algebras with a __well-founded semantics used for modelling the data stored in relational databases__, and defining queries on it.\n\nThe main application of relational algebra is providing a theoretical foundation for __relational databases__, particularly query languages for such databases, chief among which is SQL.\n\n### Natural join\n\n#### About\n\nIn SQL language, a natural junction between two tables will be done if :\n\n* At least one column has the same name in both tables\n* Theses two columns have the same data type\n    * CHAR (character)\n    * INT (integer)\n    * FLOAT (floating point numeric data)\n    * VARCHAR (long character chain)\n    \n#### mySQL request\n\n        SELECT \u003CCOLUMNS>\n        FROM \u003CTABLE_1>\n        NATURAL JOIN \u003CTABLE_2>\n\n        SELECT \u003CCOLUMNS>\n        FROM \u003CTABLE_1>, \u003CTABLE_2>\n        WHERE TABLE_1.ID = TABLE_2.ID\n\n## 4_ Inner, Outer, Cross, theta-join\n\n### Inner join\n\nThe INNER JOIN keyword selects records that have matching values in both tables.\n\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      INNER JOIN table2 ON table1.column_name = table2.column_name;\n\n![inner-join-image](https:\u002F\u002Fwww.w3schools.com\u002Fsql\u002Fimg_innerjoin.gif)\n\n### Outer join\n\nThe FULL OUTER JOIN keyword return all records when there is a match in either left (table1) or right (table2) table records.\n\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      FULL OUTER JOIN table2 ON table1.column_name = table2.column_name; \n\n![outer-join-image](https:\u002F\u002Fwww.w3schools.com\u002Fsql\u002Fimg_fulljoin.gif)\n\n### Left join\n\nThe LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match.\n\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      LEFT JOIN table2 ON table1.column_name = table2.column_name;\n\n![left-join-image](https:\u002F\u002Fwww.w3schools.com\u002Fsql\u002Fimg_leftjoin.gif)\n\n### Right join\n\nThe RIGHT JOIN keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match.\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      RIGHT JOIN table2 ON table1.column_name = table2.column_name;\n\n![left-join-image](https:\u002F\u002Fwww.w3schools.com\u002Fsql\u002Fimg_rightjoin.gif)\n\n## 5_ CAP theorem\n\nIt is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:\n \n* Every read receives the most recent write or an error.\n* Every request receives a (non-error) response – without guarantee that it contains the most recent write.\n* The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.\n\nIn other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP Theorem is quite different from the consistency guaranteed in ACID database transactions.\n\n## 6_ Tabular data\n\nTabular data are __opposed to relational__ data, like SQL database.\n\nIn tabular data, __everything is arranged in columns and rows__. Every row have the same number of column (except for missing value, which could be substituted by \"N\u002FA\".\n\nThe __first line__ of tabular data is most of the time a __header__, describing the content of each column.\n\nThe most used format of tabular data in data science is __CSV___. Every column is surrounded by a character (a tabulation, a coma ..), delimiting this column from its two neighbours.\n\n## 7_ Entropy\n\nEntropy is a __measure of uncertainty__. High entropy means the data has high variance and thus contains a lot of information and\u002For noise.\n\nFor instance, __a constant function where f(x) = 4 for all x has no entropy and is easily predictable__, has little information, has no noise and can be succinctly represented . Similarly, f(x) = ~4 has some entropy while f(x) = random number is very high entropy due to noise.\n\n## 8_ Data frames & series\n\nA data frame is used for storing data tables. It is a list of vectors of equal length.\n\nA series is a series of data points ordered.\n\n## 9_ Sharding\n\n*Sharding* is **horizontal(row wise) database partitioning** as opposed to **vertical(column wise) partitioning** which is *Normalization*\n\nWhy use Sharding?\n\n1. Database systems with large data sets or high throughput applications can challenge the capacity of a single server.\n2. Two methods to address the growth : Vertical Scaling and Horizontal Scaling\n3. Vertical Scaling\n\n    * Involves increasing the capacity of a single server\n    * But due to technological and economical restrictions, a single machine may not be sufficient for the given workload.\n\n4. Horizontal Scaling\n    * Involves dividing the dataset and load over multiple servers, adding additional servers to increase capacity as required\n    * While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, potentially providing better efficiency than a single high-speed high-capacity server. \n    * Idea is to use concepts of Distributed systems to achieve scale\n    * But it comes with same tradeoffs of increased complexity that comes hand in hand with distributed systems.\n    * Many Database systems provide Horizontal scaling via Sharding the datasets.\n\n## 10_ OLAP\n\nOnline analytical processing, or OLAP, is an approach to answering multi-dimensional analytical (MDA) queries swiftly in computing. \n\nOLAP is part of the __broader category of business intelligence__, which also encompasses relational database, report writing and data mining. Typical applications of OLAP include ___business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture__.\n\nThe term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP).\n\n## 11_ Multidimensional Data model\n\n## 12_ ETL\n\n* Extract\n  * extracting the data from the multiple heterogenous source system(s)\n  * data validation to confirm whether the data pulled has the correct\u002Fexpected values in a given domain\n\n* Transform\n  * extracted data is fed into a pipeline which applies multiple functions on top of data\n  * these functions intend to convert the data into the format which is accepted by the end system\n  * involves cleaning the data to remove noise, anamolies and redudant data\n* Load\n  * loads the transformed data into the end target\n\n## 13_ Reporting vs BI vs Analytics\n\n## 14_ JSON and XML\n\n### JSON\n\nJSON is a language-independent data format. Example describing a person:\n\t\n\t{\n\t  \"firstName\": \"John\",\n\t  \"lastName\": \"Smith\",\n\t  \"isAlive\": true,\n\t  \"age\": 25,\n\t  \"address\": {\n\t    \"streetAddress\": \"21 2nd Street\",\n\t    \"city\": \"New York\",\n\t    \"state\": \"NY\",\n\t    \"postalCode\": \"10021-3100\"\n\t  },\n\t  \"phoneNumbers\": [\n\t    {\n\t      \"type\": \"home\",\n\t      \"number\": \"212 555-1234\"\n\t    },\n\t    {\n\t      \"type\": \"office\",\n\t      \"number\": \"646 555-4567\"\n\t    },\n\t    {\n\t      \"type\": \"mobile\",\n\t      \"number\": \"123 456-7890\"\n\t    }\n\t  ],\n\t  \"children\": [],\n\t  \"spouse\": null\n\t}\n\n## XML\n\nExtensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.\n \n \t\u003CCATALOG>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Bloodroot\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Sanguinaria canadensis\u003C\u002FBOTANICAL>\n\t    \u003CZONE>4\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Shady\u003C\u002FLIGHT>\n\t    \u003CPRICE>$2.44\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>031599\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Columbine\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Aquilegia canadensis\u003C\u002FBOTANICAL>\n\t    \u003CZONE>3\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Shady\u003C\u002FLIGHT>\n\t    \u003CPRICE>$9.37\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>030699\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Marsh Marigold\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Caltha palustris\u003C\u002FBOTANICAL>\n\t    \u003CZONE>4\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Sunny\u003C\u002FLIGHT>\n\t    \u003CPRICE>$6.81\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>051799\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t\u003C\u002FCATALOG>\n\n## 15_ NoSQL\n\nnoSQL is oppsed to relationnal databases (stand for __N__ot __O__nly __SQL__). Data are not structured and there's no notion of keys between tables.\n\nAny kind of data can be stored in a noSQL database (JSON, CSV, ...) whithout thinking about a complex relationnal scheme.\n\n__Commonly used noSQL stacks__: Cassandra, MongoDB, Redis, Oracle noSQL ...\n\n## 16_ Regex\n\n### About\n\n__Reg__ ular __ex__ pressions (__regex__) are commonly used in informatics.\n\nIt can be used in a wide range of possibilities :\n* Text replacing\n* Extract information in a text (email, phone number, etc)\n* List files with the .txt extension ..\n\nhttp:\u002F\u002Fregexr.com\u002F is a good website for experimenting on Regex.\n\n### Utilisation\n\nTo use them in [Python](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fre.html), just import:\n\n    import re\n\n## 17_ Vendor landscape\n\n## 18_ Env Setup\n\n# 2_ Statistics\n\n\n[Statistics-101 for data noobs](https:\u002F\u002Fmedium.com\u002F@debuggermalhotra\u002Fstatistics-101-for-data-noobs-2e2a0e23a5dc)\n\n## 1_ Pick a dataset\n\n### Datasets repositories\n\n#### Generalists\n\n- [KAGGLE](https:\u002F\u002Fwww.kaggle.com\u002Fdatasets)\n- [Google](https:\u002F\u002Ftoolbox.google.com\u002Fdatasetsearch)\n\n#### Medical\n\n- [PMC](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002F)\n\n#### Other languages\n\n##### French\n\n- [DATAGOUV](https:\u002F\u002Fwww.data.gouv.fr\u002Ffr\u002F)\n\n## 2_ Descriptive statistics\n\n### Mean\n\nIn probability and statistics, population mean and expected value are used synonymously to refer to one __measure of the central tendency either of a probability distribution or of the random variable__ characterized by that distribution.\n\nFor a data set, the terms arithmetic mean, mathematical expectation, and sometimes average are used synonymously to refer to a central value of a discrete set of numbers: specifically, the __sum of the values divided by the number of values__.\n\n![mean_formula](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Fbd2f5fb530fc192e4db7a315777f5bbb5d462c90)\n\n### Median\n\nThe median is the value __separating the higher half of a data sample, a population, or a probability distribution, from the lower half__. In simple terms, it may be thought of as the \"middle\" value of a data set.\n\n### Descriptive statistics in Python\n\n[Numpy](http:\u002F\u002Fwww.numpy.org\u002F) is a python library widely used for statistical analysis.\n\n#### Installation\n\n    pip3 install numpy\n\n#### Utilization\n    \n    import numpy\n\n## 3_ Exploratory data analysis\n\nThe step includes visualization and analysis of data. \n\nRaw data may possess improper distributions of data which may lead to issues moving forward.\n\nAgain, during applications we must also know the distribution of data, for instance, the fact whether the data is linear or spirally distributed.\n\n[Guide to EDA in Python](https:\u002F\u002Ftowardsdatascience.com\u002Fdata-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-1-eda-49ce99e36655)\n\n##### Libraries in Python \n\n[Matplotlib](https:\u002F\u002Fmatplotlib.org\u002F)\n\nLibrary used to plot graphs in Python\n\n__Installation__:\n\n    pip3 install matplotlib\n\n__Utilization__:\n\n    import matplotlib.pyplot as plt\n\n[Pandas](https:\u002F\u002Fpandas.pydata.org\u002F)\n\nLibrary used to large datasets in python\n\n__Installation__:\n\n    pip3 install pandas\n\n__Utilization__:\n\n    import pandas as pd\n    \n[Seaborn](https:\u002F\u002Fseaborn.pydata.org\u002F)\n\nYet another Graph Plotting Library in Python.\n\n__Installation__:\n\n    pip3 install seaborn\n\n__Utilization__:\n\n    import seaborn as sns\n\n\n#### PCA\n\nPCA stands for principle component analysis.\n\nWe often require to shape of the data distribution as we have seen previously. We need to plot the data for the same.\n\nData can be Multidimensional, that is, a dataset can have multiple features. \n\nWe can plot only two dimensional data, so, for multidimensional data, we project the multidimensional distribution in two dimensions, preserving the principle components of the distribution, in order to get an idea of the actual distribution through the 2D plot. \n\nIt is used for dimensionality reduction also. Often it is seen that several features do not significantly contribute any important insight to the data distribution. Such features creates complexity and increase dimensionality of the data. Such features are not considered which results in decrease of the dimensionality of the data.\n\n[Mathematical Explanation](https:\u002F\u002Fmedium.com\u002Ftowards-artificial-intelligence\u002Fdemystifying-principal-component-analysis-9f13f6f681e6)\n\n[Application in Python](https:\u002F\u002Ftowardsdatascience.com\u002Fdata-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-2-pca-feature-92f8f6ec8c8)\n\n## 4_ Histograms\n\nHistograms are representation of distribution of numerical data. The procedure consists of binnng the numeric values using range divisions i.e, the entire range in which the data varies is split into several fixed intervals. Count or frequency of occurences of the numbers in the range of the bins are represented.\n\n[Histograms](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHistogram)\n\n![plot](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F1\u002F1d\u002FExample_histogram.png\u002F220px-Example_histogram.png)\n\nIn python, __Pandas__,__Matplotlib__,__Seaborn__ can be used to create Histograms.\n\n## 5_ Percentiles & outliers\n\n### Percentiles\n\nPercentiles are numberical measures in statistics, which represents how much or what percentage of data falls below a given number or instance in a numerical data distribution. \n\nFor instance, if we say 70 percentile, it represents, 70% of the data in the ditribution are below the given numerical value. \n\n[Percentiles](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPercentile)\n\n### Outliers\n\nOutliers are data points(numerical) which have significant differences with other data points. They differ from majority of points in the distribution. Such points may cause the central measures of distribution, like mean, and median. So, they need to be detected and removed.\n\n[Outliers](https:\u002F\u002Fwww.itl.nist.gov\u002Fdiv898\u002Fhandbook\u002Fprc\u002Fsection1\u002Fprc16.htm)\n\n__Box Plots__ can be used detect Outliers in the data. They can be created using __Seaborn__ library\n\n![Image_Box_Plot](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F612\u002F1*105IeKBRGtyPyMy3-WQ8hw.png)\n  \n## 6_ Probability theory\n\n__Probability__ is the likelihood of an event in a Random experiment. For instance, if a coin is tossed, the chance of getting a head is 50% so, probability is 0.5.\n\n__Sample Space__: It is the set of all possible outcomes of a Random Experiment. \n__Favourable Outcomes__: The set of outcomes we are looking for in a Random Experiment\n\n__Probability = (Number of Favourable Outcomes) \u002F (Sample Space)__\n\n__Probability theory__ is a branch of mathematics that is associated with the concept of probability.\n\n[Basics of Probability](https:\u002F\u002Ftowardsdatascience.com\u002Fbasic-probability-theory-and-statistics-3105ab637213)\n\n## 7_ Bayes theorem\n\n### Conditional Probability:\n\nIt is the probability of one event occurring, given that another event has already occurred. So, it gives a sense of relationship between two events and the probabilities of the occurences of those events.\n\nIt is given by:\n\n__P( A | B )__ : Probability of occurence of A, after B occured.\n\nThe formula is given by: \n\n![formula](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F74cbddb93db29a62d522cd6ab266531ae295a0fb)\n\nSo, P(A|B) is equal to Probablity of occurence of A and B, divided by Probability of occurence of B.\n\n[Guide to Conditional Probability](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FConditional_probability)\n\n### Bayes Theorem\n\nBayes theorem provides a way to calculate conditional probability. Bayes theorem is widely used in machine learning most in Bayesian Classifiers.  \n\nAccording to Bayes theorem the probability of A, given that B has already occurred is given by Probability of A multiplied by the probability of B given A has already occurred divided by the probability of B.\n\n__P(A|B) =  P(A).P(B|A) \u002F P(B)__\n\n\n[Guide to Bayes Theorem](https:\u002F\u002Fmachinelearningmastery.com\u002Fbayes-theorem-for-machine-learning\u002F)\n\n\n## 8_ Random variables\n\nRandom variable are the numeric outcome of an experiment or random events. They are normally a set of values. \n\nThere are two main types of Random Variables:\n\n__Discrete Random Variables__: Such variables take only a finite number of distinct values\n\n__Continous Random Variables__: Such variables can take an infinite number of possible values.\n\n\n## 9_ Cumul Dist Fn (CDF)\n\nIn probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable __X__, or just distribution function of __X__, evaluated at __x__, is the probability that __X__ will take a value less than or equal to __x__.\n\nThe cumulative distribution function of a real-valued random variable X is the function given by:\n\n![CDF](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Ff81c05aba576a12b4e05ee3f4cba709dd16139c7)\n\nResource:\n\n[Wikipedia](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCumulative_distribution_function)\n\n## 10_ Continuous distributions\n\nA continuous distribution describes the probabilities of the possible values of a continuous random variable. A continuous random variable is a random variable with a set of possible values (known as the range) that is infinite and uncountable.\n\n## 11_ Skewness\n\nSkewness is the measure of assymetry in the data distribution or a random variable distribution about its mean. \n\nSkewness can be positive, negative or zero. \n\n![skewed image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Ff\u002Ff8\u002FNegative_and_positive_skew_diagrams_%28English%29.svg\u002F446px-Negative_and_positive_skew_diagrams_%28English%29.svg.png)\n\n__Negative skew__: Distribution Concentrated in the right, left tail is longer.\n\n__Positive skew__: Distribution Concentrated in the left, right tail is longer.\n\nVariation of central tendency measures are shown below.\n\n\n![cet](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Fc\u002Fcc\u002FRelationship_between_mean_and_median_under_different_skewness.png\u002F434px-Relationship_between_mean_and_median_under_different_skewness.png)\n\nData Distribution are often Skewed which may cause trouble during processing the data. __Skewed Distribution can be converted to Symmetric Distribution, taking Log of the distribution__.\n\n##### Skew Distribution\n\n![Skew](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F379\u002F1*PLSczKIQRc8ZtlvHED-6mQ.png)\n\n##### Log of the Skew Distribution.\n\n![log](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F376\u002F1*4GFayBYKIiqAcyI69wIFzA.png)\n\n\n[Guide to Skewness](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSkewness)\n\n\n## 12_ ANOVA\n\nANOVA stands for __analysis of variance__. \n\nIt is used to compare among groups of data distributions.\n\nOften we are provided with huge data. They are too huge to work with. The total data is called the __Population__.\n\nIn order to work with them, we pick random smaller groups of data. They are called __Samples__.\n\nANOVA is used to compare the variance among these groups or samples. \n\nVariance of  group is given by:\n\n![var](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F446\u002F1*yzAMFVIEFysMKwuT0YHrZw.png)\n\nThe differences in the collected samples are observed using the differences between the means of the groups. We often use the __t-test__ to compare the means and also to check if the samples belong to the same population,\n\nNow, t-test can only be possible among two groups. But, often we get more groups or samples.\n\nIf we try to use t-test for more than two groups we have to perform t-tests multiple times, once for each pair. This is where ANOVA is used.\n\nANOVA has two components:\n\n__1.Variation within each group__\n\n__2.Variation between groups__\n\nIt works on a ratio called the  __F-Ratio__\n\nIt is given by:\n\n![F-ratio](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F491\u002F1*I5dSwtUICySQ5xvKmq6M8A.png)\n\nF ratio shows how much of the total variation comes from the variation between groups and how much comes from the variation within groups. If much of the variation comes from the variation between groups, it is more likely that the mean of groups are different. However, if most of the variation comes from the variation within groups, then we can conclude the elements in a group are different rather than entire groups. The larger the F ratio, the more likely that the groups have different means.\n\n\nResources:\n\n[Defnition](https:\u002F\u002Fstatistics.laerd.com\u002Fstatistical-guides\u002Fone-way-anova-statistical-guide.php)\n\n[GUIDE 1](https:\u002F\u002Ftowardsdatascience.com\u002Fanova-analysis-of-variance-explained-b48fee6380af)\n\n[Details](https:\u002F\u002Fmedium.com\u002F@StepUpAnalytics\u002Fanova-one-way-vs-two-way-6b3ff87d3a94)\n\n\n## 13_ Prob Den Fn (PDF)\n\nIt stands for probability density function. \n\n__In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.__\n\nThe probability density function (PDF) P(x) of a continuous distribution is defined as the derivative of the (cumulative) distribution function D(x).\n\nIt is given by the integral of the function over a given range.\n\n![PDF](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F45fd7691b5fbd323f64834d8e5b8d4f54c73a6f8)\n\n## 14_ Central Limit theorem\n\n## 15_ Monte Carlo method\n\n## 16_ Hypothesis Testing\n\n### Types of curves\n\nWe need to know about two distribution curves first.\n\nDistribution curves reflect the probabilty of finding an instance or a sample of a population at a certain value of the distribution.\n\n__Normal Distribution__\n\n![normal distribution](https:\u002F\u002Fsciences.usca.edu\u002Fbiology\u002Fzelmer\u002F305\u002Fnorm\u002Fstanorm.jpg)\n\nThe normal distribution represents how the data is distributed. In this case, most of the data samples in the distribution are scattered at and around the mean of the distribution. A few instances are scattered or present at the long tail ends of the distribution.\n\nFew points about Normal Distributions are:\n\n1. The curve is always Bell-shaped. This is because most of the data is found around the mean, so the proababilty of finding a sample at the mean or central value is more.\n\n2. The curve is symmetric\n\n3. The area under the curve is always 1. This is because all the points of the distribution must be present under the curve\n\n4. For Normal Distribution, Mean and Median lie on the same line in the distribution. \n\n__Standard Normal Distribution__\n\nThis type of distribution are normal distributions which following conditions.\n\n1. Mean of the distribution is 0\n\n2. The Standard Deviation of the distribution is equal to 1.\n\nThe idea of Hypothesis Testing works completely on the data distributions.\n\n### Hypothesis Testing\n\nHypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.\n\nFor example, say, we take the hypothesis that boys in a class are taller than girls. \n\nThe above statement is just an assumption on the population of the class.\n\n__Hypothesis__ is just an assumptive proposal or statement made on the basis of observations made on a set of information or data. \n\nWe initially propose two mutually exclusive statements based on the population of the sample data. \n\nThe initial one is called __NULL HYPOTHESIS__. It is denoted by H0.\n\nThe second one is called __ALTERNATE HYPOTHESIS__. It is denoted by H1 or Ha. It is used as a contrary to Null Hypothesis. \n\nBased on the instances of the population we accept or reject the NULL Hypothesis and correspondingly we reject or accept the ALTERNATE Hypothesis.\n \n#### Level of Significance\n\nIt is the degree which we consider to decide whether to accept or reject the NULL hypothesis. When we consider a hypothesis on a population, it is not the case that 100% or all instances of the population abides the assumption, so we decide a __level of significance as a cutoff degree, i.e, if our level of significance is 5%, and (100-5)% = 95% of the data abides by the assumption, we accept the Hypothesis.__\n\n__It is said with 95% confidence, the hypothesis is accepted__\n\n![curve](https:\u002F\u002Fi.stack.imgur.com\u002Fd8iHd.png)\n\nThe non-reject region is called __acceptance region or beta region__. The rejection regions are called __critical or alpha regions__. __alpha__ denotes the __level of significance__.\n\nIf level of significance is 5%. the two alpha regions have (2.5+2.5)% of the population and the beta region has the 95%. \n\nThe acceptance and rejection gives rise to two kinds of errors:\n\n__Type-I Error:__ NULL Hypothesis is true, but wrongly Rejected.\n\n__Type-II Error:__ NULL Hypothesis if false but is wrongly accepted.\n\n![hypothesis](https:\u002F\u002Fmicrobenotes.com\u002Fwp-content\u002Fuploads\u002F2020\u002F07\u002FGraphical-representation-of-type-1-and-type-2-errors.jpg)\n\n### Tests for Hypothesis\n\n__One Tailed Test__: \n\n![One-tailed](https:\u002F\u002Fprwatech.in\u002Fblog\u002Fwp-content\u002Fuploads\u002F2019\u002F07\u002Fonetailtest.png)\n\nThis is a test for Hypothesis, where the rejection region is only one side of the sampling distribution. The rejection region may be in right tail end or in the left tail end.\n\nThe idea is if we say our level of significance is 5% and we consider a hypothesis \"Hieght of Boys in a class is \u003C=6 ft\". We consider the hypothesis true if atmost 5% of our population are more than 6 feet. So, this will be one-tailed as the test condition only restricts one tail end, the end with hieght > 6ft. \n\n![Two Tailed](https:\u002F\u002Fi0.wp.com\u002Fwww.real-statistics.com\u002Fwp-content\u002Fuploads\u002F2012\u002F11\u002Ftwo-tailed-significance-testing.png)\n\nIn this case, the rejection region extends at both tail ends of the distribution.\n\nThe idea is if we say our level of significance is 5% and we consider a hypothesis \"Hieght of Boys in a class is !=6 ft\".\n\nHere, we can accept the NULL hyposthesis iff atmost 5% of the population is less than or greater than 6 feet. So, it is evident that the crirtical region will be at both tail ends and the region is 5% \u002F 2 = 2.5% at both ends of the distribution. \n\n\n\n## 17_ p-Value\n\nBefore we jump into P-values we need to look at another important topic in the context: Z-test.\n\n### Z-test\n\nWe need to know two terms: __Population and Sample.__\n\n__Population__ describes the entire available data distributed. So, it refers to all records provided in the dataset.\n\n__Sample__ is said to be a group of data points randomly picked from a population or a given distribution. The size of the sample can be any number of data points, given by __sample size.__\n\n__Z-test__ is simply used to determine if a given sample distribution belongs to a given population. \n\nNow,for Z-test we have to use __Standard Normal Form__ for the standardized comparison measures.\n\n![std1](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F700\u002F1*VYCN5b-Zubr4rrc9k37SAg.png)\n\nAs we already have seen, standard normal form is a normal form with mean=0 and standard deviation=1.\n\nThe __Standard Deviation__ is a measure of how much differently the points are distributed around the mean.\n\n![std2](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F640\u002F1*kzFQaZ08dTjlPq1zrcJXgg.png)\n\nIt states that approximately 68% , 95% and 99.7% of the data lies within 1, 2 and 3 standard deviations of a normal distribution respectively.\n\nNow, to convert the normal distribution to standard normal distribution we need a standard score called Z-Score.\nIt is given by:\n\n![Z-score](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F125\u002F1*X--kDNyurDEo2zKbSDDf-w.png)\n\nx = value that we want to standardize\n\nµ = mean of the distribution of x\n\nσ = standard deviation of the distribution of x\n\nWe need to know another concept __Central Limit Theorem__.\n\n##### Central Limit Theorem \n\n_The theorem states that the mean of the sampling distribution of the sample means is equal to the population mean irrespective if the distribution of population where sample size is greater than 30._\n\nAnd\n\n_The sampling distribution of sampling mean will also follow the normal distribution._\n\nSo, it states, if we pick several samples from a distribution with the size above 30, and pick the static sample means and use the sample means to create a distribution, the mean of the newly created sampling distribution is equal to the original population mean.\n\nAccording to the theorem, if we draw samples of size N, from a population with population mean μ and population standard deviation σ, the condition stands:\n\n![std3](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F121\u002F0*VPW964abYGyevE3h.png)\n\ni.e, mean of the distribution of sample means is equal to the sample means.\n\nThe standard deviation of the sample means is give by:\n\n![std4](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F220\u002F0*EMx4C_A9Efsd6Ef6.png)\n\nThe above term is also called standard error.\n\nWe use the theory discussed above for Z-test. If the sample mean lies close to the population mean, we say that the sample belongs to the population and if it lies at a distance from the population mean, we say the sample is taken from a different population.\n\nTo do this we use a formula and check if the z statistic is greater than or less than 1.96 (considering two tailed test, level of significance = 5%)\n\n![los](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F424\u002F0*C9XaCIUWoJaBSMeZ.gif)\n\n![std5](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F137\u002F1*DRiPmBtjK4wmidq9Ha440Q.png)\n \n The above formula gives Z-static\n\nz = z statistic\n\nX̄ = sample mean\n\nμ = population mean\n\nσ = population standard deviation\n\nn = sample size\n\nNow, as the Z-score is used to standardize the distribution, it gives us an idea how the data is distributed overall.\n\n### P-values\n\nIt is used to check if the results are statistically significant based on the significance level.  \n\nSay, we perform an experiment and collect observations or data. Now, we make a hypothesis (NULL hypothesis) primary, and a second hypothesis, contradictory to the first one called the alternative hypothesis.\n\nThen we decide a level of significance which serve as a threshold for our null hypothesis. The P value actually gives the probability of the statement. Say, the p-value of our alternative hypothesis is 0.02, it means the probability of alternate hypothesis happenning is 2%. \n\nNow, the level of significance into play to decide if we can allow 2% or p-value of 0.02. It can be said as a level of endurance of the null hypothesis. If our level of significance is 5% using a two tailed test, we can allow 2.5% on both ends of the distribution, we accept the NULL hypothesis, as level of significance > p-value of alternate hypothesis. \n\nBut if the p-value is greater than level of significance, we tell that the result is __statistically significant, and we reject NULL hypothesis.__ .\n\nResources:\n\n1. https:\u002F\u002Fmedium.com\u002Fanalytics-vidhya\u002Feverything-you-should-know-about-p-value-from-scratch-for-data-science-f3c0bfa3c4cc\n\n2. https:\u002F\u002Ftowardsdatascience.com\u002Fp-values-explained-by-data-scientist-f40a746cfc8\n\n3.https:\u002F\u002Fmedium.com\u002Fanalytics-vidhya\u002Fz-test-demystified-f745c57c324c\n\n## 18_ Chi2 test\n\nChi2 test is extensively used in data science and machine learning problems for feature selection.\n\nA chi-square test is used in statistics to test the independence of two events. So, it is used to check for independence of features used. Often dependent features are used which do not convey a lot of information but adds dimensionality to a feature space.\n\nIt is one of the most common ways to examine relationships between two or more categorical variables.\n\nIt involves calculating a number, called the chi-square statistic - χ2. Which follows a chi-square distribution.\n\nIt is given as the summation of the difference of the expected values and observed value divided by the observed value.\n\n![Chi2](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F266\u002F1*S8rfFkmLhDbOz4RGNwuz6g.png)\n\n\nResources:\n\n[Definitions](investopedia.com\u002Fterms\u002Fc\u002Fchi-square-statistic.asp)\n\n[Guide 1](https:\u002F\u002Ftowardsdatascience.com\u002Fchi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223)\n\n[Guide 2](https:\u002F\u002Fmedium.com\u002Fswlh\u002Fwhat-is-chi-square-test-how-does-it-work-3b7f22c03b01)\n\n[Example of Operation](https:\u002F\u002Fmedium.com\u002F@kuldeepnpatel\u002Fchi-square-test-of-independence-bafd14028250)\n\n\n## 19_ Estimation\n\n## 20_ Confid Int (CI)\n\n## 21_ MLE\n\n## 22_ Kernel Density estimate\n\nIn statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.\n\nKernel Density estimate can be regarded as another way to represent the probability distribution. \n\n![KDE1](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F2\u002F2a\u002FKernel_density.svg\u002F250px-Kernel_density.svg.png)\n\nIt consists of choosing a kernel function. There are mostly three used.\n\n1. Gaussian \n\n2. Box\n\n3. Tri\n\nThe kernel function depicts the probability of finding a data point. So, it is highest at the centre and decreases as we move away from the point.\n\nWe assign a kernel function over all the data points and finally calculate the density of the functions, to get the density estimate of the distibuted data points. It practically adds up the Kernel function values at a particular point on the axis. It is as shown below.\n\n![KDE 2](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F4\u002F41\u002FComparison_of_1D_histogram_and_KDE.png\u002F500px-Comparison_of_1D_histogram_and_KDE.png)\n\nNow, the kernel function is given by:\n\n![kde3](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Ff3b09505158fb06033aabf9b0116c8c07a68bf31)\n\nwhere K is the kernel — a non-negative function — and h > 0 is a smoothing parameter called the bandwidth. \n\nThe 'h' or the bandwidth is the parameter, on which the curve varies.\n\n![kde4](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Fe\u002Fe5\u002FComparison_of_1D_bandwidth_selectors.png\u002F220px-Comparison_of_1D_bandwidth_selectors.png)\n\nKernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.\n\nResources:\n\n[Basics](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x5zLaWT5KPs)\n\n[Advanced](https:\u002F\u002Fjakevdp.github.io\u002FPythonDataScienceHandbook\u002F05.13-kernel-density-estimation.html)\n\n## 23_ Regression\n\nRegression tasks deal with predicting the value of a __dependent variable__ from a set of __independent variables.__\n\nSay, we want to predict the price of a car. So, it becomes a dependent variable say Y, and the features like engine capacity, top speed, class, and company become the independent variables, which helps to frame the equation to obtain the price.\n\nIf there is one feature say x. If the dependent variable y is linearly dependent on x, then it can be given by __y=mx+c__, where the m is the coefficient of the independent in the equation, c is the intercept or bias.\n\nThe image shows the types of regression\n\n![types](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F2001\u002F1*dSFn-uIYDhDfdaG5GXlB3A.png)\n\n[Guide to Regression](https:\u002F\u002Ftowardsdatascience.com\u002Fa-deep-dive-into-the-concept-of-regression-fb912d427a2e)\n\n## 24_ Covariance\n\n### Variance\nThe variance is a measure of how dispersed or spread out the set is. If it is said that the variance is zero, it means all the elements in the dataset are same. If the variance is low, it means the data are slightly dissimilar. If the variance is very high, it means the data in the dataset are largely dissimilar. \n\nMathematically, it is a measure of how far each value in the data set is from the mean.\n\nVariance (sigma^2) is given by summation of the square of distances of each point from the mean, divided by the number of points\n\n![formula var](https:\u002F\u002Fcdn.sciencebuddies.org\u002FFiles\u002F474\u002F9\u002FDefVarEqn.jpg)\n\n### Covariance\n\nCovariance gives us an idea about the degree of association between two considered random variables. Now, we know random variables create distributions. Distribution are a set of values or data points which the variable takes and we can easily represent as vectors in the vector space.\n\nFor vectors covariance is defined as the dot product of two vectors. The value of covariance can vary from positive infinity to negative infinity. If the two distributions or vectors grow in the same direction the covariance is positive and vice versa. The Sign gives the direction of variation and the Magnitude gives the amount of variation.  \n\nCovariance is given by:\n\n![cov_form](https:\u002F\u002Fcdn.corporatefinanceinstitute.com\u002Fassets\u002Fcovariance1.png)\n\nwhere Xi and Yi denotes the i-th point of the two distributions and X-bar and Y-bar represent the mean values of both the distributions, and n represents the number of values or data points in the distribution. \n\n## 25_ Correlation\n\nCovariance measures the total relation of the variables namely both direction and magnitude. Correlation is a scaled measure of covariance. It is dimensionless and independent of scale. It just shows the strength of variation for both the variables.\n\nMathematically, if we represent the distribution using vectors, correlation is said to be the cosine angle between the vectors. The value of correlation varies from +1 to -1. +1 is said to be a strong positive correlation and -1 is said to be a strong negative correlation. 0 implies no correlation, or the two variables are independent of each other. \n\nCorrelation is given by:\n\n![corr](https:\u002F\u002Fcdn.corporatefinanceinstitute.com\u002Fassets\u002Fcovariance3.png)\n\nWhere:\n\nρ(X,Y) – the correlation between the variables X and Y\n\nCov(X,Y) – the covariance between the variables X and Y\n\nσX – the standard deviation of the X-variable\n\nσY – the standard deviation of the Y-variable\n\nStandard deviation is given by square roo of variance.\n\n## 26_ Pearson coeff\n\n## 27_ Causation\n\n## 28_ Least2-fit\n\n## 29_ Euclidian Distance\n\n__Eucladian Distance is the most used and standard measure for the distance between two points.__\n\nIt is given as the square root of sum of squares of the difference between coordinates of two points.\n\n__The Euclidean distance between two points in Euclidean space is a number, the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and is occasionally called the Pythagorean distance.__\n\n__In the Euclidean plane, let point p have Cartesian coordinates (p_{1},p_{2}) and let point q have coordinates (q_{1},q_{2}). Then the distance between p and q is given by:__\n\n![eucladian](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F9c0157084fd89f5f3d462efeedc47d3d7aa0b773)\n\n\n# 3_ Programming\n\n## 1_ Python Basics\n\n### About\n\nPython is a high-level programming langage. I can be used in a wide range of works.\n\nCommonly used in data-science, [Python](https:\u002F\u002Fwww.python.org\u002F)  has a huge set of libraries, helpful to quickly do something.\n\nMost of informatics systems already support Python, without installing anything.\n\n### Execute a script\n\n* Download the .py file on your computer\n* Make it executable (_chmod +x file.py_ on Linux)\n* Open a terminal and go to the directory containing the python file\n* _python file.py_ to run with Python2 or _python3 file.py_ with Python3\n\n## 2_ Working in excel\n\n## 3_ R setup \u002F R studio\n\n### About\n\nR is a programming language specialized in statistics and mathematical visualizations.\n\nIt can be used with manually created scripts using the terminal, or directly in the R console.\n\n### Installation\n\n#### Linux\n\n\tsudo apt-get install r-base\n\t\n\tsudo apt-get install r-base-dev\n\n#### Windows\n\nDownload the .exe setup available on [CRAN](https:\u002F\u002Fcran.rstudio.com\u002Fbin\u002Fwindows\u002Fbase\u002F) website.\n\n### R-studio\n\nRstudio is a graphical interface for R. It is available for free on [their website](https:\u002F\u002Fwww.rstudio.com\u002Fproducts\u002Frstudio\u002Fdownload\u002F).\n\nThis interface is divided in 4 main areas :\n\n![rstudio](https:\u002F\u002Fowi.usgs.gov\u002FR\u002Ftraining-curriculum\u002Fintro-curriculum\u002Fstatic\u002Fimg\u002Frstudio.png)\n\n* The top left is the script you are working on (highlight code you want to execute and press Ctrl + Enter)\n* The bottom left is the console to instant-execute some lines of codes\n* The top right is showing your environment (variables, history, ...)\n* The bottom right show figures you plotted, packages, help ... The result of code execution\n\n## 4_ R basics\n\nR is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.\n\nThe R language is widely used among statisticians and data miners for developing statistical software and data analysis.\n\nPolls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years.\n\n## 5_ Expressions\n\n## 6_ Variables\n\n## 7_ IBM SPSS\n\n## 8_ Rapid Miner\n\n## 9_ Vectors\n\n## 10_ Matrices\n\n## 11_ Arrays\n\n## 12_ Factors\n\n## 13_ Lists\n\n## 14_ Data frames\n\n## 15_ Reading CSV data\n\nCSV is a format of __tabular data__ comonly used in data science. Most of structured data will come in such a format.\n\nTo __open a CSV file__ in Python, just open the file as usual :\n\t\n\traw_file = open('file.csv', 'r')\n\t\n* 'r': Reading, no modification on the file is possible\n* 'w': Writing, every modification will erease the file \n* 'a': Adding, every modification will be made at the end of the file\n\n### How to read it ?\n\nMost of the time, you will parse this file line by line and do whatever you want on this line. If you want to store data to use them later, build lists or dictionnaries.\n\nTo read such a file row by row, you can use :\n\n* Python [library csv](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fcsv.html)\n* Python [function open](https:\u002F\u002Fdocs.python.org\u002F2\u002Flibrary\u002Ffunctions.html#open)\n\n## 16_ Reading raw data\n\n## 17_ Subsetting data\n\n## 18_ Manipulate data frames\n\n## 19_ Functions\n\nA function is helpful to execute redondant actions.\n\nFirst, define the function:\n\n\tdef MyFunction(number):\n\t\t\"\"\"This function will multiply a number by 9\"\"\"\n\t\tnumber = number * 9\n\t\treturn number\n\n## 20_ Factor analysis\n\n## 21_ Install PKGS\n\nPython actually has two mainly used distributions. Python2 and python3.\n\n### Install pip\n\nPip is a library manager for Python. Thus, you can easily install most of the packages with a one-line command. To install pip, just go to a terminal and do:\n\t\n\t# __python2__\n\tsudo apt-get install python-pip\n\t# __python3__\n\tsudo apt-get install python3-pip\n\t\nYou can then install a library with [pip](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fpip?) via a terminal doing:\n\n\t# __python2__ \n\tsudo pip install [PCKG_NAME]\n\t# __python3__ \n\tsudo pip3 install [PCKG_NAME]\n\nYou also can install it directly from the core (see 21_install_pkgs.py)\n\n\n# 4_ Machine learning\n\n## 1_ What is ML ?\n\n### Definition\n\nMachine Learning is part of the Artificial Intelligences study. It concerns the conception, devloppement and implementation of sophisticated methods, allowing a machine to achieve really hard tasks, nearly impossible to solve with classic algorithms.\n\nMachine learning mostly consists of three algorithms:\n\n![ml](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F561\u002F0*qlvUmkmkeefqe_Mk)\n\n### Utilisation examples\n\n* Computer vision\n* Search engines\n* Financial analysis\n* Documents classification\n* Music generation\n* Robotics ...\n\n## 2_ Numerical var\n\nVariables which can take continous integer or real values. They can take infinite values.\n\nThese types of variables are mostly used for features which involves measurements. For example, hieghts of all students in a class.\n\n## 3_ Categorical var\n\nVariables that take finite discrete values. They take a fixed set of values, in order to classify a data item.\n\nThey act like assigned labels. For example: Labelling the students of a class according to gender: 'Male' and 'Female'\n\n## 4_ Supervised learning\n\nSupervised learning is the machine learning task of inferring a function from __labeled training data__. \n\nThe training data consist of a __set of training examples__. \n\nIn supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). \n\nA supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. \n\nIn other words:\n\nSupervised Learning learns from a set of labeled examples. From the instances and the labels, supervised learning models try to find the correlation among the features, used to describe an instance, and learn how each feature contributes to the label corresponding to an instance. On receiving an unseen instance, the goal of supervised learning is to label the instance based on its feature correctly.\n\n__An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances__.\n\n## 5_ Unsupervised learning\n\nUnsupervised machine learning is the machine learning task of inferring a function to describe hidden structure __from \"unlabeled\" data__ (a classification or categorization is not included in the observations). \n\nSince the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.\n\nUnsupervised learning deals with data instances only. This approach tries to group data and form clusters based on the similarity of features. If two instances have similar features and placed in close proximity in feature space, there are high chances the two instances will belong to the same cluster. On getting an unseen instance, the algorithm will try to find, to which cluster the instance should belong based on its feature.\n\nResource:\n\n[Guide to unsupervised learning](https:\u002F\u002Ftowardsdatascience.com\u002Fa-dive-into-unsupervised-learning-bf1d6b5f02a7)\n\n## 6_ Concepts, inputs and attributes\n\nA machine learning problem takes in the features of a dataset as input.\n\nFor supervised learning, the model trains on the data and then it is ready to perform. So, for supervised learning, apart from the features we also need to input  the corresponding labels of the data points to let the model train on them.\n\nFor unsupervised learning, the models simply perform by just citing complex relations among data items and grouping them accordingly. So, unsupervised learning do not need a labelled dataset. The input is only the feature section of the dataset.\n\n## 7_ Training and test data\n\nIf we train a supervised machine learning model using a dataset, the model captures the dependencies of that particular data set very deeply. So, the model will always perform well on the data and it won't be proper measure of how well the model performs. \n\nTo know how well the model performs, we must train and test the model on different datasets. The dataset we train the model on is called Training set, and the dataset we test the model on is called the test set.\n\nWe normally split the provided dataset to create the training and test set. The ratio of splitting is majorly: 3:7 or 2:8 depending on the data, larger being the trining data.\n\n#### sklearn.model_selection.train_test_split is used for splitting the data.\n\nSyntax:\n\n    from sklearn.model_selection import train_test_split\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n  \n[Sklearn docs](https:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fgenerated\u002Fsklearn.model_selection.train_test_split.html)\n\n## 8_ Classifiers\n\nClassification is the most important and most common machine learning problem. Classification problems can be both suprvised and unsupervised problems.\n\nThe classification problems involve labelling data points to belong to a particular class based on the feature set corresponding to the particluar data point.\n\nClassification tasks can be performed using both machine learning and deep learning techniques.\n\nMachine learning classification techniques involve: Logistic Regressions, SVMs, and Classification trees. The models used to perform the classification are called classifiers.\n\n## 9_ Prediction\n\nThe output generated by a machine learning models for a particuolar problem is called its prediction. \n\nThere are majorly two kinds of predictions corresponding to two types of problen: \n\n1. Classification\n\n2. Regression\n\nIn classiication, the prediction is mostly a class or label, to which a data points belong\n\nIn regression, the prediction is a number, a continous a numeric value, because regression problems deal with predicting the value. For example, predicting the price of a house.\n\n## 10_ Lift\n\n## 11_ Overfitting\n\nOften we train our model so much or make our model so complex that our model fits too tghtly with the training data.\n\nThe training data often contains outliers or represents misleading patterns in the data. Fitting the training data with such irregularities to deeply cause the model to lose its generalization. The model performs very well on the training set but not so good on the test set. \n\n![overfitting](https:\u002F\u002Fhackernoon.com\u002Fhn-images\u002F1*xWfbNW3arf39wxk4ZkI2Mw.png)\n\nAs we can see on training further a point the training error decreases and testing error increases.\n\nA hypothesis h1 is said to overfit iff there exists another hypothesis h where h gives more error than h1 on training data and less error than h1 on the test data\n\n## 12_ Bias & variance\n\nBias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.\n\nVariance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.\n\n\nBasically High variance causes overfitting and high bias causes underfitting. We want our model to have low bias and low variance to perform perfectly. We need to avoid a model with higher variance and high bias\n\n![bias&variance](https:\u002F\u002Fcommunity.alteryx.com\u002Ft5\u002Fimage\u002Fserverpage\u002Fimage-id\u002F52874iE986B6E19F3248CF?v=1.0)\n\nWe can see that for Low bias and Low Variance our model predicts all the data points correctly. Again in the last image having high bias and high variance the model predicts no data point correctly.\n\n![B&v2](https:\u002F\u002Fadolfoeliazat.com\u002Fwp-content\u002Fuploads\u002F2020\u002F07\u002FBias-Variance-tradeoff-in-Machine-Learning.png)\n\nWe can see from the graph that rge Error increases when the complex is either too complex or the model is too simple. The bias increases with simpler model and Variance increases with complex models.\n\nThis is one of the most important tradeoffs in machine learning\n\n\n\n## 13_ Tree and classification\n\nWe have previously talked about classificaion. We have seen the most used methods are Logistic Regression, SVMs and decision trees. Now, if the decision boundary is linear the methods like logistic regression and SVM serves best, but its a complete scenerio when the decision boundary is non linear, this is where decision tree is used.\n\n![tree](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FZena_Hira\u002Fpublication\u002F279274803\u002Ffigure\u002Ffig4\u002FAS:324752402075653@1454438414424\u002FLinear-versus-nonlinear-classification-problems.png)\n\nThe first image shows linear decision boundary and second image shows non linear decision boundary.\n\nIh the cases, for non linear boundaries, the decision trees condition based approach work very well for classification problems. The algorithm creates conditions on features to drive and reach a decision, so is independent of functions.\n\n![tree2](https:\u002F\u002Fdatabricks.com\u002Fwp-content\u002Fuploads\u002F2014\u002F09\u002Fdecision-tree-example.png)\n\nDecision tree approach for classification\n\n## 14_ Classification rate\n\n## 15_ Decision tree\n\nDecision Trees are some of the most used machine learning algorithms. They are used for both classification and Regression. They can be used for both linear and non-linear data, but they are mostly used for non-linear data. Decision Trees as the name suggests works on a set of decisions derived from the data and its behavior. It does not use a linear classifier or regressor, so its performance is independent of the linear nature of the data. \n\nOne of the other most important reasons to use tree models is that they are very easy to interpret.\n\nDecision Trees can be used for both classification and regression. The methodologies are a bit different, though principles are the same. The decision trees use the CART algorithm (Classification and Regression Trees)\n\nResource:\n\n[Guide to Decision Tree](https:\u002F\u002Ftowardsdatascience.com\u002Fa-dive-into-decision-trees-a128923c9298)\n\n\n## 16_ Boosting\n\n#### Ensemble Learning\n\nIt is the method used to enhance the performance of the Machine learning models by combining several number of models or weak learners. They provide improved efficiency.\n\nThere are two types of ensemble learning:\n\n__1. Parallel ensemble learning or bagging method__\n\n__2. Sequential ensemble learning or boosting method__\n\nIn parallel method or bagging technique, several weak classifiers are created in parallel. The training datasets are created randomly on a bootstrapping basis from the original dataset. The datasets used for the training and creation phases are weak classifiers. Later during predictions, the reults from all the classifiers are bagged together to provide the final results.\n\n![bag](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F850\u002F1*_pfQ7Xf-BAwfQXtaBbNTEg.png)\n\nEx: Random Forests\n\nIn sequential learning or boosting weak learners are created one after another and the data sample set are weighted in such a manner that during creation, the next learner focuses on the samples that were wrongly predicted by the previous classifier. So, at each step, the classifier improves and learns from its previous mistakes or misclassifications.\n\n![boosting](https:\u002F\u002Fwww.kdnuggets.com\u002Fwp-content\u002Fuploads\u002FBudzik-fig2-ensemble-learning.jpg)\n\nThere are mostly three types of boosting algorithm:\n\n__1. Adaboost__\n\n__2. Gradient Boosting__\n\n__3. XGBoost__\n\n__Adaboost__ algorithm works in the exact way describe. It creates a weak learner, also known as stumps, they are not full grown trees, but contain a single node based on which the classification is done. The misclassifications are observed and they are weighted more than the correctly classified ones while training the next weak learner. \n\n__sklearn.ensemble.AdaBoostClassifier__ is used for the application of the classifier on real data in python.\n\n![adaboost](https:\u002F\u002Fars.els-cdn.com\u002Fcontent\u002Fimage\u002F3-s2.0-B9780128177365000090-f09-18-9780128177365.jpg)\n\nReources:\n\n[Understanding](https:\u002F\u002Fblog.paperspace.com\u002Fadaboost-optimizer\u002F#:~:text=AdaBoost%20is%20an%20ensemble%20learning,turn%20them%20into%20strong%20ones.)\n\n\n__Gradient Boosting__ algorithm starts with a node giving 0.5 as output for both classification and regression. It serves as the first stump or weak learner. We then observe the Errors in predictions. Now, we create other learners or decision trees to actually predict the errors based on the conditions. The errors are called Residuals. Our final output is:\n\n__0.5 (Provided by the first learner) + The error provided by the second tree or learner.__\n\nNow, if we use this method, it learns the predictions too tightly, and loses generalization. In order to avoid that gradient boosting uses a learning parameter _alpha_. \n\nSo, the final results after two learners is obtained as:\n\n__0.5 (Provided by the first learner) + _alpha_ X (The error provided by the second tree or learner.)__\n\nWe can see that using the added portion we take a small leap towards the correct results. We continue adding learners until the point we are very close to the actual value given by the training set.\n\nOverall the equation becomes:\n\n\n__0.5 (Provided by the first learner) + _alpha_ X (The error provided by the second tree or learner.)+ _alpha_ X (The error provided by the third tree or learner.)+.............__\n\n\n__sklearn.ensemble.GradientBoostingClassifier__ used to apply gradient boosting in python\n\n![GBM](https:\u002F\u002Fwww.elasticfeed.com\u002Fwp-content\u002Fuploads\u002F09cc1168a39db0c0d6ea1c66d27ecfd3.jpg)\n\nResource:\n\n[Guide](https:\u002F\u002Fmedium.com\u002Fmlreview\u002Fgradient-boosting-from-scratch-1e317ae4587d) \n\n## 17_ Naïves Bayes classifiers\n\nThe Naive Bayes classifiers are a collection of classification algorithms based on __Bayes’ Theorem.__\n\nBayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is given by:\n\n![bayes](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F87c061fe1c7430a5201eef3fa50f9d00eac78810)\n\nWhere P(A|B) is the probabaility of occurrence of A knowing B already occurred and P(B|A) is the probability of occurrence of B knowing A occurred.\n\n[Scikit-learn Guide](https:\u002F\u002Fgithub.com\u002Fabr-98\u002Fdata-scientist-roadmap\u002Fedit\u002Fmaster\u002F04_Machine-Learning\u002FREADME.md)\n\nThere are mostly two types of Naive Bayes:\n\n__1. Gaussian Naive Bayes__\n\n__2. Multinomial Naive Bayes.__\n\n#### Multinomial Naive Bayes\n\nThe method is used mostly for document classification. For example, classifying an article as sports article or say film magazine. It is also used for differentiating actual mails from spam mails. It uses the frequency of words used in different magazine to make a decision.\n\nFor example, the word \"Dear\" and \"friends\" are used a lot in actual mails and \"offer\" and \"money\" are used a lot in \"Spam\" mails. It calculates the prorbability of the occurrence of the words in case of actual mails and spam mails using the training examples. So, the probability of occurrence of \"money\" is much higher in case of spam mails and so on. \n\nNow, we calculate the probability of a mail being a spam mail using the occurrence of words in it. \n\n#### Gaussian Naive Bayes\n\nWhen the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.\n\n![gnb](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F422\u002F1*AYsUOvPkgxe3j1tEj2lQbg.gif)\n\nIt links guassian distribution and Bayes theorem. \n\nResources:\n\n[GUIDE](https:\u002F\u002Fyoutu.be\u002FH3EjCKtlVog)\n\n## 18_ K-Nearest neighbor\n\nK-nearest neighbour algorithm is the most basic and still essential algorithm. It is a memory based approach and not a model based one. \n\nKNN is used in both supervised and unsupervised learning. It simply locates the data points across the feature space and used distance as a similarity metrics.\n\nLesser the distance between two data points, more similar the points are. \n\nIn K-NN classification algorithm, the point to classify is plotted on the feature space and classified as the class of its nearest K-neighbours. K is the user parameter. It gives the measure of how many points we should consider while deciding the label of the point concerned. If K is more than 1 we consider the label that is in majority.\n\nIf the dataset is very large, we can use a large k. The large k is less effected by noise and generates smooth boundaries. For small dataset, a small k must be used. A small k helps to notice the variation in boundaries better.\n\n![knn](https:\u002F\u002Fwww.mathworks.com\u002Fmatlabcentral\u002Fmlc-downloads\u002Fdownloads\u002Fsubmissions\u002F46117\u002Fversions\u002F4\u002Fscreenshot.jpg)\n\nResource:\n\n[GUIDE](https:\u002F\u002Ftowardsdatascience.com\u002Fmachine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)\n\n## 19_ Logistic regression\n\nRegression is one of the most important concepts used in machine learning.\n\n[Guide to regression](https:\u002F\u002Ftowardsdatascience.com\u002Fa-deep-dive-into-the-concept-of-regression-fb912d427a2e)\n\nLogistic Regression is the most used classification algorithm for linearly seperable datapoints. Logistic Regression is used when the dependent variable is categorical. \n\nIt uses the linear regression equation:\n\n__Y= w1x1+w2x2+w3x3……..wkxk__\n\nin a modified format:\n\n__Y= 1\u002F 1+e^-(w1x1+w2x2+w3x3……..wkxk)__\n\nThis modification ensures the value always stays between 0 and 1. Thus, making it feasible to be used for classification.\n\nThe above equation is called __Sigmoid__ function. The function looks like:\n\n![Logreg](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F700\u002F1*HXCBO-Wx5XhuY_OwMl0Phw.png)\n\nThe loss fucnction used is called logloss or binary cross-entropy.\n\n__Loss= —Y_actual. log(h(x)) —(1 — Y_actual.log(1 — h(x)))__\n\nIf Y_actual=1, the first part gives the error, else the second part.\n\n![loss](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F700\u002F1*GZiV3ph20z0N9QSwQTHKqg.png)\n\nLogistic Regression is used for multiclass classification also. It uses softmax regresssion or One-vs-all logistic regression.\n\n[Guide to logistic Regression](https:\u002F\u002Ftowardsdatascience.com\u002Flogistic-regression-detailed-overview-46c4da4303bc)\n\n\n__sklearn.linear_model.LogisticRegression__ is used to apply logistic Regression in python.\n\n## 20_ Ranking\n\n## 21_ Linear regression\n\nRegression tasks deal with predicting the value of a dependent variable from a set of independent variables i.e, the provided features. Say, we want to predict the price of a car. So, it becomes a dependent variable say Y, and the features like engine capacity, top speed, class, and company become the independent variables, which helps to frame the equation to obtain the price.\n\n\nNow, if there is one feature say x. If the dependent variable y is linearly dependent on x, then it can be given by y=mx+c, where the m is the coefficient of the feature in the equation, c is the intercept or bias. Both M and C are the model parameters.\n\nWe use a loss function or cost function called Mean Square error of (MSE). It is given by the square of the difference between the actual and the predicted value of the dependent variable.\n\n__MSE=1\u002F2m * (Y_actual — Y_pred)²__\n\nIf we observe the function we will see its a parabola, i.e, the function is convex in nature. This convex function is the principle used in Gradient Descent to obtain the value of the model parameters\n\n![loss](https:\u002F\u002Fmiro.medium.com\u002Fmax\u002F2238\u002F1*Xgk6XI4kEcSmDaEAxqB1CA.png)\n\nThe image shows the loss function.\n\nTo get the correct estimate of the model parameters we use the method of __Gradient Descent__\n\n[Guide to Gradient Descent](https:\u002F\u002Ftowardsdatascience.com\u002Fan-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2)\n\n[Guide to linear Regression](https:\u002F\u002Ftowardsdatascience.com\u002Flinear-regression-detailed-view-ea73175f6e86)\n\n__sklearn.linear_model.LinearRegression__ is used to apply linear regression in python\n\n## 22_ Perceptron\n\nThe perceptron has been the first model described in the 50ies.\n\nThis is a __binary classifier__, ie it can't separate more than 2 groups, and thoses groups have to be __linearly separable__.\n\nThe perceptron __works like a biological neuron__. It calculate an activation value, and if this value if positive, it returns 1, 0 otherwise.\n\n## 23_ Hierarchical clustering\n\nThe hierarchical algorithms are so-called because they create tree-like structures to create clusters. These algorithms also use a distance-based approach for cluster creation.\n\nThe most popular algorithms are:\n\n__Agglomerative Hierarchical clustering__\n\n__Divisive Hierarchical clustering__\n\n__Agglomerative Hierarchical clustering__: In this type of hierarchical clustering, each point initially starts as a cluster, and slowly the nearest or similar most cluster","该项目是一个汇集了免费数据科学学习资源的仓库。它涵盖了从基础数学概念如矩阵运算、哈希函数到计算机科学中的二叉树、时间复杂度分析等知识点，同时也包括了关系代数和数据库操作的基础教程。项目内容详尽，适合初学者系统性地构建数据科学知识体系，也适用于有一定基础的数据科学家深入理解和复习相关理论。此外，由于其包含的人工智能、机器学习等领域的内容，对于希望在这些方向上发展的技术人员也非常有帮助。",2,"2026-06-11 03:24:46","top_topic"]