## Wednesday, February 8, 2017

### Machine Learning K Nearest Neighbors KNN Algorithm

On a high level, KNN tries to look for the most similar existing instances (defined by a chosen distance function), then retrieve the corresponding labels. Then using a voting method to decide the final label based on k votes. Voting mechanism could be a simple majority. May weigh instances unevenly. Commonly used distance function is euclidean e.g. as in sklearn. FYI: Euclidean is a special case of Minkowski with p=2

Can modify the weight parameter: uniform treat all neighbors equally, distance as measured in a specified distance function. Can customize func.

 KNN k nearest neigbhors to a given point k the number of neighbors n the number of data points, data is sorted to speed up the algorithm distance Euclidean distance shortest line connecting the points OR a custom function that defines the distance visualize scatter plot, each point is a circle with a radius that include certain number of neighbors KNN a query intensive algorithm, not learning intensive performance big o notation, log(n) binary search, 1 is constant, n is linear intuition KNN stores all the data, then performs a binary search on the data when querying. Linear regression only stores the model y = mx+b. Key concepts: LEARN vs QUERY Running Time Space 1 NN 1-nearest neigbhor, 1 dimensional list e.g. [1 2 4 7 8 9] learning 1 n KNN all data to storage without learning, so running time is 1 which means constant in Big O notation, and storage space is n for the number of data points query log(n) binary search to find one point 1 K NN learning 1 n query log(n) + k binary search log(n) to find one point and the k items next to it in a sorted list 1 linear regression learning n 1 query 1 1