Teaching assistants: Debjit Paria, Kapil Pause
Evaluation:
Assignments 40%, midsemester exam 20%, final exam 40%
Copying is fatal
Text and reference books:
Web Data Mining by Bing Liu.
Foundations of Data Science by Avrim Blum, John Hopcroft and Ravi Kannan
Machine Learning by Tom Mitchell.
C4.5: Programs for Machine Learning by Ross Quinlan.
Artificial Intelligence: A Modern Approach by Stuart J Russell and Peter Norvig.
An Introduction to Information Retrieval by Christopher D Manning, Prabhakar Raghavan and Hinrich Schütze
Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow by Aurélien Géron, O'Reilly, 2nd edition (2019)
Here is a tentative list of topics.
Supervised learning: Frequent itemsets, association rules, regression, decision trees, naive Bayes, SVM, classifier evaluation, expectation maximization, ensemble classifiers.
Unsupervised learning: Clustering, outlier detection.
Text mining: Basic ideas from information retrieval, TF/IDF model, Page Rank, HITS
Other topics: Probabilistic graphical models, Bayesian networks, Markov models, neural networks, ranking and social choice, …
Lecture 1, 7 Jan 2020: Class notes
Frequent itemsets, a-priori algorithm
Lecture 2, 9 Jan 2020: Class notes
A-priori algorithm, association rules
Lecture 3, 14 Jan 2020: Class notes
Class association rules, supervised learning, decision trees
Lecture 4, 16 Jan 2020: Class notes
Decision trees, different impurity measures – entropy, Gini index
Lecture 5, 21 Jan 2020: Class notes
Decision trees, information gain and information gain ratio
Discretizing continuous attributes
Classification and regression trees (CART)
Classifier evaluation
Lecture 6, 23 Jan 2020: Class notes
Classifier evaluation: confusion matrix, precision,recall, F-score
Regression: gradient descent
Lecture 7, 28 Jan 2020: Class notes
Logistic regression
Overfitting, tree pruning
Lecture 8, 29 Jan 2020:
Using the Python scikit-learn library for regression and decision trees
Lecture 9, 04 Feb 2020: Class notes
Naive Bayesian Classifiers, generative probablisitic models and parameter estimation, text classification: boolean document model
Lecture 10, 06 Feb 2020: Class notes
Text classification: multinomial (bag of words) document model
Clustering: K-Means
Lecture 11, 11 Feb 2020: Class notes
Clustering: K-Means, Hierarchical, Distance functions for non-numeric attributes, Normalizing values
Density based clustering
Lecture 12, 18 Feb 2020: Class notes
Density based local outlier detection
Applications of clustering: image segmentation, preprocessing for classification
Mixture models, Expectation-Maximization, clustering using mixture of Gaussians
Lecture 13, 20 Feb 2020: Class notes
Semi-supervised learning: EM for text classification, nearest-neighbour label extrapolation for MNIST data
Using the Python scikit-learn library for unsupervised learning
Lecture 14, 3 Mar 2020: Class notes
Perceptron algorithm
Kernel methods
Lecture 15, 5 Mar 2020: Class notes
Support vector machines
Kernel methods
Lecture 16, 12 Mar 2020: Class notes
Neural networks: Multilayer perceptrons, sigmoid neurons, network architecture, learning weights, universality
Lecture 17: Lecture notes, Lecture video
Neural networks: Backpropagation
Lecture 18: Lecture notes, Lecture video
Neural networks
Cross entropy cost function
Case study: MNIST, recognizing handwritten digits
Deep learning: structuring networks in layers, convolutional networks
Lecture 19: Lecture notes, Lecture video
Ensemble classifiers: Bagging, Random Forests
Lecture 20: Lecture notes, Lecture video
Boosting
Lecture 21: Lecture notes, Lecture video
Information retrieval
Basic definitions, boolean document model, term-document matrix, evaluating boolean queries, inverted index, postings, stop words, lemmatization and stemming
Ranking: parametric indices and zone based scoring, TF-IDF weightage, vector space model and cosine similarity
Lecture 22: Lecture notes, Lecture video
Anchor text, web as a graph, social network analysis, Page Rank
Lecture 23: Lecture notes, Lecture video
Singular Value Decomposition, Latent Semantic Indexing, Principal Component Analysis