- Sept 8: The modeling
spectrum
- read: HTF pp 415-420

- Sept 13: A picture of high
dimensions
- No reading. Start on the "new" first homework assignment.

- Sept 15: Nearest
neighbors in high dimensions
- read: Johnson-Lindenstrauss lemma.
- read: Database friendly projections
- read: Yuval Peres's chapter on Johnson-Lindestrauss from his lecture notes.
- You now know enough to complete homework 1.

- Sept 20:Stepwise
regression
- start homework 2
- read: Best basis problem
- read: William J. Welch, "Algorithmic Compuexity: Three NP-Hard problems in computational statistics," (.pdf) J. Statist. Comput. Simul. 1982. (added 2014: There is more recent work on NP completeness of varaible selection. Natarajan has one, Michael Jordan has a piece, and I'm working on one.)
- read: Wikipedia's article on NP-complete. Someone should add the Welch result to the list of NP-complete problems. Do not get distracted by the page on sudoku.

- Sept 22: Bonferroni
- read carefully the first 4 sections of Risk inflation.

- Sept 27: Risk Inflation
(Homework 1 due)
- look over: Donoho and Johnstone (1994) Ideal Denoising in an Orthonormal Basis Chosen from a Library of Bases (.pdf)

- Sept 29: Curve fitting and wavelets
- Read carefully: Donoho and Johnstone's (1994) wavelet paper.

- Oct 4: Proper scoring rules
- Read: General method for comparing probability assessors, by Mark Schervish.

- Oct 6: Alternative scoring rules and calibration, Homework 2 due

- Oct 11: Spam: Bag of words and Naive-Bayes
- Talk in OPIM at Noon in G50 (free food)
- Read NYT data mining article (mirror)
- Read: Madigan's paper on Naive Bayes
- Read: as usual the wiki on Bayes filtering and Naive Bayes.
- Read about the Good-turing estimator for rare event probabilities.

- Oct 13: Tails
- Read: proof of Chernoff used in class.
- Read: Either Hal White's orginal sandwich estimator paper, or GEE paper by Liang and Zeiger

- Oct 17/18: (Fall break)
- Oct 20: Sandwich estimator

- Oct 25: graphs
- Oct 27: SFS: streaming feature selection

- Nov 1: Alpha spending
- Homework 3 due start on hw4
- Read one page background on alpha spending.
- Read: Excess discovery count (.ps) by Bob and me

- Nov 3: FDR and EDC

- Nov 8: Support vector machines (guest lecture by Jon)
- Nov 10: RKHS
- Homework 4 due
- nice source of information about support vectors and RKHS
- In particular see RKHS and regression using rkhs. You might find it easier to read the pdf files rather than the html file.
- Read section 5.8 of Hastie, Tibshirani, Friedman that I handed out.

- Nov 15: Support vector machines: part II (guest lecture by Jon)
- Nov 17: Fitting RKHS using LS

- Nov 22: Clustering
- read handout: pages 412-413 and 461-464 of Hastie, Tibshirani, Friedman.
- Read Tali Tishby and Eyal Krupka's NIPS paper.

- Nov 29:Trees
- read handout: pages 266 -289 of Hastie, Tibshirani, Friedman.

- Dec 1: Information
theory
- read Bob's gentle introduction to information theory.
- Feel free to talk to Bob, Adi or me about information theory.
- A book on information theory that is better than Harry Potter
- For general information on information theory.

- Dec 6: Alternative models of
data
- read Rick's and my review paper.
- My first annals of stat paper was in this area. Guess what? It was on regression! I've written several papers on this.

- Dec 8: Summary

- Dec 19: Homework 5 due
- Dec 21: Late date for homework 5.

Exactly what data mining is depends on who you talk to. For example, Andrew Moore takes a very wide view of data mining. He includes lovely topics from economics (i.e. game theory) to topics from classical AI (i.e. A* algorithm). This will contrast with the approach I will take. I'll focus much more highly on statistical regression.

I've written a crude outline of what the course will cover.

Last modified: Thu Sep 25 13:00:07 EDT 2014