Machine Learning
From Knowledge Discovery
Contents |
General ML Software
- Weka http://www.cs.waikato.ac.nz/~ml/weka/ (see also http://sourceforge.net/projects/weka/)
- Reinforcement Learning Demo - A little OpenGL program I made, demonstrating reinforcement learning http://students.cs.unipi.gr/~p02132/projects.php?mcat=2 (windows only)
Matlab Code
unsupervised
- k-means
- principle components analysis (PCA)
- Multidimensional Scaling (MDS)
- CPCM (clustering by predicting cluster membership) learning a metric for unsupervised learning
- bayesnet toolbox http://bnt.sourceforge.net/
supervised
- linear regression
- logistic regression
- stepwise feature selection
- streamwise feature selection
- KNN
- netlab neural network code http://www.ncrg.aston.ac.uk/netlab/
other
- more general matlab functions and tricks
- http://www.kyb.tuebingen.mpg.de/bs/people/spider/ matlab toolkit and interfact to weka
- EM for bernoulli distribution
Data Sources
- UCI Machine Learning Repository - http://www.ics.uci.edu/~mlearn/MLRepository.html
- The StatLib Datasets Archive - http://lib.stat.cmu.edu/datasets/
- KDD cups http://www.kdnuggets.com/datasets/kddcup.html
- weka data sets http://weka.sourceforge.net/wiki/index.php/Datasets or http://www.cs.waikato.ac.nz/~ml/weka/
- NetflixPrize
- facial images: the yale and the ferret ones http://www.itl.nist.gov/iad/humanid/feret/feret_master.html
- digits http://yann.lecun.com/exdb/mnist/
- Reuters-21578 Text Categorization Corpus - http://www.daviddlewis.com/resources/testcollections/reuters21578/
- CMU Computer Vision Test Images http://www.cs.cmu.edu/~cil/v-images.html
- Upenn Motion Capture Data
- CMU Motion Capture Data
- Animal dissimilarities, as estimated by human subjects. Contact Rob Goldstone, rgoldsto at indiana.edu, for permission to use.
- random links http://www.kdnuggets.com/datasets/
Data in matlab format
- Credit Card Application Data
- Credit card data 1 - UC Irvine credit card data with 1/0 indicator for missing attributes
- Credit card data 2 - UC Irvine credit card data with missing attributes replaced by mean
- Original source - Description of features in crx.names file
- Credit conversion information
- Congressional Voting Data
- Congressional votes data 1 - UC Irvine congressional votes data with 1/0 indicator for missing attributes
- Congressional votes data 2 - UC Irvine congressional votes data with missing attributes replaced by mean
- Original source - Description of features in .names file. Note that the missing data counts from the source description seem to be wrong. For instance, it says 0 missing for attribute 2, when clearly the 2nd instance is missing attribute 2. The other information looks right though.
- House votes conversion information
- Perl Conversion Script
- download tar/gz
- While converting some data from the UC Irvine machine learning repository, I ended up writing a generic conversion script to deal with things like discrete variables and missing data and create a file that Matlab would like. There's actually two perl scripts in this archive: the first uses the 1/0 indicator for any attributes with missing data and the second replaces missing data with the mean value for that attribute. Both scripts look through the data set and replace any discrete lettered feature with 0/1 attributes as discussed in class. I think these scripts can handle a lot of the data in the UC Irvine repository, but are still not very generic. Let me know if there are any obvious problems with them and feel free to fix them up. -Mike (mike.mattozzi at gmail.com).
How to handle different problems (FAQ)
Statistics background
- relation between
- MLE
- error
- information content
- p-value
- chi-squared
- Bennett's inequality
- sandwich estimator
- Phenomena in High Dimensions
- Statistical Theory for Clustering
- Consistency
- Stability
- Central limit theorems for k-means
Machine Learning background
A good place to start is the textbook The Elements of Statistical Learning and Andrew Moore's Notes. Alternatively, the Duda Hart and Stork's book on Pattern Classification is an excellent engineering view of the area.
- People at Penn should take CIS520
