Machine Learning

From Knowledge Discovery

Jump to: navigation, search

Contents

General ML Software

Matlab Code

unsupervised

supervised

other

Data Sources

Data in matlab format

  • Credit Card Application Data
  • Congressional Voting Data
    • Congressional votes data 1 - UC Irvine congressional votes data with 1/0 indicator for missing attributes
    • Congressional votes data 2 - UC Irvine congressional votes data with missing attributes replaced by mean
    • Original source - Description of features in .names file. Note that the missing data counts from the source description seem to be wrong. For instance, it says 0 missing for attribute 2, when clearly the 2nd instance is missing attribute 2. The other information looks right though.
    • House votes conversion information
  • Perl Conversion Script
    • download tar/gz
    • While converting some data from the UC Irvine machine learning repository, I ended up writing a generic conversion script to deal with things like discrete variables and missing data and create a file that Matlab would like. There's actually two perl scripts in this archive: the first uses the 1/0 indicator for any attributes with missing data and the second replaces missing data with the mean value for that attribute. Both scripts look through the data set and replace any discrete lettered feature with 0/1 attributes as discussed in class. I think these scripts can handle a lot of the data in the UC Irvine repository, but are still not very generic. Let me know if there are any obvious problems with them and feel free to fix them up. -Mike (mike.mattozzi at gmail.com).

How to handle different problems (FAQ)

Statistics background

  • relation between
    • MLE
    • error
    • information content
    • p-value
    • chi-squared
  • Bennett's inequality
  • sandwich estimator
  • Phenomena in High Dimensions
  • Statistical Theory for Clustering
    • Consistency
    • Stability
    • Central limit theorems for k-means

Machine Learning background

A good place to start is the textbook The Elements of Statistical Learning and Andrew Moore's Notes. Alternatively, the Duda Hart and Stork's book on Pattern Classification is an excellent engineering view of the area.

  • People at Penn should take CIS520

CIS520 2006

Personal tools