CIS5520 Project Proposals
From Knowledge Discovery
---
On 11/20/06, Tanmoy Chakraborty <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">tanmoy at seas.upenn.edu</A>> wrote: > We are merging 2 groups to do the project: the 4 persons involved in the > project are Tanmoy(myself), Varun, Savi and Ketaki. > > We shall use the "Internet Usage Data" available on the UC Irvine KDD > archives. The data consists of demographic details of internet users in > 1997. There are about 10,000 observations, and about 70-80 features. Most of > the features are discrete and can take 5-10 values(with the country feature > taking more than 100 values). We shall expand the set of features to make it > binary(as you had described to us), which will generate about 500 features. > Then we shall run PCA and clustering on them. > We shall keep one of the features(age group, sex etc.) as the > observation(output) column, and try to learn it from the other features, by > various learning methods that we have learnt. (Decision tree seems to be a > good choice, especially when applied with boosting.) Also, we can choose to > keep out different features, and try to predict the feature kept out. For > example, we can consider age group to be the output and try to predict it, > we can also consider sex to be the output and try to predict it. > > Tanmoy. > Sender: ungar at cis.upenn.edu Source: http://gosset.wharton.upenn.edu/pipermail/email2wiki/2006-December/000009.html
---
---------- Forwarded message ---------- From: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">distasio at seas.upenn.edu</A> <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">distasio at seas.upenn.edu</A>> Date: Nov 18, 2006 4:41 PM Subject: CIS 520 final project proposal To: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ungar at cis.upenn.edu</A> Dear Professor Ungar, Below is my proposal for the final project for CIS 520. Sincerely, Joseph Distasio For my final project I propose to implement what is called stepwise k-nn, which is a combination of normal stepwise regression and k nearest neighbors. In normal stepwise regression, one would pick the feature to add based on which one produced the largest reduction in error. This error is calculated by using linear regression on the data, using only the features that have currently been added to the model. In stepwise k-nn, rather than using linear regression to generate the error, one uses k nearest neighbors, which makes its prediction by averaging the output of the k closest observations. Like normal stepwise regression, the feature that produces the best error will be the one that is added to the model. This method will be coded up in Matlab and run on two sets of data. In order to get a context of how well it performs, methods for k nearest neighbors and normal stepwise regression will also be coded up and run on the same two sets of data. These results will then be compared in order to see if stepwise k-nn is an improvement over the other two methods (cross validation will be used to determine the best value of k that should be used). If the results are inconclusive, more data sets will be used in an attempt to find out which method is better. It is expected that stepwise k-nn will ultimately produce better results for two reasons. First, it will attempt to exclude the features that generate a lot of error, while k nearest neighbors will still use these features in its model. Secondly, it makes use of k nearest neighbors to generate its model, which tends to fit better on more data sets than linear regression, which is used by normal stepwise regression. I am planning on using the following two data sets for my final project: <A HREF="http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.html">http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.html</A> <A HREF="http://kdd.ics.uci.edu/databases/tic/tic.html">http://kdd.ics.uci.edu/databases/tic/tic.html</A> <A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A>
---
---------- Forwarded message ---------- From: Roy Anati <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">royanati at seas.upenn.edu</A>> Date: Nov 17, 2006 11:25 AM Subject: CIS 520 - Project Proposal To: "UNGAR, DR LYLE H" <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">UNGAR at cis.upenn.edu</A>> Dear Prof. Ungar, Please find below my proposal for this term's CIS 520 project. Sincerely, Roy Anati I intend to analyse different methods of feature selection on one object recognition data set and one digit recognition data set. This analysis will involve extracting features from both datasets using various methods. After which a classifier for each feature type and data set witll be created, with the addition of one classifier for the raw data. Subsequently we will compare classification results for different feature extraction methods. I'm planning on using the following image data sets: Caltech 101[1]: The Caltech 101 data set contains 9,197 images from 101 different categories. These include various objects like faces, chairs and more. Specifically each category contains from 40 to 800 images, with 50 images per category on average. Most of the images are roughly 300 X 200 pixels. A great advantage of this data set is that in most of the images the object is centred and in a typical pose. Each image is first going to be resized to a single uniform size (Using interpolation), in this case 300 X 200 pixels. Further the images are going to be gray-scaled to reduce dimensionality. Optical Recognition of Handwritten Digits (Original, unnormalized version)[2]: This data set contains image information for handwritten digits. Each digit is represented as a 32X32 bitmap image. This set contains handwriting from 43 people. The database contains images from 10 differentr categories (digits 0..9). The following method will be used to generate features: Principle Component Analysis: Since the data involved is pixel data the features are weighted equally, therefore it seems logical to normalize the data by centering the mean to 0 and the variance to 1. After which PCA is going to be used to generate final features. An advantage of using PCA is that it transforms the feature space to reduce the amount of correlation between features. Discrete Cosine Transform: Used in the JPEG compression format DCT generates frequency based features for a given image. DCT encodes the data inherent in an image in ascending order of frequency. Note that low frequencies represent coarse image features (such as the average) whereas higher frequencies represent varations in image color over a finer scale. Possible modifications: Employing different color spaces, using feature reduction, generating features on image blocks. After these features have been generated, the following classifier will be trained on the given data K-Nearest Neighbhours: For a given parameter k, we define as the estimated value of an observation as a function of it's k nearest neighbhours. Since this model is to perform classification instead of regression we will use a voting system to classify test images instead of using a simple averaging. An advantage of this system is that it requires no training, except the definition of the parameter k, which will determined using cross-validation. On the other hand testing in image is a costly operation requiring calculating distance between the test image and every training image. Possible modifications: Using different distance metrics, different classification functions. [1] L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004 [2] Alpaydin, C. Kaynak, Department of Computer Engineering Bogazici University, 80815 Istanbul Turkey <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">alpaydin at boun.edu.tr</A> September 1998. <A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A> Sender: ungar at cis.upenn.edu Source: http://gosset.wharton.upenn.edu/pipermail/email2wiki/2006-December/000018.html
Gesture Classification from Motion Capture Data Our plan is to analyze motion capture data from the Human Modeling and Simulation motion capture lab at Penn. Motion capture data is primarily used for capturing motions to be used in an animation or as an analysis tool in biomechanics research. We wish to use the data and machine learning techniques to recognize the type of motions being performed. We will begin with a set of 9 simple full body motions. We then will run a variety of methods on the data captured and discuss the various performances. Some methods include: K-nearest neighbors, forward stepwise regression, principle component regression, ridge regression, Uniform Cost Search. Additionally, we use K-means to cluster the data in an unsupervised way and compare the results to the hand annotated classification on some of the methods mentioned previously. Our data consisted of motions that we previously captured from the LiveActor Motion Capture device in the Human Modeling and Simulation motion capture lab at Penn. An actor (Joe) places LED markers near each joint (30 used for this experiment) captured at a rate of 30 frames per second. While the operator (Steve) starts and stops the capture session for each motion. We captured 9 distinct motion gestures: 1) circles 2)pushups 3)kicks 4)punches 5)waves 6)judochops 7) jumping jacks 8) walks (in-place) 9) come-heres The data we are working with is the 3-d positional data (scene above) from the 30 markers. For each point, we have an X,Y,Z coordinate in 3-D space. We then sampled the motions for 10 frames, throwing out the other frames and ensuring the motions were each the same length. These will constitute our features. The features were hand labeled with the appropriate classes per each type of the 9 gestures. Also, motion capture data is susceptible to noise in the capturing process. A major concern is overfitting the data and having too many features and not enough data. Researchers: Joe Kider and Steven Crowe
Forwarded message ----------
From: Lyle ungar <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ungar at cis.upenn.edu</A>> Date: Dec 2, 2006 11:37 PM Subject: Fwd: CIS 520 - Project Proposal To: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">email2wiki at gosset.wharton.upenn.edu</A> ---------- Forwarded message ---------- From: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">wwoodwor at seas.upenn.edu</A> <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">wwoodwor at seas.upenn.edu</A>> Date: Nov 20, 2006 7:24 PM Subject: CIS 520 - Project Proposal To: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ungar at cis.upenn.edu</A> Cc: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">thomas.p.barker at gmail.com</A> Dr. Ungar: Included in this email is the project proposal for me (Bill Woodworth) and Thomas Barker. Let us know if you have any questions or concerns. Thanks, Bill ------------------------------- Thomas Barker Bill Woodworth Proposed data sets: Forest Cover Type (from <A HREF="http://kdd.ics.uci.edu/summary.data.type.html">http://kdd.ics.uci.edu/summary.data.type.html</A>) This is an extensive data set spanning over a half million records. Each observation is classified into one of seven ?cover types.? Unlike the spam dataset we have been using, ?forest cover type? will provide us a unique opportunity to examine the behavior of various classification methods on non-binary, non-real-valued outputs. Pollution datasets ? NO2 and PM10 (from <A HREF="http://lib.stat.cmu.edu/datasets/">http://lib.stat.cmu.edu/datasets/</A>) Both of these datasets are significantly smaller than the one described above, and represent measurements of factors contributing to pollution from a Norwegian data set. Both data sets share the same features, but have different response measures (which are real-valued). Given this relationship, it might be interesting to examine the results of training on one of the data sets and testing on the other, or to combine the data sets into one larger set and measure performance that way. Proposed methods: We would like to observe how well methods such as PCA and ridge regression (using cross validation for the number of principal components and to determine lambda) compare to more sophisticated methods such as a search with belief networks or possibly a clustering approach on each of the different sized data sets. We plan to compare these approaches across the three datasets, with specific attention paid to how variations in the number of observations and in the complexity of the classifiers affect the overall performance. This will likely include varying the size and way in which our validation and training data sets are selected, particularly in the case of the pollution datasets. -- <A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A> -- <A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A> Sender: ungar at cis.upenn.edu Source: http://gosset.wharton.upenn.edu/pipermail/email2wiki/2006-December/000019.html
---
I think this should work. There is also another similar data set, the "yale faces" data. Since you are processing the data already, you should also try predicting who the person is (a separate model). If you have time, you might also use non-negative matrix factorization as well as PCA. I'll try to put up code for that. You may find that when you compute the eigenvectors you need to compute the dominant eigenvectors, rather than all of them, in order to have matlab converge in finite time. _ lyle On 12/9/06, Jack Sim <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ssimzie at gmail.com</A>> wrote: > Dear Professor Lyle Ungar, > > Below is my proposal for final project. > I hope you remember me, I was the one who submitted proposal through a hard > copy only, and couldn't get reply from you yet. > > I should have sent this e-mail much earlier, since I have actually started > coding for this project. > But I think a good advice would be still surely valuable for me to advance > the project. > > Best regards, > Jiwoong Sim > > ---------------------------------------------------------------------------- > > > Final Project Description > > Jiwoong Sim > > > > In this project, I will focus on machine learning project using human facial > image as input data. I 've picked this topic because I wanted to verify the > learned model directly by human sense. In many machine learning > applications, learned model are not intuitive enough for human to > comprehend. Although this is not always required in machine learning > process, I wanted to pick a topic which human could verify the learned model > and could make reasoning on the result. To satisfy this purpose, image as an > input looked like the best choice, and also we have covered human face image > as an input in HW on PCA section, so I 've searched for public face image > data set. > > > > The public data set I' ve found was "The Japanese Female Facial > Expression(JAFFE) Database ", which could found in > <A HREF="http://www.kasrl.org/jaffe.html.">http://www.kasrl.org/jaffe.html.</A> This data set contains 213 images posed by > 10 female models. The advantage of this data set is that each image has 5 > different score for different type of emotion. Also, Cohn-Kanade facieal > expression database from CMU( > <A HREF="http://vasc.ri.cmu.edu/idb/html/face/facial_expression/">http://vasc.ri.cmu.edu/idb/html/face/facial_expression/</A>) > seems to satisfy requirements for the project. I 'll do experiment on both > dataset. > > > > The plan for this project is to perform PCA on this data set, and find the > correlation between principal components and each emotion scores. Using this > learned model, we could not only classify a test input but also could > compose a new face output using emotion score as a parameter, and verify how > composed image changes as the emotion score changes. Also classifying > learning will be performed using coefficients for calculated PCAs. > > > > Expected problem for this project is that number of input image would not be > sufficient for learning a good model. The total number of input image might > be not a problem, but there are only 10 models in this picture so the > differences between models would overwhelm the differences between emotions. > If this supposition proved to be true, I should have to find another > database or change the purpose of learning on this project (such as > classifying model from input image, which would not be as interesting as > initial proposal). --
