CIS5520 Project Proposals

From Knowledge Discovery

Jump to: navigation, search

---

On 11/20/06, Tanmoy Chakraborty <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">tanmoy at seas.upenn.edu</A>> wrote:
> We are merging 2 groups to do the project: the 4 persons involved in the
> project are Tanmoy(myself), Varun, Savi and Ketaki.
>
> We shall use the "Internet Usage Data" available on the UC Irvine KDD
> archives. The data consists of demographic details of internet users in
> 1997. There are about 10,000 observations, and about 70-80 features. Most of
> the features are discrete and can take 5-10 values(with the country feature
> taking more than 100 values). We shall expand the set of features to make it
> binary(as you had described to us), which will generate about 500 features.
> Then we shall run PCA and clustering on them.
> We shall keep one of the features(age group, sex etc.) as the
> observation(output) column, and try to learn it from the other features, by
> various learning methods that we have learnt. (Decision tree seems to be a
> good choice, especially when applied with boosting.) Also, we can choose to
> keep out different features, and try to predict the feature kept out. For
> example, we can consider age group to be the output and try to predict it,
> we can also consider sex to be the output and try to predict it.
>
> Tanmoy.
>



Sender: ungar at cis.upenn.edu
Source: http://gosset.wharton.upenn.edu/pipermail/email2wiki/2006-December/000009.html

---

---------- Forwarded message ----------
From: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">distasio at seas.upenn.edu</A> <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">distasio at seas.upenn.edu</A>>
Date: Nov 18, 2006 4:41 PM
Subject: CIS 520 final project proposal
To: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ungar at cis.upenn.edu</A>


Dear Professor Ungar,
  Below is my proposal for the final project for CIS 520.

Sincerely,
Joseph Distasio


     For my final project I propose to implement what is called
stepwise k-nn, which is a combination of normal stepwise regression and
k nearest neighbors.  In normal stepwise regression, one would pick the
feature to add based on which one produced the largest reduction in
error.  This error is calculated by using linear regression on the
data, using only the features that have currently been added to the
model.  In stepwise k-nn, rather than using linear regression to
generate the error, one uses k nearest neighbors, which makes its
prediction by averaging the output of the k closest observations.  Like
normal stepwise regression, the feature that produces the best error
will be the one that is added to the model.  This method will be coded
up in Matlab and run on two sets of data.  In order to get a context of
how well it performs, methods for k nearest neighbors and normal
stepwise regression will also be coded up and run on the same two sets
of data.  These results will then be compared in order to see if
stepwise k-nn is an improvement over the other two methods (cross
validation will be used to determine the best value of k that should be
used).  If the results are inconclusive, more data sets will be used in
an attempt to find out which method is better.  It is expected that
stepwise k-nn will ultimately produce better results for two reasons.
First, it will attempt to exclude the features that generate a lot of
error, while k nearest neighbors will still use these features in its
model.  Secondly, it makes use of k nearest neighbors to generate its
model, which tends to fit better on more data sets than linear
regression, which is used by normal stepwise regression.

 I am planning on using the following two data sets for my final project:

<A HREF="http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.html">http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.html</A>
<A HREF="http://kdd.ics.uci.edu/databases/tic/tic.html">http://kdd.ics.uci.edu/databases/tic/tic.html</A>



<A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A>

---

---------- Forwarded message ----------
From: Roy Anati <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">royanati at seas.upenn.edu</A>>
Date: Nov 17, 2006 11:25 AM
Subject: CIS 520 - Project Proposal
To: "UNGAR, DR LYLE H" <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">UNGAR at cis.upenn.edu</A>>


Dear Prof. Ungar,

Please find below my proposal for this term's CIS 520 project.

Sincerely,
Roy Anati

I intend to analyse different methods of feature selection on one
object recognition data set and one digit recognition data set. This
analysis will involve extracting features from both datasets using
various methods. After which a classifier for each feature type and
data set witll be created, with the addition of one classifier for the
raw data. Subsequently we will compare classification results for
different feature extraction methods.

I'm planning on using the following image data sets:

Caltech 101[1]: The Caltech 101 data set contains 9,197 images from 101
different categories. These include various objects like faces, chairs
and more. Specifically each category contains from 40 to 800 images,
with 50 images per category on average. Most of the images are roughly
300 X 200 pixels. A great advantage of this data set is that in most of
the images the object is centred and in a typical pose. Each image is
first going to be resized to a single uniform size (Using
interpolation), in this case 300 X 200 pixels. Further the images are
going to be gray-scaled to reduce dimensionality.

Optical Recognition of Handwritten Digits (Original, unnormalized version)[2]:
This data set contains image information for handwritten digits. Each
digit is represented as a 32X32 bitmap image. This set contains
handwriting from 43 people. The database contains images from 10
differentr categories (digits 0..9).

The following method will be used to generate features:

Principle Component Analysis:
Since the data involved is pixel data the features are weighted
equally, therefore it seems logical to normalize the data by centering
the mean to 0 and the variance to 1.  After which PCA is going to be
used to generate final features. An advantage of using PCA is that it
transforms the feature space to reduce the amount of correlation
between features.

Discrete Cosine Transform:
Used in the JPEG compression format DCT generates frequency based
features for a given image. DCT encodes the data inherent in an image
in ascending order of frequency. Note that low frequencies represent
coarse image features (such as the average) whereas higher frequencies
represent varations in image color over a finer scale.

Possible modifications: Employing different color spaces, using feature
reduction, generating features on image blocks.

After these features have been generated, the following classifier will
be trained on the given data
K-Nearest Neighbhours:
For a given parameter k, we define as the estimated value of an
observation as a function of it's k nearest neighbhours. Since this
model is to perform classification instead of regression we will use a
voting system to classify test images instead of using a simple
averaging. An advantage of this system is that it requires no training,
except the definition of the parameter k, which will determined using
cross-validation. On the other hand testing in image is a costly
operation requiring calculating distance between the test image and
every training image.
Possible modifications: Using different distance metrics, different
classification functions.

[1] L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual
models from few training examples: an incremental Bayesian approach
tested on 101 object categories. IEEE. CVPR 2004, Workshop on
Generative-Model Based Vision. 2004

[2] Alpaydin, C. Kaynak, Department of Computer Engineering Bogazici
University, 80815 Istanbul Turkey <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">alpaydin at boun.edu.tr</A> September 1998.
 
<A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A>

Sender: ungar at cis.upenn.edu
Source: http://gosset.wharton.upenn.edu/pipermail/email2wiki/2006-December/000018.html
Gesture Classification from Motion Capture Data 

Our plan is to analyze motion capture data from the Human Modeling
and Simulation motion capture lab at Penn. Motion capture data is
primarily used for capturing motions to be used in an  animation or
as an analysis tool in biomechanics research. We wish to use the
data and machine learning techniques to recognize the type of motions
being performed.

We will begin with a set of 9 simple full body motions. We then will
run a variety of methods  on the data captured and discuss the various
performances. Some methods include: K-nearest neighbors, forward
stepwise regression, principle component regression, ridge regression,
Uniform Cost Search. Additionally, we use K-means to cluster the data
in an unsupervised way and compare the results to the hand annotated
classification on some of the methods mentioned previously. 

Our data consisted of motions that we previously captured from the
LiveActor Motion Capture device in the Human Modeling and Simulation
motion capture lab at Penn. An actor (Joe) places LED markers near
each joint (30 used for this experiment) captured at a rate of 30
frames per second. While the operator (Steve) starts and stops the
capture session for each motion. 

We captured 9 distinct motion gestures:
1) circles
2)pushups
3)kicks
4)punches
5)waves
6)judochops
7) jumping jacks
8) walks (in-place)
9) come-heres 

The data we are working with is the 3-d positional data (scene above) 
from the 30 markers. For  each point, we have an X,Y,Z coordinate in 3-D 
space. We then sampled the motions for 10 frames, throwing out the other 
frames and ensuring the motions were each the same length. These will
constitute our features. The features were hand labeled with the appropriate 
classes per each type of the 9 gestures. Also, motion capture data is 
susceptible to noise in the capturing process. A major concern is 
overfitting the data and having too many features and not enough data. 

Researchers:  Joe Kider and Steven Crowe

Forwarded message ----------
From: Lyle ungar <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ungar at cis.upenn.edu</A>>
Date: Dec 2, 2006 11:37 PM
Subject: Fwd: CIS 520 - Project Proposal
To: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">email2wiki at gosset.wharton.upenn.edu</A>


---------- Forwarded message ----------
From: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">wwoodwor at seas.upenn.edu</A> <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">wwoodwor at seas.upenn.edu</A>>
Date: Nov 20, 2006 7:24 PM
Subject: CIS 520 - Project Proposal
To: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ungar at cis.upenn.edu</A>
Cc: <A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">thomas.p.barker at gmail.com</A>


Dr. Ungar:

Included in this email is the project proposal for me (Bill Woodworth)
and Thomas Barker.

Let us know if you have any questions or concerns.

Thanks,
Bill

-------------------------------
Thomas Barker
Bill Woodworth


Proposed data sets:

Forest Cover Type (from <A HREF="http://kdd.ics.uci.edu/summary.data.type.html">http://kdd.ics.uci.edu/summary.data.type.html</A>)

This is an extensive data set spanning over a half million records.
Each observation is classified into one of seven ?cover types.? Unlike
the spam dataset we have been using, ?forest cover type? will provide
us a unique opportunity to examine the behavior of various
classification methods on non-binary, non-real-valued outputs.

Pollution datasets ? NO2 and PM10 (from <A HREF="http://lib.stat.cmu.edu/datasets/">http://lib.stat.cmu.edu/datasets/</A>)

Both of these datasets are significantly smaller than the one described
above, and represent measurements of factors contributing to pollution
from a Norwegian data set. Both data sets share the same features, but
have different response measures (which are real-valued). Given this
relationship, it might be interesting to examine the results of
training on one of the data sets and testing on the other, or to
combine the data sets into one larger set and measure performance that
way.


Proposed methods:

We would like to observe how well methods such as PCA and ridge
regression (using cross validation for the number of principal
components and to determine lambda) compare to more sophisticated
methods such as a search with belief networks or possibly a clustering
approach on each of the different sized data sets.

We plan to compare these approaches across the three datasets, with
specific attention paid to how variations in the number of observations
and in the complexity of the classifiers affect the overall
performance. This will likely include varying the size and way in which
our validation and training data sets are selected, particularly in the
case of the pollution datasets.



--


<A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A>


-- 


<A HREF="http://www.cis.upenn.edu/~ungar">http://www.cis.upenn.edu/~ungar</A>



Sender: ungar at cis.upenn.edu
Source: http://gosset.wharton.upenn.edu/pipermail/email2wiki/2006-December/000019.html

---

I think this should work.

There is also another similar data set, the "yale faces" data.

Since you are processing the data already, you should also try
predicting who the person
is (a separate model).  If you have time, you might also use
non-negative matrix factorization
as well as PCA.

I'll try to put up code for that.

You may find that when you compute the eigenvectors you need to
compute the dominant
eigenvectors, rather than all of them, in order to have matlab
converge in finite time.

_ lyle

On 12/9/06, Jack Sim <<A HREF="http://gosset.wharton.upenn.edu/mailman/listinfo/email2wiki">ssimzie at gmail.com</A>> wrote:
> Dear Professor Lyle Ungar,
>
> Below is my proposal for final project.
> I hope you remember me, I was the one who submitted proposal through a hard
> copy only, and couldn't get reply from you yet.
>
> I should have sent this e-mail much earlier, since I have actually started
> coding for this project.
> But I think a good advice would be still surely valuable for me to advance
> the project.
>
> Best regards,
> Jiwoong Sim
>
> ----------------------------------------------------------------------------
>
>
> Final Project Description
>
> Jiwoong Sim
>
>
>
> In this project, I will focus on machine learning project using human facial
> image as input data. I 've picked this topic because I wanted to verify the
> learned model directly by human sense. In many machine learning
> applications, learned model are not intuitive enough for human to
> comprehend. Although this is not always required in machine learning
> process, I wanted to pick a topic which human could verify the learned model
> and could make reasoning on the result. To satisfy this purpose, image as an
> input looked like the best choice, and also we have covered human face image
> as an input in HW on PCA section, so I 've searched for public face image
> data set.
>
>
>
> The public data set I' ve found was "The Japanese Female Facial
> Expression(JAFFE) Database ", which could found in
> <A HREF="http://www.kasrl.org/jaffe.html.">http://www.kasrl.org/jaffe.html.</A> This data set contains 213 images posed by
> 10 female models. The advantage of this data set is that each image has 5
> different score for different type of emotion. Also, Cohn-Kanade facieal
> expression database from CMU(
> <A HREF="http://vasc.ri.cmu.edu/idb/html/face/facial_expression/">http://vasc.ri.cmu.edu/idb/html/face/facial_expression/</A>)
> seems to satisfy requirements for the project. I 'll do experiment on both
> dataset.
>
>
>
> The plan for this project is to perform PCA on this data set, and find the
> correlation between principal components and each emotion scores. Using this
> learned model, we could not only classify a test input but also could
> compose a new face output using emotion score as a parameter, and verify how
> composed image changes as the emotion score changes. Also classifying
> learning will be performed using coefficients for calculated PCAs.
>
>
>
> Expected problem for this project is that number of input image would not be
> sufficient for learning a good model. The total number of input image might
> be not a problem, but there are only 10 models in this picture so the
> differences between models would overwhelm the differences between emotions.
> If this supposition proved to be true, I should have to find another
> database or change the purpose of learning on this project (such as
> classifying model from input image, which would not be as interesting as
> initial proposal).


--
Personal tools