Research

From Knowledge Discovery

Jump to: navigation, search
Overview
The Penn Data Mining Group develops principled means of learning models of the large and complex data sets widely found in both scientific and commercial relational databases. (Think of all the records a pharmaceutical research lab or a hospital has.) Due to the size and sparsity of these data and the limited amount of labeled training data and limited number of experiments that can be run, it is critical to incorporate prior knowledge into the learning process.
Prior knowledge can reside in text (e.g. scientific literature), in relational or other databases, or in known facts in physics (conservation of energy, forms of spiral nebulae) or biology (know interactions between genes or proteins). Much of our work involves finding better ways to effectively use this knowledge in generative (e.g. belief nets, HMMs) or discriminative (e.g. regression or maximum entropy) modeling.
A key difficulty in learning from such data is that one generally does not know a priori which of the tens or hundreds of thousands of potential features to use in a model. In such cases, we use the structure of the data to systematically generate new features, and then test them for significance. These significance tests, in turn, suggest which subspaces of the feature space to explore further.
We are applying our methods to bioinformatics, robotics, multi-agent systems, and collaborative filtering. We draw on and contribute to a wide variety of modern statistical and machine learning methods, including maximum entropy methods, reinforcement learning, clustering, and probabilistic and statistical methods of learning from relational data.


Contents

Bioinformatics

Our work in bioinformatics falls in two main areas: incorporating richer forms of biological knowledge into generative models, and learning from data in multiple relational databases and data implicit in scientific articles.

Incorporating biological knowledge into generative models The prior knowledge may be of the form of knowledge of the proximity of genes along a chromosome for predicting which genes were hoziontally transferred, or of which nucleotide distributions are or not conserved across organisms for gene identification in novel organisms.

When learning gene regulatory networks, the prior knowledge can be of the form of knowledge of which gene up- or down regulate other genes, which are transcription factors, and the typical number of genes regulating and being regulated by genes of a given type. This knowledge can be used to reduce the number of experiments needed to build probabilistic models, such as Bayesian belief networks of gene regulation.

Structural Relational Learning (described above) has tremendous potential for use in bioinformatics. As more genomes become sequenced, and more gene expression data becomes available, cross species comparisons are becoming increasingly important. Relational data between genes, proteins, protein function and localization, and expression data is a perfect application for SRL. Many other problems, such predicting protein-protein interactions are best addressed by combining multiple types of data: results of careful experiments, as found in hand curated databases or in scientific publications, less accurate, but more plentiful <mass experiments>

The bioinformatic literature, as exemplified by articles abstracted in Medline, is of great utility to scientists, and potentially to computers, building models of of gene function. However, the literature is vast. We are developing maximum entropy-based information extraction tools. Because it is impossible to hand label more than a tiny fraction of the available text, it is critical to make use of data bases that have already been manually extracted from the literature, and to develop new active learning techniques for determining which sentences to have annotators label.


Clustering and Collaborative Filtering

The value of a cluster must be determined by its use, otherwise evaluation of unsupervised clustering methods is highly problematic. Clusters can improve predictive accuracy (e.g. when used as new features in SRL, or as in input to a program for finding gene splice sites), can improve performance on a reinforcement learning task (e.g. robotic navigation), or can produce better recommendations of new products or items, leading to increased consumer purchases or satisfaction. We integrate purchase history similarity (collaborative filtering) with item-based similarity (content-based recommendation). We have also developed a new measure of recommender performance, the CROC curve, which is appropriate when one is recommending items to many people.


Feature Selection

coming soon...


Information Extraction

coming soon...


Reinforcement Learning for Robotics and Multi-agent Systems

Learning in systems which require physical experiments, as opposed to simulations, require new approaches to reinforcement learning (RL). We have developed policy gradient-based RL methods which quickly learn locally optimal control policies for mode-shifting controllers. We have also studied different forms of learning co-ordination in multi-agent systems.

In current work, we are learning new definitions of state space, and spatial landmarks, using methods similar to those described above under SRL. Different clusters of percepts (new state space representations) are generated and then tested for their utility in speeding learning and enhancing agent performance.


Statistical Relational Learning

We are developing techniques for automatically generating and selecting features from data stored in relational data bases in order to build accurate regression models. This involves feature generation by systematic search though a space of SQL queries against a database, generation of new relations in the database using clustering, selecting the most predictive features, while avoiding overfitting. This is nontrivial, since we generate tens of thousands of features, and avoid generating hundreds of thousands of features by using information on which features have been selected at a given time to guide the search for future features.

We have applied SRL to the task of predicting publication venue for articles in CiteSeer and paper citations in articles in physics. In both cases, we many articles with authors, citations, publication location and text of the abstract and title. In future work, we will extend these methods to problems bioinformatics.

This work also has implications for cognitive science. Questions of what concepts are innate vs. learned, such as object permanence can be addressed at least obliquely by showing that complex concepts can be learned by recursive application of relatively simple statistical learning rules.

Personal tools