Research
From Knowledge Discovery
|
Contents |
Bioinformatics
Our work in bioinformatics falls in two main areas: incorporating richer forms of biological knowledge into generative models, and learning from data in multiple relational databases and data implicit in scientific articles.
Incorporating biological knowledge into generative models The prior knowledge may be of the form of knowledge of the proximity of genes along a chromosome for predicting which genes were hoziontally transferred, or of which nucleotide distributions are or not conserved across organisms for gene identification in novel organisms.
When learning gene regulatory networks, the prior knowledge can be of the form of knowledge of which gene up- or down regulate other genes, which are transcription factors, and the typical number of genes regulating and being regulated by genes of a given type. This knowledge can be used to reduce the number of experiments needed to build probabilistic models, such as Bayesian belief networks of gene regulation.
Structural Relational Learning (described above) has tremendous potential for use in bioinformatics. As more genomes become sequenced, and more gene expression data becomes available, cross species comparisons are becoming increasingly important. Relational data between genes, proteins, protein function and localization, and expression data is a perfect application for SRL. Many other problems, such predicting protein-protein interactions are best addressed by combining multiple types of data: results of careful experiments, as found in hand curated databases or in scientific publications, less accurate, but more plentiful <mass experiments>
The bioinformatic literature, as exemplified by articles abstracted in Medline, is of great utility to scientists, and potentially to computers, building models of of gene function. However, the literature is vast. We are developing maximum entropy-based information extraction tools. Because it is impossible to hand label more than a tiny fraction of the available text, it is critical to make use of data bases that have already been manually extracted from the literature, and to develop new active learning techniques for determining which sentences to have annotators label.
Clustering and Collaborative Filtering
The value of a cluster must be determined by its use, otherwise evaluation of unsupervised clustering methods is highly problematic. Clusters can improve predictive accuracy (e.g. when used as new features in SRL, or as in input to a program for finding gene splice sites), can improve performance on a reinforcement learning task (e.g. robotic navigation), or can produce better recommendations of new products or items, leading to increased consumer purchases or satisfaction. We integrate purchase history similarity (collaborative filtering) with item-based similarity (content-based recommendation). We have also developed a new measure of recommender performance, the CROC curve, which is appropriate when one is recommending items to many people.
Feature Selection
coming soon...
Information Extraction
coming soon...
Reinforcement Learning for Robotics and Multi-agent Systems
Learning in systems which require physical experiments, as opposed to simulations, require new approaches to reinforcement learning (RL). We have developed policy gradient-based RL methods which quickly learn locally optimal control policies for mode-shifting controllers. We have also studied different forms of learning co-ordination in multi-agent systems.
In current work, we are learning new definitions of state space, and spatial landmarks, using methods similar to those described above under SRL. Different clusters of percepts (new state space representations) are generated and then tested for their utility in speeding learning and enhancing agent performance.
Statistical Relational Learning
We are developing techniques for automatically generating and selecting features from data stored in relational data bases in order to build accurate regression models. This involves feature generation by systematic search though a space of SQL queries against a database, generation of new relations in the database using clustering, selecting the most predictive features, while avoiding overfitting. This is nontrivial, since we generate tens of thousands of features, and avoid generating hundreds of thousands of features by using information on which features have been selected at a given time to guide the search for future features.
We have applied SRL to the task of predicting publication venue for articles in CiteSeer and paper citations in articles in physics. In both cases, we many articles with authors, citations, publication location and text of the abstract and title. In future work, we will extend these methods to problems bioinformatics.
This work also has implications for cognitive science. Questions of what concepts are innate vs. learned, such as object permanence can be addressed at least obliquely by showing that complex concepts can be learned by recursive application of relatively simple statistical learning rules.
