-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
477 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
## Bioinformatics | ||
|
||
Bioinformatics has emerged as an exciting new research area giving rise to numerous | ||
challenging computational problems whose successful solution will ultimately impact | ||
every aspect of our every day life. This is currently one of the lab's main research | ||
thrust areas and is primarily designed to develop and apply data-mining and | ||
knowledge-based techniques to solve various problems arising in this field. | ||
|
||
Our ongoing research has led to the development of clustering and classification | ||
algorithms suitable for analyzing gene expression data, DNA- and | ||
protein-sequence-based classification algorithms, highly accurate remote homology | ||
recognition and fold prediction algorithms, scalable clustering algorithms for | ||
protein sequences, and algorithms that predict various aspects of a protein's | ||
secondary and tertiary structure based on its primary sequence. | ||
|
||
Many problems arising in bioinformatics can be formulated as classification or | ||
prediction problem instances whose goal is to gain some higher-level knowledge from | ||
primary information. Examples of such problems are gene prediction, promoter | ||
identification, protein family assignment, gene functional assignment, secondary | ||
structure prediction, fold-recognition, tertiary structure prediction, etc. | ||
Developing effective algorithms for these problems usually involves two steps. The | ||
first step is that of identifying the signals present in the data that capture the | ||
key physical/chemical/biological properties of the various objects and classes, | ||
whereas the second step is that of developing supervised machine learning algorithms | ||
that can properly model and exploit them toward the goal of building accurate | ||
classifiers. Within this context, our research is focused on identifying the right | ||
set of signals for various problems, developing novel classification algorithms, and | ||
analyzing whether or not there are sufficiently strong signals present in the | ||
datasets to allow for the effective use of computational techniques in the first | ||
place. | ||
|
||
Our current research in this area is focusing on protein function and structure | ||
prediction. Some of the specific research projects are: | ||
|
||
* Improve the performance of remote homology recognition and fold prediction algorithms | ||
by designing novel and effective kernel methods that combine various observed and/or | ||
predicted signals. | ||
|
||
* Enhance the effectiveness of local structure prediction algorithms by designing | ||
structural alphabets that combine predictability with structure reproducibility. | ||
|
||
* Improve the performance of ab initio | ||
|
||
The research over the years has been funded by a number of Federal agencies including | ||
ARL, NSF, and NIH. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
## Chemical Informatics | ||
|
||
Discovering new drugs is an expensive and challenging process. Any new drug should | ||
not only produce the desired response to the disease but should do so with minimal | ||
side effects and be superior to the existing drugs in the market. The goal of this | ||
project is to develop effective and efficient algorithms for analyzing chemical | ||
compound databases and identifying biologically active compounds. | ||
|
||
One of the key steps in the drug design process is the identification of the chemical | ||
compounds (hit compounds) that display the desired and reproducible behavior against | ||
the specific biomolecular target and represents a significant hurdle in the early | ||
stages of drug discovery. The 1990s saw the widespread adoption of high-throughput | ||
screening (HTS), which use highly automated techniques to conduct the biological | ||
assays and can be used to screen a large number of compounds. Although the number of | ||
compounds that can be evaluated by these methods is very large, these numbers are | ||
small in comparison to the millions of drug-like compounds that exist or can be | ||
synthesized by combinatorial chemistry methods. Moreover, in most cases it is hard to | ||
find all desirable properties in a single compound and medicinal chemists are | ||
interested in not just identifying the hits but studying what part of the chemical | ||
compound leads to desirable behavior, so that new compounds can be rationally | ||
synthesized (lead development). | ||
|
||
Computational techniques that build models to correctly assign chemical compounds to | ||
various classes of interest can address these limitations, have extensive | ||
applications in pharmaceutical research, and they are used extensively to replace or | ||
supplement HTS-based approaches. | ||
|
||
Our most recent research is currently concentrated on the following areas: | ||
|
||
* Develop computationally efficient algorithms to mine large databases of molecular | ||
graphs and identify key substructures present in active (inactive) compounds. | ||
|
||
* Develop sophisticated feature selection and generation algorithms that combine | ||
multiple criteria to identify and synthesize a set of substructure-based features | ||
that simultaneously simplify the representation of the original compounds while | ||
retaining and exposing their key features. | ||
|
||
* Develop kernel-based clustering and classification approaches that take into | ||
account the relationships between these substructures at different levels of | ||
granularity and complexity. | ||
|
||
The research over the years has been funded by a number of Federal agencies including | ||
NIH (primary agency), ARL, and NSF. | ||
|
||
### Software | ||
|
||
Our research thus far has resulted in the development of computationally efficient | ||
algorithms to find frequent substructures in molecular graphs (either topological or | ||
geometric). The topological version of this algorithm, called FSG, is currently | ||
available as part of our pattern discovery toolkit PAFI, which can be downloaded and | ||
used for educational and research purposes. | ||
|
||
Another recent development is the AFGen program that operates on a database of | ||
chemical compounds and generates their descriptor-based representation by considering | ||
all bounded length acyclic fragments that they contain. These descriptors are quite | ||
effective in capturing the structural characteristics of chemical compounds. | ||
Experiments in the context of SVM-based classification and ranked-retrieval show that | ||
these descriptors consistently and statistically outperform previously developed | ||
schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as | ||
well as recently introduced descriptors obtained by mining and analyzing the | ||
structure of the molecular graphs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
## Data Mining | ||
|
||
The goal of this project is to develop effective and computationally efficient | ||
algorithms for analyzing large volumes of data. The ultimate purpose of these | ||
analyses is to discover key and actionable information and gain insights about the | ||
underlying processes/systems that created the data (or are being described by the | ||
data). | ||
|
||
This emerging discipline is becoming increasingly important as advances in data | ||
collection have led to the explosive growth in the amount of available data. Data | ||
mining algorithms are used extensively to analyze business, commerce, scientific, | ||
engineering, and security data and dramatically improve the effectiveness of | ||
applications in areas such as marketing, predictive modeling, life sciences, | ||
information retrieval, and engineering. | ||
|
||
Our research was initially focused on developing high-performance scalable parallel | ||
algorithms for solving core data mining problems but in recent years, it has expanded | ||
to include research on fundamental data mining algorithms in the areas of data | ||
clustering, classification, pattern discovery, sequence mining, graph mining, and its | ||
applications in information retrieval, collaborative filtering, and bioinformatics. | ||
|
||
Our latest research is focusing on the following areas: | ||
|
||
* Algorithms for finding meaningful clusters in large sparse graphs like those arising | ||
in relational/social networks and the web. | ||
|
||
* Large-margin and kernel-based classification algorithms with an emphasis towards | ||
algorithms that can learn arbitrary output spaces. | ||
|
||
* Algorithms that can mine large and complex graphs. | ||
|
||
The research over the years has been funded by a number of Federal agencies including | ||
ARL, NSF, and NIH. | ||
|
||
|
||
### Software | ||
|
||
Many of the algorithms that we developed have been made available to the public in | ||
the form of stand-alone programs and libraries that are used extensively in many | ||
academic, government, and industrial sites. This includes the CLUTO clustering | ||
toolkit that implements different classes of feature- and similarity-based clustering | ||
algorithms, the SUGGEST and SLIM libraries of scalable collaborative-filtering based | ||
recommendation algorithms, and the PAFI pattern finding toolkit that contains various | ||
algorithms to find frequent patterns in transaction, sequence, and graph databases. | ||
|
||
All of these tools are available for download from our Software page. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
## Graph Partitioning | ||
|
||
This project is the longest running research activity in the lab and dates back to | ||
the time of George's PhD work. The fundamental problem that is trying to solve is | ||
that of splitting a large irregular graphs into k parts. This problem has | ||
applications in many different areas including, parallel/distributed computing (load | ||
balancing of computations), scientific computing (fill-reducing matrix re-orderings), | ||
EDA algorithms for VLSI CAD (placement), data mining (clustering), social network | ||
analysis (community discovery), pattern recognition, relationship network analysis, | ||
etc. | ||
|
||
The partitioning is usually done so that it satisfies certain constraints and | ||
optimizes certain objectives. The most common constraint is that of producing | ||
equal-size partitions, whereas the most common objective is that of minimizing the | ||
number of cut edges (i.e., the edges that straddle partition boundaries). However, in | ||
many cases, different application areas tend to require their own type of constraints | ||
and objectives; thus, making the problem all that more interesting and challenging! | ||
|
||
The research in the lab is focusing on a class of algorithms that have come to be | ||
known as multilevel graph partitioning algorithms. These algorithms solve the problem | ||
by following an approximate-and-solve paradigm, which is very effective for this as | ||
well as other (combinatorial) optimization problems. | ||
|
||
Over the years we focused and produced good solutions for a number of | ||
graph-partitioning related problems. This includes partitioning algorithms for graphs | ||
corresponding to finite element meshes, multilevel nested dissection, parallel | ||
graph/mesh partitioning, dynamic/adaptive graph repartitioning, multi-constraint and | ||
multi-objective partitioning, and circuit and hypergraph partitioning. | ||
|
||
Our latest research is focusing on three key areas: | ||
|
||
* Mesh/graph partitioning algorithms that take into the fine-grain characteristics of | ||
the underlying parallel computer and can deal with heterogeneous computing and | ||
communication capabilities. | ||
|
||
* Partitioning/load-balancing algorithms for mesh-less or mesh/particles scientific | ||
simulations. | ||
|
||
* Partitioning algorithms for scale-free graphs and/or graphs whose degree distribution | ||
follows a power-low curve. | ||
|
||
The research over the years has been funded by a number of Federal agencies including | ||
DOE, ARO, ARL, NSF and companies including IBM, SGI, and Cray. | ||
|
||
|
||
### Software | ||
|
||
A direct outcome of our graph partitioning research is the development of the METIS | ||
family of multilevel partitioning programs and libraries. This includes | ||
computationally efficient and highly effective tools for partitioning very large | ||
graphs on serial and parallel computers as well as tools for partitioning | ||
hypergraphs, especially those corresponding to netlists of VLSI circuits. | ||
|
||
In addition, some ideas inspired from the multilevel optimization algorithms | ||
developed for graph partitioning found their way on two other research projects. The | ||
first is the work on MGridGen, a tool to generate a sequence of coarse grids for | ||
geometric multigrid-based precoditioners. The second is the work on CLUTO, our data | ||
clustering tool, which contains effective graph-partitioning based clustering | ||
algorithms. In fact, the quality of the partitionings produced by CLUTO are in | ||
general better than those produced by the serial graph partitioning algorithm in | ||
METIS. | ||
|
||
All of these tools are available for download from our Software page. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
## Software Tools Developed by Lab Members | ||
|
||
Over the years, the research in the lab has resulted in the development of a number | ||
of software tools and libraries for key problems in the areas of parallel processing, | ||
data mining, bioinformatics, and collaborative filtering. | ||
|
||
It is our general policy to make these tools available to the research community for | ||
use in their own research and/or non-commercial applications. | ||
|
||
Here is a list of software tools that can be downloaded. | ||
|
||
|
||
### METIS: A Family of Multilevel Partitioning Algorithms | ||
|
||
This is a collection of serial and parallel programs & libraries that can be used to | ||
partitioning unstructured graphs, finite element meshes, and hypergraphs, both on | ||
serial as well as on parallel computers. | ||
|
||
|
||
### CLUTO: Software for Clustering High-Dimensional DataSets | ||
|
||
This is a collection of computationally efficient and high-quality data clustering | ||
and cluster analysis programs & libraries, that are well suited for high-dimensional | ||
data sets. | ||
|
||
|
||
### BDMPI: Big Data Message Passing Interface | ||
|
||
BDMPI is a message passing library and associated runtime system for developing | ||
out-of-core distributed computing applications for problems whose aggregate memory | ||
requirements exceed the amount of memory that is available on the underlying | ||
computing cluster. | ||
|
||
|
||
### SLIM - Sparse Linear Methods for Top-N Recommender Systems | ||
|
||
This is a library that implements a set of top-N recommendation methods that learn an | ||
item-item similarity matrix using sparse linear models. | ||
|
||
|
||
### NERSTRAND - Multi-threaded modularity-based graph clustering | ||
|
||
This is a program that implements various serial and parallel modularity-based graph | ||
clustering algorithms based on the multilevel paradigm. These algorithms can produce | ||
high-quality clustering solutions and can scale to very large graphs. | ||
|
||
|
||
### SPLATT - Parallel Sparse Tensor Decomposition | ||
|
||
This is a program that implements various serial and parallel modularity-based graph | ||
clustering algorithms based on the multilevel paradigm. These algorithms can produce | ||
high-quality clustering solutions and can scale to very large graphs. | ||
|
||
|
||
### L2AP - Fast Cosine Similarity Search With Prefix L-2 Norm Bounds | ||
|
||
This is a program that implements various fast algorithms for for finding the set of | ||
all pairs of similar vectors (e.g., documents) whose similarity is greater than a | ||
user-specified threshold. | ||
|
||
|
||
### L2Knng - Fast K-Nearest Neighbor Graph Construction with L2-Norm Pruning | ||
|
||
This is a program that provides high-performance implementations of several methods | ||
for constructing the K-nearest neighbor graph of a set of vectors based on cosine | ||
similarity. | ||
|
||
|
||
### PAFI: Software for Finding Patterns in Diverse Datasets | ||
|
||
This is a collection of computationally efficient programs for finding frequent | ||
patterns in transactional, sequential, and graph datasets. | ||
|
||
|
||
### AFGen: Fragment-based Descriptors for Chemical Compounds | ||
|
||
AFGen is a program that takes as input a set of chemical compounds and generates | ||
their vector-space representation based on the set of fragment-based descriptors they | ||
contain. The descriptor space consists of graph fragments that can have three | ||
different types of topologies: paths (PF), acyclic subgraphs (AF), and arbitrary | ||
topology subgraphs (GF). This vector-based representation can be used for different | ||
tasks in cheminformatics including similarity search, virtual screening, and library | ||
design. | ||
|
||
These descriptors are quite effective in capturing the structural characteristics of | ||
chemical compounds. Experiments in the context of SVM-based classification and | ||
ranked-retrieval show that these descriptors consistently and statistically | ||
outperform previously developed schemes based on the widely used fingerprint- and | ||
Maccs keys-based descriptors, as well as recently introduced descriptors obtained by | ||
mining and analyzing the structure of the molecular graphs. | ||
|
||
**Getting the files** | ||
|
||
* [afgen-2.0.0.tar.gz Linux (i686/x86_64)](files/afgen/afgen-2.0.0.tar.gz) | ||
|
||
**Installing** | ||
|
||
On Unix systems, after downloading AFGen you need to uncompress and untar it. This is | ||
achieved by executing the following command: | ||
|
||
gunzip afgen-2.0.0.tar.gz | ||
tar -xvf afgen-2.0.0.tar | ||
|
||
At this point you should have a directory named afgen-2.0.0. This directory contains | ||
AFGen's stand-alone programs, its documentation, and a sample dataset. | ||
|
||
**Documentation** | ||
|
||
Instructions describing how to use AFGen can be found at afgen-2.0/doc/index.html. | ||
|
||
|
||
|
||
### SUGGEST: A top-N Recommender Engine | ||
|
||
SUGGEST is a Top-N recommendation engine that implements a variety of recommendation | ||
algorithms. Top-N recommender systems, a personalized information filtering | ||
technology, are used to identify a set of N items that will be of interest to a | ||
certain user. In recent years, top-N recommender systems have been used in a number | ||
of different applications such to recommend products a customer will most likely buy; | ||
recommend movies, TV programs, or music a user will find enjoyable; identify | ||
web-pages that will be of interest; or even suggest alternate ways of searching for | ||
information. | ||
|
||
The algorithms implemented by SUGGEST are based on collaborative filtering that is | ||
the most successful and widely used framework for building recommender systems. | ||
SUGGEST implements two classes of collaborative filtering-based top-N recommendation | ||
algorithms, called user-based and item-based. | ||
|
||
SUGGEST is currently distributed in a binary format and consists a stand-alone | ||
executable program and a library, which can be used to call SUGGEST's routines | ||
directly from another application. | ||
|
||
|
||
|
||
|
||
### MGridGen: Multilevel Serial & Parallel Coarse Grid Construction Library | ||
|
||
MGridGen is a parallel library written entirely in ANSI C that implements (serial) | ||
algorithms for obtaining a sequence of successive coarse grids that are well-suited | ||
for geometric multigrid methods. The quality of the elements of the coarse grids is | ||
optimized using a multilevel framework. It is portable on most Unix systems that have | ||
an ANSI C compiler. | ||
|
||
An MPI-based parallel version of MGridGen, called ParMGridGen, has also been | ||
developed that extends the functionality provided by MGridGen and is especially | ||
suited for large scale numerical simulations. It is written entirely in ANSI C and | ||
MPI and is portable on most parallel computers that support MPI. | ||
|
||
[Source code](https://github.com/mrklein/ParMGridGen) | ||
|
||
|
||
|
||
### PSPASES: A Parallel Sparse Direct Solver | ||
|
||
PSPASES (Parallel SPArse Symmetric dirEct Solver) is a high performance, scalable, | ||
parallel, MPI-based library, intended for solving linear systems of equations | ||
involving sparse symmetric positive definite matrices. The library provides various | ||
interfaces to solve the system using four phases of direct method of solution: | ||
compute fill-reducing ordering, perform symbolic factorization, compute numerical | ||
factorization, and solve triangular systems of equations. The library efficiently | ||
implements the scalable parallel algorithms developed by lab members and our | ||
collaborators, to compute each of the phases. | ||
|
||
|
||
|
||
|
||
|
Oops, something went wrong.