Skip to content

Commit

Permalink
adding glaros
Browse files Browse the repository at this point in the history
  • Loading branch information
karypis committed Jul 7, 2024
1 parent f53e08d commit ed3edfe
Show file tree
Hide file tree
Showing 6 changed files with 477 additions and 0 deletions.
45 changes: 45 additions & 0 deletions glaros/projects/bio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
## Bioinformatics

Bioinformatics has emerged as an exciting new research area giving rise to numerous
challenging computational problems whose successful solution will ultimately impact
every aspect of our every day life. This is currently one of the lab's main research
thrust areas and is primarily designed to develop and apply data-mining and
knowledge-based techniques to solve various problems arising in this field.

Our ongoing research has led to the development of clustering and classification
algorithms suitable for analyzing gene expression data, DNA- and
protein-sequence-based classification algorithms, highly accurate remote homology
recognition and fold prediction algorithms, scalable clustering algorithms for
protein sequences, and algorithms that predict various aspects of a protein's
secondary and tertiary structure based on its primary sequence.

Many problems arising in bioinformatics can be formulated as classification or
prediction problem instances whose goal is to gain some higher-level knowledge from
primary information. Examples of such problems are gene prediction, promoter
identification, protein family assignment, gene functional assignment, secondary
structure prediction, fold-recognition, tertiary structure prediction, etc.
Developing effective algorithms for these problems usually involves two steps. The
first step is that of identifying the signals present in the data that capture the
key physical/chemical/biological properties of the various objects and classes,
whereas the second step is that of developing supervised machine learning algorithms
that can properly model and exploit them toward the goal of building accurate
classifiers. Within this context, our research is focused on identifying the right
set of signals for various problems, developing novel classification algorithms, and
analyzing whether or not there are sufficiently strong signals present in the
datasets to allow for the effective use of computational techniques in the first
place.

Our current research in this area is focusing on protein function and structure
prediction. Some of the specific research projects are:

* Improve the performance of remote homology recognition and fold prediction algorithms
by designing novel and effective kernel methods that combine various observed and/or
predicted signals.

* Enhance the effectiveness of local structure prediction algorithms by designing
structural alphabets that combine predictability with structure reproducibility.

* Improve the performance of ab initio

The research over the years has been funded by a number of Federal agencies including
ARL, NSF, and NIH.
61 changes: 61 additions & 0 deletions glaros/projects/cheminfo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
## Chemical Informatics

Discovering new drugs is an expensive and challenging process. Any new drug should
not only produce the desired response to the disease but should do so with minimal
side effects and be superior to the existing drugs in the market. The goal of this
project is to develop effective and efficient algorithms for analyzing chemical
compound databases and identifying biologically active compounds.

One of the key steps in the drug design process is the identification of the chemical
compounds (hit compounds) that display the desired and reproducible behavior against
the specific biomolecular target and represents a significant hurdle in the early
stages of drug discovery. The 1990s saw the widespread adoption of high-throughput
screening (HTS), which use highly automated techniques to conduct the biological
assays and can be used to screen a large number of compounds. Although the number of
compounds that can be evaluated by these methods is very large, these numbers are
small in comparison to the millions of drug-like compounds that exist or can be
synthesized by combinatorial chemistry methods. Moreover, in most cases it is hard to
find all desirable properties in a single compound and medicinal chemists are
interested in not just identifying the hits but studying what part of the chemical
compound leads to desirable behavior, so that new compounds can be rationally
synthesized (lead development).

Computational techniques that build models to correctly assign chemical compounds to
various classes of interest can address these limitations, have extensive
applications in pharmaceutical research, and they are used extensively to replace or
supplement HTS-based approaches.

Our most recent research is currently concentrated on the following areas:

* Develop computationally efficient algorithms to mine large databases of molecular
graphs and identify key substructures present in active (inactive) compounds.

* Develop sophisticated feature selection and generation algorithms that combine
multiple criteria to identify and synthesize a set of substructure-based features
that simultaneously simplify the representation of the original compounds while
retaining and exposing their key features.

* Develop kernel-based clustering and classification approaches that take into
account the relationships between these substructures at different levels of
granularity and complexity.

The research over the years has been funded by a number of Federal agencies including
NIH (primary agency), ARL, and NSF.

### Software

Our research thus far has resulted in the development of computationally efficient
algorithms to find frequent substructures in molecular graphs (either topological or
geometric). The topological version of this algorithm, called FSG, is currently
available as part of our pattern discovery toolkit PAFI, which can be downloaded and
used for educational and research purposes.

Another recent development is the AFGen program that operates on a database of
chemical compounds and generates their descriptor-based representation by considering
all bounded length acyclic fragments that they contain. These descriptors are quite
effective in capturing the structural characteristics of chemical compounds.
Experiments in the context of SVM-based classification and ranked-retrieval show that
these descriptors consistently and statistically outperform previously developed
schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as
well as recently introduced descriptors obtained by mining and analyzing the
structure of the molecular graphs.
48 changes: 48 additions & 0 deletions glaros/projects/dm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
## Data Mining

The goal of this project is to develop effective and computationally efficient
algorithms for analyzing large volumes of data. The ultimate purpose of these
analyses is to discover key and actionable information and gain insights about the
underlying processes/systems that created the data (or are being described by the
data).

This emerging discipline is becoming increasingly important as advances in data
collection have led to the explosive growth in the amount of available data. Data
mining algorithms are used extensively to analyze business, commerce, scientific,
engineering, and security data and dramatically improve the effectiveness of
applications in areas such as marketing, predictive modeling, life sciences,
information retrieval, and engineering.

Our research was initially focused on developing high-performance scalable parallel
algorithms for solving core data mining problems but in recent years, it has expanded
to include research on fundamental data mining algorithms in the areas of data
clustering, classification, pattern discovery, sequence mining, graph mining, and its
applications in information retrieval, collaborative filtering, and bioinformatics.

Our latest research is focusing on the following areas:

* Algorithms for finding meaningful clusters in large sparse graphs like those arising
in relational/social networks and the web.

* Large-margin and kernel-based classification algorithms with an emphasis towards
algorithms that can learn arbitrary output spaces.

* Algorithms that can mine large and complex graphs.

The research over the years has been funded by a number of Federal agencies including
ARL, NSF, and NIH.


### Software

Many of the algorithms that we developed have been made available to the public in
the form of stand-alone programs and libraries that are used extensively in many
academic, government, and industrial sites. This includes the CLUTO clustering
toolkit that implements different classes of feature- and similarity-based clustering
algorithms, the SUGGEST and SLIM libraries of scalable collaborative-filtering based
recommendation algorithms, and the PAFI pattern finding toolkit that contains various
algorithms to find frequent patterns in transaction, sequence, and graph databases.

All of these tools are available for download from our Software page.


63 changes: 63 additions & 0 deletions glaros/projects/gp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
## Graph Partitioning

This project is the longest running research activity in the lab and dates back to
the time of George's PhD work. The fundamental problem that is trying to solve is
that of splitting a large irregular graphs into k parts. This problem has
applications in many different areas including, parallel/distributed computing (load
balancing of computations), scientific computing (fill-reducing matrix re-orderings),
EDA algorithms for VLSI CAD (placement), data mining (clustering), social network
analysis (community discovery), pattern recognition, relationship network analysis,
etc.

The partitioning is usually done so that it satisfies certain constraints and
optimizes certain objectives. The most common constraint is that of producing
equal-size partitions, whereas the most common objective is that of minimizing the
number of cut edges (i.e., the edges that straddle partition boundaries). However, in
many cases, different application areas tend to require their own type of constraints
and objectives; thus, making the problem all that more interesting and challenging!

The research in the lab is focusing on a class of algorithms that have come to be
known as multilevel graph partitioning algorithms. These algorithms solve the problem
by following an approximate-and-solve paradigm, which is very effective for this as
well as other (combinatorial) optimization problems.

Over the years we focused and produced good solutions for a number of
graph-partitioning related problems. This includes partitioning algorithms for graphs
corresponding to finite element meshes, multilevel nested dissection, parallel
graph/mesh partitioning, dynamic/adaptive graph repartitioning, multi-constraint and
multi-objective partitioning, and circuit and hypergraph partitioning.

Our latest research is focusing on three key areas:

* Mesh/graph partitioning algorithms that take into the fine-grain characteristics of
the underlying parallel computer and can deal with heterogeneous computing and
communication capabilities.

* Partitioning/load-balancing algorithms for mesh-less or mesh/particles scientific
simulations.

* Partitioning algorithms for scale-free graphs and/or graphs whose degree distribution
follows a power-low curve.

The research over the years has been funded by a number of Federal agencies including
DOE, ARO, ARL, NSF and companies including IBM, SGI, and Cray.


### Software

A direct outcome of our graph partitioning research is the development of the METIS
family of multilevel partitioning programs and libraries. This includes
computationally efficient and highly effective tools for partitioning very large
graphs on serial and parallel computers as well as tools for partitioning
hypergraphs, especially those corresponding to netlists of VLSI circuits.

In addition, some ideas inspired from the multilevel optimization algorithms
developed for graph partitioning found their way on two other research projects. The
first is the work on MGridGen, a tool to generate a sequence of coarse grids for
geometric multigrid-based precoditioners. The second is the work on CLUTO, our data
clustering tool, which contains effective graph-partitioning based clustering
algorithms. In fact, the quality of the partitionings produced by CLUTO are in
general better than those produced by the serial graph partitioning algorithm in
METIS.

All of these tools are available for download from our Software page.
167 changes: 167 additions & 0 deletions glaros/software/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
## Software Tools Developed by Lab Members

Over the years, the research in the lab has resulted in the development of a number
of software tools and libraries for key problems in the areas of parallel processing,
data mining, bioinformatics, and collaborative filtering.

It is our general policy to make these tools available to the research community for
use in their own research and/or non-commercial applications.

Here is a list of software tools that can be downloaded.


### METIS: A Family of Multilevel Partitioning Algorithms

This is a collection of serial and parallel programs & libraries that can be used to
partitioning unstructured graphs, finite element meshes, and hypergraphs, both on
serial as well as on parallel computers.


### CLUTO: Software for Clustering High-Dimensional DataSets

This is a collection of computationally efficient and high-quality data clustering
and cluster analysis programs & libraries, that are well suited for high-dimensional
data sets.


### BDMPI: Big Data Message Passing Interface

BDMPI is a message passing library and associated runtime system for developing
out-of-core distributed computing applications for problems whose aggregate memory
requirements exceed the amount of memory that is available on the underlying
computing cluster.


### SLIM - Sparse Linear Methods for Top-N Recommender Systems

This is a library that implements a set of top-N recommendation methods that learn an
item-item similarity matrix using sparse linear models.


### NERSTRAND - Multi-threaded modularity-based graph clustering

This is a program that implements various serial and parallel modularity-based graph
clustering algorithms based on the multilevel paradigm. These algorithms can produce
high-quality clustering solutions and can scale to very large graphs.


### SPLATT - Parallel Sparse Tensor Decomposition

This is a program that implements various serial and parallel modularity-based graph
clustering algorithms based on the multilevel paradigm. These algorithms can produce
high-quality clustering solutions and can scale to very large graphs.


### L2AP - Fast Cosine Similarity Search With Prefix L-2 Norm Bounds

This is a program that implements various fast algorithms for for finding the set of
all pairs of similar vectors (e.g., documents) whose similarity is greater than a
user-specified threshold.


### L2Knng - Fast K-Nearest Neighbor Graph Construction with L2-Norm Pruning

This is a program that provides high-performance implementations of several methods
for constructing the K-nearest neighbor graph of a set of vectors based on cosine
similarity.


### PAFI: Software for Finding Patterns in Diverse Datasets

This is a collection of computationally efficient programs for finding frequent
patterns in transactional, sequential, and graph datasets.


### AFGen: Fragment-based Descriptors for Chemical Compounds

AFGen is a program that takes as input a set of chemical compounds and generates
their vector-space representation based on the set of fragment-based descriptors they
contain. The descriptor space consists of graph fragments that can have three
different types of topologies: paths (PF), acyclic subgraphs (AF), and arbitrary
topology subgraphs (GF). This vector-based representation can be used for different
tasks in cheminformatics including similarity search, virtual screening, and library
design.

These descriptors are quite effective in capturing the structural characteristics of
chemical compounds. Experiments in the context of SVM-based classification and
ranked-retrieval show that these descriptors consistently and statistically
outperform previously developed schemes based on the widely used fingerprint- and
Maccs keys-based descriptors, as well as recently introduced descriptors obtained by
mining and analyzing the structure of the molecular graphs.

**Getting the files**

* [afgen-2.0.0.tar.gz Linux (i686/x86_64)](files/afgen/afgen-2.0.0.tar.gz)

**Installing**

On Unix systems, after downloading AFGen you need to uncompress and untar it. This is
achieved by executing the following command:

gunzip afgen-2.0.0.tar.gz
tar -xvf afgen-2.0.0.tar

At this point you should have a directory named afgen-2.0.0. This directory contains
AFGen's stand-alone programs, its documentation, and a sample dataset.

**Documentation**

Instructions describing how to use AFGen can be found at afgen-2.0/doc/index.html.



### SUGGEST: A top-N Recommender Engine

SUGGEST is a Top-N recommendation engine that implements a variety of recommendation
algorithms. Top-N recommender systems, a personalized information filtering
technology, are used to identify a set of N items that will be of interest to a
certain user. In recent years, top-N recommender systems have been used in a number
of different applications such to recommend products a customer will most likely buy;
recommend movies, TV programs, or music a user will find enjoyable; identify
web-pages that will be of interest; or even suggest alternate ways of searching for
information.

The algorithms implemented by SUGGEST are based on collaborative filtering that is
the most successful and widely used framework for building recommender systems.
SUGGEST implements two classes of collaborative filtering-based top-N recommendation
algorithms, called user-based and item-based.

SUGGEST is currently distributed in a binary format and consists a stand-alone
executable program and a library, which can be used to call SUGGEST's routines
directly from another application.




### MGridGen: Multilevel Serial & Parallel Coarse Grid Construction Library

MGridGen is a parallel library written entirely in ANSI C that implements (serial)
algorithms for obtaining a sequence of successive coarse grids that are well-suited
for geometric multigrid methods. The quality of the elements of the coarse grids is
optimized using a multilevel framework. It is portable on most Unix systems that have
an ANSI C compiler.

An MPI-based parallel version of MGridGen, called ParMGridGen, has also been
developed that extends the functionality provided by MGridGen and is especially
suited for large scale numerical simulations. It is written entirely in ANSI C and
MPI and is portable on most parallel computers that support MPI.

[Source code](https://github.com/mrklein/ParMGridGen)



### PSPASES: A Parallel Sparse Direct Solver

PSPASES (Parallel SPArse Symmetric dirEct Solver) is a high performance, scalable,
parallel, MPI-based library, intended for solving linear systems of equations
involving sparse symmetric positive definite matrices. The library provides various
interfaces to solve the system using four phases of direct method of solution:
compute fill-reducing ordering, perform symbolic factorization, compute numerical
factorization, and solve triangular systems of equations. The library efficiently
implements the scalable parallel algorithms developed by lab members and our
collaborators, to compute each of the phases.





Loading

0 comments on commit ed3edfe

Please sign in to comment.