adding glaros

karypis · Jul 7, 2024 · ed3edfe · ed3edfe
1 parent f53e08d
commit ed3edfe
Show file tree

Hide file tree

Showing 6 changed files with 477 additions and 0 deletions.
diff --git a/glaros/projects/bio.md b/glaros/projects/bio.md
@@ -0,0 +1,45 @@
+## Bioinformatics
+
+Bioinformatics has emerged as an exciting new research area giving rise to numerous
+challenging computational problems whose successful solution will ultimately impact
+every aspect of our every day life. This is currently one of the lab's main research
+thrust areas and is primarily designed to develop and apply data-mining and
+knowledge-based techniques to solve various problems arising in this field.
+
+Our ongoing research has led to the development of clustering and classification
+algorithms suitable for analyzing gene expression data, DNA- and
+protein-sequence-based classification algorithms, highly accurate remote homology
+recognition and fold prediction algorithms, scalable clustering algorithms for
+protein sequences, and algorithms that predict various aspects of a protein's
+secondary and tertiary structure based on its primary sequence.
+
+Many problems arising in bioinformatics can be formulated as classification or
+prediction problem instances whose goal is to gain some higher-level knowledge from
+primary information. Examples of such problems are gene prediction, promoter
+identification, protein family assignment, gene functional assignment, secondary
+structure prediction, fold-recognition, tertiary structure prediction, etc.
+Developing effective algorithms for these problems usually involves two steps. The
+first step is that of identifying the signals present in the data that capture the
+key physical/chemical/biological properties of the various objects and classes,
+whereas the second step is that of developing supervised machine learning algorithms
+that can properly model and exploit them toward the goal of building accurate
+classifiers. Within this context, our research is focused on identifying the right
+set of signals for various problems, developing novel classification algorithms, and
+analyzing whether or not there are sufficiently strong signals present in the
+datasets to allow for the effective use of computational techniques in the first
+place.
+
+Our current research in this area is focusing on protein function and structure
+prediction. Some of the specific research projects are:
+
+* Improve the performance of remote homology recognition and fold prediction algorithms
+  by designing novel and effective kernel methods that combine various observed and/or
+  predicted signals.
+
+* Enhance the effectiveness of local structure prediction algorithms by designing
+  structural alphabets that combine predictability with structure reproducibility.
+
+* Improve the performance of ab initio
+
+The research over the years has been funded by a number of Federal agencies including
+ARL, NSF, and NIH.
diff --git a/glaros/projects/cheminfo.md b/glaros/projects/cheminfo.md
@@ -0,0 +1,61 @@
+## Chemical Informatics
+
+Discovering new drugs is an expensive and challenging process. Any new drug should
+not only produce the desired response to the disease but should do so with minimal
+side effects and be superior to the existing drugs in the market. The goal of this
+project is to develop effective and efficient algorithms for analyzing chemical
+compound databases and identifying biologically active compounds.
+
+One of the key steps in the drug design process is the identification of the chemical
+compounds (hit compounds) that display the desired and reproducible behavior against
+the specific biomolecular target and represents a significant hurdle in the early
+stages of drug discovery. The 1990s saw the widespread adoption of high-throughput
+screening (HTS), which use highly automated techniques to conduct the biological
+assays and can be used to screen a large number of compounds. Although the number of
+compounds that can be evaluated by these methods is very large, these numbers are
+small in comparison to the millions of drug-like compounds that exist or can be
+synthesized by combinatorial chemistry methods. Moreover, in most cases it is hard to
+find all desirable properties in a single compound and medicinal chemists are
+interested in not just identifying the hits but studying what part of the chemical
+compound leads to desirable behavior, so that new compounds can be rationally
+synthesized (lead development).
+
+Computational techniques that build models to correctly assign chemical compounds to
+various classes of interest can address these limitations, have extensive
+applications in pharmaceutical research, and they are used extensively to replace or
+supplement HTS-based approaches.
+
+Our most recent research is currently concentrated on the following areas:
+
+* Develop computationally efficient algorithms to mine large databases of molecular 
+  graphs and identify key substructures present in active (inactive) compounds.
+
+* Develop sophisticated feature selection and generation algorithms that combine
+multiple criteria to identify and synthesize a set of substructure-based features
+that simultaneously simplify the representation of the original compounds while
+retaining and exposing their key features.
+
+* Develop kernel-based clustering and classification approaches that take into
+  account the relationships between these substructures at different levels of
+  granularity and complexity.
+
+The research over the years has been funded by a number of Federal agencies including
+NIH (primary agency), ARL, and NSF.
+
+### Software
+
+Our research thus far has resulted in the development of computationally efficient
+algorithms to find frequent substructures in molecular graphs (either topological or
+geometric). The topological version of this algorithm, called FSG, is currently
+available as part of our pattern discovery toolkit PAFI, which can be downloaded and
+used for educational and research purposes.
+
+Another recent development is the AFGen program that operates on a database of
+chemical compounds and generates their descriptor-based representation by considering
+all bounded length acyclic fragments that they contain. These descriptors are quite
+effective in capturing the structural characteristics of chemical compounds.
+Experiments in the context of SVM-based classification and ranked-retrieval show that
+these descriptors consistently and statistically outperform previously developed
+schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as
+well as recently introduced descriptors obtained by mining and analyzing the
+structure of the molecular graphs.
diff --git a/glaros/projects/dm.md b/glaros/projects/dm.md
@@ -0,0 +1,48 @@
+## Data Mining
+
+The goal of this project is to develop effective and computationally efficient
+algorithms for analyzing large volumes of data. The ultimate purpose of these
+analyses is to discover key and actionable information and gain insights about the
+underlying processes/systems that created the data (or are being described by the
+data).
+
+This emerging discipline is becoming increasingly important as advances in data
+collection have led to the explosive growth in the amount of available data. Data
+mining algorithms are used extensively to analyze business, commerce, scientific,
+engineering, and security data and dramatically improve the effectiveness of
+applications in areas such as marketing, predictive modeling, life sciences,
+information retrieval, and engineering.
+
+Our research was initially focused on developing high-performance scalable parallel
+algorithms for solving core data mining problems but in recent years, it has expanded
+to include research on fundamental data mining algorithms in the areas of data
+clustering, classification, pattern discovery, sequence mining, graph mining, and its
+applications in information retrieval, collaborative filtering, and bioinformatics.
+
+Our latest research is focusing on the following areas:
+
+* Algorithms for finding meaningful clusters in large sparse graphs like those arising
+in relational/social networks and the web.
+
+* Large-margin and kernel-based classification algorithms with an emphasis towards
+algorithms that can learn arbitrary output spaces.
+
+* Algorithms that can mine large and complex graphs.
+
+The research over the years has been funded by a number of Federal agencies including
+ARL, NSF, and NIH.
+
+
+### Software
+
+Many of the algorithms that we developed have been made available to the public in
+the form of stand-alone programs and libraries that are used extensively in many
+academic, government, and industrial sites. This includes the CLUTO clustering
+toolkit that implements different classes of feature- and similarity-based clustering
+algorithms, the SUGGEST and SLIM libraries of scalable collaborative-filtering based
+recommendation algorithms, and the PAFI pattern finding toolkit that contains various
+algorithms to find frequent patterns in transaction, sequence, and graph databases.
+
+All of these tools are available for download from our Software page.
+
+
diff --git a/glaros/projects/gp.md b/glaros/projects/gp.md
@@ -0,0 +1,63 @@
+## Graph Partitioning
+
+This project is the longest running research activity in the lab and dates back to
+the time of George's PhD work. The fundamental problem that is trying to solve is
+that of splitting a large irregular graphs into k parts. This problem has
+applications in many different areas including, parallel/distributed computing (load
+balancing of computations), scientific computing (fill-reducing matrix re-orderings),
+EDA algorithms for VLSI CAD (placement), data mining (clustering), social network
+analysis (community discovery), pattern recognition, relationship network analysis,
+etc.
+
+The partitioning is usually done so that it satisfies certain constraints and
+optimizes certain objectives. The most common constraint is that of producing
+equal-size partitions, whereas the most common objective is that of minimizing the
+number of cut edges (i.e., the edges that straddle partition boundaries). However, in
+many cases, different application areas tend to require their own type of constraints
+and objectives; thus, making the problem all that more interesting and challenging!
+
+The research in the lab is focusing on a class of algorithms that have come to be
+known as multilevel graph partitioning algorithms. These algorithms solve the problem
+by following an approximate-and-solve paradigm, which is very effective for this as
+well as other (combinatorial) optimization problems.
+
+Over the years we focused and produced good solutions for a number of
+graph-partitioning related problems. This includes partitioning algorithms for graphs
+corresponding to finite element meshes, multilevel nested dissection, parallel
+graph/mesh partitioning, dynamic/adaptive graph repartitioning, multi-constraint and
+multi-objective partitioning, and circuit and hypergraph partitioning.
+
+Our latest research is focusing on three key areas:
+
+* Mesh/graph partitioning algorithms that take into the fine-grain characteristics of
+the underlying parallel computer and can deal with heterogeneous computing and
+communication capabilities.
+
+* Partitioning/load-balancing algorithms for mesh-less or mesh/particles scientific
+simulations.
+
+* Partitioning algorithms for scale-free graphs and/or graphs whose degree distribution
+follows a power-low curve.
+
+The research over the years has been funded by a number of Federal agencies including
+DOE, ARO, ARL, NSF and companies including IBM, SGI, and Cray.
+
+
+### Software
+
+A direct outcome of our graph partitioning research is the development of the METIS
+family of multilevel partitioning programs and libraries. This includes
+computationally efficient and highly effective tools for partitioning very large
+graphs on serial and parallel computers as well as tools for partitioning
+hypergraphs, especially those corresponding to netlists of VLSI circuits.
+
+In addition, some ideas inspired from the multilevel optimization algorithms
+developed for graph partitioning found their way on two other research projects. The
+first is the work on MGridGen, a tool to generate a sequence of coarse grids for
+geometric multigrid-based precoditioners. The second is the work on CLUTO, our data
+clustering tool, which contains effective graph-partitioning based clustering
+algorithms. In fact, the quality of the partitionings produced by CLUTO are in
+general better than those produced by the serial graph partitioning algorithm in
+METIS.
+
+All of these tools are available for download from our Software page.
diff --git a/glaros/software/overview.md b/glaros/software/overview.md
@@ -0,0 +1,167 @@
+## Software Tools Developed by Lab Members
+
+Over the years, the research in the lab has resulted in the development of a number
+of software tools and libraries for key problems in the areas of parallel processing,
+data mining, bioinformatics, and collaborative filtering.
+
+It is our general policy to make these tools available to the research community for
+use in their own research and/or non-commercial applications.
+
+Here is a list of software tools that can be downloaded.
+
+
+### METIS: A Family of Multilevel Partitioning Algorithms
+
+This is a collection of serial and parallel programs & libraries that can be used to
+partitioning unstructured graphs, finite element meshes, and hypergraphs, both on
+serial as well as on parallel computers.
+
+
+### CLUTO: Software for Clustering High-Dimensional DataSets
+
+This is a collection of computationally efficient and high-quality data clustering
+and cluster analysis programs & libraries, that are well suited for high-dimensional
+data sets.
+
+
+### BDMPI: Big Data Message Passing Interface
+
+BDMPI is a message passing library and associated runtime system for developing
+out-of-core distributed computing applications for problems whose aggregate memory
+requirements exceed the amount of memory that is available on the underlying
+computing cluster.
+
+
+### SLIM - Sparse Linear Methods for Top-N Recommender Systems
+
+This is a library that implements a set of top-N recommendation methods that learn an
+item-item similarity matrix using sparse linear models.
+
+
+### NERSTRAND - Multi-threaded modularity-based graph clustering
+
+This is a program that implements various serial and parallel modularity-based graph
+clustering algorithms based on the multilevel paradigm. These algorithms can produce
+high-quality clustering solutions and can scale to very large graphs.
+
+
+### SPLATT - Parallel Sparse Tensor Decomposition
+
+This is a program that implements various serial and parallel modularity-based graph
+clustering algorithms based on the multilevel paradigm. These algorithms can produce
+high-quality clustering solutions and can scale to very large graphs.
+
+
+### L2AP - Fast Cosine Similarity Search With Prefix L-2 Norm Bounds
+
+This is a program that implements various fast algorithms for for finding the set of
+all pairs of similar vectors (e.g., documents) whose similarity is greater than a
+user-specified threshold.
+
+
+### L2Knng - Fast K-Nearest Neighbor Graph Construction with L2-Norm Pruning
+
+This is a program that provides high-performance implementations of several methods
+for constructing the K-nearest neighbor graph of a set of vectors based on cosine
+similarity.
+
+
+### PAFI: Software for Finding Patterns in Diverse Datasets
+
+This is a collection of computationally efficient programs for finding frequent
+patterns in transactional, sequential, and graph datasets.
+
+
+### AFGen: Fragment-based Descriptors for Chemical Compounds
+
+AFGen is a program that takes as input a set of chemical compounds and generates
+their vector-space representation based on the set of fragment-based descriptors they
+contain. The descriptor space consists of graph fragments that can have three
+different types of topologies: paths (PF), acyclic subgraphs (AF), and arbitrary
+topology subgraphs (GF). This vector-based representation can be used for different
+tasks in cheminformatics including similarity search, virtual screening, and library
+design.
+
+These descriptors are quite effective in capturing the structural characteristics of
+chemical compounds. Experiments in the context of SVM-based classification and
+ranked-retrieval show that these descriptors consistently and statistically
+outperform previously developed schemes based on the widely used fingerprint- and
+Maccs keys-based descriptors, as well as recently introduced descriptors obtained by
+mining and analyzing the structure of the molecular graphs.
+
+**Getting the files**
+
+* [afgen-2.0.0.tar.gz Linux (i686/x86_64)](files/afgen/afgen-2.0.0.tar.gz)
+
+**Installing**
+
+On Unix systems, after downloading AFGen you need to uncompress and untar it. This is
+achieved by executing the following command:
+
+    gunzip afgen-2.0.0.tar.gz
+    tar -xvf afgen-2.0.0.tar
+
+At this point you should have a directory named afgen-2.0.0. This directory contains
+AFGen's stand-alone programs, its documentation, and a sample dataset.
+
+**Documentation**
+
+Instructions describing how to use AFGen can be found at afgen-2.0/doc/index.html.
+
+
+
+### SUGGEST: A top-N Recommender Engine
+
+SUGGEST is a Top-N recommendation engine that implements a variety of recommendation
+algorithms. Top-N recommender systems, a personalized information filtering
+technology, are used to identify a set of N items that will be of interest to a
+certain user. In recent years, top-N recommender systems have been used in a number
+of different applications such to recommend products a customer will most likely buy;
+recommend movies, TV programs, or music a user will find enjoyable; identify
+web-pages that will be of interest; or even suggest alternate ways of searching for
+information.
+
+The algorithms implemented by SUGGEST are based on collaborative filtering that is
+the most successful and widely used framework for building recommender systems.
+SUGGEST implements two classes of collaborative filtering-based top-N recommendation
+algorithms, called user-based and item-based.
+
+SUGGEST is currently distributed in a binary format and consists a stand-alone
+executable program and a library, which can be used to call SUGGEST's routines
+directly from another application.
+
+
+
+
+### MGridGen: Multilevel Serial & Parallel Coarse Grid Construction Library
+
+MGridGen is a parallel library written entirely in ANSI C that implements (serial)
+algorithms for obtaining a sequence of successive coarse grids that are well-suited
+for geometric multigrid methods. The quality of the elements of the coarse grids is
+optimized using a multilevel framework. It is portable on most Unix systems that have
+an ANSI C compiler.
+
+An MPI-based parallel version of MGridGen, called ParMGridGen, has also been
+developed that extends the functionality provided by MGridGen and is especially
+suited for large scale numerical simulations. It is written entirely in ANSI C and
+MPI and is portable on most parallel computers that support MPI.
+
+[Source code](https://github.com/mrklein/ParMGridGen)
+
+
+
+### PSPASES: A Parallel Sparse Direct Solver
+
+PSPASES (Parallel SPArse Symmetric dirEct Solver) is a high performance, scalable,
+parallel, MPI-based library, intended for solving linear systems of equations
+involving sparse symmetric positive definite matrices. The library provides various
+interfaces to solve the system using four phases of direct method of solution:
+compute fill-reducing ordering, perform symbolic factorization, compute numerical
+factorization, and solve triangular systems of equations. The library efficiently
+implements the scalable parallel algorithms developed by lab members and our
+collaborators, to compute each of the phases.
+
+
+
+
+