Openings:We are looking for graduate students, post-docs, and data scientists, including those interested in doing bench experiments. If interested, please email me your CV and a brief statement of how our research interests overlap. Thanks.
We work at the interface of theoretical computer science, machine learning, and systems biology. We are interested in:
- Models of the evolution and development of complex biological systems, including molecular interactions in the cell and neural interactions in the brain;
- Robustness and flexibility of networks and signaling pathways in the face of environmental noise, failures, and perturbations;
- "Algorithms in nature", i.e. how collections of molecules, cells, and organisms process information and solve interesting distributed computing problems.
S. Dasgupta, C. F. Stevens, and S. Navlakha (2017). A neural algorithm for a fundamental computing problem. Science (in press).
[abstract]Abstract: Similarity search, such as identifying similar images in a database or similar documents on the Web, is a fundamental computing problem faced by large-scale information retrieval systems. We discovered that the fruit fly olfactory circuit solves this problem with a variant of a computer science algorithm (called locality-sensitive hashing). The fly circuit assigns similar neural activity patterns to similar odors, so that behaviors learned from one odor can be applied when a similar odor is experienced. The fly algorithm, however, uses three computational ingredients that depart from traditional approaches. These ingredients can be translated to improve the performance of computer similarity searches. This perspective helps illuminate the logic supporting an important sensory function, and provides a conceptually new algorithm for solving a fundamental computational problem.
S. Navlakha, Z. Bar-Joseph, and A. L. Barth (2017). Network design and the brain. Trends Cogn. Sci (in press).
[abstract]Abstract: Neural circuits have evolved to accommodate similar information processing challenges as those faced by engineered systems. Here, we compare neural versus engineering strategies for constructing networks. During circuit development, synapses are massively over-produced and then pruned back over time, whereas in engineered networks, connections are initially sparse and are then added over time. We provide a computational perspective on these two different approaches, including discussion of how and why they are used, insights that one can provide the other, and areas for future joint investigation. By thinking algorithmically about the goals, constraints, and optimization principles employed by neural circuits, we can develop brain-derived strategies for enhancing network design, while also stimulating experimental hypotheses about circuit development and function.
A. Conn, U. Pedmale, J. Chory, and S. Navlakha (2017). High-resolution laser scanning reveals plant architectures that reflect universal network design principles. Cell Syst., 5, 1:53-62.e3.
[abstract]Abstract: Transport networks serve critical functions in biological and engineered systems, and yet their design requires trade-offs between competing objectives. Due to their sessile lifestyle, plants need to optimize their architecture to efficiently acquire and distribute resources, while also minimizing costs in building infrastructure. To understand how plants resolve this design tradeoff, we used high-precision three-dimensional laser scanning to map the architectures of tomato, tobacco, or sorghum plants grown in several environmental conditions and through multiple developmental time-points, scanning in total 505 architectures from 37 plants. Using a graph-theoretic algorithm we developed to evaluate design strategies, we find that plant architectures lie along the Pareto front between two simple length-based objectives -- minimizing total branch length and minimizing nutrient transport distance -- thereby conferring a selective fitness advantage for plant transport processes. The location along the Pareto front can distinguish amongst species and conditions, suggesting that during evolution, natural selection may employ common network design principles despite different optimization trade-offs.
A. Conn, U. Pedmale, J. Chory, C. F. Stevens, and S. Navlakha (2017). A statistical description of plant shoot architecture. Current Biol., 27, 14:2078-2088.e3.
[abstract]Abstract: Plant architectures can be characterized statistically by their spatial density function, which specifies the probability of finding a branch at each location in the territory occupied by a plant. Using high-precision 3D scanning, we analyzed 557 plant shoot architectures, representing 3 species, grown across 3-5 environmental conditions, and through 20-30 developmental time-points. We found two elegant properties in the spatial density functions of these architectures: all functions could be nearly modified in one direction without affecting the density in orthogonal directions (called separability), and all functions shared the same underlying shape, aside from stretching and compression (called self-similarity). Surprisingly, despite their striking visual diversity, we discovered that all architectures could be described as variations on a single underlying function: a Gaussian density function truncated at roughly two standard deviations. We also observed systematic variation in the spatial density functions across species, growth conditions, and time, which suggests functional specialization despite following the same general design form.
J. Y. Suen and S. Navlakha (2017). Using inspiration from synaptic plasticity rules to optimize traffic flow in distributed engineered networks. Neural Comput., 29, 5:1204-1228.
[abstract]Abstract: Controlling the flow and routing of data is a fundamental problem in many distributed networks, including transportation systems, integrated circuits, and the Internet. In the brain, synaptic plasticity rules have been discovered that regulate network activity in response to environmental inputs, which enable circuits to be stable yet flexible. Here, we develop a new neuro-inspired model for network flow control that only depends on modifying edge weights in an activity-dependent manner. We show how two fundamental plasticity rules (long-term potentiation and long-term depression) can be cast as a distributed gradient descent algorithm for regulating traffic flow in engineered networks. We then characterize, both via simulation and analytically, how different forms of edge-weight update rules affect network routing efficiency and robustness. We find a close correspondence between certain classes of synaptic weight update rules derived experimentally in the brain and rules commonly used in engineering, suggesting common principles to both.
S. Navlakha (2017). Learning the structural vocabulary of a network. Neural Comput., 29, 2:287-312.
[abstract]Abstract: Networks have become instrumental in deciphering how information is processed and transferred within systems in almost every scientific field today. Nearly all network analyses, however, have relied on humans to devise structural features of networks believed to be most discriminative for an application. We present a framework for comparing and classifying networks without human-crafted features using deep learning. After training, autoencoders contain hidden units that encode a robust structural vocabulary for succinctly describing graphs. We use this feature vocabulary to tackle several network mining problems and find improved predictive performance versus many popular features used today. These problems include uncovering growth mechanisms driving the evolution of networks, predicting protein network fragility, and identifying environmental niches for metabolic networks. Deep learning offers a principled approach for mining complex networks and tackling graph-theoretic problems.
S. Singh, S. Rashid, S. Navlakha, and Z. Bar-Joseph (2016). Distributed gradient descent in bacterial food search. Proc. 20th Intl. Conf. on Research in Computational Molecular Biology (RECOMB).
[abstract]Abstract: Communication and coordination play a major role in the ability of bacterial cells to adapt to ever changing environments and conditions. Recent work has shown that such coordination underlies several aspects of bacterial responses including their ability to develop antibiotic resistance. Here we develop a new distributed gradient descent method that helps explain how bacterial cells collectively search for food in harsh environments using extremely limited communication and computational complexity. This method can also be used for computational tasks when agents are facing similarly restricted conditions. We formalize the communication and computation assumptions required for successful coordination and prove that the method we propose leads to convergence even when using a dynamically changing interaction network. The proposed method improves upon prior models suggested for bacterial foraging despite making fewer assumptions. Simulation studies and analysis of experimental data illustrate the ability of the method to explain and further predict several aspects of bacterial swarm food search.
S. Chandrasekaran, S. Navlakha, N. J. Audette, D. D. McCreary, J. Suhan, Z. Bar-Joseph, and A. L. Barth (2015). Unbiased, high-throughput electron microscopy analysis of experience-dependent synaptic changes in the neocortex. J. Neurosci., 35(50): 16450-16462.
[abstract]Abstract: Neocortical circuits can be altered by sensory and motor experience, with experimental evidence supporting both anatomical and electrophysiological changes in synaptic properties. Previous studies have focused on changes in specific neurons or pathways—for example, the thalamocortical circuitry, layer 4–3 (L4–L3) synapses, or in the apical dendrites of L5 neurons— but a broad-scale analysis of experience-induced changes across the cortical column has been lacking. Without this comprehensive approach, a full understanding of how cortical circuits adapt during learning or altered sensory input will be impossible. Here we adapt an electron microscopy technique that selectively labels synapses, in combination with a machine-learning algorithm for semiautomated synapse detection, to perform an unbiased analysis of developmental and experience-dependent changes in synaptic properties across an entire cortical column in mice. Synapse density and length were compared across development and during whisker-evoked plasticity. Between postnatal days 14 and 18, synapse density significantly increases most in superficial layers, and synapse length increases in L3 and L5B. Removal of all but a single whisker row for 24 h led to an apparent increase in synapse density in L2 and a decrease in L6, and a significant increase in length in L3. Targeted electrophysiological analysis of changes in miniature EPSC and IPSC properties in L2 pyramidal neurons showed that mEPSC frequency nearly doubled in the whisker-spared column, a difference that was highly significant. Together, this analysis shows that data-intensive analysis of column-wide changes in synapse properties can generate specific and testable hypotheses about experience-dependent changes in cortical organization.
S. Navlakha, A. L. Barth, and Z. Bar-Joseph (2015). Decreasing-rate pruning optimizes the construction of efficient and robust networks. PLoS Comput. Biol., 11(7): e1004347.
[abstract]Abstract: Robust, efficient, and low-cost networks are advantageous in both biological and engineered systems. During neural network development in the brain, synapses are massively over-produced and then pruned-back over time. This strategy is not commonly used when designing engineered networks, since adding connections that will soon be removed is considered wasteful. Here, we show that for large distributed routing networks, network function is markedly enhanced by hyper-connectivity followed by aggressive pruning and that the global rate of pruning, a developmental parameter not previously studied by experimentalists, plays a critical role in optimizing network structure. We first used high-throughput image analysis techniques to quantify the rate of pruning in the mammalian neocortex across a broad developmental time window and found that the rate is decreasing over time. Based on these results, we analyzed a model of computational routing networks and show using both theoretical analysis and simulations that decreasing rates lead to more robust and efficient networks compared to other rates. We also present an application of this strategy to improve the distributed design of airline networks. Thus, inspiration from neural network formation suggests effective ways to design distributed networks across several domains.
S. Navlakha and Z. Bar-Joseph (2015). Distributed information processing in biological and computational systems. Commun. ACM, 58(1), 94–102.
[abstract]Abstract: Two converging technologies have led to increased interest in the relationship between information processing in biological and distributed computational systems. On the one hand, we can now experimentally study biological networks at a much finer detail than ever before. On the other, mobile and sensor networks have become pervasive and require new algorithms to improve their robustness and handle their dynamic nature. In this review, we focus on the similarities and differences between information processing problems in both domains. We discuss the constraints that affect the communication models under which these systems operate, the trade-off between run-time and robustness, and the various algorithmic principles utilized by both domains. Our hope is that this review will allow researchers to identify the most promising biological and computational problems that can be studied concurrently to improve our understanding of both fields.
S. Navlakha, C. Faloutsos, Z. Bar-Joseph (2015). MassExodus: Modeling evolving networks in harsh environments. Proc. of the European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery and Databases (ECML/PKDD). Journal version in Data Mining and Knowledge Discovery (DAMI), 29(5), 1-22.
[abstract]Abstract: Consider networks in harsh environments, where nodes may be lost due to failure, attack, or infection -- how is the topology affected by such events? Can we mimic and measure the effect? We propose a new generative model of network evolution in dynamic and harsh environments. Our model can reproduce the range of topologies observed across known robust and fragile biological networks, as well as several additional transport, communication, and social networks. We also develop a new optimization measure to evaluate robustness based on preserving high connectivity following random or adversarial bursty node loss. Using this measure, we evaluate the robustness of several real-world networks and propose a new distributed algorithm to construct secure networks operating within malicious environments.
S. Navlakha, X. He, C. Faloutsos, Z. Bar-Joseph (2014). Topological properties of robust biological and computational networks. J. R. Soc. Interface, 11(96):20140283.
[abstract]Abstract: Network robustness is an important principle in biology and engineering. Previous studies of global networks have identified both redundancy and sparseness as topological properties used by robust networks. By focusing on molecular subnetworks, or modules, we show that module topology is tightly linked to the level of environmental variability (noise) the module expects to encounter. Modules internal to the cell that are less exposed to environmental noise are more connected and less robust than external modules. A similar design principle is used by several other biological networks. We propose a simple change to the evolutionary gene duplication model which gives rise to the rich range of module topologies observed within real networks. We apply these observations to evaluate and design communication networks that are specifically optimized for noisy or malicious environments. Combined, joint analysis of biological and computational networks leads to novel algorithms and insights benefiting both fields.
S. Navlakha, P. Ahammad, E. W. Myers (2013). Unsupervised segmentation of noisy electron microscopy images using salient watersheds and region merging. BMC Bioinformatics, 14:294.
[abstract]Abstract: Segmenting electron microscopy (EM) images of cellular and subcellular processes in the nervous system is a key step in many bioimaging pipelines involving classification and labeling of ultrastructures. However, fully automated techniques to segment images are often susceptible to noise and heterogeneity in EM images (e.g. different histological preparations, different organisms, different brain regions, etc.). Supervised techniques to address this problem are often helpful but require large sets of training data, which are often difficult to obtain in practice, especially across many conditions. We propose a new, principled unsupervised algorithm to segment EM images using a two-step approach: edge detection via salient watersheds following by robust region merging. We performed experiments to gather EM neuroimages of two organisms (mouse and fruit fly) using different histological preparations and generated manually curated ground-truth segmentations. We compared our algorithm against several state-of-the-art unsupervised segmentation algorithms and found superior performance using two standard measures of under-and over-segmentation error. Our algorithm is general and may be applicable to other large-scale segmentation problems for bioimages.
S. Navlakha, J. Suhan, A. L. Barth, and Z. Bar-Joseph (2013). A high-throughput framework to detect synapses in electron microscopy images. Proc. 21st Intl. Conf. on Intelligent Systems for Molecular Biology/12th European Conference on Computational Biology (ISMB/ECCB), 2013. Journal version in Bioinformatics, 29(13):i9-17.
[abstract]Abstract: Synaptic connections underlie learning and memory in the brain and are dynamically formed and eliminated during development and in response to stimuli. Quantifying changes in overall number and strength of synapses is an important pre-requisite for studying connectivity and plasticity in these cases or in diseased conditions. Unfortunately, most techniques to detect such changes are either low-throughput (e.g. electrophysiology), prone to error and difficult to automate (e.g. standard electron microscopy), or too coarse (e.g. MRI) to provide accurate and large-scale measurements. To facilitate high-throughput analyses, we used a 50-year-old experimental technique to selectively stain for synapses in electron microscopy (EM) images and developed a machine learning framework to automatically detect synapses in these images. To validate our method we experimentally imaged brain tissue of the somatosensory cortex in four mice. We detected thousands of synapses in these images and demonstrate the accuracy of our approach using cross-validation with manually labeled data and by comparing against existing algorithms and against tools that process standard EM images. We also used a semi-supervised algorithm that leverages unlabeled data to overcome sample heterogeneity and improve performance. Our algorithms are highly efficient and scalable and are freely available for others to use.
S. Navlakha, A. Gitter, and Z. Bar-Joseph (2012). A network-based approach for predicting missing pathway interactions. PLoS Comput. Biol., 8(8):e1002640.
[abstract]Abstract: Embedded within large-scale protein interaction networks are signaling pathways that encode response cascades in the cell. Unfortunately, even for well-studied species like S. cerevisiae, only a fraction of all true protein interactions are known, which makes it difficult to reason about the exact flow of signals and the corresponding causal relations in the network. To help address this problem, we introduce a framework for predicting new interactions that aid connectivity between upstream proteins (sources) and downstream transcription factors (targets) of a particular pathway. Our algorithms attempt to globally minimize the distance between sources and targets by finding a small set of shortcut edges to add to the network. Unlike existing algorithms for predicting general protein interactions, by focusing on proteins involved in specific responses our approach homes-in on pathway-consistent interactions by reducing source-target distances. We applied our method to extend pathways in osmotic stress response in yeast and identified several missing interactions, some of which are supported by published reports. We also performed experiments that support a novel interaction not previously reported. Our framework is general and may be applicable to edge prediction problems in other domains.
S. Navlakha and Z. Bar-Joseph (2011). Algorithms in nature: the convergence of systems biology and computational thinking. Nature/EMBO Mol. Syst. Biol., 7:546.
[abstract]Abstract: Computer science and biology have enjoyed a long and fruitful relationship for decades. Biologists rely on computational methods to analyze and integrate large data sets, while several computational methods were inspired by the high-level design principles of biological systems. Recently, these two directions have been converging. In this review, we argue that thinking computationally about biological processes may lead to more accurate models, which in turn can be used to improve the design of algorithms. We discuss the similar mechanisms and requirements shared by computational and biological processes and then present several recent studies that apply this joint analysis strategy to problems related to coordination, network analysis, and tracking and vision. We also discuss additional biological processes that can be studied in a similar manner and link them to potential computational problems. With the rapid accumulation of data detailing the inner workings of biological systems, we expect this direction of coupling biological and computational studies to greatly expand in the future.
A. Thor, P. Anderson, L. Raschid, S. Navlakha, B. Saha, S. Khuller, and X-N. Zhang (2011). Link prediction for annotation graphs using graph summarization. Proc. 10th Intl. Semantic Web Conference (ISWC), 7031:714-729.
[abstract]Abstract: Annotation graph datasets are a natural representation of scientific knowledge. They are common in the life sciences where genes or proteins are annotated with controlled vocabulary terms (CV terms) from ontologies. The W3C Linking Open Data (LOD) initiative and semantic Web technologies are playing a leading role in making such datasets widely available. Scientists can mine these datasets to discover patterns of annotation. While ontology alignment and integration across datasets has been explored in the context of the semantic Web, there is no current approach to mine such patterns in annotation graph datasets. In this paper, we propose a novel approach for link prediction; it is a preliminary task when discovering more complex patterns. Our prediction is based on a complementary methodology of graph summarization (GS) and dense subgraphs (DSG). GS can exploit and summarize knowledge captured within the ontologies and in the annotation patterns. DSG uses the ontology structure, in particular the distance between CV terms, to filter the graph, and to find promising subgraphs. We develop a scoring function based on multiple heuristics to rank the predictions. We perform an extensive evaluation on Arabidopsis thaliana genes.
R. Patro, E. Sefer, J. Malin, G. Marçais, S. Navlakha, and C. Kingsford (2011). Parsimonious reconstruction of network evolution. Workshop on Algorithms in Bioinformatics (WABI).. Journal version in Algorithms Mol. Biol., 7(1):25, 2012.
[abstract]Abstract: We consider the problem of reconstructing a maximally parsimonious history of network evolution under models that support gene duplication and loss and independent interaction gain and loss. We introduce a combinatorial framework for encoding network histories, and we give a fast procedure that, given a set of duplication histories, in practice ï¬nds network histories with close to the minimum number of interaction gain or loss events. In contrast to previous studies, our method does not require knowing the relative ordering of unrelated duplication events. Results on simulated histories suggest that common ancestral networks can be accurately reconstructed using this parsimony approach.
S. Navlakha and C. Kingsford (2011). Network archaeology: uncovering ancient networks from present-day interactions. PLoS Comput. Biol., 7(4):e1001119. Presented at the 6th RECOMB Systems Biology Satellite Conference, 2010.
[abstract]Abstract: What proteins interacted in a long-extinct ancestor of yeast? How have different members of a protein complex assembled together over time? Our ability to answer such questions has been limited by the unavailability of ancestral protein-protein interaction (PPI) networks. To overcome this limitation, we propose several novel algorithms to reconstruct the growth history of a present-day network. Our likelihood-based method finds a probable previous state of the graph by applying an assumed growth model backwards in time. This approach retains node identities so that the history of individual nodes can be tracked. Using this methodology, we estimate protein ages in the yeast PPI network that are in good agreement with sequence-based estimates of age and with structural features of protein complexes. Further, by comparing the quality of the inferred histories for several different growth models (duplication-mutation with complementarity, forest fire, and preferential attachment), we provide additional evidence that a duplication-based model captures many features of PPI network growth better than models designed to mimic social network growth. From the reconstructed history, we model the arrival time of extant and ancestral interactions and predict that complexes have significantly re-wired over time and that new edges tend to form within existing complexes. We also hypothesize a distribution of per-protein duplication rates, track the change of the network's clustering coefficient, and predict paralogous relationships between extant proteins that are likely to be complementary to the relationships inferred using sequence alone. Finally, we infer plausible parameters for the model, thereby predicting the relative probability of various evolutionary events. The success of these algorithms indicates that parts of the history of the yeast PPI are encoded in its present-day form.
G. Duggal, S. Navlakha, M. Girvan, and C. Kingsford (2010). Uncovering many views of biological networks using ensembles of graph partitions. Proc. 1st Intl. Workshop on Discovering, Summarizing, and Using Multiple Clusterings (KDD MultiClust), 2010:9.
[abstract]Abstract: Densely interacting regions of biological networks often correspond to functional modules such as protein complexes. Most algorithms proposed to uncover modules, however, produce one clustering that only reveals a single view of how the cell is organized. We describe two new methods to find ensembles of provably near-optimal modularity partitions that lie within a heuristically constrained search space. We also show how to count the number of solutions in this space that exist within a bounded modularity range. We apply our algorithms to a protein interaction network for S. cerevisiae and show how fine-grained differences between near-optimal partitions can be used to define robust communities. We also propose a technique to find structurally diverse nearoptimal solutions and show that these different partitions are enriched for different biological functions. Our results indicate that near-optimal solutions can represent alternative and complementary views of the network's structure.
J. R. White, S. Navlakha, N. Nagarajan, M. Ghodsi, C. Kingsford, and M. Pop (2010). Alignment and clustering of phylogenetic markers - implications for microbial diversity studies. BMC Bioinformatics, 11:152.
[abstract]Abstract: Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies. Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters. Our results highlight the need for systematic and open evaluation of data analysis methodologies, especially as targeted 16S rRNA diversity studies are increasingly relying on high-throughput sequencing technologies. All data and results from our study are available through the JGI FAMeS website.
S. Navlakha and C. Kingsford (2010). The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26(8):1057-63.
[abstract]Abstract: Understanding the association between genetic diseases and their causal genes is an important problem concerning human health. With the recent influx of high-throughput data describing interactions between gene products, scientists have been provided a new avenue through which these associations can be inferred. Despite the recent interest in this problem, however, there is little understanding of the relative benefits and drawbacks underlying the proposed techniques. We assessed the utility of physical protein interactions for determining geneâ€“disease associations by examining the performance of seven recently developed computational methods (plus several of their variants). We found that random-walk approaches individually outperform clustering and neighborhood approaches, although most methods make predictions not made by any other method. We show how combining these methods into a consensus method yields Pareto optimal performance. We also quantified how a diffuse topological distribution of disease-related proteins negatively affects prediction quality and are thus able to identify diseases especially amenable to network-based predictions and others for which additional information sources are absolutely required. The predictions made by each algorithm considered are available online.
S. Navlakha and C. Kingsford (2010). Exploring biological network dynamics with ensembles of graph partitions. Proc. 15th Intl. Pacific Symposium on Biocomputing (PSB), 166-177.
[abstract]Abstract: Unveiling the modular structure of biological networks can reveal important organizational patterns in the cell. Many graph partitioning algorithms have been proposed towards this end. However, most approaches only consider a single, optimal decomposition of the network. In this work, we make use of the multitude of near-optimal clusterings in order to explore the dynamics of network clusterings and how those dynamics relate to the structure of the underlying network. We recast the modularity optimization problem as an integer linear program with diversity constraints. These constraints produce an ensemble of dissimilar but still highly modular clusterings. We apply our approach to four social and biological networks and show how optimal and near-optimal solutions can be used in conjunction to identify deeper community structure in the network, including inter-community dynamics, communities that are especially resilient to change, and core-and-peripheral community members.
S. Navlakha, J. White, N. Nagarajan, M. Pop, and C. Kingsford (2009). Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. Proc. 14th Intl. Conf. on Research in Computational Molecular Biology (RECOMB), 5541:400–417. Journal version in J. Comput. Biol., 17(3):503–516, 2010.
[abstract]Abstract: Hierarchical clustering is a popular method for grouping to- gether similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decom- posing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation of information metric. Our approach is general and does not require the user to enter the number of clusters desired. We apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For these two applications, we test the quality of our clusters by using them to predict complex and species membership, respectively. We find that our approach generally outperforms the commonly used heuristic methods.
S. Navlakha, M.C. Schatz, and C. Kingsford (2009). Revealing biological modules via graph summarization. J. Comput. Biol., 16(2):253-64. Presented at the 5th RECOMB Systems Biology Satellite Conference, 2008.
[abstract]Abstract: The division of a protein interaction network into biologically meaningful modules can aid with automated detection of protein complexes and prediction of biological processes and can uncover the global organization of the cell. We propose the use of a graph summarization (GS) technique, based on graph compression, to cluster protein interaction graphs into biologically relevant modules. The method is motivated by defining a biological module as a set of proteins that have similar sets of interaction partners. We show this definition, put into practice by a GS algorithm, reveals modules that are more biologically enriched than those found by other methods. We also apply GS to predict complex memberships, biological processes, and co-complexed pairs and show that in most settings GS is preferable over existing methods of protein interaction graph clustering.
S. Navlakha, R. Rastogi, and N. Shrivastava (2008). Graph
summarization with bounded error. Proc. 34th Intl. Conf. on Management of Data (SIGMOD), 419-432.
[abstract]Abstract: We propose a highly compact two-part representation of a given graph G consisting of a graph summary and a set of corrections. The graph summary is an aggregate graph in which each node corresponds to a set of nodes in G, and each edge represents the edges between all pair of nodes in the two sets. On the other hand, the corrections portion specifies the list of edge-corrections that should be applied to the summary to recreate G. Our representations allow for both lossless and lossy graph compression with bounds on the introduced error. Further, in combination with the MDL principle, they yield highly intuitive coarse-level summaries of the input graph G. We develop algorithms to construct highly compressed graph representations with small sizes and guaranteed accuracy, and validate our approach through an extensive set of experiments with multiple real-life graph data sets. To the best of our knowledge, this is the first work to compute graph summaries using the MDL principle, and use the summaries (along with corrections) to compress graphs with bounded error.
H. Haidarian-Shahri, G. M. Namata, S. Navlakha, A. Deshpande, and N. Roussopoulos (2007). A graph-based approach to vehicle tracking in traffic camera video streams. Proc. 5th Intl. Workshop on Data Management for Sensor Networks (VLDB DMSN).
[abstract]Abstract: Vehicle tracking has a wide variety of applications from law enforcement to traffic planning and public safety. However, the image resolution of the videos available from most traffic camera systems, make it difficult to track vehicles based on unique identifiers like license plates. In many cases, vehicles with similar attributes are indistinguishable from one another due to image quality issues. Often, network bandwidth and power constraints limit the frame rate, as well. In this paper, we discuss the challenges of performing vehicle tracking queries over video streams from ubiquitous traffic cameras. We identify the limitations of tracking vehicles individually in such conditions and provide a novel graph-based approach using the identity of neighboring vehicles to improve the performance. We evaluate our approach using streaming video feeds from live traffic cameras available on the Internet. The results show that vehicle tracking is feasible, even for low quality and low frame rate traffic cameras. Additionally, exploitation of the attributes of neighboring vehicles significantly improves the performance.