## Book Chapters

S. Martin, C. Westergaard, and J. White (2016), "

As wind farms scale to include more and more turbines, questions about turbine wake interactions become increasingly important. Turbine wakes reduce wind speed and downwind turbines suffer decreased performance. The cumulative effect of the wakes throughout a wind farm will therefore decrease the performance for the entire farm. These interactions are dynamic and complicated, and it is difficult to quantify the overall effect of the wakes. In this paper, Supervisory Control and Data Acquisition (SCADA) data from an existing wind farm is analyzed in order to explore methods for documenting wake interactions.

**Visualizing Wind Farm Wake Losses Using SCADA Data,"***Whither Turbulence and Big Data in the 21st Century?*A. Pollard, L. Castillo, L. Danaila, and M Glauser Eds., Springer-Verlag (publisher, software)As wind farms scale to include more and more turbines, questions about turbine wake interactions become increasingly important. Turbine wakes reduce wind speed and downwind turbines suffer decreased performance. The cumulative effect of the wakes throughout a wind farm will therefore decrease the performance for the entire farm. These interactions are dynamic and complicated, and it is difficult to quantify the overall effect of the wakes. In this paper, Supervisory Control and Data Acquisition (SCADA) data from an existing wind farm is analyzed in order to explore methods for documenting wake interactions.

P. J. Crossno, T. M. Shead, M. A. Sielicki, W. L. Hunt, S. Martin, and M.-Y. Hseih (2015), "

**Slycat Ensemble Analysis of Electrical Circuit Simulations**,"

*Topological and Statistical Methods for Complex Data,*J. Bennet, F. Vivodtzev, and V. Pascucci Eds., Springer-Verlag (publisher)

Slycat is a framework for developing web based applications for visualization of data mining results. In this chapter, we describe the use of Slycat for ensemble analysis of electrical circuit simulations. We use Canonical Correlation Analysis (CCA) to model relationships between input and output variables, providing a visualization of an entire ensemble at once. The tight integration of analysis and visualization allows users to iteratively explore their data, forming and testing hypotheses about how simulation input parameters are driving output results in their ensembles.

H. K. Ho, L. Zhang, K. Ramamohanarao, and S. Martin (2013), "

**A Survey of Machine Learning Methods for Secondary and Supersecondary Protein Structure Prediction**,"

*Protein Supersecondary Structure,*A. Kister, Ed., Humana Press (publisher)

In this chapter we provide a survey of protein secondary and supersecondary structure prediction using methods from machine learning. Our focus is on machine learning methods applicable to Î²-hairpin and Î²-sheet prediction, but we also discuss methods for more general supersecondary structure prediction. We provide background on the secondary and supersecondary structures that we discuss, the features used to describe them, and the basic theory behind the machine learning methods used. We survey the machine learning methods available for secondary and supersecondary structure prediction and compare them where possible.

M. Misra, S. Martin, and J.-L. Faulon (2011), "

The past two decades have seen a large accumulation of biological sequences and chemical compounds in many publicly available databases. For a long time, the two communities of bioinformatics and cheminformatics have developed in parallel working largely with sequence data alone or mainly in the chemical space, respectively. As the need to study biological networks has increased, however, a concurrent need to develop tools and algorithms capable of handling the combined sequence and chemical space has arisen. We present here a graph-based technique, named molecular signature, which is sufficiently adaptable to permit combined description, for high-throughput analyses, of both sequences and chemicals.

**Graphs: Flexible Representations of Molecular Structures and Biological Networks**,"*Computational Approaches in Cheminformatics and Bioinformatics*, R. Guha and A. Bender, Eds., Wiley. & Sons. (publisher)The past two decades have seen a large accumulation of biological sequences and chemical compounds in many publicly available databases. For a long time, the two communities of bioinformatics and cheminformatics have developed in parallel working largely with sequence data alone or mainly in the chemical space, respectively. As the need to study biological networks has increased, however, a concurrent need to develop tools and algorithms capable of handling the combined sequence and chemical space has arisen. We present here a graph-based technique, named molecular signature, which is sufficiently adaptable to permit combined description, for high-throughput analyses, of both sequences and chemicals.

S. Martin (2010), "

**Machine Learning based Bioinformatics Algorithms: Application to Chemicals**,"

*Handbook of Cheminformatics Algorithms*, J.-L. Faulon and A. Bender, Eds., CRC Press. (publisher)

In this chapter we present a targeted overview of clustering, classification and regression algorithms. The target of our overview is algorithms, which have been used in either bioinformatics or chemoinformatics applications. In particular, we compare and contrast the efforts in both fields.

S. Martin, W. M. Brown, and J.-L. Faulon (2008), "

**Predicting Protein Interactions using Product Kernels**,"

*Advances in Biochemical Engineering/Biotechnology: Protein-Protein Interactions*, M. Werther and H. Seitz, Eds., vol. 110, Springer-Verlag. (publisher, presentation)

In this chapter, we provide a brief discussion of the relative merits of different experimental and computational methods available for identifying protein interactions. We then focus on the application of our particular (computational) method using Support Vector Machine product kernels. We describe our method in detail and discuss the application of the method for predicting protein-protein interactions, Beta-strand interactions, and protein-chemical interactions.

G. S. Davidson, S. Martin, K. Boyack, B. N. Wylie, J. Martinez, A. Aragon, M. Werner-Washburne, M. Mosquera-Caro, and C. L. Willman (2007), "

**Robust Methods in Microarray Analysis**,"

*Genomics and Proteomics Engineering in Medicine and Biology*M. Akay, Ed., Wiley/IEEE. (publisher)

High throughput analysis techniques are required in order to make good use of the genomic sequences that have recently become available for many species, including humans. Unfortunately, microarray data are also notoriously inaccurate, and it is possible to spend far too much time contemplating the results of a given microarray analysis method, only to arrive at a dead end. In this chapter, we discuss several methods for microarray analysis we have developed, which are meant to provide more accurate results and/or quality assessments of the results obtained.

## Journal Articles

S. Martin, Harry D. Pratt III, and Travis M. Anderson (2017), “

**Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors**,”*Molecular Informatics*, 36(7): 1600125 (14pp). (journal, data)Ionic liquids (ILs) consist of cation-anion pairs. Despite this fact, current efforts to predict IL properties using quantitative structure property relationships (QSPRs) treat the cations and anions separately, ignoring potential cross-correlations. Here we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18

^{th}, 2014.E. Andries and S. Martin (2013), "

Multivariate calibration methods such as partial least-squares build calibration models that are not parsimonious: all variables (either wavelengths or samples) are used to define a calibration model. In high-dimensional or large sample size settings, interpretable analysis aims to reduce model complexity by finding a small subset of variables that significantly influences the model. The term "sparsity," as used here, refers to the calibration models having many zero-valued regression coefficients. Only the varibles associated with non-zero coefficients influence the model. In this paper, we briefly review the regression problems associated with sparse models and discuss their spectroscopic applications.

S. Martin (2012), "

We describe an inverse quantitative structure activity relationship (QSAR) framework developed for the design of molecular structures with desired properties. This framework uses chemical fragments encoded with a molecular descriptor known as a signature. It solves a system of linear constrained Diophantine equations to reorganize the fragments into novel molecular structures. The method has been previously applied to problems in drug and materials design but has inherent computational limitations due to the necessity of solving the Diophantine constraints. We propose a new approach to overcome these limitations using the Fincke-Pohst algorithm for lattice enumeration. We benchmark the new approach against previous results on LFA-1/ICAM-1 inhibitory peptides, linear homopolymers, and hydrofluoroether foam blowing agents.

S. Martin and J.-P. Watson (2011), "

We describe an algorithm capable of reconstructing a non-manifold surface embedded as a point cloud in a high dimensional space. Our algorithm will work for non-orientable surfaces, and for surfaces with certain types of self-intersection. The self-intersections must be ordinary double curves and are ï¬tted locally by intersecting planes using a degenerate quadratic surface.

S. Martin, A. Thompson, E. A. Coutsias, and J.-P. Watson (2010), "Topology of Cyclo-Octane Energy Landscape," Journal of Chemical Physics 132:234115. (journal, software, presentation)

Understanding energy landscapes is a major challenge in chemistry and biology. Although a wide variety of methods have been invented and applied to this problem, very little is understood about the actual mathematical structures underlying such landscapes. We have discovered an example of an energy landscape which is nonmanifold, demonstrating previously unknown mathematical complexity. The example occurs in the energy landscape of cyclo-octane, which was found to have the structure of a reducible algebraic variety, composed of the union of a sphere and a Klein bottle, intersecting in two rings.

**Sparse Methods in Spectroscopy: An Introduction, Overview, and Perspective**,"*Applied Spectroscopy*, 67(6):579-589. (journal)Multivariate calibration methods such as partial least-squares build calibration models that are not parsimonious: all variables (either wavelengths or samples) are used to define a calibration model. In high-dimensional or large sample size settings, interpretable analysis aims to reduce model complexity by finding a small subset of variables that significantly influences the model. The term "sparsity," as used here, refers to the calibration models having many zero-valued regression coefficients. Only the varibles associated with non-zero coefficients influence the model. In this paper, we briefly review the regression problems associated with sparse models and discuss their spectroscopic applications.

S. Martin (2012), "

**Lattice Enumeration for Inverse Molecular Design Using the Signature Descriptor**,"*Journal of Chemical Information and Modeling*, 52(7):1787-1797. (journal, software)We describe an inverse quantitative structure activity relationship (QSAR) framework developed for the design of molecular structures with desired properties. This framework uses chemical fragments encoded with a molecular descriptor known as a signature. It solves a system of linear constrained Diophantine equations to reorganize the fragments into novel molecular structures. The method has been previously applied to problems in drug and materials design but has inherent computational limitations due to the necessity of solving the Diophantine constraints. We propose a new approach to overcome these limitations using the Fincke-Pohst algorithm for lattice enumeration. We benchmark the new approach against previous results on LFA-1/ICAM-1 inhibitory peptides, linear homopolymers, and hydrofluoroether foam blowing agents.

S. Martin and J.-P. Watson (2011), "

**Non-Manifold Surface Reconstruction from High Dimensional Point Cloud Data**,"*Computational Geometry: Theory and Applications*, 44(8):427-441. (journal, software)We describe an algorithm capable of reconstructing a non-manifold surface embedded as a point cloud in a high dimensional space. Our algorithm will work for non-orientable surfaces, and for surfaces with certain types of self-intersection. The self-intersections must be ordinary double curves and are ï¬tted locally by intersecting planes using a degenerate quadratic surface.

S. Martin, A. Thompson, E. A. Coutsias, and J.-P. Watson (2010), "Topology of Cyclo-Octane Energy Landscape," Journal of Chemical Physics 132:234115. (journal, software, presentation)

Understanding energy landscapes is a major challenge in chemistry and biology. Although a wide variety of methods have been invented and applied to this problem, very little is understood about the actual mathematical structures underlying such landscapes. We have discovered an example of an energy landscape which is nonmanifold, demonstrating previously unknown mathematical complexity. The example occurs in the energy landscape of cyclo-octane, which was found to have the structure of a reducible algebraic variety, composed of the union of a sphere and a Klein bottle, intersecting in two rings.

S. Martin, G. Chandler, and M. S. Derzon (2008), "Simulation of High Pressure Micro-Capillary 3He Counters," Journal of Physics G: Nuclear and Particle Physics 35:115103. (journal)

Low pressure (1-4 atm) cylindrical 3He counters are widely used as neutron detectors. These detectors are relatively large (1-2.5 cm diameter) and can be subject to noise induced by microphonics. Meanwhile, new advancements in micro-fabrication are enabling the manufacture of high pressure (over 3000 atm) micro-capillaries (~100 micron diameter). Can these micro-capillaries be used as accurate and high-efficiency 3He counters? To investigate these questions, we have developed a mathematical model/computer simulation.

W. M. Brown, S. Martin, S. N. Pollock, E. A. Coutsias, and J.-P. Watson (2008), "Algorithmic Dimensionality Reduction for Molecular Structure Analysis," Journal of Chemical Physics 129(6):064118. (journal)

Linear dimensionality reduction approaches have been used to exploit the redundancy in a Cartesian coordinate representation of molecular motion by producing low-dimensional representations of molecular motion. Here, we investigate the effectiveness of several automated algorithms for nonlinear dimensionality reduction for representation of trans,trans-1,2,4-triflouorocyclooctane conformation - a molecule whose structure can be described on a 2-manifold in a Cartesian coordinate phase space

W. M. Brown, A. Sasson, D. R. Bellew, L. A. Hunsaker, S. Martin, A. Leitao, L. M. Deck, D. L. Vander Jagt, and T. Oprea (2008), "Efficient Calculation of Molecular Properties from Simulation using Kernel Molecular Dynamics," Journal of Chemical Information and Modeling 48(8):1626-1637. (journal)

Understanding the relationship between chemical structure and function is a ubiquitous problem in chemistry and biology. Here, we present a novel approach that uses aspects of simulation and informatics in order to formulate structure-property relationships. We show how supervised learning can be utilized to overcome the sampling problem in simulation approaches. Likewise, we show how learning can be achieved based on molecular descriptions that are rooted in the physics of dynamic intermolecular forces.

J.-L. Faulon, M. Misra, S. Martin, K. Sale, and R. Sapra (2008), "Genome Scale Enzyme-Metabolite and Drug-Target Interaction Predictions using the Signature Molecular Descriptor," Bioinformatics 24(2):225-233. (journal, pdf)

Identifying protein enzymatic or pharmacological activities are important areas of research in biology and chemistry. Biological and chemical databases are increasingly being populated with linkages between protein sequences and chemical structures. There is now sufficient information to apply machine-learning techniques to predict interactions between chemicals and proteins at a genome scale. Current machine-learning techniques use as input either protein sequences and structures or chemical information. We propose here a method to infer protein-chemical interactions using heterogeneous input consisting of both protein sequence and chemical information.

S. Martin, Z. Zhang, A. Martino, and J.-L. Faulon (2007), "Boolean Dynamics of Genetic Regulatory Networks Inferred from Microarray Time Series Data," Bioinformatics 23(7):866-874. (journal, pdf, supplement)

Methods available for the inference of genetic regulatory networks strive to produce a single network, usually by optimizing some quantity to fit the experimental observations. In this paper we investigate the possibility that multiple networks can be inferred, all resulting in similar dynamics. This idea is motivated by theoretical work which suggests that biological networks are robust and adaptable to change, and that the overall behavior of a genetic regulatory network might be captured in terms of dynamical basins of attraction.

S. Martin, Z. Mao, L. S. Chan, and S. Rasheed (2007), "Inferring Protein-Protein Interaction Networks using Protein Complex Data," International Journal of Bioinformatics Research and Applications 3(4):480-492. Expanded version of BIOT 2006 conference paper with same authors. (journal)

Present day approaches for the determination of protein-protein interaction networks are usually based on two hybrid experimental measurements. Here we consider a computational method that uses another type of experimental data: instead of direct information about protein-protein interactions, we consider data in the form of protein complexes. We propose a method for using these complexes to provide predictions of protein-protein interactions. When applied to a dataset obtained from a cat melanoma cell line we find that we are able to predict when a protein pair belongs to a complex with 96% accuracy.

S. Martin, R. D. Carr, and J.-L. Faulon (2006), "Random Removal of Edges from Scale Free Graphs," Physica A 371(2):870-876. (journal)

It has been discovered that many naturally occurring networks (the internet, the power grid of the western US, various biological networks, etc.) satisfy a power-law degree distribution. Such scale-free networks have many interesting properties, one of which is robustness to random damage. This problem has been analyzed from the point of view of node deletion and connectedness. Recently, it has also been considered from the point of view of node deletion and scale preservation. In this paper we consider the problem from the point of view of edge deletion and scale preservation. In agreement with the work on node deletion and scale preservation, we show that a scale-free graph should not be expected to remain scale free when edges are removed at random.

C. Wilson, G. S. Davidson, S. Martin, E. Andries, J. Potter, R. Harvey, K. Ar, Y. Xu, K. J. Kopecky, D. P. Ankerst, H. Gundacker, M. L. Slovak, M. Mosquera-Caro, I-M. Chen, D. L. Stirewalt, M. Murphy, F. A. Shultz, H. Kang, X. Wang, J. P. Radich, F. R. Appelbaum, S. R. Atlas, J. Godwin, and C. L. Willman (2006), â€œGene Expression Profiling of Adult Acute Myeloid Leukemia Identifies Novel Biologic Clusters for Risk Classification and Outcome Prediction,

To determine whether gene expression profiling could improve risk classification and outcome prediction in older acute myeloid leukemia (AML) patients, expression profiles were obtained in pretreatment leukemic samples from 170 patients whose median age was 65 years. These expression profiles were analyzed using unsupervised clustering methods were used to classify patients into 6 cluster groups that varied significantly in rates of resistant disease. These gene expression signatures provide insights into novel groups of AML not predicted by traditional studies that impact prognosis and potential therapy.

W. M. Brown, S. Martin, Mark D. Rintoul, and J.-L. Faulon (2006), "Designing Novel Polymers with Targeted Properties using the Signature Molecular Descriptor," Journal of Chemical Information and Modeling 46(2): 826-835. (journal)

A method for solving the inverse quantitative structure-property relationship (QSPR) problem is presented which facilitates the design of novel polymers with targeted properties. Here, we demonstrate the efficacy of the approach using the targeted design of polymers exhibiting a desired glass transition temperature, heat capacity, and density. We show how the inverse problem can be solved to design poly(N-methyl hexamethylene sebacamide) despite the fact that the polymer was used not used in the training of this model.

W. M. Brown, S. Martin, J. Chabarek, C. Strauss, and J.-L. Faulon (2006), "Prediction of Beta-Strand Packing Interactions using the Signature Product," Journal of Molecular Modeling 12(3):355-361. (journal, poster)

The prediction of Beta-sheet topology requires the consideration of long-range interactions between Beta-strands that are not necessarily consecutive in sequence. Since these interactions are difficult to simulate using ab initio methods, we propose a supplementary method able to assign Beta-sheet topology using only sequence information. Our method is based on the signature molecular descriptor, which has been used previously to predict protein-protein interactions successfully, and to develop quantitative structure-activity relationships for small organic drugs and peptide inhibitors.

J.-L. Faulon, W. M. Brown, and S. Martin (2005), "Reverse Engineering Chemical Structures from Molecular Descriptors: How Many Soluti

Physical, chemical and biological properties are the ultimate information of interest for chemical compounds. Molecular descriptors that map structural information to activities and properties are obvious candidates for information sharing. In this paper, we consider the feasibility of using molecular descriptors to safely exchange chemical information in such a way that the original chemical structures cannot be reverse engineered.

S. Martin, D. Roe, and J.-L. Faulon (2005), "Predicting Protein-Protein Interactions using Signature Products,"

Proteome-wide prediction of protein-protein interaction is a difficult and important problem in biology. Although there have been recent advances in both experimental and computational methods for predicting protein-protein interactions, we are only beginning to see a confluence of these techniques. In this paper, we describe a very general, high-throughput method for predicting protein-protein interactions. Our method combines a sequence-based description of proteins with experimental information that can be gathered from any type of protein-protein interaction screen.

C. Churchwell, M. D. Rintoul, S. Martin, D. P. Visco Jr., A. Kotu, R. S. Larson, L. O. Sillerud, D. C. Brown, and J.-L. Faulon (2004), "The Signature Molecular Descriptor 3. Inverse-Quantitative Structure-Activity Relationship of ICAM-1 Inhibitory Peptides,"

We present a methodology for solving the inverse-quantitative structureâ€“activity relationship (QSAR) problem using the molecular descriptor called signature. First, we create a QSAR equation that correlates the occurrence of a signature to the activity values using a stepwise multilinear regression technique. Second, we construct constraint equations, specifically the graphicality and consistency equations, which facilitate the reconstruction of the solution compounds directly from the signatures. Third, we solve the set of constraint equations, which are both linear and Diophantine in nature. Last, we reconstruct and enumerate the solution molecules and calculate their activity values from the QSAR equation.

S. Martin, M. Kirby, and R. Miranda (2000), "Symmetric Veronese Classifiers with Application to Materials Design,"

To solve the materials classification problem, we propose a fast, exhaustive approach. We propose to test every feature (chemical property), every pair of features, every three features, etc., against every classifier architecture from a certain group of classifiers known as Support Vector Machines. This approach generalizes Pierre Villars' work to higher dimensions and more operations. We have duplicated his result in identifying the Mendeleev Number as the single best feature, and we have produced a new result for the case of two features: namely, we have identified the Mendeleev number with the valence electron number as the best combination of two features.

Treatment of acute lymphoblastic leukemia (ALL) involves the assignment of patients to risk groups based on cytogentic abnormalities. Here we report the results of a gene expression experiment in which we have discovered that the predictions of karyotype are insensitive, in that there are a large number of false positive classifications among patients with poorly defined cytogenetic abnormalities.

Low pressure (1-4 atm) cylindrical 3He counters are widely used as neutron detectors. These detectors are relatively large (1-2.5 cm diameter) and can be subject to noise induced by microphonics. Meanwhile, new advancements in micro-fabrication are enabling the manufacture of high pressure (over 3000 atm) micro-capillaries (~100 micron diameter). Can these micro-capillaries be used as accurate and high-efficiency 3He counters? To investigate these questions, we have developed a mathematical model/computer simulation.

W. M. Brown, S. Martin, S. N. Pollock, E. A. Coutsias, and J.-P. Watson (2008), "Algorithmic Dimensionality Reduction for Molecular Structure Analysis," Journal of Chemical Physics 129(6):064118. (journal)

Linear dimensionality reduction approaches have been used to exploit the redundancy in a Cartesian coordinate representation of molecular motion by producing low-dimensional representations of molecular motion. Here, we investigate the effectiveness of several automated algorithms for nonlinear dimensionality reduction for representation of trans,trans-1,2,4-triflouorocyclooctane conformation - a molecule whose structure can be described on a 2-manifold in a Cartesian coordinate phase space

W. M. Brown, A. Sasson, D. R. Bellew, L. A. Hunsaker, S. Martin, A. Leitao, L. M. Deck, D. L. Vander Jagt, and T. Oprea (2008), "Efficient Calculation of Molecular Properties from Simulation using Kernel Molecular Dynamics," Journal of Chemical Information and Modeling 48(8):1626-1637. (journal)

Understanding the relationship between chemical structure and function is a ubiquitous problem in chemistry and biology. Here, we present a novel approach that uses aspects of simulation and informatics in order to formulate structure-property relationships. We show how supervised learning can be utilized to overcome the sampling problem in simulation approaches. Likewise, we show how learning can be achieved based on molecular descriptions that are rooted in the physics of dynamic intermolecular forces.

J.-L. Faulon, M. Misra, S. Martin, K. Sale, and R. Sapra (2008), "Genome Scale Enzyme-Metabolite and Drug-Target Interaction Predictions using the Signature Molecular Descriptor," Bioinformatics 24(2):225-233. (journal, pdf)

Identifying protein enzymatic or pharmacological activities are important areas of research in biology and chemistry. Biological and chemical databases are increasingly being populated with linkages between protein sequences and chemical structures. There is now sufficient information to apply machine-learning techniques to predict interactions between chemicals and proteins at a genome scale. Current machine-learning techniques use as input either protein sequences and structures or chemical information. We propose here a method to infer protein-chemical interactions using heterogeneous input consisting of both protein sequence and chemical information.

S. Martin, Z. Zhang, A. Martino, and J.-L. Faulon (2007), "Boolean Dynamics of Genetic Regulatory Networks Inferred from Microarray Time Series Data," Bioinformatics 23(7):866-874. (journal, pdf, supplement)

Methods available for the inference of genetic regulatory networks strive to produce a single network, usually by optimizing some quantity to fit the experimental observations. In this paper we investigate the possibility that multiple networks can be inferred, all resulting in similar dynamics. This idea is motivated by theoretical work which suggests that biological networks are robust and adaptable to change, and that the overall behavior of a genetic regulatory network might be captured in terms of dynamical basins of attraction.

S. Martin, Z. Mao, L. S. Chan, and S. Rasheed (2007), "Inferring Protein-Protein Interaction Networks using Protein Complex Data," International Journal of Bioinformatics Research and Applications 3(4):480-492. Expanded version of BIOT 2006 conference paper with same authors. (journal)

Present day approaches for the determination of protein-protein interaction networks are usually based on two hybrid experimental measurements. Here we consider a computational method that uses another type of experimental data: instead of direct information about protein-protein interactions, we consider data in the form of protein complexes. We propose a method for using these complexes to provide predictions of protein-protein interactions. When applied to a dataset obtained from a cat melanoma cell line we find that we are able to predict when a protein pair belongs to a complex with 96% accuracy.

S. Martin, R. D. Carr, and J.-L. Faulon (2006), "Random Removal of Edges from Scale Free Graphs," Physica A 371(2):870-876. (journal)

It has been discovered that many naturally occurring networks (the internet, the power grid of the western US, various biological networks, etc.) satisfy a power-law degree distribution. Such scale-free networks have many interesting properties, one of which is robustness to random damage. This problem has been analyzed from the point of view of node deletion and connectedness. Recently, it has also been considered from the point of view of node deletion and scale preservation. In this paper we consider the problem from the point of view of edge deletion and scale preservation. In agreement with the work on node deletion and scale preservation, we show that a scale-free graph should not be expected to remain scale free when edges are removed at random.

C. Wilson, G. S. Davidson, S. Martin, E. Andries, J. Potter, R. Harvey, K. Ar, Y. Xu, K. J. Kopecky, D. P. Ankerst, H. Gundacker, M. L. Slovak, M. Mosquera-Caro, I-M. Chen, D. L. Stirewalt, M. Murphy, F. A. Shultz, H. Kang, X. Wang, J. P. Radich, F. R. Appelbaum, S. R. Atlas, J. Godwin, and C. L. Willman (2006), â€œGene Expression Profiling of Adult Acute Myeloid Leukemia Identifies Novel Biologic Clusters for Risk Classification and Outcome Prediction,

*Blood*108(2): 685-696. (journal, pdf)To determine whether gene expression profiling could improve risk classification and outcome prediction in older acute myeloid leukemia (AML) patients, expression profiles were obtained in pretreatment leukemic samples from 170 patients whose median age was 65 years. These expression profiles were analyzed using unsupervised clustering methods were used to classify patients into 6 cluster groups that varied significantly in rates of resistant disease. These gene expression signatures provide insights into novel groups of AML not predicted by traditional studies that impact prognosis and potential therapy.

W. M. Brown, S. Martin, Mark D. Rintoul, and J.-L. Faulon (2006), "Designing Novel Polymers with Targeted Properties using the Signature Molecular Descriptor," Journal of Chemical Information and Modeling 46(2): 826-835. (journal)

A method for solving the inverse quantitative structure-property relationship (QSPR) problem is presented which facilitates the design of novel polymers with targeted properties. Here, we demonstrate the efficacy of the approach using the targeted design of polymers exhibiting a desired glass transition temperature, heat capacity, and density. We show how the inverse problem can be solved to design poly(N-methyl hexamethylene sebacamide) despite the fact that the polymer was used not used in the training of this model.

W. M. Brown, S. Martin, J. Chabarek, C. Strauss, and J.-L. Faulon (2006), "Prediction of Beta-Strand Packing Interactions using the Signature Product," Journal of Molecular Modeling 12(3):355-361. (journal, poster)

The prediction of Beta-sheet topology requires the consideration of long-range interactions between Beta-strands that are not necessarily consecutive in sequence. Since these interactions are difficult to simulate using ab initio methods, we propose a supplementary method able to assign Beta-sheet topology using only sequence information. Our method is based on the signature molecular descriptor, which has been used previously to predict protein-protein interactions successfully, and to develop quantitative structure-activity relationships for small organic drugs and peptide inhibitors.

J.-L. Faulon, W. M. Brown, and S. Martin (2005), "Reverse Engineering Chemical Structures from Molecular Descriptors: How Many Soluti

**ons?**,"*Journal of Computer Aided Molecular Design*19(9-10):637-650. (journal)Physical, chemical and biological properties are the ultimate information of interest for chemical compounds. Molecular descriptors that map structural information to activities and properties are obvious candidates for information sharing. In this paper, we consider the feasibility of using molecular descriptors to safely exchange chemical information in such a way that the original chemical structures cannot be reverse engineered.

S. Martin, D. Roe, and J.-L. Faulon (2005), "Predicting Protein-Protein Interactions using Signature Products,"

*Bioinformatics*21(2):218-226. (journal, pdf, software)Proteome-wide prediction of protein-protein interaction is a difficult and important problem in biology. Although there have been recent advances in both experimental and computational methods for predicting protein-protein interactions, we are only beginning to see a confluence of these techniques. In this paper, we describe a very general, high-throughput method for predicting protein-protein interactions. Our method combines a sequence-based description of proteins with experimental information that can be gathered from any type of protein-protein interaction screen.

C. Churchwell, M. D. Rintoul, S. Martin, D. P. Visco Jr., A. Kotu, R. S. Larson, L. O. Sillerud, D. C. Brown, and J.-L. Faulon (2004), "The Signature Molecular Descriptor 3. Inverse-Quantitative Structure-Activity Relationship of ICAM-1 Inhibitory Peptides,"

*Journal of Molecular Graphics and Modeling*43(3):721-734. (journal)We present a methodology for solving the inverse-quantitative structureâ€“activity relationship (QSAR) problem using the molecular descriptor called signature. First, we create a QSAR equation that correlates the occurrence of a signature to the activity values using a stepwise multilinear regression technique. Second, we construct constraint equations, specifically the graphicality and consistency equations, which facilitate the reconstruction of the solution compounds directly from the signatures. Third, we solve the set of constraint equations, which are both linear and Diophantine in nature. Last, we reconstruct and enumerate the solution molecules and calculate their activity values from the QSAR equation.

S. Martin, M. Kirby, and R. Miranda (2000), "Symmetric Veronese Classifiers with Application to Materials Design,"

*Engineering Applications of Artificial Intelligence*13(5):513-520. (journal)To solve the materials classification problem, we propose a fast, exhaustive approach. We propose to test every feature (chemical property), every pair of features, every three features, etc., against every classifier architecture from a certain group of classifiers known as Support Vector Machines. This approach generalizes Pierre Villars' work to higher dimensions and more operations. We have duplicated his result in identifying the Mendeleev Number as the single best feature, and we have produced a new result for the case of two features: namely, we have identified the Mendeleev number with the valence electron number as the best combination of two features.

## Letter to the Editor

S. Martin, M. P. Mosquera-Caro, J. W. Potter, G. S. Davidson, E. Andries, H. Kang, P. Helman, R. L. Veroff, S. R. Atlas, M. Murphy, X. Wang, K. Ar, Y. Xu, I-M. Chen, F. A. Schultz, C. S. Wilson, R. Harvey, E. Bedrick, J. Shuster, A. J. Carroll, B. Camitta, and C. L. Willman (2007), "Gene Expression Overlap affects Karyotype Prediction in Pediatric ALL," Leukemia 21:1341-1344. (journal)Treatment of acute lymphoblastic leukemia (ALL) involves the assignment of patients to risk groups based on cytogentic abnormalities. Here we report the results of a gene expression experiment in which we have discovered that the predictions of karyotype are insensitive, in that there are a large number of false positive classifications among patients with poorly defined cytogenetic abnormalities.

## Conference Proceedings

Matthew F. Barone, Julia Ling, Kenny Chowdhary, Warren Davis, Jeffrey Fike, and Shawn Martin (2017), “

**Machine Learning Models of Errors in Large Eddy Simulation Predictions of Surface Pressure Fluctuations**,” 47th AIAA Fluid Dynamics Conference, 3979. (proceedings)We investigate a novel application of deep neural networks to modeling of errors in prediction of surface pressure fluctuations beneath a compressible, turbulent flow. In this context, the truth solution is given by Direct Numerical Simulation (DNS) data, while the predictive model is a wall-modeled Large Eddy Simulation (LES). The neural network provides a means to map relevant statistical flow-features within the LES solution to errors in prediction of wall pressure spectra. We simulate a number of flat plate turbulent boundary layers using both DNS and wall-modeled LES to build up a database with which to train the neural network. We then apply machine learning techniques to develop an optimized neural network model for the error in terms of relevant flow features.

S. Martin and T.-T. Quach (2016), "Interactive Visualization of Multivariate Time Series Data," Human Computer Interaction International (HCII), Foundations of Augmented Cognition: 322-332. (proceedings, software)

Organizing multivariate time series data for presentation to an analyst is a challenging task. Rather than providing a monolithic single use machine learning solution, we have developed a system that encourages analyst interaction. This system, Dial-A-Cluster (DAC), uses multidimensional scaling to provide a visualization of the datapoints depending on distance measures provided for each time series. The analyst can interactively adjust dial the relative influence of each time series to change the visualization and resulting clusters. Additional computations are provided which optimize the visualization according to metadata of interest and rank time series measurements according to their influence on analyst selected clusters.

E. Goodman, J. Ingram, S. Martin, and D. Grunwald (2015), "

In this paper we use anomaly scores derived from a technique for bipartite graphs as features for a supervised machine learning algorithm for two cyber security problems: classifying Short Message Service (SMS) text messages as either spam or non-spam and detecting malicious lateral movement within a network. We examine the UCI SMS Spam Collection Data Set for the SPAM problem and use an authentication graph from Los Alamos National Laboratory. By using the anomaly scores we are able to improve the area under the curve (AUC) for the receiver operating characteristic (ROC) up to 27.5% for the spam data and 21.4% for the authentication data.

X. Fu, S. Martin, S. Mills, and B. McCane (2013), "Improved Spectral Clustering Using Adaptive Mahalanobis Distance," 2nd IAPR Asian Conference on Pattern Recognition: 171-175. (proceedings)

In manifold clustering, data are sampled from multiple manifolds and the goal is to partition the data accordingly. Spectral clustering algorithms have been developed to solve this problem, but they tend to fail when the underlying manifolds are very close to each other and/or they intersect. We propose an improvement to spectral clustering algorithms using adaptive neighborhoods computed using Mahalanobis distance. We show the effectiveness of this approach on some artificial data. We further incorporate the modification into recent related algorithms and compare the results on datasets in motion segmentation, handwritten digit recognition, and object rotation.

S. Martin and L. Szymanski (2013), "Singularity Resolution for Dimension Reduction," Image and Vision Computing New Zealand (IVCNZ): 19-24. (proceedings)

Manifold clustering is often used to partition a multiple manifold dataset prior to the application of manifold learning. Thus manifold clustering can be seen as a pre-processing step foreliminating singularities in a dataset before doing dimension reduction. In this paper, we propose an algorithm for resolving singularities prior to dimension reduction. We achieve singularity resolution using algebraic blow ups as motivation. With this type of singularity resolution, we are able to simultaneously perform manifold clustering and learning.

S. Martin, V. Subramanya, and S. Mills (2012), "Using Graph Layout to Generalise Focus+Context Image Magnification and Distortion," Image and Vision Computing New Zealand (IVCNZ): 97-102. (proceedings, presentation)

We present a novel framework for performing distortion-oriented focus+context image magnification. Our framework uses algorithms from graph drawing to manipulate the mesh underlying an image. Specifically, we apply a spectral graph layout algorithm to a weighted graph, where vertices in the graph correspond to pixels in the image, and edges connect directly adjacent vertices/pixels. By assigning appropriate weights to the edges, we can replicate the results of previous distortion-oriented approaches. In addition, we can perform image-aware distortion by using pixel values to influence the edge weights of our graph. We compare our approach to previous methods and demonstrate new results using image-based edge weighting schemes.

S. Martin, W. M. Brown, R. Klavans, and K. Boyack (2011), "OpenOrd: An Open-Source Toolbox for Large Graph Layout," Visualization and Data Analysis (VDA): 7868-06. (proceedings, software)

We document an open-source toolbox for drawing large-scale undirected graphs. This toolbox is based on a previously implemented closed-source algorithm known as VxOrd. Our toolbox, which we call OpenOrd, extends the capabilities of VxOrd to large graph layout by incorporating edge-cutting, a multi-level approach, average-link clustering, and a parallel implementation. At each level, vertices are grouped using force-directed layout and average-link clustering. The clustered vertices are then re-drawn and the process is repeated. When a suitable drawing of the coarsened graph is obtained, the algorithm is reversed to obtain a drawing of the original graph. This approach results in layouts of large graphs which incorporate both local and global structure.

Organizing multivariate time series data for presentation to an analyst is a challenging task. Rather than providing a monolithic single use machine learning solution, we have developed a system that encourages analyst interaction. This system, Dial-A-Cluster (DAC), uses multidimensional scaling to provide a visualization of the datapoints depending on distance measures provided for each time series. The analyst can interactively adjust dial the relative influence of each time series to change the visualization and resulting clusters. Additional computations are provided which optimize the visualization according to metadata of interest and rank time series measurements according to their influence on analyst selected clusters.

E. Goodman, J. Ingram, S. Martin, and D. Grunwald (2015), "

**Using Bipartite Anomaly Features for Cyber Security Applications**," International Conference on Machine Learning Applications (ICMLA): 301-306. (proceedings)In this paper we use anomaly scores derived from a technique for bipartite graphs as features for a supervised machine learning algorithm for two cyber security problems: classifying Short Message Service (SMS) text messages as either spam or non-spam and detecting malicious lateral movement within a network. We examine the UCI SMS Spam Collection Data Set for the SPAM problem and use an authentication graph from Los Alamos National Laboratory. By using the anomaly scores we are able to improve the area under the curve (AUC) for the receiver operating characteristic (ROC) up to 27.5% for the spam data and 21.4% for the authentication data.

X. Fu, S. Martin, S. Mills, and B. McCane (2013), "Improved Spectral Clustering Using Adaptive Mahalanobis Distance," 2nd IAPR Asian Conference on Pattern Recognition: 171-175. (proceedings)

In manifold clustering, data are sampled from multiple manifolds and the goal is to partition the data accordingly. Spectral clustering algorithms have been developed to solve this problem, but they tend to fail when the underlying manifolds are very close to each other and/or they intersect. We propose an improvement to spectral clustering algorithms using adaptive neighborhoods computed using Mahalanobis distance. We show the effectiveness of this approach on some artificial data. We further incorporate the modification into recent related algorithms and compare the results on datasets in motion segmentation, handwritten digit recognition, and object rotation.

S. Martin and L. Szymanski (2013), "Singularity Resolution for Dimension Reduction," Image and Vision Computing New Zealand (IVCNZ): 19-24. (proceedings)

Manifold clustering is often used to partition a multiple manifold dataset prior to the application of manifold learning. Thus manifold clustering can be seen as a pre-processing step for

S. Martin, V. Subramanya, and S. Mills (2012), "Using Graph Layout to Generalise Focus+Context Image Magnification and Distortion," Image and Vision Computing New Zealand (IVCNZ): 97-102. (proceedings, presentation)

We present a novel framework for performing distortion-oriented focus+context image magnification. Our framework uses algorithms from graph drawing to manipulate the mesh underlying an image. Specifically, we apply a spectral graph layout algorithm to a weighted graph, where vertices in the graph correspond to pixels in the image, and edges connect directly adjacent vertices/pixels. By assigning appropriate weights to the edges, we can replicate the results of previous distortion-oriented approaches. In addition, we can perform image-aware distortion by using pixel values to influence the edge weights of our graph. We compare our approach to previous methods and demonstrate new results using image-based edge weighting schemes.

S. Martin, W. M. Brown, R. Klavans, and K. Boyack (2011), "OpenOrd: An Open-Source Toolbox for Large Graph Layout," Visualization and Data Analysis (VDA): 7868-06. (proceedings, software)

We document an open-source toolbox for drawing large-scale undirected graphs. This toolbox is based on a previously implemented closed-source algorithm known as VxOrd. Our toolbox, which we call OpenOrd, extends the capabilities of VxOrd to large graph layout by incorporating edge-cutting, a multi-level approach, average-link clustering, and a parallel implementation. At each level, vertices are grouped using force-directed layout and average-link clustering. The clustered vertices are then re-drawn and the process is repeated. When a suitable drawing of the coarsened graph is obtained, the algorithm is reversed to obtain a drawing of the original graph. This approach results in layouts of large graphs which incorporate both local and global structure.

S. Martin and S. McKenna, (2007), "Predicting Building Contamination using Machine Learning," International Conference on Machine Learning and Applications (ICMLA): 192-197. (proceedings, presentation)

Potential events involving biological or chemical contamination of buildings are of major concern in the area of homeland security. Tools are needed to provide rapid, onsite predictions of contaminant levels given only approximate measurements in limited locations throughout a building. In principal, such tools could use calculations based on physical process models to provide accurate predictions. In practice, however, physical process models are too complex and computationally costly to be used in a real-time scenario. We investigate the feasibility of using machine learning to provide easily computed but approximate models that would be applicable in the field.

J. Joo, S. Plimpton, S. Martin, L. Swiler, and J.-L. Faulon (2007), "Sensitivity Analysis of a Computational Model of the IKK-NF-kB-IkBa-A20 Signal Transduction Network," Annals of the New York Academy of Sciences 1115:221-239. (proceedings)

The NF-kB signaling network plays an important role in many different compartments of the immune system during immune activation. Using a computational model of the NF-kB signaling network involving two negative regulators, IkBa and A20, we performed sensitivity analyses with three different sampling methods and present a ranking of the kinetic rate variables by the strength of their influence on the NF-kB signaling response. We also present a classification of temporal-response profiles of nuclear NF-kB concentration into six clusters, which can be regrouped to three biologically relevant clusters.

S. Martin (2006), "An Approximate Version of Kernel PCA," Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA):239-244. (proceedings, presentation, poster)

We propose an analog of kernel principal component analysis (kernel PCA). Our algorithm is based on an approximation of PCA which uses Gram-Schmidt orthonormalization. We combine this approximation with support vector machine kernels to obtain a nonlinear generalization of PCA. By using our approximation to PCA we are able to provide a more easily computed (in the case of many data points) and readily interpretable version of kernel PCA.

S. Martin, Z. Mao, L. S. Chan, S. Rasheed (2006), "Protein Interactions Extrapolated from Feline Protein Complexes," Proceedings of the 3rd Biotechnology and Bioinformatics Symposium (BIOT):45-52. (pdf, presentation)

The determination of protein-protein interaction networks is a difficult problem in biology. Present day approaches to this problem are usually based on two hybrid experimental measurements coupled with refinement and extrapolation using computational techniques. Here we consider a computational method for similar refinement and extrapolation using experimental data from which protein interactions can not be directly inferred.

S. Martin (2006), "The Numerical Stability of Kernel Methods," Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics (AIMATH):P01. (pdf, presentation)

Kernel methods use kernel functions to provide nonlinear versions of different methods in machine learning and data mining, such as Principal Component Analysis and Support Vector Machines. These kernel functions require the calculation of some or all of the entries of a matrix of the form X

^{T}X . The formation of this type of matrix is known to result in potential numerical instability in the case of least squares problems. How does the computation of the kernel matrix impact the stability of kernel methods? We investigate this question in detail in the case of kernel PCA and also provide some analysis of kernel use in Support Vector Machines.

S. Martin (2005), "Training Support Vector Machines using Gilbert's Algorithm," Proceedings of the 5th IEEE International Conference on Data Mining (ICDM):306-313. (proceedings, presentation, software) Support vector machines are classifiers designed around the computation of an optimal separating hyperplane. This hyperplane is typically obtained by solving a constrained quadratic programming problem, but may also be located by solving a nearest point problem. Gilbert's algorithm can be used to solve this nearest point problem but is unreasonably slow. In this paper we present a modified version of Gilbert's algorithm for the fast computation of the support vector machine hyperplane.

S. Martin and A. Backer (2005), "Estimating Manifold Dimension by Inversion Error," Proceedings of the 20th annual ACM Symposium on Applied Computing (SAC):22-26. (proceedings, presentation)

There has been recent interest in the application of a class of nonlinear dimensionality reduction algorithms which assume that a dataset has been sampled from a manifold. From this assumption, it follows that estimating the dimension of the manifold is the first step in analyzing an image dataset. Once an estimate of the dimension is obtained, it is used as a parameter for the nonlinear dimensionality reduction algorithm. In this paper, we consider reversing this approach. Instead of estimating the dimension of the manifold in order to obtain a low dimensional representation, we consider producing low dimensional representations in order to estimate of the dimensionality of the manifold.

S. Martin, M. Kirby, and R. Miranda (2000), "

**Kernel/Feature Selection for Support Vector Machines Applied to Materials Design**," Proceedings of 9th IFAC Symposium on Artificial Intelligence in Real Time Control (AIRTC):29-34. (pdf)

Support Vector Machines are classifiers with architectures determined by kernel functions. In these proceedings we propose a method for selecting the best SVM kernel for a given classiï¬cation problem. Our method searches for the best kernel by remapping the data via a kernel variant of the classical Gram-Schmidt orthonormalization procedure then using Fisher's linear discriminant on the remapped data.

## Extended Abstracts

S. Martin, W. M. Brown, J.-L. Faulon, D. Weis, D. Visco, and J. Kenneke (2005), "Inverse Design of Large Molecules using Linear Diophantine Equations," Proceedings of the 4th IEEE Computational Systems Bioinformatics Workshops (CSBW):11-16. (proceedings, poster)We have previously developed a method for the inverse design of small ligands. A key step in our method involves computing the Hilbert basis of a system of linear Diophantine equations. In our previous application, the ligands considered were small peptide rings, so that the resulting system of Diophantine equations was relatively small and easy to solve. When considering larger molecules, however, the Diophantine system is larger and more difficult to solve. In this work we present a method for reducing the system of Diophantine equations before they are solved, allowing the inverse design of larger compounds.

S. Martin, G. S. Davidson, E. E. May, J.-L. Faulon, and M. Werner-Washburne (2004), "

**Inferring Genetic Networks from Microarray Data**," Proceedings of the 3rd IEEE Computational Systems Bioinformatics (CSB):566-569. (proceedings, poster)

In theory, it should be possible to infer realistic genetic networks from time series microarray data. In practice, however, network discovery has proved problematic. The three major challenges are 1) inferring the network; 2) estimating the stability of the inferred network; and 3) making the network visually accessible to the user. Here we describe a method, tested on publicly available time series microarray data, which addresses these concerns.

J.-L. Faulon, S. Martin, and R. D. Carr (2004), "

**Dynamical Robustness in Gene Regulatory Networks**," Proceedings of the 3rd IEEE Computational Systems Bioinformatics (CSB):626-627. (proceedings, pdf, poster)

We investigate the robustness of biological networks, emphasizing gene regulatory networks. We define the robustness of a dynamical network as the magnitude of perturbation in terms of rates and concentrations that will not change the steady state dynamics of the network. We find the number of dynamical networks versus their dynamical robustness follows a power law.

## Dissertation and M.Sc. Paper

S. Martin (2001),**Techniques in Support Vector Classification**, Ph. D. Dissertation, Colorado State University. (pdf)

Here we consider three problems in Support Vector Classification: feature selection, kernel selection, and training. Feature selection is done using Fisher's discriminant adapted to SVMs. Kernel selection is done using a kernel version of Gram-Schmidt orthonormalization, and training is done using a geometrical interpretation of the quadratic optimization program normally used to solve for the SVM.

S. Martin (1997), "Concerning the Quadratic Relations which define the Grassman Manifold," M.S. Paper, Colorado State University. (pdf)

The Plucker embedding gives a bijective correspondence between the d-planes of a projective space Pn and the points of the Grassman Manifold in a higher dimensional space PN. The Grassman Manifold can be defined as the set of points in PN whose homogeneous coordinates satisfy certain quadratic relations, those relations being generated by sequences in {0,...,n}. Here we present a minimal set of generating sequences for the quadratic relations and subsequently investigate the linear independence of said relations.

Last Updated Dec. 18, 2017.