Scientific Computing with Python
Austin, Texas • July 6-12, 2015
 

SciPy 2015 Accepted Posters

The SciPy organizing committee is in awe of the work the SciPy community is doing and we greatly appreciate everyone who submitted a topic for this year's conference. If your submission wasn't able to be slated into the limited number of poster spots, we hope you'll take advantage of the lightning talk and Birds of a Feather (BoF) sessions to share your work.

The submissions selected for poster presentation at SciPy 2015 are listed alphabetically in the table below.  You can also see the talk selections here


15 Years of Python in the Classroom

Doug Blank, Bryn Mawr College

As a computer science educator at an all-women's college, I use the right language for any course, owing no special allegiance to Python. If a better language came along, I would switch to it in a second. However, more often than not, Python has shown its pedagogical utility. Over the past 15 years, I have used Python in the classroom in a variety of ways. From Robots to Art; from core language, to implementation of other languages, Python has been the language of choice. In this talk I explore 15 years of using Python, including most recent uses with Jupyter.

Analysis and Visualization of Imaging Experiments with IPython, sklearn and bokeh

Blake Borgeson, Recursion Pharmaceuticals

In the past 12 months, our team of 8 (up from 4 at the start) has, in 50 separate experiments, imaged human cells from 50,000 genetic and drug perturbations. From those images, we’ve segmented over 100M human cells and extracted 1,000 features from each one, yielding 100B measurements. From those measurements, we’ve identified drugs that may be useful treatments against rare genetic diseases. We’ve also found creative ways of visualizing the measurements and our quantitative results, the images, and often results combined with images, giving the biologists on our team insights only possible because of the 1000-dimensional feature space we’re lucky enough to live in, but conveyed in an intuitive enough way for the results to be both understandable and actionable. The cooperative cycle of building these tools together with the biologists has been critical, but the speed at which we’ve been able to both perform analysis and visualize results has been hugely boosted by the ecosystem of scientific tools built on python that have been developing and maturing over the past several years. I’ll demonstrate several aspects of our analysis using an ipython notebook during the talk, and share some of the additions we’ve made to our notebooks and to bokeh that help make our data science results more reliable, relatable, and remember-able!

Analyzing Genomic Data with PyEnsembl and Varcode

Alex Rubinsteyn, Mount Sinai School of Medicine
Tim O'Donnell, Mount Sinai School of Medicine

PyEnsembl and Varcode are two new libraries being developed at Mount Sinai's Hammerlab to facilitate the analysis of genomic variants with an eye toward Pythonic interfaces, data representations, and coding conventions. PyEnsembl provides access to genomic sequence and annotation data. PyEnsembl's API consists primarily of objects such as Gene, Transcript, and Exon, along with methods for querying these objects by properties such as their chromosomal locations. PyEnsembl can be used to answer fundamental questions such as "which genes overlap a genomic location?" and "what is the nucleotide sequence of a particular transcript?". Varcode sits on top of PyEnsembl and uses it to annotate genomic mutations. Varcode can be used to quickly answer questions such as "what's the degree of overlap between two sets of mutations" and "which genes are affected by each mutation?". Additionally, Varcode can predict the altered amino acid sequence arising from a mutation, which is useful for predicting properties of the mutant protein (such as presentation to the adaptive immune system). This talk will show some basic examples of PyEnsembl and Varcode in action and give a brief glimpse of how they can be used as part of a personalized cancer vaccine pipeline.

A Bioinformatic Pipeline to Search for Genetic Variants of Wood Frog Environmental Response Genes

Jordan Brooker, Research Technician, Vassar College

Wood frogs, Rana sylvatica, exhibit the ability to freeze during hibernation and reanimate when temperatures rise. These and other phenotypic differences might be explained by variation in the DNA sequences of genes that play roles in response to environmental stress. We created a bioinformatic pipeline to identify genes of interest related to stress response and have identified possible SNP locations.

A Cloud Service to Record Simulation Metadata

Yannick Congo, NIST/Blaise Pascal University
Jonathan Guyer, NIST

The notion of capturing each execution of a script or workflow and its associated metadata is enormously appealing and should be at the heart of any attempt to make scientific simulations reproducible. In view of this, we are developing a backend and frontend service to both store and view metadata simulation records using a robust data scheme agnostic approach. See https://gist.github.com/wd15/11f722a546b018525957
This presentation was authored by Daniel Wheeler, Ph.D. and Yannick Congo. It will be presented by Yannick Congo and Jonathan Guyer, Ph.D.

Comparing and Evaluating Clustering Methods for Protein Simulations

Jan H. Meinke, Forschungszentrum Juelich

Understanding protein folding is a prerequisite for understanding diseases like Alzheimer's, Parkinson's, Mad Cow, and many others. Simulations have contributed significantly to our knowledge of protein folding. The volume and complexity of data generated by these simulations, however, require new ways to analyze the data. Clustering protein structures offers a way of projecting the data onto a managable set, but it is a difficult task. The large number of coordinates (dimensions), which may be periodic, make it algorithmically challenging, and the large number of structures makes it computationally challenging. A variety of partitional and hierarchical cluster algorithms are available in Python, e.g., in scipy, sklearn, and msmbuilder. Density-based algorithms can be found, e.g., in sklearn and individual packages such as pymafia. This talk presents a comparison of the quality, speed, and complexity of different clustering algorithms available in Python including the subspace clustering algorithm MAFIA for clustering protein structures. The quality of the clusters is evaluated in terms of similarity measures as well as physical properties of the structures within a cluster. Finally, an example of an analysis based on clustering of a Monte Carlo simulation of the folding of a small protein is presented.

Computational Astrophysical Hydrodynamics for Students

Michael Zingale, Stony Brook University

We describe a simple python-based hydrodynamics code for teaching students about simulation techniques used in computational astrophysics. Solvers for advection, compressible hydrodynamics, diffusion, incompressible, and low Mach number hydrodynamics are provided. The design of the code encourages experimentation and provides a simple starting point for students to learn core techniques.

Detecting Carcinogenic Somatic Mutations Using Scikit-learn

Chak-Pong Chung

In this joint project between Institute for Pure & Applied Mathematics(IPAM) at UCLA and Beijing Genomics Institute(BGI), we use scikit-learn to detect the somatic SNPs in chromosome 1 of a breast cancer patient. The data is provided by BGI. We are given two sets of data from one patient: one set of data from the normal tissue and another set from the tumor tissue.

Development of Large Image Analysis Workflows in the Context of “SlideAtlas”

Dhanannjay "Djay" Deo, Kitware Inc
Brian Helba, Kitware Inc

We will discuss lifecycle of the large image analysis application development in the context of "SlideAtlas", an open source interactive visualization and analysis platform for large image datasets.

Epithelial Tissue Simulation and with Initial Conditions Taken From In Vivo Images

Melvyn Drag, Ohio State University

The subject of my Master's Thesis was the simulation of the epithelial tissue morphogenesis. My model, as well as all of the other models in this field, take as input some initial mesh of polygons which represent biological cells. Then, some cleverly chosen forces are applied to the vertices in the mesh and a numerical integrator moves the vertices towards some equilibrium. As of now, the initial meshes are generated through Voronoi Tesselations, or through the random perturbation of some regular mesh. I have used python to transform images of living tissues to files readable by epithelial tissue simulators. Hopefully, as the models mature, the simulators will be able to take this input and then output a video which exactly mirrors what videos of living tissue show.

ESPResSo++: A Parallel Python Module for Soft Matter Simulations

Horacio Vargas Guzman, Max Planck Institut for Polymer Research
Torsten Stuehn, Max Planck Institute for Polymer Research

Fundamental research in condensed matter sciences often calls for computer simulations to get a profound understanding of the physical and chemical properties of the underlying model systems. Two widely used methods in the field are the so called molecular dynamics (MD) and monte carlo (MC) simulations. We present ESPResSo++ software which is an extensible, flexible, fast and parallel simulation software for soft matter research. ESPResSo++ consist of a collection of C++ classes embedded in a Python framework which mainly target at the simulation and analysis of coarse-grained atomistic or bead-spring models. Since ESPResSo++ itself is a Python module it can be easily combined with other scientific Python modules like numpy, matplotlib, among others. The distribution and invocation of Python objects in a multi CPU environment is performed by a Parallel Method Invocation (PMI). A single python-script or an IPython notebook contains the complete workflow of the system setup, simulation and analysis of a scientific problem which greatly enhances reproducibility and documentation of scientific projects.

Fast, Flexible, and Scalable? Molecular Dynamics Simulations and Python

Wendell Smith, Yale University

Python projects are widely known for being easy to use, flexible, and scalable, but speed is often seen as an Achilles heel. If performance is truly a primary concern, how do you use Python? I will show how one can put the bottleneck code of a molecular dynamics simulation in a C++ library, wrap it in Python, and use the Python module for everything that takes more time to code than to run.

From DataFrame to Web Application in 10 minutes

Adam Hajari, Next Big Sound

Any data driven projects can benefit greatly from a simple, interactive, and easily accessible user interface. Whether your project is in the prototyping stage or you just want a way to quickly get your ideas and research to an audience unfamiliar with the command line, this talk will show you how to turn your python code into interactive web applications in under 10 minutes.

MDSynthesis: a Python Package Enabling Data-driven Molecular Dynamics Research

Speaker: David Dotson, Arizona State University

Molecular dynamics (MD) simulations permit atomically-detailed access to the mechanical behavior of proteins, giving a sequence of structural snapshots--a timeseries of thousands to millions of atom positions. Most protein studies routinely generate terabytes of data spread over many (possibly hundreds) of simulation trajectories, performed under a wide variety of conditions. Adding further complexity, it is necessary to store intermediate data obtained from the raw trajectories--often timeseries for specific structural or thermodynamic quantities of interest. This data management problem serves as a barrier to answering scientific questions. MDSynthesis, a Python package that handles the tedious logistics of intermediate data storage and retrieval, addresses this problem. MDSynthesis features persistent container objects that use the robust MDAnalysis library to dissect individual simulation trajectories and HDF5 as the format of choice for data storage. These containers are built for aggregation, including convenience methods for quickly combining and comparing datasets (pandas, numpy, and pure python structures) across hundreds of simulations in arbitrary ways. This makes MD data exploration feasible, and the abstraction the containers provide make it easier to write analysis code that works across many variants of a simulation system. The package is actively developed and freely available under the GPLv2 from https://github.com/Becksteinlab/MDSynthesis.

MetOceanTrack: A Desktop GUI to Simulate The Dispersal of Invasive Species in Coastal and Continental Shelf Regions

Andre Lobato, Metocean Solutions LTD.

A desktop software has been developed, using PyQT library, to allow scientists to simulate the dispersal of propagules of invasive species within New Zealand coastal areas. The simulation is performed by a proprietary lagrangian particle track model (ERCore), also written in Python, and Fortran. The software allows the visualisation of the propagules dispersion animation, as well the heat-maps for density estimations. The software will be made freely available for use in academic research.

Modeling the Outcomes of Mortgage Loan Applications

Vahan Grigoryan, Consumer Financial Protection Bureau

Understanding the mortgage loan application approval process used by financial institutions is important for both government regulators and the general public. This talk describes a machine learning model that predicts historical outcomes of those applications based on SciPy and scikit-learn libraries.

octant - A Python Package for Working with Ocean Model Datasets

Robert Hetland, Texas A&M University

Octant is a python package containing a set of tools for working with ocean model datasets. Tools include grid generation, creation of input files, and standard analysis. The grid generation component wraps an existing C package and uses simple matplotlib GUI tools to create numerical curvilinear grids for regional ocean modeling quickly and visually; the grid parameters can then be ported over to a script for reproducibility and later refinement. Other components of the package are composed in a flexible manner so that model variables and coordinates can be used in a similar way as numpy arrays, while at the same time efficiently querying potentially large model output datasets. For example, the vertical coordinate system can be indexed as an array, which is calculated on demand so that disk access and memory usage are reduced to only what is necessary. The design goal is to create a small set of efficient tools that operate in a system that uses standard numpy/scipy/matplotlib functions, as opposed to monolithic GUIs, large data structures, and overly-complex analysis tools, which are eschewed.

OOF 3D: Modular Software Design in the Face of Major Feature Changes

Andrew Reid, NIST

The Object-Oriented Finite Element package (OOF) is a Python and C++ finite element package designed to allow materials scientists to easily and quickly explore structure-property relationships in real microstructures. Initially built as a 2D code, this project recently completed the construction and release of a 3D version. The initial 2D code was written to be "3D aware" with the idea of easing this transition, and in addition, the authors endeavored to maintain an appropriate level of abstraction and encapsulation for the various objects. Nevertheless, numerous challenges arose in the design and execution of the 3D version. The lessons, good and bad, of the collision of the 3D plan with its implementation will be discussed.

Patient Signals - Building and Deploying Predictive Apps in Healthcare

Corey Chivers, University of Pennsylvania Health System

Patient Signals is a platform for building and deploying machine learning models into predictive applications using real-time clinical data.

Perceptual Colormaps in matplotlib with an Application in Oceanography

Kristen Thyng, Assistant Research Scientist, Texas A&M University

Perceptual colormaps are those in which humans perceive a step increase in the colormap with a commensurate increase in the plotted variable. The lightness value of a color is the measure by which we perceive this change properly. This talk will include a discussion of what functional relationship (linear, power law, etc) of the lightness parameter in a colormap is best for proper representation of one's data. Hue is another variable that can help to represent one's data in a colormap; this can be used in conjunction with changing lightness as additional indicators. A sample set of oceanographic data will be used to help illustrate how to choose a good colormap for a given application by considering whether to use a sequential or diverging colormap, what colors might be intuitive for viewers of the image, a way to use both lightness and hue for different indicators in the same colormap, and a way to choose intuitive colors for representation but still use a sequential perceptual representation.

Performance of PyCUDA (Python in GPU High Performance Computing)

Roberto Colistete Junior, UFES - Federal University of Espirito Santo (Brazil)
This Poster is co-authored by: Ramon Giostri Campos, UFES - Federal University of Espirito Santo (Brazil)
High Performance Computing with GPU and CUDA can be made using Python via PyCUDA. Here I will show the advantages, performance, etc, of PyCUDA versus C/C++/Wolfram Mathematica with CUDA in a calculation example from supernova type Ia cosmology.

Python Tool to Load, Process, and Plot Conductive Temperature Depth (CTD) data

Filipe Fernandes, SECOORA

The [python-ctd](https://github.com/ocefpaf/python-ctd) module a set if tools to load and process hydrographic data (CTD and XBT) using pandas DataFrames.

PyUgrid: Handling NetCDF for Unstructured Grids

Chris Calloway, UNC RENCI

UGRID is a proposed extension to the Climate and Forecast (CF) conventions for NetCDF metadata. UGRID specifies a metatdata convention for describing the topology of unstructured grids, such as finite elements meshes, common to storm model output. PyUgrid is a Python package for reading, writing, and analyzing NetCDF datasets conforming to the UGRID specification.

Reassortment Primes Influenza for Ecotype Switches

Eric Ma, Massachusetts Institute of Technology

The influenza virus is capable of undergoing genomic reassortment. Much has been intuited about the role of influenza reassortment in its evolution, and the role of host ecology in reassortment, but little has actually been quantified. Here, I will present a network reconstruction method that I have developed in collaboration with my colleagues at MIT to globally quantify the amount of reassortment that influenza has undergone, as well as a hyper-graph inference method to clarify the role of host ecology in enabling reassortment. We find that there are highly promiscuous subtypes that donate or accept genes with other viral subtypes, that reassortment is most prevalent in wild birds, and that reassortment events are consistently involved in the life history of viruses that switch between ecotypes, raising implications for surveillance efforts. In this talk, I will also share how we are making our work achieve the goal of being "highly reproducible".

Relation: The Missing Container

James Larkin, Noblis
Scott James, Noblis

How one simple container can perform a variety of common tasks including: tagging, aliasing, partitioning and inversion.

Risk Analysis of Privacy Invasion in Social Network using Python with SciPy

Jun-Bum Park, UST(Korea University of Science and Technology) & ETRI(Electronics and Telecommunications Research Institute)

The benefits of online used goods transactions is that everyone who uses the Internet can shop for and sell goods easily. But there is the invasion risk of privacy when you expose your personal information in online market. The user exposes his email address or mobile phone number, as well as other information. And an attacker can use the information to identify a target user by linking a social network service. If personal information of a targeted user is linked to a social network service, an attacker can engage in crimes such as stalking, phishing, and fraud using the linked information. In this paper, to evaluate the risk of linked personal information obtained from online used goods transaction sites from South Korea In Python.

rsfmodel - A Frictional Modeling Tool for Fault and Laboratory Data Analysis

John Leeman, Ph.D. Candidate, Penn State University
Ryan May, UCAR/Unidata

Authored by John Leeman, Ryan May, Chris Marone and Demian Saffer at Penn State University.
We have developed a tool to model transitional frictional behavior, important to industrial problems of granular material handling and natural problems such as earthquake physics. Without Python and its well tested numerical packages, we would be unable to implement such a clean and easy to interface with solution. This is much more desirable than each scientist writing their own model with differing amounts of unit testing and error checking. The interface also encourages trials of different solving methods and numerical formulations of the problem.

A Scientific Game Engine in the Cloud

Oliver Nagy

Azrael is a game engine for engineers. Unlike traditional engines it emphasises accurate physics, runs in the Cloud, has a language agnostic API, and is written (mostly) in Python. Its main purpose is to make it easy for engineers to build, study, and control complex systems, for instance an auto pilot for a space ship; or a fleet thereof; flying in formation; through an Asteroid belt... I will show a live demo to illustrate the concept. It implements a simple control algorithm to first manoeuvre an object to a pre-defined position in space, and then maintain that position despite random collisions with other objects. For more information, demos, and the code, please visit https://github.com/olitheolix/azrael.

SciPy and Real-time Big Data for Site Optimization

Winnie Cheng, Chief Data Scientist, Bankrate

We share our industry experience in building a real-time big data system that leverages the SciPy stack for website optimization at Bankrate.com. We discuss our design considerations behind creating a system that can achieve low-latency and handle large volume of data, yet have the flexibility to explore different machine learning algorithms in Python.

TLS Benchmarking with IPython

Christopher Niemira, AOL

This proposal is for a poster that shows real-world techniques in use at AOL. It is not based a research paper. TLS (nee SSL) has been very much in the news of late. And while the conversation about encryption was, for many years, an argument about whether proper key management was worth the effort, today it is about how quickly and effectively practitioners can react to the latest threats. This poster will illustrate AOL's process and toolkit for evaluating cryptographic solutions for use in production. In particular, it will show how we determine the "cost" of enabling different encryption schemes (TLS versions, cipher suites, ECC curves, etc...) on various platforms by using IPython.Parallel as a synchronous benchmark driver, and how we leverage the scipy stack to analyze the results visually. It will also show how we validate entropy and bulk throughput. Of particular significance is that this poster will show how the principles of reproducible research are relevant in industry, especially the fields of software/hardware evaluation and QA.

Topological Data Analysis using Python

Dan Dickinson

Topological data analysis is an active and growing research area in mathematics, and has many exciting potential applications in diverse fields. Of particular interest to data scientists is the Mapper algorithm, which allows users to explore and extract meaningful insights from high dimensional data sets using intuitive interactive graphs. This poster covers an implementation written in Python, and highlights an application of the algorithm and resulting graph.

Toolset and Workflow for Verification, Validation and Uncertainty Quantification in Large-scale Engineering Applications

Damon McDougall, UT Austin

Predictive engineering is difficult. First we identify an observable physical phenomenon and an unobservable quantity of interest for which a prediction is desired. Then we attempt to recreate the observable phenomenon in an experimental setting; observations of the phenomenon contain errors. Next, we conjure a mathematical model for the observable physical phenomenon, potentially making modelling errors. Implementing this mathematical model on a computer leads to a numerical model that is not exact due to discretization errors. This numerical model is often high-dimensional and computationally expensive. Using output from the numerical model we compare with experimental observations and decide if the model is valid or not. If not, we must introduce a model for the inadequacy and calibrate this inadequacy by writing down a well-posed Bayesian inverse problem. Lastly, we propagate any uncertain parameters of the calibrated inadequacy model through to a prediction of a quantity of interest, while accounting for all uncertainty. The above procedure can be distilled down to three main components. The first is verification, the act of ensuring the numerical model solves a discrete version of the mathematical model. The second is validation, the process of ensuring the mathematical model and observations from a real experimental setting are consistent with each other. The third is uncertainty quantification, which ensures that uncertain parameters are understood in a distributional sense and that these distributions are propagated to the quantity of interest. This talk explores how to deal with each of these error-prone and time-consuming issues in turn. We concentrate on the specific problem of porous media flow but keep the ideas and workflow general. For each topic, I will showcase some of the software I use on a daily basis to do predictive engineering, some of which are scientific Python libraries and some of which are not.

The Use of Python in the Large-Scale Analysis and Identification of Potential Antimicrobial Peptides

Shaylyn Scott, George Mason University

Recent advancements in mass spectrometry technology and proteomics methods allow for the large-scale sequencing of peptides and small proteins from complex mixtures. However, analysis of the copious amount of peptide sequences produced can be challenging, and new software tools are needed to efficiently dissect the substantial volume of sequence data. One such software tool, PEAKS7, was used to assign de novo sequences for peptides harvested from complex biological mixtures (plasma) based on their respective ETD & HCD MS/MS spectra; where applicable, these sequences were validated against a database, such as a transcriptome or genome. However, the manual identification of effective antimicrobial peptides from the high volume of sequence data has proven to be a difficult and time-consuming task. A python script was developed to aid in the identification of potential antimicrobial peptides. The python script generates a FASTA file of the sequences, from which it sends post requests to an antimicrobial database, entitled CAMP, which utilizes SVM, RF, and DA predictor algorithms to predict whether a specific peptide sequence will likely correspond to a peptide with antimicrobial properties. The script then calculates physicochemical properties of each sequence. All of the data is automatically exported to an excel file and sorted by certain physicochemical parameters, including hydrophobicity and net charge, or by the number of algorithms that predicted a sequence to be antimicrobial. This script has dramatically reduced the time that it takes to process the peptide sequence data, reducing the time from weeks to minutes per dataset. This streamlined process has led to the prediction and experimental validation of several novel antimicrobial peptides.

Using Python to Manage 800,000 WAAS Data Points per Second

Benjamin Potter, Federal Aviation Administration

WAAS engineers have produced a system that handles more than 800,000 points per second of data. The system was built using Python, Graphite, and ZeroMQ, and allows all engineers to easily review and analyze critical WAAS data.

Using Python and Jupyter Notebooks for a Biomedical Imaging Phenotyping Service

Brian Chapman, University of Utah
John Roberts, University of Utah
Stuart Schulthies, University of Utah

In this presentation, we describe a centralized medical image processing service based on Python and Jupyter notebooks. The image processing service facilitates extracting quantitative features from medical images that can be combined with other observable characteristics for patient phenotyping.

Using Self Organizing Maps to Visualize and Cluster High Dimensional Data

Richard Xie, Endgame

Cyber-security analysis often involves high dimensional data. In this talk, I will present how data scientists at Endgame use the Self Organizing Map (SOM) technique in Python to reduce the dimensionality for visualizing and clustering large data sets. I will demonstrate that some computationally expensive clustering methods, like hierarchical clustering, become feasible after dimensionality reduction.

A Web-based Visualization Platform for Microbiome Metaproteomics

Sandip Chatterjee, The Scripps Research Institute

The human distal gut harbors tens of trillions of microorganisms that have recently been shown to be associated with human health and disease. We have recently described an analysis method to identify proteins in the microbiome using mass spectrometry and proteomic scoring against a comprehensive protein search database. Further work has resulted in the collection of additional data and functional annotation of these unique datasets. To simplify and make sense of these data, we present an open-source, web-based data visualization platform built using the Python libraries Flask and Bokeh.

xray: netCDF Made Joyful

Stephan Hoyer, The Climate Corporation

xray is a new library that provides an in-memory representation of the netCDF file-format in Python. Xray extends the easy-of-use and speed of pandas to the Common Data Model. For example, we take a pragmatic approach to metadata that makes it both easy and rewarding to utilize and preserve labels. This talk will focus on a tour of xray's capabilities in the context of a meteorological example, including its ability to scale to files that do not fit into memory. In the process, we will highlight the distinctions between our approach and that of other Python libraries (e.g., Iris and netCDF4) for working with netCDF data.