Scientific Computing with Python
Austin, Texas • July 6-12, 2015
 

SciPy 2015 Accepted Talks

The SciPy organizing committee is in awe of the work the SciPy community is doing and we greatly appreciate everyone who submitted a topic for this year's conference. If your submission wasn't able to be slated into the limited number of main conference talk sessions, we encourage you to take advantage of the lightning talk and Birds of a Feather (BoF) sessions to share your work.

The submissions selected for the main conference tracks and mini-symposia presentations are listed alphabetically in the table below. The final agenda and scheduled times will be posted the first week of June. You may also see the accepted posters here.



Accelerating Python with the Numba JIT Compiler

Stanley Seibert, Continuum Analytics

Numba is an open source, cross-platform just-in-time compiler for Python functions. In a wide range of critical functions found in scientific applications, Numba can generate code nearly as fast as C or FORTRAN. Although originally aimed at Numpy arrays, Numba has expanded to work with other Python types. Come learn how Numba can speed up your application!

Agent-Based Modeling in Python with Mesa

Jackie Kazil
David Masad, George Mason University

Agent-based modeling is currently a hole in Python’s robust and growing scientific computing ecosystem. Mesa is a new open-source package meant to fill that gap. It allows users to quickly create agent-based models using built-in core components (such as spatial grids and agent schedulers) or customized implementations; visualize them using an innovative browser-based interface; and analyze their results using Python’s extensive data analysis tools. Mesa is being developed by a group of modeling practitioners with experience in academia, government, and the private sector. It is also completely open-source, encouraging users to contribute to what we hope will be a growing repository of model components which others can reuse and expand upon in future research.

ASDF, a New Scientific Data Format

Perry Greenfield, Space Telescope Science Institute
Michael Droettboom, Space Telescope Science Institute
Erik Bray, Space Telescope Science Institute

We have developed an alternate disk format for astronomical data based on YAML for the metadata and structural data organization, but also with efficient support for binary data in many forms. The original goal for the Advanced Scientific Data Format (ASDF) was to define a format that circumvented many of the limits imposed by the current astronomical standard format, FITS (Flexible Image Transport System). While the format is intended to be used for astronomical data, there is intrinsically nothing in it that is specific to astronomy, and it should prove useful in a wide variety scientific and engineering fields. We contrast it with HDF5, highlighting the advantages it provides over this widely used scientific format. In particular, we illustrate that it is much more self documenting, friendly to smaller, character-based data files, extensible, and more easily adaptable to validation tools, even for custom conventions. We have written a draft specification and have a Python implementation for supporting the format that handles all non-object-based numpy array variants. We have developed a schema mechanism for validating that the files conform to the required elements for the format, including locally defined conventions that supply schema definitions for those conventions, and the tools for validating the files against these schema definitions.

Astropy in 2015

Erik Tollerud, Hubble Fellow, Astropy/Yale University

The Astropy Project is a community effort to develop a single core package for Astronomy in Python and foster interoperability between Python astronomy packages. I will give a status update on the Astropy core package over the last year, which includes the v1.0 release, as well as plans for the core library in the next year. I will also describe some of the "affiliated packages" Python packages that use Astropy and are associated with the community, but are not actually a part of the core library itself. In particular I will focus on the recent growth of the packages involved in this effort, and the tools we have provided to make it easier for working scientists to provide and maintain their own domain-specific packages.

Automated Image Quality Monitoring with IQMon

Josh Walawender, Instrument Astronomer, Subaru Telescope

Automated telescopes are capable of generating images more quickly than they can be inspected by a human, but detailed information on the performance of the telescope is valuable for monitoring and tuning of their operation. The IQMon (Image Quality Monitor) package was developed to provide basic image quality metrics of automated telescopes in near real time.

Basic Sound Processing in Python

Speaker: Allen Downey, Professor of Computer Science, Olin College

Digital signal processing (DSP) has applications in all areas of engineering and science, but DSP methods are not widely known.  Python provides an opportunity to make DSP more accessible. In this talk, I present an introduction to DSP focused on sound-processing applications. I present tool for working with digital signals using NumPy, SciPy and IPython. Examples include spectral analysis of music, spectrograms, noise, filtering, and system characterization. This material is based on Think DSP, a work-in-progress book available at think-dsp.com.

A Better Default Colormap for Matplotlib

Stéfan van der Walt, University of California, Berkeley
Nathaniel Smith, University of California, Berkeley

The default colourmap in Matplotlib is the colourful rainbow-map called Jet, which is deficient in many ways: small changes in the data sometimes produce large perceptual differences and vice-versa; its lightness gradient is non-monotonic; and, it is not particularly robust against color-blind viewing. Thus, a new default colormap is needed -- but no obvious candidate has been found. Here, we present our proposed new default colormap for Matplotlib, and expose the theory, tools, data exploration and motivations behind its design.

The Biopython Project: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics

Tiago Antao

Computational Biology is seeing a burgeoning explosion of large datasets coming from genomics and proteomics. In this presentation I will introduce the Biopython project, a large-scale, mature free software project to do modern bioinformatics analysis.

Blaze + Odo

Phillip Cloud, Continuum Analytics

Blaze separates expressions from computation. Odo moves complex data resources from point A to point B. Together Blaze and Odo smooth over many of the complexities of computing with large data warehouse technologies like Redshift, Impala and HDFS. Because we designed Blaze and Odo with PyData in mind they also integrate well with pandas, numpy, and a host of other foundational libraries. We show examples of both Blaze and Odo in action and discuss the design behind each library.

Building a Cloud-based Data Science Platform with Python

Stephen Hoover, Data Scientist, Civis Analytics

Python is a powerful, easy-to-use language which has a wide range of numerical and machine-learning open-source libraries. At Civis Analytics, we've built a cloud-based platform for data science which empowers analysts to extract insights from their data without specialized machine learning or coding knowledge. The platform itself runs on Amazon Web Services, and the machine learning workflows at the core of the platform are coded in Python. Open-source Python libraries such as pandas, numpy, statsmodels, and scikit-learn let our data scientists focus on high-level workflows and greatly accelerate our development process. In this talk, I'll discuss why Civis likes Python for powering predictive models. I'll talk about how we use Python open-source libraries to help with data analysis, and some of the challenges we've overcome along the way. Issues such as interfacing with other code, handling corner cases, and good coding practices become even more important in a production environment.

Causal Bayesian NetworkX

Michael Pacer, Graduate Student, University of California, Berkeley

Causal Bayesian networks are a powerful framework that has been useful in the rational analysis of human causal reasoning. These probabilistic graphical models excel at representing conditional independence relationships and interventions defined in terms of network modifications. I will discuss the formal properties of causal Bayesian networks, why they are useful, and what their weaknesses are. I will present an implementation in NetworkX that is capable of characterizing structure manipulations over sets of graphs, a JSON-compatible format for storing distribution information in NetworkX node attributes, and a Monte-Carlo sampler that uses the format to sample from conditional probability distributions defined on the network.

Characterizing the Seafloor with Python as a Toolbox

Johanna Hansen, Woods Hole Oceanographic Institution

This presentation will describe a newly developed workflow for extracting insight about the deep ocean. The workflow can automatically process image, sonar, and sensor data for quick interpretation while providing the foundation for further iterative analysis. I'll cover how we utilized scikit-learn, pandas, and MBSystem among other libraries to move away from commercial software while reducing the time needed to gain understanding of the data.

Circumventing the Linker: Using SciPy's BLAS and LAPACK within Cython

Ian Henriksen, Brigham Young University

In this talk I will discuss SciPy's new Cython API for BLAS and LAPACK; how it provides a model for linking directly against Fortran, and how a similar approach can be used to export low-level APIs that do not require any linking on the part of the user.

A Cloud Service to Record Simulation Metadata

Yannick Congo, NIST/Blaise Pascal University
Jonathan Guyer, NIST

The notion of capturing each execution of a script or workflow and its associated metadata is enormously appealing and should be at the heart of any attempt to make scientific simulations reproducible. In view of this, we are developing a backend and frontend service to both store and view metadata simulation records using a robust data scheme agnostic approach. See https://gist.github.com/wd15/11f722a546b018525957
This presentation was authored by Daniel Wheeler, Ph.D. and Yannick Congo. It will be presented by Yannick Congo and Jonathan Guyer, Ph.D.

Congress & New Media: A Data Driven Study of Congressional Public Relations

Brian Smith, Ball State University - Deptartment of Political Science

The primary focus of this study aims to provide a comprehensive time series analysis of the United States Congress and the extent to which supporting congressional staff have provided the means to pursue modern New Media strategies. Related research and supporting data suggests that turnover activity for House and Senate seats tends to introduce a higher concentration of official New Media functions with each proceeding session of Congress among its individual members, legislative committees, and congressional leadership (though not proportionally). Through an extensive collection and examination of congressional records, this study explores the hierarchical concentration on New Media skillsets found within the staff makeup of select congressional groups. The findings of this study offer supporting evidence to established congressional research regarding how modern public messaging strategies are employed to serve each of the national political parties as well as individual political actors.

Connecting Circuits for Asynchronous Data Workflows

Carson Farmer, University of Colorado, Boulder

Real-time data form environmental sensor networks, social media platforms, animal tracking systems, and more are becoming increasingly available to scientists and practitioners alike. These 'data-streams' provide unprecedented access to real-time information about changes in human and environmental systems. In an example-based talk on streaming data in Python, we outline how an asynchronous, component-based framework such as circuits can facilitate real-time interaction with streaming data sources that leverages the analytical power of the full SciPy/PyData stack.

Dask - Out-of-core NumPy/Pandas through Task Scheduling

James Crist, Graduate Student, University of Minnesota

Dask Array implements the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We describe dask, dask.array, dask.dataframe, as well as task scheduling generally.

Deep Learning: Tips from the Road

Kyle Kastner, Graduate Student, Université de Montréal

Modern neural networks use a host of specially built components to learn powerful, data driven representations of our world. By carefully choosing which components to use for a given task, it is possible to have systems which greatly outperform more naive designs. This talk will cover core components, concepts, and uses of modern "deep learning" systems. Highlights include: Initialization and optimization Convolution Recurrence Variational training Dedicated memory

Deep Learning Versus Segmentation and ML to Classify Biological Images

Cedric St-Jean, Recursion Pharmaceuticals

We take high-resolution pictures of cells under various conditions, and use various machine learning and statistical techniques to extract signal from the biological noise. I will describe the current state-of-the-art, as well as our own models and results. I will focus on comparing traditional ML classifiers on ad hoc extracted image features with our more recent deep learning forays.

Deploying Python Machine Learning Models in Production

Krishna Sridhar, Dato

Applications using machine learning are quickly becoming the norm. However, for Python data scientists, deploying machine learning models in production is a challenge. This talk examines the current options and talks about one solution in particular, Dato Predictive Services. In a hands-on setting, we will walk through code that easily deploys arbitrary Python code as a REST service.

Dexy and Docker for Scientific Reproducibility

Ana Nelson, Cosmify, Dexy

Scientific reproducibility requires convenient tools for automating both the computational environment and the project workflow all the way from code execution to document generation. Dexy is an open source Python tool for reproducible project workflows in multiple languages, and Docker provides a comprehensive way to specify and generate a standardized computational environment. In this highly practical example-based talk, learn from the author of Dexy how to use these two great tools together to make your work fully reproducible and also well-documented.

DistArray: Distributed Array Computing for Python

Robert Grant, Scientific Software Developer, Enthought
Kurt Smith, Scientific Software Developer, Enthought

DistArray brings the strength of NumPy to data-parallel high-performance computing. It is an up-and-coming Python package that provides *distributed* NumPy-like multidimensional arrays, ufuncs, and hdf5-aware IO. DistArray builds on widely-used Python HPC libraries like IPython Parallel, MPI4Py, and h5py. The project also supports the new Distributed Array Protocol to share distributed arrays with external distributed arrays--like Trilinos--without copying.

A Distributed, Standards-Based Framework for Searching, Accessing, Analyzing and Visualizing Met-Ocean Data: Application to Hurricane Sandy

Richard Signell, USGS
Filipe Fernandes, SECOORA
Kyle Wilcox, Axiom Data Science
Andrew Yan, USGS

A complete distributed, standards-based framework for searching, accessing, analyzing and visualizing meteorologic and oceanographic data will be described, with application to sharing Hurricane Sandy data.

Docker for Improved Reproducibility of Scientific Python Analyses

Matthew McCormick, Kitware

Docker allows scientists to create a reproducible computational environment that can be easily created, versioned, archived, and shared. In this talk we give an introduction to basic Docker concepts and describe how to uses it with the scientific Python stack.

Eigenvector Spatial Filtering using NumPy and ArcGIS

Bryan Chastain, University of Texas, Dallas

Ordinary Least Squares regression techniques are often applied to model spatial phenomena. While these techniques are easy to use and understand, they frequently contain spatially autocorrelated residuals, indicating a misspecification error. Several techniques have been proposed to address this issue, including Geographically Weighted Regression (GWR), Spatial Autoregressive models (SAR/CAR), Bayesian Spatially Varying Coefficients (SVC), and others. However, recent work has shown the Eigenvector Spatial Filtering (ESF) approach to be an unbiased, efficient and consistent estimator for linear regression that often outperforms many of these other techniques (Griffith et al., 2009; Griffith and Chun, 2014). Until now, ESF libraries have only been available for R and SAS (Bivand, 2008). This paper demonstrates the ESF approach in Python, which, through PySAL, streamlines the process of getting GIS data into a NumPy-based regression model.

ETE: A Python Programming Toolkit for the Analysis and Visualization of Trees

Jaime Huerta-Cepas, EMBL

ETE is a Python programming toolkit for the analysis, manipulation and visualization of hierarchical trees. It provides methods for tree traversing, annotation, pruning, splitting, rooting, querying and topology comparison. ETE is currently developed as a bioinformatics tool in the field of phylogenomics, providing also specific functionality such as NCBI taxonomy database integration, gene orthology and paralogy prediction, tree reconciliation and phylostratigraphic methods. Finally, ETE’s treeview module provides built-in tree rendering features to produce rectangular and circular trees in a highly customizable and programmatic way. All functionality, including tree visualization, has full support for IPython notebooks - http://etetoolkit.org/ipython_notebook.

Examining Malware with Python

Phil Roth, Endgame

Endgame uses Python extensively for building APIs, constructing data flows, and of course data science. In this talk, I will describe generally how the data science team at Endgame uses Python to build models around security data. I will also describe in detail how Endgame uses Python to process, analyze, and categorize malware.

Exploring Open Access Weather Radar with the Python ARM Toolkit

Jonathan Helmus, Argonne National Laboratory
Scott Collis, Argonne National Laboratory

A number of organizations provide free and open access to data from their weather radars including the NEXRAD network operated by the National Weather Service, the Federal Aviation Administration’s Terminal Doppler Weather Radars as well as the scanning cloud and precipitation radars operated by the Atmospheric Radiation Measurement Climate Research Facility (ARM). This data contains a wealth of meteorological and dermatological information. However, until recently, tools to access this data through python were not readily available. The Python ARM Radar Toolkit (Py-ART) is an open source Python module which can read, visualize and analyze data from a number of these weather radars. This presentation will give an overview of how Py-ART can be used to explore and extract out meaningful scientific insights from these open access radar datasets.

Exploring Open Source Community Dynamics with BigBang

Sebastian Benthall, Berkeley School of Information

Open mailing lists, version control systems, and issue trackers are rich sources of data for computational social science. Using BigBang, a toolkit built using scientific Python, we explore the dynamics of open source communities. In particular, we look at the relationship between responsiveness in community fora and its relationship to onboarding of new members. We also look at emergent properties of the sociotechnical network defined by contributors and the software they are working on.

flotilla: IPython Widgets + Computation + Dataviz = Data-Driven Conversations

Olga Botvinnik

See how IPython widgets can help you quickly iterate on a scientific analysis. This talk will showcase flotilla, a package for integrated data analysis, computation and visualization. To demonstrate, I will use RNA sequencing data from post-mortem brain samples.

A Framework for Analyzing GADGET Simulation Data in Pandas

Jacob Hummel, University of Texas, Austin

We present a pandas-based framework for analyzing astrophysical simulation data produced by the smoothed-particle hydrodynamics code GADGET.

From 1-Day Release to 1-Min Release (or How to Recover Time with Automation)

Damian Avila, Continuum Analytics

Releasing multi-language complex package can be a long and prone-to-error process. And if you want to support hungry users with recent development builds (because working from source is not trivial), the situation quickly get into a nightmare. So, how can we solve this problem? The answer: automation. This talk is about how we turn a 1-day release process into a 1-min task, where we just issue a line at the prompt and everything gets done "magically" without further intervention.

From Zero to Hero in Two Years, Open Collaborative Radar Software and the Secret to our Success

Scott Collis, Argonne National Laboratory
Nick Guy, University of Wyoming
Anderson Gama, SIMEPAR - Sistema Meteorológico do Paraná
Cesar Beneti, SIMEPAR - Sistema Meteorológico do Paraná
Stephen Nesbitt, University of Illinois
Scott Giangrande, Brookhaven National Laboratory
Maik Hiestermann, University of Potsdam
Kai Muehlbauer, University of Bonn

As computer languages have undergone swift development in recent decades much of the oceanic and atmospheric community has been slow to shift from strongholds such as Fortran. While there has always been a number of open source solutions for the radar meteorology community to process data, projects have been stove-piped. Scientists tend to build applications on top of these software stacks in order to publish and the resultant code often is not used outside of the scientist’s institution or direct collaborators. This presentation will outline an alternate path using community based open source software. We will discuss the Python-ARM Radar Toolkit and recent community success in performing common radar processing tasks by combining multiple tools often written on different continents.

Geodynamic Simulations in HPC with Python

Nicola Creati, OGS
Roberto Vidmar, OGS
Paolo Sterzai, OGS

A software suite to simulate large scale geodynamic processes on HPC facilities without C or Fortran knowledge.

Global Hydrology Analysis Using Python

Mattheus Ueckermann, Creare

We present the release of pyDEM, an open source Python/Cython library for world-wide geospatial terrain analysis. Past terrain-analysis libraries have been hindered by slow run-time and an inability to account for digital elevation model tile boundaries (for example, artifacts are produced from no drainage of water between neighboring tiles)—limiting them to analysis of only a few thousand square kilometers. pyDEM’s fast run-time and artifact-free analysis enables hydrological terrain analysis of the entire land surface of the Earth, producing a terabyte of hydrological-analysis data in a few days on a single compute node.

HDF5 is Eating the World

Andrew Collette, University of Colorado, Boulder

Over the past few years, HDF5 has rapidly emerged as the "go-to" technology for storing and sharing large volumes of numerical data in Python, and now powers dozens of packages from pandas to Astropy. We discuss the explosion of Python science and engineering tools which have standardized on HDF5, explore why you should use it for your next project, and look ahead to the highly-anticipated parallel features in the next major release of HDF5 and their impact on the Python community.

HoloViews: Building Complex Visualizations Easily for Reproducible Science

Jean-Luc R. Stevens, University of Edinburgh
Philipp Rudiger, University of Edinburgh
James A. Bednar, University of Edinburgh

Scientific visualization typically requires large amounts of custom coding that obscures the underlying principles of the work and makes it more difficult to share and reproduce the results. Here we describe how the new HoloViews Python package, combined with the IPython Notebook, provides a rich interface for flexible and nearly code-free visualization of your results while storing a full record of the process for later reproduction.

The James Webb Space Telescope Data Calibration Pipeline

Howard Bushouse, Space Telescope Science Institute

The James Webb Space Telescope (JWST) is the successor to the Hubble Space Telescope (HST) and is currently expected to be launched in late 2018. The Space Telescope Science Institute (STScI) is developing the pipeline systems that will be used to provide routine calibration of the science data received from JWST. The JWST calibration pipelines use a processing environment provided by a Python module called "stpipe" that provides many common services to each calibration step, relieving step developers from having to implement such functionality. The stpipe module provides multi-level logging, command-line option handling, parameter validation and persistence, and I/O management. Individual steps are written as Python classes that can be invoked individually from within Python or from the stpipe command line. Pipelines are created as a set of step classes, with stpipe handling the flow of data between steps. The stpipe environment includes the use of standard data models. The data models, defined using json schema, provide a means of validating the correct format of the data files presented to the pipeline, as well as presenting an abstract interface to isolate the calibration steps from details of how the data are stored on disk.

Jupyter / Ipython, State Of Multiuser And Real-Time Collaboration

Matthias Bussonnier, UC Berkeley BIDS / IPython / Jupyter
Kester Tong, Google

Having real-time multi-user editing of documents is now taken for granted by most users of web-based technologies. However, in the case of a scientific document backed by a live running kernel, there are a lot more technical and design challenges that have to be taken into account. In this talk we discuss our vision for integrating real-time collaboration into the Jupyter/IPython notebook, and how we plan to implement it, both from a user’s point of view and from a technical point of view. We also describe the current state of real-time collaboration and the future schedule for the different components.

Keep on Releasin': Continuous Delivery for Open Source

Philip Elson, Met Office

“How often should our software be released?” is a discussion which has been had on most project mailing lists. Typically the answer is a compromise between short release cycles and balancing the significant overhead of actually doing the release itself. In this talk I will look at entirely automating regular releases by combining tools such as docker and conda with freely available CI platforms such as Travis-CI, AppVeyor and CircleCI. I will demonstrate recent work on continuous delivery (CD) for open source projects, discuss some of the implications that an automated release process can have and show you that CD is an attainable goal for the software that we develop.

klepto: Unified Persistent Storage to Memory, Database, or Disk

Michael McKerns, UQ Foundation

*klepto* is a new python package that provides a unified programming interface to caching and archiving to memory, disk, or database. *klepto* provides a dictionary interface to caches and archives, where all caches can also be applied to any python callable as a decorator. *klepto* can be used to create dual caching strategies for speed and robustness, with design abstractions for things like multiple python processes using local memory caching asynchronously coupled to longer-term centralized storage on disk.

Librosa: Audio and Music Signal Analysis in Python

Brian McFee, New York University

This talk covers the basics of audio and music analysis with librosa, and provides an overview and historical background of the project.

The Need for New Tools for Materials Discovery

Christopher Wilmer, University of Pittsburgh

Exploring “hypothetical” materials is an emerging area in engineering research. Simple hypothetical systems, like binary alloys or small organic molecules, have been the focus of research so far, but designing large supramolecular structures (such as molecular machines) or complex hierarchical materials (that are common in Nature) has been challenging due to the limitations of existing tools. In this talk, we discuss progress towards more advanced tools for hypothetical materials research.

An Open-Source Data Archive for Expert-Annotated Dermoscopic Images

Brian Helba, Kitware Inc

We discuss the development and operation of an open-source web-based publicly-available data archive, providing images of skin lesions, associated diagnostic metadata, and annotations of clinical features by leading experts in dermatology.
This talk was authored by: Brian Helba; Kitware, Inc.
David Gutman; Emory University
Rich Stoner; University of California, San Diego
Allan Halpern; Memorial Sloan Kettering Cancer Center
Michael Marchetti; Memorial Sloan Kettering Cancer Center
Stephen Dusza; Memorial Sloan Kettering Cancer Center

Optimal Control and Parameter Identification of Dynamcal Systems with Direct Collocation using SymPy

Jason Moore, Lead Developer, PyDy

There are variety of techniques for approaching the optimal control and parameter identification problems of dynamical systems. Traditionally, discrete methods for linear systems have been utilized and/or various shooting optimization techniques for non-linear systems. But more recently the direct collocation method has been used to formulate these two problems in terms of a non-linear programming (NLP) problem where large scale sparse optimizer can be utilized to find the optimal solution. The methods have proved valuable because the computation time can be reduced by many orders of magnitude relative to shooting, local minima are less of a problem, and unstable systems can easily be dealt with. The translation of an optimal control or parameter identification problem into a non-linear programming problem is not trivial. I will present a lightweight Python package that translates high level symbolic descriptions of a dynamic system and the optimization objectives to an efficient implementation of a NLP problem which can then be passed to a variety of solvers, such as the open source IPOPT. This package, opty, allows the user to define a problem in very few lines of code which directly mirrors the math that defines the high level description of the problem. opty can be used to solve a wide variety of problems and I will demonstrate its effectiveness and ease of use on both classic problems and some research grade problems in the biomechanics and vehicle dynamics domains.

pgmpy: Probabilistic Graphical Models using Python

Ankur Ankan
Abinash Panda

Probabilistic Graphical Models (PGM) allows compact representation of Joint Probability Distributions by exploiting the Independence conditions between the random variables. Furthermore querying marginal probabilities or conditional probabilities using PGMs is much more computationally cheaper as compared to a full distribution. pgmpy is a Python library that allows us to create different types of graphical models, do inference over it and to learn optimal parameters from data.

The Polyglot Beaker Notebook

Scott Draves, Two Sigma Open Source

The Beaker Notebook is a new open source tool for collaborative data science. Like IPython, Beaker uses a notebook-based metaphor for idea flow. However, Beaker was designed to be polyglot from the ground up. That is, a single notebook may contain cells from multiple different languages that communicate with one another through a unique feature called autotranslation. You can set a variable in a Python cell and then read that variable in a subsequent R cell, and everything just works – magically. Beaker comes with built-in support for Python, R, Scala, Groovy, Julia, and Javascript. In addition, Beaker also supports multiple kinds of cells for text, like HTML, LaTeX, Markdown, and our own visualization library that allows for the plotting of large data sets. This talk will motivate the design, review the architecture, and include a live demo of Beaker in action.

Practical Integration of Processing, Inversion and Visualization of Magnetotelluric Geophysical Data

Gudni Rosenkjaer, University of British Columbia

Having well contained software to perform essential tasks is beneficial in our geophysical workflow. We'll discuss the ongoing implementation of geophysical software for the Magnetotelluric problem that builds on the SimPEG framework (Simulation and Parameter Estimation in Geophysics), and  show how we implement our code and workflows to deal with the overhead of processing data, running physical simulations, completing inversions, and visualizing the results.

pycalphad: Computational Thermodynamics in Python

Richard Otis, Pennsylvania State University

All physical simulations which treat diffusion and/or phase transformations assume some kind of thermodynamic model for calculating driving forces and predicting the relative stability of phases. It has been difficult to couple these models to scientific Python codes due to a lack of library support. Here we present pycalphad, a free and MIT-licensed pure Python library for designing thermodynamic models, calculating phase diagrams and investigating phase equilibria within the CALculation of PHAse Diagrams (CALPHAD) method. The library provides routines for reading Thermo-Calc database (TDB) files and for solving the multi-component, multi-phase Gibbs energy minimization problem. Beyond coupling with simulation codes, the purpose of this project is to provide any interested people the ability to tinker with and improve CALPHAD models without having to be a computer scientist or expert programmer.

PyEDA: Data Structures and Algorithms for Electronic Design Automation

Chris Drake, Engineer, Google

I will present PyEDA, a Python library for electronic design automation (EDA). PyEDA provides both a high level interface to the representation of Boolean functions, and blazingly-fast C extensions for fundamental algorithms where performance is essential. You will learn about Boolean satisfiability (SAT), binary decision diagrams, techniques for formal equivalence checking, constraint solving, and more.

PyNeb: Nebular Analysis Tools for Astrophysics

Richard Shaw, National Optical Astronomy Observatory

PyNeb, a new open-source package for the study of gaseous nebulae, provides a powerful toolkit for visualizing and exploring atomic data, determining physical diagnostics and ionic abundances of ionized gas. PyNeb is the successor to the IRAF "nebular" package, which has seen wide use over the last 20 years. I will demonstrate the capabilities of this package and describe some recent science results that have already made use of PyNeb. See the package homepage at: http://www.iac.es/proyecto/PyNeb/

PyRK: A Python Package for Nuclear Reactor Kinetics

Kathryn Huff, Fellow, University of California, Berkeley

In this work, PyRK, a new python package for nuclear reactor kinetics, is introduced and demonstrated. PyRK (Python for Reactor Kinetics) has been designed for coupled thermal-hydraulics and neutronics for 0-dimensional, transient, nuclear reactor transient analysis. PyRK is intended for analysis of many commonly studied transient scenarios including normal reactor startup and shutdown as well as abnormal scenarios including Beyond Design Basis Events (BDBEs) such as Accident Transients Without Scram (ATWS). This package allows nuclear engineers to rapidly prototype nuclear reactor control and safety systems in the context of their novel nuclear reactor designs. One application of this package presented here captures the evolution of accident scenarios in a novel reactor design, the Pebble-Bed, Fluoride-Salt-Cooled, High- Temperature Reactor.

PySPLIT: a Package for the Generation, Analysis, and Visualization of HYSPLIT Air Parcel Trajectories

Mellissa Cross

The HYSPLIT (HYbrid Single Particle Lagrangian Transport) modeling utility is ubiquitous in the meteorological community. It outputs air parcel paths projected forward or backward in time, hereafter referred to as trajectories. Previously, there were no packages available to facilitate work with HYSPLIT in the mainstream scientific Python ecosystem. This talk introduces PySPLIT, a Python package that facilitates the HYSPLIT trajectory analysis workflow by providing an elegant, intuitive API for generating, inspecting, and plotting trajectory paths and data.

PyStruct - Structured Prediction in Python

Andreas Mueller, New York University Center for Data Science

Structured Prediction is a generalization of classification to sequences, graphs and more general output spaces. It is a well-established method in computer vision in natural language processing in a variety of flavors. The talk will introduce basic concepts and show how to perform structure prediction with the PyStruct library.

Python in Data Science Research and Education

Randy Paffenroth, Worcester Polytechnic Institute

This talk will focus on the use of Python, IPython notebooks, scikit-learn, NumPy, SciPy, and pandas in Data Science graduate education and research. In particular, we will focus on how Python can be used in a Data Science Master’s program and how that work can then be transitioned into research projects.

Python Derived Imaging Biomarkers of Dementia

Ross Mitchell, Professor of Radiology, Mayo Clinic College of Medicine, Mayo Clinic Arizona

Alzheimer’s disease (AD) is the most common neurodegenerative dementia and a major public health issue. A non-invasive MRI based biomarker of disease progression would greatly aid therapy evaluation. We used a variety of Python libraries to quantify tissue structure and composition from brain MR images. Our image analysis method achieved an accuracy of 95% in differentiating between normal and AD groups. We believe our new analysis method could improve the diagnostic and predictive power of clinical trials.

Python as a First Programming Language for Biomedical Scientists

Jeannie Irwin, University of Pittsburgh

In this presentation we will discuss our decade-long experience of using Python as a first-programming language for graduate students in biomedical science programs. We will also discuss our forward looking plans for Python as data science becomes a more embedded element in healthcare.

Python in Tidal Energy: Three Tools Used in a Collaboration on Array Optimization

Kristen Thyng, Assistant Research Scientist, Texas A&M University

Tidal energy is a means of generating electricity by utilizing fast moving currents via rotating turbines. An outstanding question in this field is how to arrange turbines within a given permitted lease area to maximize some goal. Ultimately, the goal is to generate electricity at a low enough rate of cost and a high enough rate of return on investment. These goals may be affected by considerations such as cable costs, depth of the sea water, characteristics of the flow area, array layout, wake interactions, and turbine characteristics. Additional considerations may need to include limiting impact to the environment. A collaboration between the authors aims to improve the methodology and understanding of tidal array optimization. A Python-based tool from each author is used in this study. Dr. Funke runs 2D finite element simulations (using FeNICS) with tidal turbine farms modeled as bottom friction (OpenTidalFarm) in order to optimize the locations of the turbines with respect to some goal (such as power generation). Dr. Roc ties in the economics and engineering side of the problem with his Python-based GUI and function which adds constraints onto the optimization work done in OpenTidalFarm. Finally, Dr. Thyng examines the changes in the system flow fields due to the turbine farm which could have potential environmental consequences. In future work, limiting these environmental impacts will also be incorporated into the OpenTidalFarm farm placement optimization.

Python Tools for Space Mission Data Analyses

Michael Aye, LASP

Python tools for the analysis of several kinds of planetary space mission data are being presented. Missions concerned include the Mars Reconnaissance Orbiter, Lunar Reconnaissance Orbiter and the Cassini Saturn mission. Topics include data categorization, geo-referencing, cluster analysis and embarrassingly parallel applications.

Qiita: Report of Progress Towards an Open Access Microbiome Data Analysis and Visualization Platform

Adam Robbins-Pianka, University of Colorado, Boulder
Yoshiki Vazquez-Baeza, University of California, San Diego

Qiita (canonically pronounced “cheetah“) is a new, free, open source platform for running microbial community analysis using QIIME and EMPeror through a browser-based GUI. We also introduce a central deployment of the system at qiita.microbio.me, where open-access is emphasized and data can be shared and used in meta-analyses incorporating any samples available on the system. We will discuss the design of the system’s backend, including data storage practices for multiomics data, metadata, results, as well as system data. Lastly we will showcase EMPeror in a real-world meta-analysis, and demonstrate the benefits of using standardized sample metadata.

Rapid Accurate and Simple Segmentation of Objects in Medical Images

Ross Mitchell, Professor of Radiology, Mayo Clinic College of Medicine, Mayo Clinic Arizona

The clinically standard approaches to assess cancer treatment response rely upon simplified, diameter-based estimates of lesion size in medical images. These estimates suffer from low accuracy, and poor correlation with treatment response. We recently developed a segmentation algorithm that leverages the massive parallelism of commodity GPUs. Though rapid, the user must still tune image intensity parameters for accurate segmentation. We used Python to implement a classification-based method that only requires users to label foreground and background seed points in the medical image volumes. Our method achieved high accuracy when segmenting complex biological structures. Our algorithm could improve treatment monitoring in cancer clinical trials.

RESTful HDF

John Readey, HDF Group

In the SciPy community, HDF5 has rapidly emerged as the de facto technology for storing and sharing large volumes of numerical data. With the HDF Server (h5serv) project, a Python-based web-service, HDF5 data can now be read and written over http using a RESTful api. As a web service this opens up many avenues to utilize HDF in ways that would have been difficult to achieve previously. In addition to the service itself, h5pyd is a package written specifically for Python clients that enables access to the h5serv REST API in a way that is compatible with the popular h5py package.

Scientific Computing for Undergraduates: Getting Students into the Kitchen

Amelia McBee Henriksen, Brigham Young University

Teaching undergraduate students applied mathematics without providing them a strong foundation in scientific computing is like teaching students to cook without ever letting them near a kitchen: it just doesn’t work. This presentation expounds on interdisciplinary methods toward teaching applied mathematics, using Brigham Young University’s Applied and Computational Mathematics Emphasis as a vehicle for that discussion.

Scientific Data Analysis and Visualization with VTK and ParaView

Cory Quammen, Kitware

I discuss the integration of Python in VTK and ParaView, open-source software packages for producing 3D visualizations of scientific data.

Scientific Python using Mobile OS

Roberto Colistete Junior, UFES - Federal University of Espirito Santo (Brazil)

Python and its scientific modules are available in some mobile OS (Operating Systems), so more than one billion of smartphones and tablets can run scientific Python. Here I will present scientific Python from the perspective of users, developers and maintainers.

scikit-bio: A Bioinformatics Library for Data Scientists, Students, and Developers

Jai Ram Rideout, Northern Arizona University
Evan Bolyen, Northern Arizona University

Python is widely used in computational biology, with many high profile bioinformatics software projects being largely or entirely written in Python. Until now, there has not been a bioinformatics library that has a stable, consistent API; interoperates with the scientific Python computing stack; supports high-performance operations; and integrates directly with bioinformatics educational materials. This year we present the first beta version of scikit-bio, which will be released in conjunction with SciPy 2015. This talk will introduce the core bioinformatics data structures in scikit-bio, including representations of biological sequences, multiple sequence alignments, and phylogenetic trees, and demonstrate several key features.

Signal Processing and Communications: Teaching and Research Using IPython Notebook

Mark Wickert, University of Colorado

This talk will take the audience through the story of how an electrical and computer engineering faculty member has come to embrace Python, in particular IPython Notebook, as an analysis and simulation tool for both teaching and research in signal processing and communications. Legacy tools such as MATLAB are well established (entrenched) in this discipline, but engineers need to be aware of alternatives, especially in the case of Python where there is such a vibrant community of developers and a seemingly well designed language. In this talk case studies will also be used to describe domain specific code modules that are being developed to support both lecture and lab oriented courses going through the conversion from MATLAB to Python.

Spatial Income Inequality Dynamics in PySAL

Wei Kang, GeoDa Center for Geospatial Analysis & Computation
Sergio Rey, GeoDa Center for Geospatial Analysis and Computation

Discrete Markov chain theory has been widely applied to the study of regional income distribution dynamics and convergence. Despite its popularity, several issues pertaining to spatial effects and discretization remain unexamined. The spatial dynamics module in the Python Spatial Analysis Library (PySAL) is intended to address these issues by developing and implementing enhanced Markov methods including spatial Markov chain (SMC), fractional Markov chain, rank-based Markov chain and LISA-Markov chain. This talk will be centered on SMC and simulation strategies of SMC for examining properties of tests for spatial dependence in income inequality dynamics.

So How ARE Scientists Using Python on Supercomputers?

Michael Milligan, Minnesota Supercomputing Institute, University of Minnesota

The Minnesota Supercomputing Institute is a large facility at a large research university, with over 3000 users from over 100 departments. As such, we serve the "long tail"of scientific computing: the numerous users whose applications are too small (or are prototypes for) headline national supercomputers, but too big for a laptop. Improved monitoring capabilities now allow us to sample in detail the software being run by our users, including learning how often and in what capacity Python is being used. In this talk we will share statistics, case studies, and lessons learned from supporting diverse applications of Python for an extremely heterogeneous scientific user base.

State of the Library: matplotlib

Thomas Caswell, Brookhaven National Laboratory
Benjamin Root

News, recent enhancements, and future plans of matplotlib.

Statistical Learning of Human Brain Structure in DIPY

Ariel Rokem, The University of Washington eScience Institute
Franco Pestilli, Indiana University
Bagrat Amirbekian, University of California, San Francisco
Stefan van der Walt, University of California, Berkeley
Brian Wandell, Stanford University
Eleftherios Garyfallidis, University of Sherbrooke

Diffusion MRI in Python (Dipy; http://dipy.org) is a free, open-source software library for the analysis of diffusion MRI (dMRI), a medical imaging technique that is used to make inferences about the structure, connectivity, and tissue properties of the human brain. In this presentation, I will discuss the application of principles from statistical learning to dMRI, and the implementation of these techniques in Dipy.

Story Time with Bokeh

Bryan Van de Ven, Continuum Analytics

With support from the DARPA XDATA Initiative, and contributions from community members, the Bokeh visualization library (http://bokeh.pydata.org) has grown into a large, successful open source project with heavy interest and following on GitHub (https://github.com/bokeh/bokeh). The principal goals of Bokeh are to provide capability to developers and domain experts: * easily create novel and powerful visualizations * that extract insight from remote, possibly large data sets * to be published to the web for others to explore and interact This talk will describe how the architecture of Bokeh enables these goals, and demonstrate how it can be leveraged by anyone using python for analysis to visualize and present their work.

Striplog: Wrangling 1D Subsurface Data

Matt Hall, Agile Geoscience

Striplog is a new library attempting to make it easier to work with 1D subsurface data, especially irregularly sampled, interval-based qualitative data such as cuttings descriptions, special core analyses, and stratigraphic intervals. It was conceived and built for a particular use case (translating cuttings descriptions into striplogs), but could be the start of a new framework for working with these and other kinds of well data. There is very little open source software in this space today, so this is a rallying cry for contributions to a framework that could be useful in academia, government, and industry.

Structural Cohesion: Visualization and Heuristics for Fast Computation with NetworkX and matplotlib

Jordi Torrents, NetworkX

The structural cohesion model is a powerful sociological conception of cohesion in social groups, but its diffusion in empirical literature has been hampered by operationalization and computational problems. We present useful heuristics for computing structural cohesion that allow a speed-up of one order of magnitude over the algorithms currently available. Both the heuristics and the exact algorithm have been implemented on NetworkX by the first author. Using as examples three large collaboration networks (co-maintenance of Debian packages, co-authorship in Nuclear Theory and High-Energy Theory) we will illustrate our approach to measure structural cohesion in relatively large networks. We also introduce a novel graphical representation of the structural cohesion analysis to quickly spot differences across networks which is implemented using matplotlib.

Teaching with IPython/Jupyter Notebooks and JupyterHub

Jessica Hamrick, University of California, Berkeley
Min Ragan-Kelley, IPython
Kyle Kelley, Rackspace

How does one teach a class of 220 students using the IPython/Jupyter notebook? In this talk, I will describe an answer to this question: by using JupyterHub, a multi-user platform for hosting and running IPython/Jupyter notebooks.

Touch your Data! Color 3D Printing with Python

Joe Kington, Chevron

3D printing can be a communication tool, not a gimmick. Here's how to use Python libraries to build color 3D printable models from scientific datasets.

Towards a Better Documentation System for Scientific Python

Carlos Cordoba, Continuum Analytics

Our aim in this talk is to present a new library (tentatively called oinspect) which takes a docstring written in reStructured Text (or otherwise) and creates two different representations of it: a rich one (based on HTML and including images, highlighted doctests and rendered Latex) and a plain text one (including extra information, like an object's class and constructor docstrings).

TrendVis: An Elegant Interface for Dense, Sparkline-Like, Quantitative Visualizations of Multiple Series Using Matplotlib

Mellissa Cross

TrendVis is a python package that uses matplotlib to create a highly customizable, highly readable, information-dense plot style for the visualization of multiple datasets against a common parameter.

Typing Arrays with DyND

Mark Wiebe, Thinkbox Software

When working with array data, flexibility is key, but exact layout in memory is often essential for performance and to feed hungry algorithms. In this talk, I will show how DyND provides expressive, detailed control through its type system, while letting you play in the Python style you know and love.

UDL: Unified Interface for Deep Learning

Haitham Elmarakeby, Virginia Tech University

UDL is a python object-oriented library that provide a unified interface to use and integrate deep learning libraries. The library is a set of classes implemented on the top of well-known libraries such as Pylearn2 and Caffe. Using UDL enables users to easily supply their own data sets, train models, get predictions, and score the performance. We provide a simple use-case that show how using UDL can make it easy to integrate different components implemented in Pylean2, Caffe, and Scikit-learn in the same pipeline.

Using Python to Span the Gap between Education, Research, and Industry Applications in Geophysics

Lindsey Heagy, University of British Columbia: Geophysical Inversion Facility

As researchers, we require tools that facilitate exploration of scientific concepts and methodologies. Additionally, these methodologies and concepts must be disseminated to practitioners and students. We will discuss how we have used the Python environment to “package” our geophysical software framework SimPEG (Simulation and Parameter Estimation in Geophysics) at various levels of abstraction tailored to researchers, students and practitioners.

VisPy: Harnessing The GPU For Fast, High-Level Visualization

Luke Campagnola, University of North Carolina, Chapel Hill

The growing availability of large, multidimensional data sets has created demand for high-performance, interactive visualization tools. VisPy leverages the GPU to provide fast, interactive, and beautiful visualizations in a high-level API. This presentation will introduce the main features, architecture, and techniques used in VisPy.

Visualizing Physiological Signals in Real Time

Sebastian Sepulveda, Universidad de Valaparaiso

This work presents a software to visualize and record in real time physiological signals, such as electrocardiography (ECG) and electromiography (EMG). The software is also capable of real time analysis, such as filtering and spectral estimation.

Welcome to the Algo Wars: Leveraging Design Thinking for Building Scalable Enterprise Intelligent Systems using Python

Zubin Dowlaty, Mu Sigma
Subir Mansukhani, Mu Sigma
Bharat Upadrasta, Mu Sigma

Join us to learn how Python is playing an essential role in the building of operational predictive systems in Fortune 500 enterprises. We will discuss concepts and best practice architectures we have witnessed that all data scientists must learn and implement today in order to be successful in production. Explore learnings from an interdisciplinary perspective, Design Thinking and how we can benefit. Recent production deployments in the internet of things and high frequency trading verticals will be used as a reference point. Get prepared for the algo wars that are upon us. Anticipation denotes intelligence.

What the FORTRAN is ** Doing in Python?

En Zyme, Proteasome Digest

Exponentiation is just extended multiplication, or is it? By the same token, multiplication is merely repeated addition, and subtraction is simply adding the negative of a number. Which leaves division as a recursive function utilizing subtraction. All of this works fine in theory. We should only need one operation, addition, and the ability to put a minus sign in front of a number. In reality, Real numbers don't really exist, there are only a finite number of Integers, negative zero is not always the same as positive zero, and Complex numbers really are. And to make matters worse we can never divide by zero, or can we? This talk is a retrospective and prospective of the intricacies of reification of 'number' and the futility of mathematical operations. Examples drawn from FORTRAN, C, Julia, and Haskell will be compared and contrasted with Python through the lenses of NumPy and SciPy, and the reductive demands of the Sciences.

Who Needs Standard Stars Anyway? Telluric Model Fitting with TelFit

Kevin Gullikson, University of Texas

Many astronomical spectra are contaminated by absorption from the Earth's atmosphere, so-called telluric contamination. I will introduce and describe TelFit, a python package to accurately model, fit, and remove the telluric contamination from observed spectra.

Widgets and Astropy: Accomplishing Productive Research with Undergraduates

Matthew Craig, Minnesota State University Moorhead

This talk describes a set of ipython notebooks with a widget interface that are based on astropy, a community-developed package of fundamental tools for astronomy. The widget interface makes astropy a much more useful tool to undergraduates or other non-experts doing research in astronomy, filling a niche for software that connects beginners to research-grade code.

Will Millennials Ever Get Married? Survival Analysis and Marriage Data

Allen Downey, Olin College

Recent studies report that an increasing share of Americans have never married, which suggests that current young adults might marry at lower rates than previous generations.  Using data from a national survey, we find that successive generations are getting married later, but our predictions suggest that the fraction of people who eventually marry will not change substantially.  Our analysis uses Pandas for data extraction and cleaning, bootstrap methods for working with stratified surveys, lifelines for survival analysis, and time series analysis with statsmodels. All code and data for this study is in a public repository.

Wrapping C and C++ Libraries with CastXML

Brad King, Kitware
Bill Hoffman, Kitware
Matthew McCormick, Kitware
Michka Popoff

Automated wrapping of C and C++ libraries requires parsing of the source code and generation of an abstract syntax tree that can be consumed by other tools. For this task we introduce CastXML, the next generation of GCC-XML, based on the Clang parser.

xray: N-D Labeled Arrays and Datasets

Stephan Hoyer, The Climate Corporation

xray is an open source project and Python package that aims to extend the labeled data power of pandas from tabular to physical datasets, by providing N-dimensional variants of the core pandas data structures. On top of the NumPy array, xray adds labeled dimensions (e.g., "time") and coordinate values (e.g., "2015-04-10"), which it uses to enable a host of operations powered by these labels: selection, aggregation, alignment, broadcasting, split-apply-combine, interoperability with pandas and serialization to netCDF/HDF5. Recently, xray has been integrated with dask, which provides easy parallelism and allows xray's labeled data operations to scale to data that does not fit into memory.