MODULE 1: MAKING SENSE OF UNSTRUCTURED DATA¶
Case Study - Data Analysis with Human-Generated Text¶
Instructor: Tamara Broderick
In this document, we walk through some tips to help you with doing your own analysis on MIT EECS faculty data using stochastic variational inference on LDA.
- Scraping your own dataset
- Pre-processing the dataset
- Implementing your own LDA code
Implementing your own SVI-LDA code
Latent Dirichlet allocation (LDA) is a generative statistical model in natural language processing, and can be used to discover ‘topics’ in a large set of documents. This is first presented by David Blei, Andrew Ng, and Michael Jordan.
The key idea is that if we see a ‘topic’ as a collection of certain words, we can look at each document as a collection of topics, the proportion of each topic depends on the proportion of words in the document that are associated with that topic. For example, the ‘sports’ topic may consist of the words: tennis, football, gymnastics. When given a set of documents, we can calculate the posterior distribution for the topics. In the original LDA paper, this is done using a coordinate descent algorithm for mean-field variational inference, and later on researchers also used Gibbs Sampling and expectation propagation. In this tutorial we will be looking only at Stochastic Variational Inference for LDA. SVI was first published in 2013 by Matt Hoffman, David Blei, Chong Wang, and John Paisley.
Traditional coordinate-descent variational inference requires each update to be carried out with all of the data, and these updates become inefficient when the dataset gets large as each update scales linearly with the size of the data. The key idea with SVI is to update global variational parameters more frequently. Using local and global parameters, and given the dataset with a known number of datapoints, we could randomly take 1 data point at a time, update the local parameter, and project the change into the global parameters. Like traditional coordinate-descent variational inference, this is done until the result converges, i.e., the change in the global parameters is smaller than a certain value. The implementation we will be talking about is a naive implementation of the algorithm described in the original paper
. Variable Notation
Here we provide a brief overview of the input variables for LDA and SVI. Variables that can be set are the following:
• λ: what we want in the end (the posterior distribution for the topics for each word
• vocab: this is the overall vocabulary we will have in the docs
• K: this is the number of topics we want to get in the end
• D: this is the total number of documents
• α: parameter for per-document topic distribution
• η: parameter for per-topic vocab distribution2017 © Massachusetts Institute of Technology
• τ: delay that down weights early iterations
• κ: forgetting rate, controls how quickly old information is forgotten; the larger the value, the slower it is.
• max:iterations: the number of maximum iterations the updates should go on for. We usually set a check such that if the difference in two consecutive values of λ is smaller than a certain value, we say the algorithm has converged. However, sometimes we could set this certain value too small, so we set a maximum iteration value to avoid updates running forever.
LDA Generative Model
We review the LDA generative model here. LDA assumes each document has K topics with different proportions. It models a corpus w of size D as follows:
• Draw distribution over vocabulary βk ~ Dirichlet(η) for topics k ∈ {1…K}
• For each document d ∈ {1…D} :
– Draw topic proportions θd ~ Dirichlet(α);
– For each word 𝑊𝑑 𝑛 in the document:
Draw topic indicator 𝑍𝑑 𝑛~ Multinomial (θd)
Draw word 𝑊𝑑 𝑛 ~ Multinomial (β𝑍𝑑𝑛)
Note that this model follows the ‘bag of words’ assumption, such that given the topic proportions, each word drawn is independent of any other words in the document.
Libaries¶
from bs4 import BeautifulSoup
from requests import get
import nltk
from nltk import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
import collections
import pandas as pd
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date!
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
LDA Generative Model¶
Priors:
- Distribution over vocabulary for topic k in {1..K}: beta[k] ~ Dirichlet(V, eta)
- Distribution over topics (latent variables): theta ~ Dirichlet(K, alpha)
For each document:
- Choose number of words: N ~ Poisson(ξ)
For each of the N words w:
- Choose a topic: z ~ Cat(K, theta)
- Choose a word: w ~ Cat(V, beta[z])
Note: This model follows the ‘bag of words’ assumption, such that given the topic proportions, each word drawn is independent of any other words in the document.
Variational Inference¶
To use variational inference, the edges between θ (theta), z and w are removed to make inference on LDA model tractable.
Global Variables¶
faculty_url = 'https://www.eecs.mit.edu/role/faculty/?fwp_role=faculty'
arXiv_format = 'arxiv.org/find/{}/1/au:+{}_{}/0/1/0/all/0/1' # arxiv.org/find/(subject)/1/au:+(lastname)_(initial)/0/1/0/all/0/1
search_url_format = 'https://arxiv.org/search/?query="{}"&searchtype=author'
subjects = {'Computer Science': 'Computer Science',
'Electrical Engineering': 'Electrical Engineering and Systems Science',
'Physics': 'Physics'}
all_papers_columns = ['Name', 'Abstract']
Web Sraping¶
- Get Facultys
Using BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/), and by analyzing the structure of the source code of arXiv, we could scrape the name list of MIT EECS faculty members. Using this information, we could list the query we send to arXiv. A possible format for the arXiv search for papers by authors is the following:
arxiv.org/find/(subject)/1/au:+(lastname)_(initial)/0/1/0/all/0/1
You could therefore adapt the names you scraped, and query through all the relevant arXiv search pages.
Within the arXiv source code, look for < class span=list-identifier >, which will give the identifier for the papers listed in your query results. Similarly look for the tag for the “Abstract” within each paper and scrape the abstract for each paper you find.
Note that you might want to scrape more information than you need and then do some local processing with the text you have instead.
from urllib.request import Request, urlopen
faculty_url = "https://www.eecs.mit.edu/role/faculty/?fwp_role=faculty"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(faculty_url,headers=hdr)
page = urlopen(req)
faculty_page_content = BeautifulSoup(page,'html.parser')
#print(faculty_page_content)
names=[]
names = [x.text for x in faculty_page_content.find_all("h5")]
print(names)
['Hal Abelson', 'Elfar Adalsteinsson', 'Fadel Adib', 'Anant Agarwal', 'Pulkit Agrawal', 'Akintunde Akinwande', 'Mohammad Alizadeh', 'Saman Amarasinghe', 'Jacob Andreas', 'Dimitri Antoniadis', 'Ahmad Bahai', 'Hari Balakrishnan', 'Marc A Baldo', 'Regina Barzilay', 'Stephen Bates', 'Sara Beery', 'Adam Belay', 'Karl Berggren', 'Dimitri Bertsekas', 'Robert Berwick', 'Sangeeta Bhatia', 'Duane Boning', 'Guy Bresler', 'Tamara Broderick', 'Vladimir Bulović', 'Michael Carbin', 'Vincent Chan', 'Anantha Chandrakasan', 'YuFeng (Kevin) Chen', 'Adam Chlipala', 'Isaac Chuang', 'Sam Coday', 'Connor Wilson Coley', 'Henry Corrigan-Gibbs', 'Munther Dahleh', 'Luca Daniel', 'Constantinos Daskalakis', 'Randall Davis', 'Jesús del Alamo', 'Christina Delimitrou', 'Erik Demaine', 'Srini Devadas', 'David DeWitt', 'Priya Donti', 'Frederic Durand', 'Joel Emer', 'Dirk Englund', 'Gabriele Farina', 'Dennis Freeman', 'William Freeman', 'James Fujimoto', 'Mohsen Ghaffari', 'Marzyeh Ghassemi', 'Manya Ghobadi', 'David Gifford', 'Shafrira Goldwasser', 'Polina Golland', 'Martha Gray', 'W. Eric L Grimson', 'John Guttag', 'Dylan Hadfield-Menell', 'Peter Hagelstein', 'Jongyoon Han', 'Ruonan Han', 'Song Han', 'Kaiming He', 'Thomas Heldt', 'Sam Hopkins', 'Berthold Horn', 'Qing Hu', 'Daniel Huttenlocher', 'Marija Ilic', 'Piotr Indyk', 'Phillip Isola', 'Tommi Jaakkola', 'Daniel Jackson', 'Patrick Jaillet', 'Stefanie Jegelka', 'Ericmoore Jossou', 'M. Frans Kaashoek', 'Leslie Kaelbling', 'Yael Kalai', 'David Karger', 'Dina Katabi', 'Manolis Kellis', 'Yoon Kim', 'James Kirtley, Jr.', 'Leslie Kolodziejski', 'Mina Konakovic Lukovic', 'Jing Kong', 'Tim Kraska', 'Tushar Krishna', 'Butler Lampson', 'Jeffrey Lang', 'Hae-Seung Lee', 'Steven Leeb', 'Charles Leiserson', 'Laura Lewis', 'Jae Lim', 'Barbara Liskov', 'Kuikui Liu', 'Luqiao Liu', 'Tomás Lozano-Pérez', 'Nancy Lynch', 'Samuel Madden', 'Aleksander Mądry', 'Thomas Magnanti', 'Wojciech Matusik', 'Muriel Médard', 'Alexandre Megretski', 'Silvio Micali', 'Robert Miller', 'Robert Morris', 'Stefanie Mueller', 'Sendhil Mullainathan', 'Anand V Natarajan', 'Farnaz Niroui', 'Jelena Notaros', 'Kevin P. O’Brien', 'William D. Oliver', 'Alan Oppenheim', 'Terry Orlando', 'Asu Ozdaglar', 'Tomás Palacios', 'Pablo Parrilo', 'David Perreault', 'Yury Polyanskiy', 'Jonathan Ragan-Kelley', 'Manish Raghavan', 'Rajeev Ram', 'L. Rafael Reif', 'Negar Reiskarimian', 'Martin Rinard', 'Ronald Rivest', 'Ronitt Rubinfeld', 'Daniela Rus', 'Daniel Sanchez', 'Arvind Satyanarayan', 'Nidhi Seethapathi', 'Devavrat Shah', 'Jeffrey Shapiro', 'Nir Shavit', 'Julian Shun', 'Vincent Sitzmann', 'Tess Smidt', 'Charles Sodini', 'Armando Solar-Lezama', 'Justin Solomon', 'David Sontag', 'Suvrit Sra', 'Michael Stonebraker', 'Collin Stultz', 'Gerald Sussman', 'Vivienne Sze', 'Peter Szolovits', 'Russell Tedrake', 'Bruce Tidor', 'Antonio Torralba', 'John Tsitsiklis', 'Caroline Uhler', 'Vinod Vaikuntanathan', 'George Verghese', 'Joel Voldman', 'Martin Wainwright', 'Cardinal Warde', 'Jacob White', 'Ryan Williams', 'Virginia Vassilevska Williams', 'Alan Willsky', 'Ashia Wilson', 'Gregory Wornell', 'Mengjia Yan', 'Guangyu Robert Yang', 'Sixian You', 'Nickolai Zeldovich', 'Lizhong Zheng', 'Victor Zue']
- Scrape Papers
#scrapping abstracts
def scrapeArXiV(names):
papers = list()
for name in names:
search_url = search_url_format.format(name.replace(' ', '+'))
papers_author = get(search_url)
papers_author_content = BeautifulSoup(papers_author.content, 'html.parser')
papers_author_body = papers_author_content.body
results = papers_author_body.find_all("li", class_="arxiv-result")
abstracts = [result.find("span", class_="abstract-full") for result in results]
abstracts_content = [abstract.a.unwrap() for abstract in abstracts]
abstracts_content = [abstract.contents[0] for abstract in abstracts]
if abstracts_content:
papers = papers + abstracts_content
return papers
papers = scrapeArXiV(names)
papers
['\n Our increasing reliance on digital technology for personal, economic, and government affairs has made it essential to secure the communications and devices of private citizens, businesses, and governments. This has led to pervasive use of cryptography across society. Despite its evident advantages, law enforcement and national security agencies have argued that the spread of cryptography has hindered access to evidence and intelligence. Some in industry and government now advocate a new technology to access targeted data: client-side scanning (CSS). Instead of weakening encryption or providing law enforcement with backdoor keys to decrypt communications, CSS would enable on-device analysis of data in the clear. If targeted information were detected, its existence and, potentially, its source, would be revealed to the agencies; otherwise, little or no information would leave the client device. Its proponents claim that CSS is a solution to the encryption versus public safety debate: it offers privacy -- in the sense of unimpeded end-to-end encryption -- and the ability to successfully investigate serious crime. In this report, we argue that CSS neither guarantees efficacious crime prevention nor prevents surveillance. Indeed, the effect is the opposite. CSS by its nature creates serious security and privacy risks for all society while the assistance it can provide for law enforcement is at best problematic. There are multiple ways in which client-side scanning can fail, can be evaded, and can be abused.\n ', '\n In fetal-brain MRI, head-pose changes between prescription and acquisition present a challenge to obtaining the standard sagittal, coronal and axial views essential to clinical assessment. As motion limits acquisitions to thick slices that preclude retrospective resampling, technologists repeat ~55-second stack-of-slices scans (HASTE) with incrementally reoriented field of view numerous times, deducing the head pose from previous stacks. To address this inefficient workflow, we propose a robust head-pose detection algorithm using full-uterus scout scans (EPI) which take ~5 seconds to acquire. Our ~2-second procedure automatically locates the fetal brain and eyes, which we derive from maximally stable extremal regions (MSERs). The success rate of the method exceeds 94% in the third trimester, outperforming a trained technologist by up to 20%. The pipeline may be used to automatically orient the anatomical sequence, removing the need to estimate the head pose from 2D views and reducing delays during which motion can occur.\n ', '\n Fetal motion is unpredictable and rapid on the scale of conventional MR scan times. Therefore, dynamic fetal MRI, which aims at capturing fetal motion and dynamics of fetal function, is limited to fast imaging techniques with compromises in image quality and resolution. Super-resolution for dynamic fetal MRI is still a challenge, especially when multi-oriented stacks of image slices for oversampling are not available and high temporal resolution for recording the dynamics of the fetus or placenta is desired. Further, fetal motion makes it difficult to acquire high-resolution images for supervised learning methods. To address this problem, in this work, we propose STRESS (Spatio-Temporal Resolution Enhancement with Simulated Scans), a self-supervised super-resolution framework for dynamic fetal MRI with interleaved slice acquisitions. Our proposed method simulates an interleaved slice acquisition along the high-resolution axis on the originally acquired data to generate pairs of low- and high-resolution images. Then, it trains a super-resolution network by exploiting both spatial and temporal correlations in the MR time series, which is used to enhance the resolution of the original data. Evaluations on both simulated and in utero data show that our proposed method outperforms other self-supervised super-resolution methods and improves image quality, which is beneficial to other downstream tasks and evaluations.\n ', '\n Image denoising is of great importance for medical imaging system, since it can improve image quality for disease diagnosis and downstream image analyses. In a variety of applications, dynamic imaging techniques are utilized to capture the time-varying features of the subject, where multiple images are acquired for the same subject at different time points. Although signal-to-noise ratio of each time frame is usually limited by the short acquisition time, the correlation among different time frames can be exploited to improve denoising results with shared information across time frames. With the success of neural networks in computer vision, supervised deep learning methods show prominent performance in single-image denoising, which rely on large datasets with clean-vs-noisy image pairs. Recently, several self-supervised deep denoising models have been proposed, achieving promising results without needing the pairwise ground truth of clean images. In the field of multi-image denoising, however, very few works have been done on extracting correlated information from multiple slices for denoising using self-supervised deep learning methods. In this work, we propose Deformed2Self, an end-to-end self-supervised deep learning framework for dynamic imaging denoising. It combines single-image and multi-image denoising to improve image quality and use a spatial transformer network to model motion between different slices. Further, it only requires a single noisy image with a few auxiliary observations at different time frames for training and inference. Evaluations on phantom and in vivo data with different noise statistics show that our method has comparable performance to other state-of-the-art unsupervised or self-supervised denoising methods and outperforms under high noise levels.\n ', "\n Purpose: To develop a scan-specific model that estimates and corrects k-space errors made when reconstructing accelerated Magnetic Resonance Imaging (MRI) data.\n Methods: Scan-Specific Artifact Reduction in k-space (SPARK) trains a convolutional-neural-network to estimate and correct k-space errors made by an input reconstruction technique by back-propagating from the mean-squared-error loss between an auto-calibration signal (ACS) and the input technique's reconstructed ACS. First, SPARK is applied to GRAPPA and demonstrates improved robustness over other scan-specific models, such as RAKI and residual-RAKI. Subsequent experiments demonstrate that SPARK synergizes with residual-RAKI to improve reconstruction performance. SPARK also improves reconstruction quality when applied to advanced acquisition and reconstruction techniques like 2D virtual coil (VC-) GRAPPA, 2D LORAKS, 3D GRAPPA without an integrated ACS region, and 2D/3D wave-encoded images.\n Results: SPARK yields 1.5x - 2x RMSE reduction when applied to GRAPPA and improves robustness to ACS size for various acceleration rates in comparison to other scan-specific techniques. When applied to advanced reconstruction techniques such as residual-RAKI, 2D VC-GRAPPA and LORAKS, SPARK achieves up to 20% RMSE improvement. SPARK with 3D GRAPPA also improves performance by ~2x and perceived image quality without a fully sampled ACS region. Finally, SPARK synergizes with non-cartesian 2D and 3D wave-encoding imaging by reducing RMSE between 20-25% and providing qualitative improvements.\n Conclusion: SPARK synergizes with physics-based acquisition and reconstruction techniques to improve accelerated MRI by training scan-specific models to estimate and correct reconstruction errors in k-space.\n ", '\n Fetal MRI is heavily constrained by unpredictable and substantial fetal motion that causes image artifacts and limits the set of viable diagnostic image contrasts. Current mitigation of motion artifacts is predominantly performed by fast, single-shot MRI and retrospective motion correction. Estimation of fetal pose in real time during MRI stands to benefit prospective methods to detect and mitigate fetal motion artifacts where inferred fetal motion is combined with online slice prescription with low-latency decision making. Current developments of deep reinforcement learning (DRL), offer a novel approach for fetal landmarks detection. In this task 15 agents are deployed to detect 15 landmarks simultaneously by DRL. The optimization is challenging, and here we propose an improved DRL that incorporates priors on physical structure of the fetal body. First, we use graph communication layers to improve the communication among agents based on a graph where each node represents a fetal-body landmark. Further, additional reward based on the distance between agents and physical structures such as the fetal limbs is used to fully exploit physical structure. Evaluation of this method on a repository of 3-mm resolution in vivo data demonstrates a mean accuracy of landmark estimation within 10 mm of ground truth as 87.3%, and a mean error of 6.9 mm. The proposed DRL for fetal pose landmark search demonstrates a potential clinical utility for online detection of fetal motion that guides real-time mitigation of motion artifacts as well as health diagnosis during MRI of the pregnant mother.\n ', '\n We demonstrate that neural network layers that explicitly combine frequency and image feature representations are a versatile building block for analysis of imaging data acquired in the frequency space. Our work is motivated by the challenges arising in MRI acquisition where the signal is a corrupted Fourier transform of the desired image. The joint learning schemes proposed and analyzed in this paper enable both correction of artifacts native to the frequency space and manipulation of image space representations to reconstruct coherent image structures. This is in contrast to most current deep learning approaches for image reconstruction that apply learned data manipulations solely in the frequency space or solely in the image space. We demonstrate the advantages of joint convolutional learning on three diverse tasks: image reconstruction from undersampled acquisitions, motion correction, and image denoising in brain and knee MRI. We further demonstrate advantages of the joint learning approaches across training schemes using a wide variety of loss functions. Unlike purely image based and purely frequency based architectures, the joint models produce consistently high quality output images across all tasks and datasets. Joint image and frequency space feature representations promise to significantly improve modeling and reconstruction of images acquired in the frequency space. Our code is available at https://github.com/nalinimsingh/interlacer.\n ', '\n Fetal brain MRI is useful for diagnosing brain abnormalities but is challenged by fetal motion. The current protocol for T2-weighted fetal brain MRI is not robust to motion so image volumes are degraded by inter- and intra- slice motion artifacts. Besides, manual annotation for fetal MR image quality assessment are usually time-consuming. Therefore, in this work, a semi-supervised deep learning method that detects slices with artifacts during the brain volume scan is proposed. Our method is based on the mean teacher model, where we not only enforce consistency between student and teacher models on the whole image, but also adopt an ROI consistency loss to guide the network to focus on the brain region. The proposed method is evaluated on a fetal brain MR dataset with 11,223 labeled images and more than 200,000 unlabeled images. Results show that compared with supervised learning, the proposed method can improve model accuracy by about 6\\% and outperform other state-of-the-art semi-supervised learning methods. The proposed method is also implemented and evaluated on an MR scanner, which demonstrates the feasibility of online image quality assessment and image reacquisition during fetal MR scans.\n ', '\n Purpose: To improve the image quality of highly accelerated multi-channel MRI data by learning a joint variational network that reconstructs multiple clinical contrasts jointly.\n Methods: Data from our multi-contrast acquisition was embedded into the variational network architecture where shared anatomical information is exchanged by mixing the input contrasts. Complementary k-space sampling across imaging contrasts and Bunch-Phase/Wave-Encoding were used for data acquisition to improve the reconstruction at high accelerations. At 3T, our joint variational network approach across T1w, T2w and T2-FLAIR-weighted brain scans was tested for retrospective under-sampling at R=6 (2D) and R=4x4 (3D) acceleration. Prospective acceleration was also performed for 3D data where the combined acquisition time for whole brain coverage at 1 mm isotropic resolution across three contrasts was less than three minutes.\n Results: Across all test datasets, our joint multi-contrast network better preserved fine anatomical details with reduced image-blurring when compared to the corresponding single-contrast reconstructions. Improvement in image quality was also obtained through complementary k-space sampling and Bunch-Phase/Wave-Encoding where the synergistic combination yielded the overall best performance as evidenced by exemplarily slices and quantitative error metrics.\n Conclusion: By leveraging shared anatomical structures across the jointly reconstructed scans, our joint multi-contrast approach learnt more efficient regularizers which helped to retain natural image appearance and avoid over-smoothing. When synergistically combined with advanced encoding techniques, the performance was further improved, enabling up to R=16-fold acceleration with good image quality. This should help pave the way to very rapid high-resolution brain exams.\n ', '\n We propose Nonlinear Dipole Inversion (NDI) for high-quality Quantitative Susceptibility Mapping (QSM) without regularization tuning, while matching the image quality of state-of-the-art reconstruction techniques. In addition to avoiding over-smoothing that these techniques often suffer from, we also obviate the need for parameter selection. NDI is flexible enough to allow for reconstruction from an arbitrary number of head orientations, and outperforms COSMOS even when using as few as 1-direction data. This is made possible by a nonlinear forward-model that uses the magnitude as an effective prior, for which we derived a simple gradient descent update rule. We synergistically combine this physics-model with a Variational Network (VN) to leverage the power of deep learning in the VaNDI algorithm. This technique adopts the simple gradient descent rule from NDI and learns the network parameters during training, hence requires no additional parameter tuning. Further, we evaluate NDI at 7T using highly accelerated Wave-CAIPI acquisitions at 0.5 mm isotropic resolution and demonstrate high-quality QSM from as few as 2-direction data.\n ', '\n The performance and diagnostic utility of magnetic resonance imaging (MRI) in pregnancy is fundamentally constrained by fetal motion. Motion of the fetus, which is unpredictable and rapid on the scale of conventional imaging times, limits the set of viable acquisition techniques to single-shot imaging with severe compromises in signal-to-noise ratio and diagnostic contrast, and frequently results in unacceptable image quality. Surprisingly little is known about the characteristics of fetal motion during MRI and here we propose and demonstrate methods that exploit a growing repository of MRI observations of the gravid abdomen that are acquired at low spatial resolution but relatively high temporal resolution and over long durations (10-30 minutes). We estimate fetal pose per frame in MRI volumes of the pregnant abdomen via deep learning algorithms that detect key fetal landmarks. Evaluation of the proposed method shows that our framework achieves quantitatively an average error of 4.47 mm and 96.4\\% accuracy (with error less than 10 mm). Fetal pose estimation in MRI time series yields novel means of quantifying fetal movements in health and disease, and enables the learning of kinematic models that may enhance prospective mitigation of fetal motion artifacts during MRI acquisition.\n ', '\n We present a robust method to correct for motion in volumetric in-utero MRI time series. Time-course analysis for in-utero volumetric MRI time series often suffers from substantial and unpredictable fetal motion. Registration provides voxel correspondences between images and is commonly employed for motion correction. Current registration methods often fail when aligning images that are substantially different from a template (reference image). To achieve accurate and robust alignment, we make a Markov assumption on the nature of motion and take advantage of the temporal smoothness in the image data. Forward message passing in the corresponding hidden Markov model (HMM) yields an estimation algorithm that only has to account for relatively small motion between consecutive frames. We evaluate the utility of the temporal model in the context of in-utero MRI time series alignment by examining the accuracy of propagated segmentation label maps. Our results suggest that the proposed model captures accurately the temporal dynamics of transformations in in-utero MRI time series.\n ', '\n We present a robust method to correct for motion and deformations for in-utero volumetric MRI time series. Spatio-temporal analysis of dynamic MRI requires robust alignment across time in the presence of substantial and unpredictable motion. We make a Markov assumption on the nature of deformations to take advantage of the temporal structure in the image data. Forward message passing in the corresponding hidden Markov model (HMM) yields an estimation algorithm that only has to account for relatively small motion between consecutive frames. We demonstrate the utility of the temporal model by showing that its use improves the accuracy of the segmentation propagation through temporal registration. Our results suggest that the proposed model captures accurately the temporal dynamics of deformations in in-utero MRI time series.\n ', '\n A linear inverse problem is proposed that requires the determination of multiple unknown signal vectors. Each unknown vector passes through a different system matrix and the results are added to yield a single observation vector. Given the matrices and lone observation, the objective is to find a simultaneously sparse set of unknown vectors that solves the system. We will refer to this as the multiple-system single-output (MSSO) simultaneous sparsity problem. This manuscript contrasts the MSSO problem with other simultaneous sparsity problems and conducts a thorough initial exploration of algorithms with which to solve it. Seven algorithms are formulated that approximately solve this NP-Hard problem. Three greedy techniques are developed (matching pursuit, orthogonal matching pursuit, and least squares matching pursuit) along with four methods based on a convex relaxation (iteratively reweighted least squares, two forms of iterative shrinkage, and formulation as a second-order cone program). The algorithms are evaluated across three experiments: the first and second involve sparsity profile recovery in noiseless and noisy scenarios, respectively, while the third deals with magnetic resonance imaging radio-frequency excitation pulse design.\n ', "\n We present the design, implementation, and evaluation of RF-Grasp, a robotic system that can grasp fully-occluded objects in unknown and unstructured environments. Unlike prior systems that are constrained by the line-of-sight perception of vision and infrared sensors, RF-Grasp employs RF (Radio Frequency) perception to identify and locate target objects through occlusions, and perform efficient exploration and complex manipulation tasks in non-line-of-sight settings.\n RF-Grasp relies on an eye-in-hand camera and batteryless RFID tags attached to objects of interest. It introduces two main innovations: (1) an RF-visual servoing controller that uses the RFID's location to selectively explore the environment and plan an efficient trajectory toward an occluded target, and (2) an RF-visual deep reinforcement learning network that can learn and execute efficient, complex policies for decluttering and grasping.\n We implemented and evaluated an end-to-end physical prototype of RF-Grasp. We demonstrate it improves success rate and efficiency by up to 40-50% over a state-of-the-art baseline. We also demonstrate RF-Grasp in novel tasks such mechanical search of fully-occluded objects behind obstacles, opening up new possibilities for robotic manipulation. Qualitative results (videos) available at rfgrasp.media.mit.edu\n ", '\n Broadening access to both computational and educational resources is critical to diffusing machine-learning (ML) innovation. However, today, most ML resources and experts are siloed in a few countries and organizations. In this paper, we describe our pedagogical approach to increasing access to applied ML through a massive open online course (MOOC) on Tiny Machine Learning (TinyML). We suggest that TinyML, ML on resource-constrained embedded devices, is an attractive means to widen access because TinyML both leverages low-cost and globally accessible hardware, and encourages the development of complete, self-contained applications, from data collection to deployment. To this end, a collaboration between academia (Harvard University) and industry (Google) produced a four-part MOOC that provides application-oriented instruction on how to develop solutions using TinyML. The series is openly available on the edX MOOC platform, has no prerequisites beyond basic programming, and is designed for learners from a global variety of backgrounds. It introduces pupils to real-world applications, ML algorithms, data-set engineering, and the ethical considerations of these technologies via hands-on programming and deployment of TinyML applications in both the cloud and their own microcontrollers. To facilitate continued learning, community building, and collaboration beyond the courses, we launched a standalone website, a forum, a chat, and an optional course-project competition. We also released the course materials publicly, hoping they will inspire the next generation of ML practitioners and educators and further broaden access to cutting-edge ML technologies.\n ', '\n Heart Failure is a major component of healthcare expenditure and a leading cause of mortality worldwide. Despite higher inter-rater variability, endomyocardial biopsy (EMB) is still regarded as the standard technique, used to identify the cause (e.g. ischemic or non-ischemic cardiomyopathy, coronary artery disease, myocardial infarction etc.) of unexplained heart failure. In this paper, we focus on identifying cardiomyopathy as ischemic or non-ischemic. For this, we propose and implement a new unified architecture comprising CNN (inception-V3 model) and bidirectional LSTM (BiLSTM) with self-attention mechanism to predict the ischemic or non-ischemic to classify cardiomyopathy using histopathological images. The proposed model is based on self-attention that implicitly focuses on the information outputted from the hidden layers of BiLSTM. Through our results we demonstrate that this framework carries a high learning capacity and is able to improve the classification performance.\n ', '\n In-hand object reorientation has been a challenging problem in robotics due to high dimensional actuation space and the frequent change in contact state between the fingers and the objects. We present a simple model-free framework that can learn to reorient objects with both the hand facing upwards and downwards. We demonstrate the capability of reorienting over 2000 geometrically different objects in both cases. The learned policies show strong zero-shot transfer performance on new objects. We provide evidence that these policies are amenable to real-world operation by distilling them to use observations easily available in the real world. The videos of the learned policies are available at: https://taochenshh.github.io/projects/in-hand-reorientation.\n ', "\n In state-of-the-art self-supervised learning (SSL) pre-training produces semantically good representations by encouraging them to be invariant under meaningful transformations prescribed from human knowledge. In fact, the property of invariance is a trivial instance of a broader class called equivariance, which can be intuitively understood as the property that representations transform according to the way the inputs transform. Here, we show that rather than using only invariance, pre-training that encourages non-trivial equivariance to some transformations, while maintaining invariance to other transformations, can be used to improve the semantic quality of representations. Specifically, we extend popular SSL methods to a more general framework which we name Equivariant Self-Supervised Learning (E-SSL). In E-SSL, a simple additional pre-training objective encourages equivariance by predicting the transformations applied to the input. We demonstrate E-SSL's effectiveness empirically on several popular computer vision benchmarks. Furthermore, we demonstrate usefulness of E-SSL for applications beyond computer vision; in particular, we show its utility on regression problems in photonics science. We will release our code.\n ", "\n Today's robotic quadruped systems can robustly walk over a diverse range of rough but continuous terrains, where the terrain elevation varies gradually. Locomotion on discontinuous terrains, such as those with gaps or obstacles, presents a complementary set of challenges. In discontinuous settings, it becomes necessary to plan ahead using visual inputs and to execute agile behaviors beyond robust walking, such as jumps. Such dynamic motion results in significant motion of onboard sensors, which introduces a new set of challenges for real-time visual processing. The requirement for agility and terrain awareness in this setting reinforces the need for robust control. We present Depth-based Impulse Control (DIC), a method for synthesizing highly agile visually-guided locomotion behaviors. DIC affords the flexibility of model-free learning but regularizes behavior through explicit model-based optimization of ground reaction forces. We evaluate the proposed method both in simulation and in the real world.\n ", "\n The current dominant paradigm for robotic manipulation involves two separate stages: manipulator design and control. Because the robot's morphology and how it can be controlled are intimately linked, joint optimization of design and control can significantly improve performance. Existing methods for co-optimization are limited and fail to explore a rich space of designs. The primary reason is the trade-off between the complexity of designs that is necessary for contact-rich tasks against the practical constraints of manufacturing, optimization, contact handling, etc. We overcome several of these challenges by building an end-to-end differentiable framework for contact-aware robot design. The two key components of this framework are: a novel deformation-based parameterization that allows for the design of articulated rigid robots with arbitrary, complex geometry, and a differentiable rigid body simulator that can handle contact-rich scenarios and computes analytical gradients for a full spectrum of kinematic and dynamic parameters. On multiple manipulation tasks, our framework outperforms existing methods that either only optimize for control or for design using alternate representations or co-optimize using gradient-free methods.\n ", '\n Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.\n ', '\n Current model-based reinforcement learning methods struggle when operating from complex visual scenes due to their inability to prioritize task-relevant features. To mitigate this problem, we propose learning Task Informed Abstractions (TIA) that explicitly separates reward-correlated visual features from distractors. For learning TIA, we introduce the formalism of Task Informed MDP (TiMDP) that is realized by training two models that learn visual features via cooperative reconstruction, but one model is adversarially dissociated from the reward signal. Empirical evaluation shows that TIA leads to significant performance gains over state-of-the-art methods on many visual control tasks where natural and unconstrained visual distractions pose a formidable challenge.\n ', '\n A majority of microrobots are constructed using compliant materials that are difficult to model analytically, limiting the utility of traditional model-based controllers. Challenges in data collection on microrobots and large errors between simulated models and real robots make current model-based learning and sim-to-real transfer methods difficult to apply. We propose a novel framework residual model learning (RML) that leverages approximate models to substantially reduce the sample complexity associated with learning an accurate robot model. We show that using RML, we can learn a model of the Harvard Ambulatory MicroRobot (HAMR) using just 12 seconds of passively collected interaction data. The learned model is accurate enough to be leveraged as "proxy-simulator" for learning walking and turning behaviors using model-free reinforcement learning algorithms. RML provides a general framework for learning from extremely small amounts of interaction data, and our experiments with HAMR clearly demonstrate that RML substantially outperforms existing techniques.\n ', '\n Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate the hypothesis that deeper networks are inductively biased to find solutions with lower rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low-rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models and show that these are the solutions that generalize well. We then show that the low-rank simplicity bias exists even after training, using a wide variety of commonly used optimizers. We found this phenomenon to be resilient to initialization, hyper-parameters, and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance without changing the effective model capacity. Practically, we demonstrate that simply linearly over-parameterizing standard models at training time can improve performance on image classification tasks, including ImageNet.\n ', '\n We present a framework for solving long-horizon planning problems involving manipulation of rigid objects that operates directly from a point-cloud observation, i.e. without prior object models. Our method plans in the space of object subgoals and frees the planner from reasoning about robot-object interaction dynamics by relying on a set of generalizable manipulation primitives. We show that for rigid bodies, this abstraction can be realized using low-level manipulation skills that maintain sticking contact with the object and represent subgoals as 3D transformations. To enable generalization to unseen objects and improve planning performance, we propose a novel way of representing subgoals for rigid-body manipulation and a graph-attention based neural network architecture for processing point-cloud inputs. We experimentally validate these choices using simulated and real-world experiments on the YuMi robot. Results demonstrate that our method can successfully manipulate new objects into target configurations requiring long-term planning. Overall, our framework realizes the best of the worlds of task-and-motion planning (TAMP) and learning-based approaches. Project website: https://anthonysimeonov.github.io/rpo-planning-framework/.\n ', "\n Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations are available at https://sites.google.com/view/opal-iclr\n ", '\n When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient\'s variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale\'s convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular "linear learning rate scaling" rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale\'s qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.\n ', '\n Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn. In turn, this early learning supports more robust generalization and intelligent behavior later in life. While much work has gone into developing methods for exploration in machine learning, artificial agents have not yet reached the high standard set by their human counterparts. In this work we propose using DeepMind Lab (Beattie et al., 2016) as a platform to directly compare child and agent behaviors and to develop new exploration techniques. We outline two ongoing experiments to demonstrate the effectiveness of a direct comparison, and outline a number of open research questions that we believe can be tested using this methodology.\n ', '\n Learning robotic manipulation tasks using reinforcement learning with sparse rewards is currently impractical due to the outrageous data requirements. Many practical tasks require manipulation of multiple objects, and the complexity of such tasks increases with the number of objects. Learning from a curriculum of increasingly complex tasks appears to be a natural solution, but unfortunately, does not work for many scenarios. We hypothesize that the inability of the state-of-the-art algorithms to effectively utilize a task curriculum stems from the absence of inductive biases for transferring knowledge from simpler to complex tasks. We show that graph-based relational architectures overcome this limitation and enable learning of complex tasks when provided with a simple curriculum of tasks with increasing numbers of objects. We demonstrate the utility of our framework on a simulated block stacking task. Starting from scratch, our agent learns to stack six blocks into a tower. Despite using step-wise sparse rewards, our method is orders of magnitude more data-efficient and outperforms the existing state-of-the-art method that utilizes human demonstrations. Furthermore, the learned policy exhibits zero-shot generalization, successfully stacking blocks into taller towers and previously unseen configurations such as pyramids, without any further training.\n ', '\n We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.\n ', "\n We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions generalizes to novel objects and backgrounds. To deal with noisy training signal for segmenting objects obtained by self-supervised interactions, we propose robust set loss. A dataset of robot's interactions along-with a few human labeled examples is provided as a benchmark for future research. We test the utility of the learned segmentation model by providing results on a downstream vision-based control task of rearranging multiple objects into target configurations from visual inputs alone. Videos, code, and robotic interaction dataset are available at https://pathak22.github.io/seg-by-interaction/\n ", "\n The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance. Videos, models, and more details are available at https://pathak22.github.io/zeroshot-imitation/\n ", '\n What makes humans so good at solving seemingly complex video games? Unlike computers, humans bring in a great deal of prior knowledge about the world, enabling efficient decision making. This paper investigates the role of human priors for solving video games. Given a sample game, we conduct a series of ablation studies to quantify the importance of various priors on human performance. We do this by modifying the video game environment to systematically mask different types of visual information that could be used by humans as priors. We find that removal of some prior knowledge causes a drastic degradation in the speed with which human players solve the game, e.g. from 2 minutes to over 20 minutes. Furthermore, our results indicate that general priors, such as the importance of objects and visual consistency, are critical for efficient game-play. Videos and the game manipulations are available at https://rach0012.github.io/humanRL_website/\n ', '\n Automated cardiac image interpretation has the potential to transform clinical practice in multiple ways including enabling low-cost serial assessment of cardiac function in the primary care and rural setting. We hypothesized that advances in computer vision could enable building a fully automated, scalable analysis pipeline for echocardiogram (echo) interpretation. Our approach entailed: 1) preprocessing; 2) convolutional neural networks (CNN) for view identification, image segmentation, and phasing of the cardiac cycle; 3) quantification of chamber volumes and left ventricular mass; 4) particle tracking to compute longitudinal strain; and 5) targeted disease detection. CNNs accurately identified views (e.g. 99% for apical 4-chamber) and segmented individual cardiac chambers. Cardiac structure measurements agreed with study report values (e.g. mean absolute deviations (MAD) of 7.7 mL/kg/m2 for left ventricular diastolic volume index, 2918 studies). We computed automated ejection fraction and longitudinal strain measurements (within 2 cohorts), which agreed with commercial software-derived values [for ejection fraction, MAD=5.3%, N=3101 studies; for strain, MAD=1.5% (n=197) and 1.6% (n=110)], and demonstrated applicability to serial monitoring of breast cancer patients for trastuzumab cardiotoxicity. Overall, we found that, compared to manual measurements, automated measurements had superior performance across seven internal consistency metrics with an average increase in the Spearman correlation coefficient of 0.05 (p=0.02). Finally, we developed disease detection algorithms for hypertrophic cardiomyopathy and cardiac amyloidosis, with C-statistics of 0.93 and 0.84, respectively. Our pipeline lays the groundwork for using automated interpretation to support point-of-care handheld cardiac ultrasound and large-scale analysis of the millions of echos archived within healthcare systems.\n ', "\n In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch. Demo video and code available at https://pathak22.github.io/noreward-rl/\n ", '\n Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. We present a learning-based system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this task, the robot learns a pixel-level inverse dynamics model of rope manipulation directly from images in a self-supervised manner, using about 60K interactions with the rope collected autonomously by the robot. The human demonstration provides a high-level plan of what to do and the low-level inverse model is used to execute the plan. We show that by combining the high and low-level plans, the robot can successfully manipulate a rope into a variety of target shapes using only a sequence of human-provided images for direction.\n ', '\n When encountering novel objects, humans are able to infer a wide range of physical properties such as mass, friction and deformability by interacting with them in a goal driven way. This process of active interaction is in the same spirit as a scientist performing experiments to discover hidden facts. Recent advances in artificial intelligence have yielded machines that can achieve superhuman performance in Go, Atari, natural language processing, and complex control problems; however, it is not clear that these systems can rival the scientific intuition of even a young child. In this work we introduce a basic set of tasks that require agents to estimate properties such as mass and cohesion of objects in an interactive simulated environment where they can manipulate the objects and observe the consequences. We found that state of art deep reinforcement learning methods can learn to perform the experiments necessary to discover such hidden properties. By systematically manipulating the problem difficulty and the cost incurred by the agent for performing experiments, we found that agents learn different strategies that balance the cost of gathering information against the cost of making mistakes in different situations.\n ', '\n The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks begs the question: what are the properties of the ImageNet dataset that are critical for learning good, general-purpose features? This work provides an empirical investigation of various facets of this question: Is more pre-training data always better? How does feature quality depend on the number of training examples per class? Does adding more object classes improve performance? For the same data budget, how should the data be split into classes? Is fine-grained recognition necessary for learning good features? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class? To answer these and related questions, we pre-trained CNN features on various subsets of the ImageNet dataset and evaluated transfer performance on PASCAL detection, PASCAL action classification, and SUN scene classification tasks. Our overall findings suggest that most changes in the choice of pre-training data long thought to be critical do not significantly affect transfer performance.? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class?\n ', "\n We investigate an experiential learning paradigm for acquiring an internal model of intuitive physics. Our model is evaluated on a real-world robotic manipulation task that requires displacing objects to target locations by poking. The robot gathered over 400 hours of experience by executing more than 100K pokes on different objects. We propose a novel approach based on deep neural networks for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics. The inverse model objective provides supervision to construct informative visual features, which the forward model can then predict and in turn regularize the feature space for the inverse model. The interplay between these two objectives creates useful, accurate models that can then be used for multi-step decision making. This formulation has the additional benefit that it is possible to learn forward models in an abstract feature space and thus alleviate the need of predicting pixels. Our experiments show that this joint modeling approach outperforms alternative methods.\n ", '\n The ability to plan and execute goal specific actions in varied, unexpected settings is a central requirement of intelligent agents. In this paper, we explore how an agent can be equipped with an internal model of the dynamics of the external world, and how it can use this model to plan novel actions by running multiple internal simulations ("visual imagination"). Our models directly process raw visual input, and use a novel object-centric prediction formulation based on visual glimpses centered on objects (fixations) to enforce translational invariance of the learned physical laws. The agent gathers training data through random interaction with a collection of different environments, and the resulting model can then be used to plan goal-directed actions in novel environments that the agent has not seen before. We demonstrate that our agent can accurately plan actions for playing a simulated billiards game, which requires pushing a ball into a target position or into collision with another ball.\n ', '\n Hierarchical feature extractors such as Convolutional Networks (ConvNets) have achieved impressive performance on a variety of classification tasks using purely feedforward processing. Feedforward architectures can learn rich representations of the input space but do not explicitly model dependencies in the output spaces, that are quite structured for tasks such as articulated human pose estimation or object segmentation. Here we propose a framework that expands the expressive power of hierarchical feature extractors to encompass both input and output spaces, by introducing top-down feedback. Instead of directly predicting the outputs in one go, we use a self-correcting model that progressively changes an initial solution by feeding back error predictions, in a process we call Iterative Error Feedback (IEF). IEF shows excellent performance on the task of articulated pose estimation in the challenging MPII and LSP benchmarks, matching the state-of-the-art without requiring ground truth scale annotation.\n ', '\n The dominant paradigm for feature learning in computer vision relies on training neural networks for the task of object recognition using millions of hand labelled images. Is it possible to learn useful features for a diverse set of visual tasks using any other form of supervision? In biology, living organisms developed the ability of visual perception for the purpose of moving and acting in the world. Drawing inspiration from this observation, in this work we investigate if the awareness of egomotion can be used as a supervisory signal for feature learning. As opposed to the knowledge of class labels, information about egomotion is freely available to mobile agents. We show that given the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt using class-label as supervision on visual tasks of scene recognition, object recognition, visual odometry and keypoint matching.\n ', '\n The human brain is adept at solving difficult high-level visual processing problems such as image interpretation and object recognition in natural scenes. Over the past few years neuroscientists have made remarkable progress in understanding how the human brain represents categories of objects and actions in natural scenes. However, all current models of high-level human vision operate on hand annotated images in which the objects and actions have been assigned semantic tags by a human operator. No current models can account for high-level visual function directly in terms of low-level visual input (i.e., pixels). To overcome this fundamental limitation we sought to develop a new class of models that can predict human brain activity directly from low-level visual input (i.e., pixels). We explored two classes of models drawn from computer vision and machine learning. The first class of models was based on Fisher Vectors (FV) and the second was based on Convolutional Neural Networks (ConvNets). We find that both classes of models accurately predict brain activity in high-level visual areas, directly from pixels and without the need for any semantic tags or hand annotation of images. This is the first time that such a mapping has been obtained. The fit models provide a new platform for exploring the functional principles of human vision, and they show that modern methods of computer vision and machine learning provide important tools for characterizing brain function.\n ', '\n In the last two years, convolutional neural networks (CNNs) have achieved an impressive suite of results on standard recognition datasets and tasks. CNN-based features seem poised to quickly replace engineered representations, such as SIFT and HOG. However, compared to SIFT and HOG, we understand much less about the nature of the features learned by large CNNs. In this paper, we experimentally probe several aspects of CNN feature learning in an attempt to help practitioners gain useful, evidence-backed intuitions about how to apply CNNs to computer vision problems.\n ', '\n Accurately maintaining digital street maps is labor-intensive. To address this challenge, much work has studied automatically processing geospatial data sources such as GPS trajectories and satellite images to reduce the cost of maintaining digital maps. An end-to-end map update system would first process geospatial data sources to extract insights, and second leverage those insights to update and improve the map. However, prior work largely focuses on the first step of this pipeline: these map extraction methods infer road networks from scratch given geospatial data sources (in effect creating entirely new maps), but do not address the second step of leveraging this extracted information to update the existing digital map data. In this paper, we first explain why current map extraction techniques yield low accuracy when extended to update existing maps. We then propose a novel method that leverages the progression of satellite imagery over time to substantially improve accuracy. Our approach first compares satellite images captured at different times to identify portions of the physical road network that have visibly changed, and then updates the existing map accordingly. We show that our change-based approach reduces map update error rates four-fold.\n ', '\n The success of blockchains has sparked interest in large-scale deployments of Byzantine fault tolerant (BFT) consensus protocols over wide area networks. A central feature of such networks is variable communication bandwidth across nodes and across time. We present DispersedLedger, an asynchronous BFT protocol that provides near-optimal throughput in the presence of such variable network bandwidth. The core idea of DispersedLedger is to enable nodes to propose, order, and agree on blocks of transactions without having to download their full content. By enabling nodes to agree on an ordered log of blocks, with a guarantee that each block is available within the network and unmalleable, DispersedLedger decouples bandwidth-intensive block downloads at different nodes, allowing each to make progress at its own pace. We build a full system prototype and evaluate it on real-world and emulated networks. Our results on a geo-distributed wide-area deployment across the Internet shows that DispersedLedger achieves 2x better throughput and 74% reduction in latency compared to HoneyBadger, the state-of-the-art asynchronous protocol.\n ', '\n This paper studies the problem of allocating tasks from different customers to vehicles in mobility platforms, which are used for applications like food and package delivery, ridesharing, and mobile sensing. A mobility platform should allocate tasks to vehicles and schedule them in order to optimize both throughput and fairness across customers. However, existing approaches to scheduling tasks in mobility platforms ignore fairness.\n We introduce Mobius, a system that uses guided optimization to achieve both high throughput and fairness across customers. Mobius supports spatiotemporally diverse and dynamic customer demands. It provides a principled method to navigate inherent tradeoffs between fairness and throughput caused by shared mobility. Our evaluation demonstrates these properties, along with the versatility and scalability of Mobius, using traces gathered from ridesharing and aerial sensing applications. Our ridesharing case study shows that Mobius can schedule more than 16,000 tasks across 40 customers and 200 vehicles in an online manner.\n ', '\n Video compression is a critical component of Internet video delivery. Recent work has shown that deep learning techniques can rival or outperform human-designed algorithms, but these methods are significantly less compute and power-efficient than existing codecs. This paper presents a new approach that augments existing codecs with a small, content-adaptive super-resolution model that significantly boosts video quality. Our method, SRVC, encodes video into two bitstreams: (i) a content stream, produced by compressing downsampled low-resolution video with the existing codec, (ii) a model stream, which encodes periodic updates to a lightweight super-resolution neural network customized for short segments of the video. SRVC decodes the video by passing the decompressed low-resolution video frames through the (time-varying) super-resolution model to reconstruct high-resolution video frames. Our results show that to achieve the same PSNR, SRVC requires 16% of the bits-per-pixel of H.265 in slow mode, and 2% of the bits-per-pixel of DVC, a recent deep learning-based video compression scheme. SRVC runs at 90 frames per second on a NVIDIA V100 GPU.\n ', "\n Training high-accuracy object detection models requires large and diverse annotated datasets. However, creating these data-sets is time-consuming and expensive since it relies on human annotators. We design, implement, and evaluate TagMe, a new approach for automatic object annotation in videos that uses GPS data. When the GPS trace of an object is available, TagMe matches the object's motion from GPS trace and the pixels' motions in the video to find the pixels belonging to the object in the video and creates the bounding box annotations of the object. TagMe works using passive data collection and can continuously generate new object annotations from outdoor video streams without any human annotators. We evaluate TagMe on a dataset of 100 video clips. We show TagMe can produce high-quality object annotations in a fully-automatic and low-cost way. Compared with the traditional human-in-the-loop solution, TagMe can produce the same amount of annotations at a much lower cost, e.g., up to 110x.\n ", "\n Credit networks rely on decentralized, pairwise trust relationships (channels) to exchange money or goods. Credit networks arise naturally in many financial systems, including the recent construct of payment channel networks in blockchain systems. An important performance metric for these networks is their transaction throughput. However, predicting the throughput of a credit network is nontrivial. Unlike traditional communication channels, credit channels can become imbalanced; they are unable to support more transactions in a given direction once the credit limit has been reached. This potential for imbalance creates a complex dependency between a network's throughput and its topology, path choices, and the credit balances (state) on every channel. Even worse, certain combinations of these factors can lead the credit network to deadlocked states where no transactions can make progress. In this paper, we study the relationship between the throughput of a credit network and its topology and credit state. We show that the presence of deadlocks completely characterizes a network's throughput sensitivity to different credit states. Although we show that identifying deadlocks in an arbitrary topology is NP-hard, we propose a peeling algorithm inspired by decoding algorithms for erasure codes that upper bounds the severity of the deadlock. We use the peeling algorithm as a tool to compare the performance of different topologies as well as to aid in the synthesis of topologies robust to deadlocks.\n ", '\n The increasing use of cloud computing for latency-sensitive applications has sparked renewed interest in providing tight bounds on network tail latency. Achieving this in practice at reasonable network utilization has proved elusive, due to a combination of highly bursty application demand, faster link speeds, and heavy-tailed message sizes. While priority scheduling can be used to reduce tail latency for some traffic, this comes at a cost of much worse delay behavior for all other traffic on the network. Most operators choose to run their networks at very low average utilization, despite the added cost, and yet still suffer poor tail behavior.\n This paper takes a different approach. We build a system, swp, to help operators (and network designers) to understand and control tail latency without relying on priority scheduling. As network workload changes, swp is designed to give real-time advice on the network switch configurations needed to maintain tail latency objectives for each traffic class. The core of swp is an efficient model for simulating the combined effect of traffic characteristics, end-to-end congestion control, and switch scheduling on service-level objectives (SLOs), along with an optimizer that adjusts switch-level scheduling weights assigned to each class. Using simulation across a diverse set of workloads with different SLOs, we show that to meet the same SLOs as swp provides, FIFO would require 65% greater link capacity, and 79% more for scenarios with tight SLOs on bursty traffic classes.\n ', "\n Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data.\n ", '\n Databases employ indexes to filter out irrelevant records, which reduces scan overhead and speeds up query execution. However, this optimization is only available to queries that filter on the indexed attribute. To extend these speedups to queries on other attributes, database systems have turned to secondary and multi-dimensional indexes. Unfortunately, these approaches are restrictive: secondary indexes have a large memory footprint and can only speed up queries that access a small number of records, and multi-dimensional indexes cannot scale to more than a handful of columns. We present Cortex, an approach that takes advantage of correlations to extend the reach of primary indexes to more attributes. Unlike prior work, Cortex can adapt itself to any existing primary index, whether single or multi-dimensional, to harness a broad variety of correlations, such as those that exist between more than two attributes or have a large number of outliers. We demonstrate that on real datasets exhibiting these diverse types of correlations, Cortex matches or outperforms traditional secondary indexes with $5\\times$ less space, and it is $2-8\\times$ faster than existing approaches to indexing correlations.\n ', "\n Queues allow network operators to control traffic: where queues build, they can enforce scheduling and shaping policies. In the Internet today, however, there is a mismatch between where queues build and where control is most effectively enforced; queues build at bottleneck links that are often not under the control of the data sender. To resolve this mismatch, we propose a new kind of middlebox, called Bundler. Bundler uses a novel inner control loop between a sendbox (in the sender's site) and a receivebox (in the receiver's site) to determine the aggregate rate for the bundle, leaving the end-to-end connections and their control loops intact. Enforcing this sending rate ensures that bottleneck queues that would have built up from the bundle's packets now shift from the bottleneck to the sendbox. The sendbox then exercises control over its traffic by scheduling packets to achieve higher-level objectives. We have implemented Bundler in Linux and evaluated it with real-world and emulation experiments. We find that Bundler allows the sender-chosen policy to be effective: when configured to implement Stochastic Fairness Queueing (SFQ), it improves median flow completion time (FCT) by between 28% and 97% across various scenarios.\n ", "\n Client-side video players employ adaptive bitrate (ABR) algorithms to optimize user quality of experience (QoE). We evaluate recently proposed RL-based ABR methods in Facebook's web-based video streaming platform. Real-world ABR contains several challenges that requires customized designs beyond off-the-shelf RL algorithms -- we implement a scalable neural network architecture that supports videos with arbitrary bitrate encodings; we design a training method to cope with the variance resulting from the stochasticity in network conditions; and we leverage constrained Bayesian optimization for reward shaping in order to optimize the conflicting QoE objectives. In a week-long worldwide deployment with more than 30 million video streaming sessions, our RL approach outperforms the existing human-engineered ABR algorithms.\n ", '\n Inferring road graphs from satellite imagery is a challenging computer vision task. Prior solutions fall into two categories: (1) pixel-wise segmentation-based approaches, which predict whether each pixel is on a road, and (2) graph-based approaches, which predict the road graph iteratively. We find that these two approaches have complementary strengths while suffering from their own inherent limitations.\n In this paper, we propose a new method, Sat2Graph, which combines the advantages of the two prior categories into a unified framework. The key idea in Sat2Graph is a novel encoding scheme, graph-tensor encoding (GTE), which encodes the road graph into a tensor representation. GTE makes it possible to train a simple, non-recurrent, supervised model to predict a rich set of features that capture the graph structure directly from an image. We evaluate Sat2Graph using two large datasets. We find that Sat2Graph surpasses prior methods on two widely used metrics, TOPO and APLS. Furthermore, whereas prior work only infers planar road graphs, our approach is capable of inferring stacked roads (e.g., overpasses), and does so robustly.\n ', '\n Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.\n ', '\n Real-time video inference on edge devices like mobile phones and drones is challenging due to the high computation cost of Deep Neural Networks. We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices. AMS uses a remote server to continually train and adapt a small model running on the edge device, boosting its performance on the live video using online knowledge distillation from a large, state-of-the-art model. We discuss the challenges of over-the-network model adaptation for video inference, and present several techniques to reduce communication cost of this approach: avoiding excessive overfitting, updating a small fraction of important model parameters, and adaptive sampling of training frames at edge devices. On the task of video semantic segmentation, our experimental results show 0.4--17.8 percent mean Intersection-over-Union improvement compared to a pre-trained model across several video datasets. Our prototype can perform video segmentation at 30 frames-per-second with 40 milliseconds camera-to-label latency on a Samsung Galaxy S10+ mobile phone, using less than 300 Kbps uplink and downlink bandwidth on the device.\n ', '\n Several programming languages use garbage collectors (GCs) to automatically manage memory for the programmer. Such collectors must decide when to look for unreachable objects to free, which can have a large performance impact on some applications. In this preliminary work, we propose a design for a learned garbage collector that autonomously learns over time when to perform collections. By using reinforcement learning, our design can incorporate user-defined reward functions, allowing an autonomous garbage collector to learn to optimize the exact metric the user desires (e.g., request latency or queries per second). We conduct an initial experimental study on a prototype, demonstrating that an approach based on tabular Q learning may be promising.\n ', '\n Query optimization remains one of the most challenging problems in data management systems. Recent efforts to apply machine learning techniques to query optimization challenges have been promising, but have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties and drawing upon a long history of research in multi-armed bandits, we introduce Bao (the BAndit Optimizer). Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints. Bao combines modern tree convolutional neural networks with Thompson sampling, a decades-old and well-studied reinforcement learning algorithm. As a result, Bao automatically learns from its mistakes and adapts to changes in query workloads, data, and schema. Experimentally, we demonstrate that Bao can quickly (an order of magnitude faster than previous approaches) learn strategies that improve end-to-end query execution performance, including tail latency. In cloud environments, we show that Bao can offer both reduced costs and better performance compared with a sophisticated commercial system.\n ', '\n Inferring road attributes such as lane count and road type from satellite imagery is challenging. Often, due to the occlusion in satellite imagery and the spatial correlation of road attributes, a road attribute at one position on a road may only be apparent when considering far-away segments of the road. Thus, to robustly infer road attributes, the model must integrate scattered information and capture the spatial correlation of features along roads. Existing solutions that rely on image classifiers fail to capture this correlation, resulting in poor accuracy. We find this failure is caused by a fundamental limitation -- the limited effective receptive field of image classifiers. To overcome this limitation, we propose RoadTagger, an end-to-end architecture which combines both Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs) to infer road attributes. The usage of graph neural networks allows information propagation on the road network graph and eliminates the receptive field limitation of image classifiers. We evaluate RoadTagger on both a large real-world dataset covering 688 km^2 area in 20 U.S. cities and a synthesized micro-dataset. In the evaluation, RoadTagger improves inference accuracy over the CNN image classifier based approaches. RoadTagger also demonstrates strong robustness against different disruptions in the satellite imagery and the ability to learn complicated inductive rules for aggregating scattered information along the road network.\n ', '\n Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this paper, we introduce Flood, a multi-dimensional in-memory index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system.\n ', '\n Street maps are a crucial data source that help to inform a wide range of decisions, from navigating a city to disaster relief and urban planning. However, in many parts of the world, street maps are incomplete or lag behind new construction. Editing maps today involves a tedious process of manually tracing and annotating roads, buildings, and other map features.\n Over the past decade, many automatic map inference systems have been proposed to automatically extract street map data from satellite imagery, aerial imagery, and GPS trajectory datasets. However, automatic map inference has failed to gain traction in practice due to two key limitations: high error rates (low precision), which manifest in noisy inference outputs, and a lack of end-to-end system design to leverage inferred data to update existing street maps.\n At MIT and QCRI, we have developed a number of algorithms and approaches to address these challenges, which we combined into a new system we call Mapster. Mapster is a human-in-the-loop street map editing system that incorporates three components to robustly accelerate the mapping process over traditional tools and workflows: high-precision automatic map inference, data refinement, and machine-assisted map editing.\n Through an evaluation on a large-scale dataset including satellite imagery, GPS trajectories, and ground-truth map data in forty cities, we show that Mapster makes automation practical for map editing, and enables the curation of map datasets that are more complete and up-to-date at less cost.\n ', "\n Bitcoin is the first fully decentralized permissionless blockchain protocol and achieves a high level of security: the ledger it maintains has guaranteed liveness and consistency properties as long as the adversary has less compute power than the honest nodes. However, its throughput is only 7 transactions per second and the confirmation latency can be up to hours. Prism is a new blockchain protocol which is designed to achieve a natural scaling of Bitcoin's performance while maintaining its full security guarantees. We present an implementation of Prism which achieves a throughput of 70,000 transactions per second and confirmation latencies of tens of seconds.\n ", "\n Effective congestion control for data center networks is becoming increasingly challenging with a growing amount of latency sensitive traffic, much fatter links, and extremely bursty traffic. Widely deployed algorithms, such as DCTCP and DCQCN, are still far from optimal in many plausible scenarios, particularly for tail latency. Many operators compensate by running their networks at low average utilization, dramatically increasing costs.\n In this paper, we argue that we have reached the practical limits of end-to-end congestion control. Instead, we propose, implement, and evaluate a new congestion control architecture called Backpressure Flow Control (BFC). BFC provides per-hop per-flow flow control, but with bounded state, constant-time switch operations, and careful use of buffers. We demonstrate BFC's feasibility by implementing it on Tofino2, a state-of-the-art P4-based programmable hardware switch. In simulation, we show that BFC achieves near optimal throughput and tail latency behavior even under challenging conditions such as high network load and incast cross traffic. Compared to existing end-to-end schemes, BFC achieves 2.3 - 60 X lower tail latency for short flows and 1.6 - 5 X better average completion time for long flows.\n ", '\n We present Placeto, a reinforcement learning (RL) approach to efficiently find device placements for distributed neural network training. Unlike prior approaches that only find a device placement for a specific computation graph, Placeto can learn generalizable device placement policies that can be applied to any graph. We propose two key ideas in our approach: (1) we represent the policy as performing iterative placement improvements, rather than outputting a placement in one shot; (2) we use graph embeddings to capture relevant information about the structure of the computation graph, without relying on node labels for indexing. These ideas allow Placeto to train efficiently and generalize to unseen graphs. Our experiments show that Placeto requires up to 6.1x fewer training steps to find placements that are on par with or better than the best placements found by prior approaches. Moreover, Placeto is able to learn a generalizable placement policy for any given family of graphs, which can then be used without any retraining to predict optimized placements for unseen graphs from the same family. This eliminates the large overhead incurred by prior RL approaches whose lack of generalizability necessitates re-training from scratch every time a new graph is to be placed.\n ', '\n Mapping road networks today is labor-intensive. As a result, road maps have poor coverage outside urban centers in many countries. Systems to automatically infer road network graphs from aerial imagery and GPS trajectories have been proposed to improve coverage of road maps. However, because of high error rates, these systems have not been adopted by mapping communities. We propose machine-assisted map editing, where automatic map inference is integrated into existing, human-centric map editing workflows. To realize this, we build Machine-Assisted iD (MAiD), where we extend the web-based OpenStreetMap editor, iD, with machine-assistance functionality. We complement MAiD with a novel approach for inferring road topology from aerial imagery that combines the speed of prior segmentation approaches with the accuracy of prior iterative graph construction methods. We design MAiD to tackle the addition of major, arterial roads in regions where existing maps have poor coverage, and the incremental improvement of coverage in regions where major roads are already mapped. We conduct two user studies and find that, when participants are given a fixed time to map roads, they are able to add as much as 3.5x more roads with MAiD.\n ', "\n Symbol detection for Massive Multiple-Input Multiple-Output (MIMO) is a challenging problem for which traditional algorithms are either impractical or suffer from performance limitations. Several recently proposed learning-based approaches achieve promising results on simple channel models (e.g., i.i.d. Gaussian). However, their performance degrades significantly on real-world channels with spatial correlation. We propose MMNet, a deep learning MIMO detection scheme that significantly outperforms existing approaches on realistic channels with the same or lower computational complexity. MMNet's design builds on the theory of iterative soft-thresholding algorithms and uses a novel training algorithm that leverages temporal and spectral correlation to accelerate training. Together, these innovations allow MMNet to train online for every realization of the channel. On i.i.d. Gaussian channels, MMNet requires two orders of magnitude fewer operations than existing deep learning schemes but achieves near-optimal performance. On spatially-correlated channels, it achieves the same error rate as the next-best learning scheme (OAMPNet) at 2.5dB lower SNR and with at least 10x less computational complexity. MMNet is also 4--8dB better overall than a classic linear scheme like the minimum mean square error (MMSE) detector.\n ", '\n We propose Accel-Brake Control (ABC), a simple and deployable explicit congestion control protocol for network paths with time-varying wireless links. ABC routers mark each packet with an "accelerate" or "brake", which causes senders to slightly increase or decrease their congestion windows. Routers use this feedback to quickly guide senders towards a desired target rate. ABC requires no changes to header formats or user devices, but achieves better performance than XCP. ABC is also incrementally deployable; it operates correctly when the bottleneck is a non-ABC router, and can coexist with non-ABC traffic sharing the same bottleneck link. We evaluate ABC using a Wi-Fi implementation and trace-driven emulation of cellular links. ABC achieves 30-40% higher throughput than Cubic+Codel for similar delays, and 2.2X lower delays than BBR on a Wi-Fi path. On cellular network paths, ABC achieves 50% higher throughput than Cubic+Codel.\n ', '\n Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (Neural Optimizer), a novel learning-based query optimizer that relies on deep neural networks to generate query executions plans. Neo bootstraps its query optimization model from existing optimizers and continues to learn from incoming queries, building upon its successes and learning from its failures. Furthermore, Neo naturally adapts to underlying data patterns and is robust to estimation errors. Experimental results demonstrate that Neo, even when bootstrapped from a simple optimizer like PostgreSQL, can learn a model that offers similar performance to state-of-the-art commercial optimizers, and in some cases even surpass them.\n ', "\n Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load.\n ", '\n Despite growing adoption of cryptocurrencies, making fast payments at scale remains a challenge. Payment channel networks (PCNs) such as the Lightning Network have emerged as a viable scaling solution. However, completing payments on PCNs is challenging: payments must be routed on paths with sufficient funds. As payments flow over a single channel (link) in the same direction, the channel eventually becomes depleted and cannot support further payments in that direction; hence, naive routing schemes like shortest-path routing can deplete key payment channels and paralyze the system. Today\'s PCNs also route payments atomically, worsening the problem. In this paper, we present Spider, a routing solution that "packetizes" transactions and uses a multi-path transport protocol to achieve high-throughput routing in PCNs. Packetization allows Spider to complete even large transactions on low-capacity payment channels over time, while the multi-path congestion control protocol ensures balanced utilization of channels and fairness across flows. Extensive simulations comparing Spider with state-of-the-art approaches shows that Spider requires less than 25% of the funds to successfully route over 95% of transactions on balanced traffic demands, and offloads 4x more transactions onto the PCN on imbalanced demands.\n ', '\n Prior research has proposed technical solutions to use peer-to-peer (P2P) content delivery to serve Internet video, showing that it can reduce costs to content providers. Yet, such methods have not become widespread except for a few niche instances. An important challenge is incentivization: what tangible benefits does P2P content delivery offer users who bring resources to the table? In this paper, we ask whether monetary incentives can help attract peers in P2P content delivery systems. We commissioned a professional survey of people around theUnited States to answer several relevant questions. We found that 51% of the 876 respondents--substantially larger than our expectations--answered "yes" to whether they would participate for suitable financial incentives. Encouraged by the results of the survey, we propose Gringotts, a system to structure incentives and securely incorporate P2P delivery into content delivery systems. Gringotts provides a novel Proof of Delivery mechanism that allows content providers to verify correct delivery of their files, and shows how to use cryptocurrency to pay peers while guarding against liars and Sybil attacks.\n ', '\n We consider reinforcement learning in input-driven environments, where an exogenous, stochastic input process affects the dynamics of the system. Input processes arise in many applications, including queuing systems, robotics control with disturbances, and object tracking. Since the state dynamics and rewards depend on the input process, the state alone provides limited information for the expected future returns. Therefore, policy gradient methods with standard state-dependent baselines suffer high variance during training. We derive a bias-free, input-dependent baseline to reduce this variance, and analytically show its benefits over state-dependent baselines. We then propose a meta-learning approach to overcome the complexity of learning a baseline that depends on a long sequence of inputs. Our experimental results show that across environments from queuing systems, computer networks, and MuJoCo robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.\n ', "\n Homa is a new transport protocol for datacenter networks. It provides exceptionally low latency, especially for workloads with a high volume of very short messages, and it also supports large messages and high network utilization. Homa uses in-network priority queues to ensure low latency for short messages; priority allocation is managed dynamically by each receiver and integrated with a receiver-driven flow control mechanism. Homa also uses controlled overcommitment of receiver downlinks to ensure efficient bandwidth utilization at high load. Our implementation of Homa delivers 99th percentile round-trip times less than 15μs for short messages on a 10 Gbps network running at 80% load. These latencies are almost 100x lower than the best published measurements of an implementation. In simulations, Homa's latency is roughly equal to pFabric and significantly better than pHost, PIAS, and NDP for almost all message sizes and workloads. Homa can also sustain higher network loads than pFabric, pHost, or PIAS.\n ", '\n This paper introduces Nimbus, a robust technique to detect whether the cross traffic competing with a flow is "elastic", and shows that this elasticity detector improves congestion control. If cross traffic is inelastic, then a sender can control queueing delays while achieving high throughput, but in the presence of elastic traffic, it may lose throughput if it attempts to control packet delay. To estimate elasticity, Nimbus modulates the flow\'s sending rate with sinusoidal pulses that create small traffic fluctuations at the bottleneck link, and measures the frequency response of the rate of the cross traffic. Our results on emulated and real-world paths show that congestion control using elasticity detection achieves throughput comparable to Cubic, but with delays that are 50-70 ms lower when cross traffic is inelastic. Nimbus detects the nature of the cross traffic more accurately than Copa, and is usable as a building block by other end-to-end algorithms.\n ', '\n Neural networks have been shown to be an effective tool for learning algorithms over graph-structured data. However, graph representation techniques---that convert graphs to real-valued vectors for use with neural networks---are still in their infancy. Recent works have proposed several approaches (e.g., graph convolutional networks), but these methods have difficulty scaling and generalizing to graphs with different sizes and shapes. We present Graph2Seq, a new technique that represents vertices of graphs as infinite time-series. By not limiting the representation to a fixed dimension, Graph2Seq scales naturally to graphs of arbitrary sizes and shapes. Graph2Seq is also reversible, allowing full recovery of the graph structure from the sequences. By analyzing a formal computational model for graph representation, we show that an unbounded sequence is necessary for scalability. Our experimental results with Graph2Seq show strong generalization and new state-of-the-art performance on a variety of graph combinatorial optimization problems.\n ', '\n Mapping road networks is currently both expensive and labor-intensive. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity. We show that these segmentation methods have high error rates because noisy CNN outputs are difficult to correct. We propose RoadTracer, a new method to automatically construct accurate road network maps from aerial images. RoadTracer uses an iterative search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We compare our approach with a segmentation method on fifteen cities, and find that at a 5% error rate, RoadTracer correctly captures 45% more junctions across these cities.\n ', '\n As its price per bit drops, SSD is increasingly becoming the default storage medium for cloud application databases. However, it has not become the preferred storage medium for key-value caches, even though SSD offers more than 10x lower price per bit and sufficient performance compared to DRAM. This is because key-value caches need to frequently insert, update and evict small objects. This causes excessive writes and erasures on flash storage, since flash only supports writes and erasures of large chunks of data. These excessive writes and erasures significantly shorten the lifetime of flash, rendering it impractical to use for key-value caches. We present Flashield, a hybrid key-value cache that uses DRAM as a "filter" to minimize writes to SSD. Flashield performs light-weight machine learning profiling to predict which objects are likely to be read frequently before getting updated; these objects, which are prime candidates to be stored on SSD, are written to SSD in large chunks sequentially. In order to efficiently utilize the cache\'s available memory, we design a novel in-memory index for the variable-sized objects stored on flash that requires only 4 bytes per object in DRAM. We describe Flashield\'s design and implementation and, we evaluate it on a real-world cache trace. Compared to state-of-the-art systems that suffer a write amplification of 2.5x or more, Flashield maintains a median write amplification of 0.5x without any loss of hit rate or throughput.\n ', '\n Switches today provide a small set of scheduling algorithms. While we can tweak scheduling parameters, we cannot modify algorithmic logic, or add a completely new algorithm, after the switch has been designed. This paper presents a design for a programmable packet scheduler, which allows scheduling algorithms---potentially algorithms that are unknown today---to be programmed into a switch without requiring hardware redesign.\n Our design builds on the observation that scheduling algorithms make two decisions: in what order to schedule packets and when to schedule them. Further, in many scheduling algorithms these decisions can be made when packets are enqueued. We leverage this observation to build a programmable scheduler using a single abstraction: the push-in first-out queue (PIFO), a priority queue that maintains the scheduling order and time for such algorithms.\n We show that a programmable scheduler using PIFOs lets us program a wide variety of scheduling algorithms. We present a detailed hardware design for this scheduler for a 64-port 10 Gbit/s shared-memory switch with <4% chip area overhead on a 16-nm standard-cell library. Our design lets us program many sophisticated algorithms, such as a 5-level hierarchical scheduler with programmable scheduling algorithms at each level.\n ', "\n Many algorithms for congestion control, scheduling, network measurement, active queue management, security, and load balancing require custom processing of packets as they traverse the data plane of a network switch. To run at line rate, these data-plane algorithms must be in hardware. With today's switch hardware, algorithms cannot be changed, nor new algorithms installed, after a switch has been built.\n This paper shows how to program data-plane algorithms in a high-level language and compile those programs into low-level microcode that can run on emerging programmable line-rate switching chipsets. The key challenge is that these algorithms create and modify algorithmic state. The key idea to achieve line-rate programmability for stateful algorithms is the notion of a packet transaction : a sequential code block that is atomic and isolated from other such code blocks. We have developed this idea in Domino, a C-like imperative language to express data-plane algorithms. We show with many examples that Domino provides a convenient and natural way to express sophisticated data-plane algorithms, and show that these algorithms can be run at line rate with modest estimated die-area overhead.\n ", "\n Hybrid switching - in which a high bandwidth circuit switch (optical or wireless) is used in conjunction with a low bandwidth packet switch - is a promising alternative to interconnect servers in today's large scale data-centers. Circuit switches offer a very high link rate, but incur a non-trivial reconfiguration delay which makes their scheduling challenging. In this paper, we demonstrate a lightweight, simple and nearly-optimal scheduling algorithm that trades-off configuration costs with the benefits of reconfiguration that match the traffic demands. The algorithm has strong connections to submodular optimization, has performance at least half that of the optimal schedule and strictly outperforms state of the art in a variety of traffic demand settings. These ideas naturally generalize: we see that indirect routing leads to exponential connectivity; this is another phenomenon of the power of multi hop routing, distinct from the well-known load balancing effects.\n ", '\n This paper presents a practical approach to rapidly introduce new dataplane functionality into networks: End-hosts embed tiny programs into packets to actively query and manipulate a network\'s internal state. We show how this "tiny packet program" (TPP) interface gives end-hosts unprecedented visibility into network behavior, enabling them to work with the network to achieve a common goal. Our design leverages what each component does best: (a) switches forward and execute tiny packet programs (at most 5 instructions) at line rate, and (b) end-hosts perform arbitrary computation on network state, which are easy to evolve. Using a hardware prototype on a NetFPGA, we show our design is feasible, at a reasonable cost. By implementing three different research proposals, we show that TPPs are also useful. And finally, we present an architecture in which they can be made secure.\n ', '\n This document describes an attempt to develop a compiler-based approach for computations with symmetric tensors. Given a computation and the symmetries of its input tensors, we derive formulas for random access under a storage scheme that eliminates redundancies; construct intermediate representations to describe the loop structure; and translate this information, using the taco tensor algebra compiler, into code. While we achieve a framework for reasoning about a fairly general class of symmetric computations, the resulting code is not performant when the symmetries are misaligned.\n ', '\n The low-energy spectrum and scattering of two-nucleon systems are studied with lattice quantum chromodynamics using a variational approach. A wide range of interpolating operators are used: dibaryon operators built from products of plane-wave nucleons, hexaquark operators built from six localized quarks, and quasi-local operators inspired by two-nucleon bound-state wavefunctions in low-energy effective theories. Sparsening techniques are used to compute the timeslice-to-all quark propagators required to form correlation-function matrices using products of these operators. Projection of these matrices onto irreducible representations of the cubic group, including spin-orbit coupling, is detailed. Variational methods are applied to constrain the low-energy spectra of two-nucleon systems in a single finite volume with quark masses corresponding to a pion mass of 806 MeV. Results for S- and D-wave phase shifts in the isospin singlet and triplet channels are obtained under the assumption that partial-wave mixing is negligible. Tests of interpolating-operator dependence are used to investigate the reliability of the energy spectra obtained and highlight both the strengths and weaknesses of variational methods. These studies and comparisons to previous studies using the same gauge-field ensemble demonstrate that interpolating-operator dependence can lead to significant effects on the two-nucleon energy spectra obtained using both variational and non-variational methods, including missing energy levels and other discrepancies. While this study is inconclusive regarding the presence of two-nucleon bound states at this quark mass, it provides robust upper bounds on two-nucleon energy levels that can be improved in future calculations using additional interpolating operators and is therefore a step toward reliable nuclear spectroscopy from the underlying Standard Model of particle physics.\n ', '\n Enabling compilers to automatically optimize code has been a longstanding goal for the compiler community. Efficiently solving this problem requires using precise cost models. These models predict whether applying a sequence of code transformations reduces the execution time of the program. Building an analytical cost model to do so is hard in modern x86 architectures due to the complexity of the microarchitecture. In this paper, we present a novel deep learning based cost model for automatic code optimization. This model was integrated in a search method and implemented in the Tiramisu compiler to select the best code transformations. The input of the proposed model is a set of simple features representing the unoptimized code and a sequence of code transformations. The model predicts the speedup expected when the code transformations are applied. Unlike previous models, the proposed one works on full programs and does not rely on any heavy feature engineering. The proposed model has only 16% of mean absolute percentage error in predicting speedups on full programs. The proposed model enables Tiramisu to automatically find code transformations that match or are better than state-of-the-art compilers without requiring the same level of heavy feature engineering required by those compilers.\n ', '\n The performance of graph programs depends highly on the algorithm, the size and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations and hardware platforms to use. The GraphIt programming language makes it easy for the programmer to write the algorithm once and optimize it for different inputs using a scheduling language. However, GraphIt currently has no support for generating high performance code for GPUs. Programmers must resort to re-implementing the entire algorithm from scratch in a low-level language with an entirely different set of abstractions and optimizations in order to achieve high performance on GPUs.\n We propose GG, an extension to the GraphIt compiler framework, that achieves high performance on both CPUs and GPUs using the same algorithm specification. GG significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that enables combining graph optimizations for GPUs. GG also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balancing by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within the L2 cache. We evaluate GG on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11x speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments.\n ', '\n We present a new algorithm for transposing sparse tensors called Quesadilla. The algorithm converts the sparse tensor data structure to a list of coordinates and sorts it with a fast multi-pass radix algorithm that exploits knowledge of the requested transposition and the tensors input partial coordinate ordering to provably minimize the number of parallel partial sorting passes. We evaluate both a serial and a parallel implementation of Quesadilla on a set of 19 tensors from the FROSTT collection, a set of tensors taken from scientific and data analytic applications. We compare Quesadilla and a generalization, Top-2-sadilla to several state of the art approaches, including the tensor transposition routine used in the SPLATT tensor factorization library. In serial tests, Quesadilla was the best strategy for 60% of all tensor and transposition combinations and improved over SPLATT by at least 19% in half of the combinations. In parallel tests, at least one of Quesadilla or Top-2-sadilla was the best strategy for 52% of all tensor and transposition combinations.\n ', '\n In this paper, we demonstrate a compiler that can optimize sparse and recurrent neural networks, both of which are currently outside of the scope of existing neural network compilers (sparse neural networks here stand for networks that can be accelerated with sparse tensor algebra techniques). Our demonstration includes a mapping of sparse and recurrent neural networks to the polyhedral model along with an implementation of our approach in TIRAMISU, our state-of-the-art polyhedral compiler. We evaluate our approach on a set of deep learning benchmarks and compare our results with hand-optimized industrial libraries. Our results show that our approach at least matches Intel MKL-DNN and in some cases outperforms it by 5x (on multicore-CPUs).\n ', "\n This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping, analysis, and assembly. We then develop a language that precisely describes how different formats group together and order a tensor's nonzeros in memory. This lets a compiler emit code that performs complex remappings of nonzeros when converting between formats. We also develop a query language that can extract statistics about sparse tensors, and we show how to emit efficient analysis code that computes such queries. Finally, we define an abstract interface that captures how data structures for storing a tensor can be efficiently assembled given specific statistics about the tensor. Disparate formats can implement this common interface, thus letting a compiler emit optimized sparse tensor conversion code for arbitrary combinations of many formats without hard-coding for any specific combination.\n Our evaluation shows that the technique generates sparse tensor conversion routines with performance between 1.00 and 2.01$\\times$ that of hand-optimized versions in SPARSKIT and Intel MKL, two popular sparse linear algebra libraries. And by emitting code that avoids materializing temporaries, which both libraries need for many combinations of source and target formats, our technique outperforms those libraries by 1.78 to 4.01$\\times$ for CSC/COO to DIA/ELL conversion.\n ", '\n We address the problem of optimizing mixed sparse and dense tensor algebra in a compiler. We show that standard loop transformations, such as strip-mining, tiling, collapsing, parallelization and vectorization, can be applied to irregular loops over sparse iteration spaces. We also show how these transformations can be applied to the contiguous value arrays of sparse tensor data structures, which we call their position space, to unlock load-balanced tiling and parallelism.\n We have prototyped these concepts in the open-source TACO system, where they are exposed as a scheduling API similar to the Halide domain-specific language for dense computations. Using this scheduling API, we show how to optimize mixed sparse/dense tensor algebra expressions, how to generate load-balanced code by scheduling sparse tensor algebra in position space, and how to generate sparse tensor algebra GPU code. Our evaluation shows that our transformations let us generate good code that is competitive with many hand-optimized implementations from the literature.\n ', '\n Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces a new priority-based extension to GraphIt, a domain-specific language for writing graph applications, to simplify writing high-performance parallel ordered graph algorithms. The extension enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. We extend the compiler with new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in $1.2\\times$--3$\\times$ speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. With the extension, GraphIt achieves up to 3$\\times$ speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.\n ', "\n Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or vector instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level vector intrinsics in their code. However, once programs are written using vector intrinsics of a specific instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer vector instruction sets. Hence, programmers have to manually rewrite the same code using vector intrinsics of a newer generation to exploit higher data widths and capabilities of new instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which revectorizes already vectorized code, by retargeting it to use vector instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-vectorized code.\n Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160$\\times$ and 1.430$\\times$ on fast integer unpacking kernels, and speedups of 1.145$\\times$ and 1.195$\\times$ on hand-vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 vector instructions respectively. We also extensively test Revec's impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting.\n ", "\n Predicting the number of clock cycles a processor takes to execute a block of assembly instructions in steady state (the throughput) is important for both compiler designers and performance engineers. Building an analytical model to do so is especially complicated in modern x86-64 Complex Instruction Set Computer (CISC) machines with sophisticated processor microarchitectures in that it is tedious, error prone, and must be performed from scratch for each processor generation. In this paper we present Ithemal, the first tool which learns to predict the throughput of a set of instructions. Ithemal uses a hierarchical LSTM--based approach to predict throughput based on the opcodes and operands of instructions in a basic block. We show that Ithemal is more accurate than state-of-the-art hand-written tools currently used in compiler backends and static machine code analyzers. In particular, our model has less than half the error of state-of-the-art analytical models (LLVM's llvm-mca and Intel's IACA). Ithemal is also able to predict these throughput values just as fast as the aforementioned tools, and is easily ported across a variety of processor microarchitectures with minimal developer effort.\n ", '\n Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for in-flight memory requests. These resources, however, often exhibit poor utilization rates on workloads with large working sets, e.g., in-memory databases, key-value stores, and graph analytics, as compilers and hardware struggle to expose ILP and MLP from the instruction stream automatically.\n In this paper, we introduce the IMLP (Instruction and Memory Level Parallelism) task programming model. IMLP tasks execute as coroutines that yield execution at annotated long-latency operations, e.g., memory accesses, divisions, or unpredictable branches. IMLP tasks are interleaved on a single thread, and integrate well with thread parallelism and vectorization. Our DSL embedded in C++, Cimple, allows exploration of task scheduling and transformations, such as buffering, vectorization, pipelining, and prefetching.\n We demonstrate state-of-the-art performance on core algorithms used in in-memory databases that operate on arrays, hash tables, trees, and skip lists. Cimple applications reach 2.5x throughput gains over hardware multithreading on a multi-core, and 6.4x single thread speedup.\n ', '\n The performance bottlenecks of graph applications depend not only on the algorithm and the underlying hardware, but also on the size and structure of the input graph. Programmers must try different combinations of a large set of techniques to develop the best implementation for a specific algorithm and type of graph. Existing graph frameworks lack flexibility, supporting only a limited set of optimizations.\n This paper introduces GraphIt, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures. GraphIt separates what is computed (algorithm) from how it is computed (schedule). Programmers specify the algorithm using an algorithm language, and performance optimizations are specified using a scheduling language. The algorithm language simplifies expressing the algorithms. We formulate graph optimizations, including edge traversal direction, data layout, parallelization, cache, NUMA, and kernel fusion optimizations, as tradeoffs among locality, parallelism, and work-efficiency. The scheduling language enables programmers to easily search through this complicated tradeoff space by composing together optimizations. We also built an autotuner to automatically find high-performance schedules. The compiler uses a new scheduling representation, the graph iteration space, to model, compose, and ensure the validity of the large number of optimizations. GraphIt outperforms the next fastest of six state-of-the-art shared-memory frameworks (Ligra, Green-Marl, GraphMat, Galois, Gemini, and Grazelle) on 24 out of 32 experiments by up to 4.8$\\times$, and is never more than 43% slower than the fastest framework on the other experiments. GraphIt also reduces the lines of code by up to an order of magnitude compared to the next fastest framework.\n ', '\n This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.\n ', '\n This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code to compute any tensor algebra expression on any combination of the aforementioned formats.\n To demonstrate our technique, we have implemented it in the taco tensor algebra compiler. Our modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations. Furthermore, by extending taco to support a wider range of formats specialized for different application and data characteristics, we can improve end-user application performance. For example, if input data is provided in the COO format, our technique allows computing a single matrix-vector multiplication directly with the data in COO, which is up to 3.6$\\times$ faster than by first converting the data to CSR.\n ', "\n Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.\n ", '\n In this position paper, we describe our vision of the future of machine programming through a categorical examination of three pillars of research. Those pillars are: (i) intention, (ii) invention, and(iii) adaptation. Intention emphasizes advancements in the human-to-computer and computer-to-machine-learning interfaces. Invention emphasizes the creation or refinement of algorithms or core hardware and software building blocks through machine learning (ML). Adaptation emphasizes advances in the use of ML-based constructs to autonomously evolve software.\n ', '\n High-performance DSL developers work hard to take advantage of modern hardware. The DSL compilers have to build their own complex middle-ends before they can target a common back-end such as LLVM, which only handles single instruction streams with SIMD instructions. We introduce Tiramisu, a common middle-end that can generate efficient code for modern processors and accelerators such as multicores, GPUs, FPGAs and distributed clusters. Tiramisu introduces a novel three-level IR that separates the algorithm, how that algorithm is executed, and where intermediate data are stored. This separation simplifies optimization and makes targeting multiple hardware architectures from the same algorithm easier. As a result, DSL compilers can be made considerably less complex with no loss of performance while immediately targeting multiple hardware or hardware combinations such as distributed nodes with both CPUs and GPUs. We evaluated Tiramisu by creating a new middle-end for the Halide and Julia compilers. We show that Tiramisu extends Halide and Julia with many new capabilities including the ability to: express new algorithms (such as recurrent filters and non-rectangular iteration spaces), perform new complex loop nest transformations (such as wavefront parallelization, loop shifting and loop fusion) and generate efficient code for more architectures (such as combinations of distributed clusters, multicores, GPUs and FPGAs). Finally, we demonstrate that Tiramisu can generate very efficient code that matches the highly optimized Intel MKL gemm (generalized matrix multiplication) implementation, we also show speedups reaching 4X in Halide and 16X in Julia due to optimizations enabled by Tiramisu.\n ', '\n This paper shows how to optimize sparse tensor algebraic expressions by introducing temporary tensors, called workspaces, into the resulting loop nests. We develop a new intermediate language for tensor operations called concrete index notation that extends tensor index notation. Concrete index notation expresses when and where sub-computations occur and what tensor they are stored into. We then describe the workspace optimization in this language, and how to compile it to sparse code by building on prior work in the literature.\n We demonstrate the importance of the optimization on several important sparse tensor kernels, including sparse matrix-matrix multiplication (SpMM), sparse tensor addition (SpAdd), and the matricized tensor times Khatri-Rao product (MTTKRP) used to factorize tensors. Our results show improvements over prior work on tensor algebra compilation and brings the performance of these kernels on par with state-of-the-art hand-optimized implementations. For example, SpMM was not supported by prior tensor algebra compilers, the performance of MTTKRP on the nell-2 data set improves by 35%, and MTTKRP can for the first time have sparse results.\n ', '\n Data analytics applications combine multiple functions from different libraries and frameworks. Even when each function is optimized in isolation, the performance of the combined application can be an order of magnitude below hardware limits due to extensive data movement across these functions. To address this problem, we propose Weld, a new interface between data-intensive libraries that can optimize across disjoint libraries and functions. Weld exposes a lazily-evaluated API where diverse functions can submit their computations in a simple but general intermediate representation that captures their data-parallel structure. It then optimizes data movement across these functions and emits efficient code for diverse hardware. Weld can be integrated into existing frameworks such as Spark, TensorFlow, Pandas and NumPy without changing their user-facing APIs. We demonstrate that Weld can speed up applications using these frameworks by up to 29x.\n ', '\n Modern hardware systems are heavily underutilized when running large-scale graph applications. While many in-memory graph frameworks have made substantial progress in optimizing these applications, we show that it is still possible to achieve up to 4 $\\times$ speedups over the fastest frameworks by greatly improving cache utilization. Previous systems have applied out-of-core processing techniques from the memory/disk boundary to the cache/DRAM boundary. However, we find that blindly applying such techniques is ineffective because of the much smaller performance gap between DRAM and cache. We present two techniques that take advantage of the cache with minimal or no instruction overhead. The first, frequency based clustering, groups together frequently accessed vertices to improve the utilization of each cache line with no runtime overhead. The second, CSR segmenting, partitions the graph to restrict all random accesses to the cache, makes all DRAM access sequential, and merges partition results using a very low overhead cache-aware merge. Both techniques can be easily implemented on top of optimized graph frameworks. Our techniques combined give speedups of up to 4 $\\times$ for PageRank, Label Propagation and Collaborative Filtering, and 2 $\\times$ for Betweenness Centrality over the best published results\n ', '\n After a neural sequence model encounters an unexpected token, can its behavior be predicted? We show that RNN and transformer language models exhibit structured, consistent generalization in out-of-distribution contexts. We begin by introducing two idealized models of generalization in next-word prediction: a local context model in which generalization is consistent with the last word observed, and a global context model in which generalization is consistent with the global structure of the input. In experiments in English, Finnish, Mandarin, and random regular languages, we demonstrate that neural language models interpolate between these two forms of generalization: their predictions are well-approximated by a log-linear combination of local and global predictive distributions. We then show that, in some languages, noise mediates the two forms of generalization: noise applied to input tokens encourages global generalization, while noise in history representations encourages local generalization. Finally, we offer a preliminary theoretical explanation of these results by proving that the observed interpolation behavior is expected in log-linear models with a particular feature correlation structure. These results help explain the effectiveness of two popular regularization schemes and show that aspects of sequence model generalization can be understood and controlled.\n ', '\n Few-shot class incremental learning -- the problem of updating a trained classifier to discriminate among an expanded set of classes with limited labeled data -- is a key challenge for machine learning systems deployed in non-stationary environments. Existing approaches to the problem rely on complex model architectures and training procedures that are difficult to tune and re-use. In this paper, we present an extremely simple approach that enables the use of ordinary logistic regression classifiers for few-shot incremental learning. The key to this approach is a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes. When combined with pretrained convolutional feature extractors, logistic regression models trained with subspace regularization outperform specialized, state-of-the-art approaches to few-shot incremental image classification by up to 22% on the miniImageNet dataset. Because of its simplicity, subspace regularization can be straightforwardly extended to incorporate additional background information about the new classes (including class names and descriptions specified in natural language); these further improve accuracy by up to 2%. Our results show that simple geometric regularization of class representations offers an effective tool for continual learning.\n ', "\n A large body of recent work has identified transformations in the latent spaces of generative adversarial networks (GANs) that consistently and interpretably transform generated images. But existing techniques for identifying these transformations rely on either a fixed vocabulary of pre-specified visual concepts, or on unsupervised disentanglement techniques whose alignment with human judgments about perceptual salience is unknown. This paper introduces a new method for building open-ended vocabularies of primitive visual concepts represented in a GAN's latent space. Our approach is built from three components: (1) automatic identification of perceptually salient directions based on their layer selectivity; (2) human annotation of these directions with free-form, compositional natural language descriptions; and (3) decomposition of these annotations into a visual concept vocabulary, consisting of distilled directions labeled with single words. Experiments show that concepts learned with our approach are reliable and composable -- generalizing across classes, contexts, and observers, and enabling fine-grained manipulation of image style and content.\n ", '\n We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions, and these descriptions generate sequences of low-level actions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks, using only a small number of seed annotations to ground language in action. In trained models, the space of natural language commands indexes a combinatorial library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals. We evaluate this approach in the ALFRED household simulation environment, providing natural language annotations for only 10% of demonstrations. It completes more than twice as many tasks as a standard approach to learning from demonstrations, matching the performance of instruction following models with access to ground-truth plans during both training and evaluation.\n ', '\n Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains -- string editing, image composition, and abstract reasoning about scenes -- even when no natural language hints are available at test time.\n ', '\n Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations -- including shuffling word order within sentences and deleting all words other than nouns -- remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.\n ', "\n Sequence-to-sequence transduction is the core problem in language processing applications as diverse as semantic parsing, machine translation, and instruction following. The neural network models that provide the dominant solution to these problems are brittle, especially in low-resource settings: they fail to generalize correctly or systematically from small datasets. Past work has shown that many failures of systematic generalization arise from neural models' inability to disentangle lexical phenomena from syntactic ones. To address this, we augment neural decoders with a lexical translation mechanism that generalizes existing copy mechanisms to incorporate learned, decontextualized, token-level translation rules. We describe how to initialize this mechanism using a variety of lexicon learning algorithms, and show that it improves systematic generalization on a diverse set of sequence modeling tasks drawn from cognitive science, formal semantics, and machine translation.\n ", "\n Does the effectiveness of neural language models derive entirely from accurate modeling of surface word co-occurrence statistics, or do these models represent and reason about the world they describe? In BART and T5 transformer language models, we identify contextual word representations that function as models of entities and situations as they evolve throughout a discourse. These neural representations have functional similarities to linguistic models of dynamic semantics: they support a linear readout of each entity's current properties and relations, and can be manipulated with predictable effects on language generation. Our results indicate that prediction in pretrained neural language models is supported, at least in part, by dynamic representations of meaning and implicit simulation of entity state, and that this behavior can be learned with only text as training data. Code and data are available at https://github.com/belindal/state-probes .\n ", "\n Black-box probing models can reliably extract linguistic features like tense, number, and syntactic role from pretrained word representations. However, the manner in which these features are encoded in representations remains poorly understood. We present a systematic study of the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features (including structured dependency relationships) are encoded in low-dimensional subspaces. We then refine this geometric picture, showing that there are hierarchical relations between the subspaces encoding general linguistic categories and more specific ones, and that low-dimensional feature encodings are distributed rather than aligned to individual neurons. Finally, we demonstrate that these linear subspaces are causally related to model behavior, and can be used to perform fine-grained manipulation of BERT's output distribution.\n ", '\n The past decade has witnessed a groundbreaking rise of machine learning for human language analysis, with current methods capable of automatically accurately recovering various aspects of syntax and semantics - including sentence structure and grounded word meaning - from large data collections. Recent research showed the promise of such tools for analyzing acoustic communication in nonhuman species. We posit that machine learning will be the cornerstone of future collection, processing, and analysis of multimodal streams of data in animal communication studies, including bioacoustic, behavioral, biological, and environmental data. Cetaceans are unique non-human model species as they possess sophisticated acoustic communications, but utilize a very different encoding system that evolved in an aquatic rather than terrestrial medium. Sperm whales, in particular, with their highly-developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent starting point for advanced machine learning tools that can be applied to other animals in the future. This paper details a roadmap toward this goal based on currently existing technology and multidisciplinary scientific community effort. We outline the key elements required for the collection and processing of massive bioacoustic data of sperm whales, detecting their basic communication units and language-like higher-level structures, and validating these models through interactive playback experiments. The technological capabilities developed by such an undertaking are likely to yield cross-applications and advancements in broader communities investigating non-human communication and animal behavioral research.\n ', "\n When intelligent agents communicate to accomplish shared goals, how do these goals shape the agents' language? We study the dynamics of learning in latent language policies (LLPs), in which instructor agents generate natural-language subgoal descriptions and executor agents map these descriptions to low-level actions. LLPs can solve challenging long-horizon reinforcement learning problems and provide a rich model for studying task-oriented language use. But previous work has found that LLP training is prone to semantic drift (use of messages in ways inconsistent with their original natural language meanings). Here, we demonstrate theoretically and empirically that multitask training is an effective counter to this problem: we prove that multitask training eliminates semantic drift in a well-studied family of signaling games, and show that multitask training of neural LLPs in a complex strategy game reduces drift and while improving sample efficiency.\n ", '\n Synthesizing programs from examples requires searching over a vast, combinatorial space of possible programs. In this search process, a key challenge is representing the behavior of a partially written program before it can be executed, to judge if it is on the right track and predict where to search next. We introduce a general technique for representing partially written programs in a program synthesis engine. We take inspiration from the technique of abstract interpretation, in which an approximate execution model is used to determine if an unfinished program will eventually satisfy a goal specification. Here we learn an approximate execution model implemented as a modular neural network. By constructing compositional program representations that implicitly encode the interpretation semantics of the underlying programming language, we can represent partial programs using a flexible combination of concrete execution state and learned neural representations, using the learned approximate semantics when concrete semantics are not known (in unfinished parts of the program). We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs, such as loops and higher-order functions, and can be used to synthesize programs more accurately for a given search budget than pure neural approaches in several domains.\n ', '\n Flexible neural sequence models outperform grammar- and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data -- particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems -- instruction following (SCAN) and morphological analysis (SIGMORPHON 2018) -- where R&R enables learning of new constructions and tenses from as few as eight initial examples.\n ', '\n We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset and code for replicating experiments are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.\n ', '\n We propose and demonstrate a novel machine learning algorithm that assesses pulmonary edema severity from chest radiographs. While large publicly available datasets of chest radiographs and free-text radiology reports exist, only limited numerical edema severity labels can be extracted from radiology reports. This is a significant challenge in learning such models for image classification. To take advantage of the rich information present in the radiology reports, we develop a neural network model that is trained on both images and free-text to assess pulmonary edema severity from chest radiographs at inference time. Our experimental results suggest that the joint image-text representation learning improves the performance of pulmonary edema assessment compared to a supervised model trained on images only. We also show the use of the text for explaining the image classification by the joint model. To the best of our knowledge, our approach is the first to leverage free-text radiology reports for improving the image model performance in this application. Our code is available at https://github.com/RayRuizhiLiao/joint_chestxray.\n ', '\n We present a randomized controlled trial for a model-in-the-loop regression task, with the goal of measuring the extent to which (1) good explanations of model predictions increase human accuracy, and (2) faulty explanations decrease human trust in the model. We study explanations based on visual saliency in an image-based age prediction task for which humans and learned models are individually capable but not highly proficient and frequently disagree. Our experimental design separates model quality from explanation quality, and makes it possible to compare treatments involving a variety of explanations of varying levels of quality. We find that presenting model predictions improves human accuracy. However, visual explanations of various kinds fail to significantly alter human accuracy or trust in the model - regardless of whether explanations characterize an accurate model, an inaccurate one, or are generated randomly and independently of the input image. These findings suggest the need for greater evaluation of explanations in downstream decision making tasks, better design-based tools for presenting explanations to users, and better approaches for generating explanations.\n ', '\n We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts that closely approximate neuron behavior. Compared to prior work that uses atomic labels as explanations, analyzing neurons compositionally allows us to more precisely and expressively characterize their behavior. We use this procedure to answer several questions on interpretability in models for vision and natural language processing. First, we examine the kinds of abstractions learned by neurons. In image classification, we find that many neurons learn highly abstract but semantically coherent visual concepts, while other polysemantic neurons detect multiple unrelated features; in natural language inference (NLI), neurons learn shallow lexical heuristics from dataset biases. Second, we see whether compositional explanations give us insight into model performance: vision neurons that detect human-interpretable concepts are positively correlated with task performance, while NLI neurons that fire for shallow heuristics are negatively correlated with task performance. Finally, we show how compositional explanations provide an accessible way for end users to produce simple "copy-paste" adversarial examples that change model behavior in predictable ways.\n ', "\n Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms state-of-the-art models trained on natural language data in several domains. These results suggest that simulation-to-real transfer is a practical framework for developing NLP applications, and that improved models for transfer might provide wide-ranging improvements in downstream tasks.\n ", '\n Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful.\n Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.\n ', '\n Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts ("greet the pink brontosaurus by the ferris wheel"). Modern neural networks, by contrast, struggle to interpret novel compositions. In this paper, we introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding. Going beyond a related benchmark that focused on syntactic aspects of generalization, gSCAN defines a language grounded in the states of a grid world, facilitating novel evaluations of acquiring linguistically motivated rules. For example, agents must understand how adjectives such as \'small\' are interpreted relative to the current world state or how adverbs such as \'cautiously\' combine with new verbs. We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules.\n ', '\n To be successful in real-world tasks, Reinforcement Learning (RL) needs to exploit the compositional, relational, and hierarchical structure of the world, and learn to transfer it to the task at hand. Recent advances in representation learning for language make it possible to build models that acquire world knowledge from text corpora and integrate this knowledge into downstream decision making problems. We thus argue that the time is right to investigate a tight integration of natural language understanding into RL in particular. We survey the state of the field, including work on instruction following, text games, and learning from textual domain knowledge. Finally, we call for the development of new environments as well as further investigation into the potential uses of recent Natural Language Processing (NLP) techniques for such tasks.\n ', '\n We propose a simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models. Under this protocol, synthetic training examples are constructed by taking real training examples and replacing (possibly discontinuous) fragments with other fragments that appear in at least one similar environment. The protocol is model-agnostic and useful for a variety of tasks. Applied to neural sequence-to-sequence models, it reduces error rate by as much as 87% on diagnostic tasks from the SCAN dataset and 16% on a semantic parsing task. Applied to n-gram language models, it reduces perplexity by roughly 1% on small corpora in several languages.\n ', '\n We improve the informativeness of models for conditional text generation using techniques from computational pragmatics. These techniques formulate language production as a game between speakers and listeners, in which a speaker should generate output text that a listener can use to correctly identify the original input that the text describes. While such approaches are widely used in cognitive science and grounded language learning, they have received less attention for more standard language generation tasks. We consider two pragmatic modeling methods for text generation: one where pragmatics is imposed by information preservation, and another where pragmatics is imposed by explicit modeling of distractors. We find that these methods improve the performance of strong existing systems for abstractive summarization and generation from structured meaning representations.\n ', "\n Many machine learning algorithms represent input data with vector embeddings or discrete codes. When inputs exhibit compositional structure (e.g. objects built from parts or procedures from subroutines), it is natural to ask whether this compositional structure is reflected in the the inputs' learned representations. While the assessment of compositionality in languages has received significant attention in linguistics and adjacent fields, the machine learning literature lacks general-purpose tools for producing graded measurements of compositional structure in more general (e.g. vector-valued) representation spaces. We describe a procedure for evaluating compositionality by measuring how well the true representation-producing model can be approximated by a model that explicitly composes a collection of inferred representational primitives. We use the procedure to provide formal and empirical characterizations of compositional structure in a variety of settings, exploring the relationship between compositionality and learning dynamics, human judgments, representational similarity, and generalization.\n ", '\n Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration. Instruction following from natural language instructions provides an appealing alternative: in the same way that we can specify goals to other humans simply by speaking or writing, we would like to be able to specify tasks for our machines. However, a single instruction may be insufficient to fully communicate our intent or, even if it is, may be insufficient for an autonomous agent to actually understand how to perform the desired task. In this work, we propose an interactive formulation of the task specification problem, where iterative language corrections are provided to an autonomous agent, guiding it in acquiring the desired skill. Our proposed language-guided policy learning algorithm can integrate an instruction and a sequence of corrections to acquire new skills very quickly. In our experiments, we show that this method can enable a policy to follow instructions and corrections for simulated navigation and manipulation tasks, substantially outperforming direct, non-interactive instruction following.\n ', "\n In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction. Existing models designed to produce interpretable traces of their decision-making process typically require these traces to be supervised at training time. In this paper, we present a novel neural modular approach that performs compositional reasoning by automatically inducing a desired sub-task decomposition without relying on strong supervision. Our model allows linking different reasoning tasks though shared modules that handle common routines across tasks. Experiments show that the model is more interpretable to human evaluators compared to other state-of-the-art models: users can better understand the model's underlying reasoning procedure and predict when it will succeed or fail based on observing its intermediate outputs.\n ", '\n Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.\n ', '\n We show that explicit pragmatic inference aids in correctly generating and following natural language instructions for complex, sequential tasks. Our pragmatics-enabled models reason about why speakers produce certain instructions, and about how listeners will react upon hearing them. Like previous pragmatic models, we use learned base listener and speaker models to build a pragmatic speaker that uses the base listener to simulate the interpretation of candidate descriptions, and a pragmatic listener that reasons counterfactually about alternative descriptions. We extend these models to tasks with sequential structure. Evaluation of language generation and interpretation shows that pragmatic inference improves state-of-the-art listener models (at correctly interpreting human instructions) and speaker models (at producing instructions correctly interpreted by humans) in diverse settings.\n ', '\n Deep reinforcement learning has achieved many recent successes, but our understanding of its strengths and limitations is hampered by the lack of rich environments in which we can fully characterize optimal behavior, and correspondingly diagnose individual actions against such a characterization. Here we consider a family of combinatorial games, arising from work of Erdos, Selfridge, and Spencer, and we propose their use as environments for evaluating and comparing different approaches to reinforcement learning. These games have a number of appealing features: they are challenging for current learning approaches, but they form (i) a low-dimensional, simply parametrized environment where (ii) there is a linear closed form solution for optimal behavior from any state, and (iii) the difficulty of the game can be tuned by changing environment parameters in an interpretable way. We use these Erdos-Selfridge-Spencer games not only to compare different algorithms, but test for generalization, make comparisons to supervised learning, analyse multiagent play, and even develop a self play algorithm. Code can be found at: https://github.com/rubai5/ESS_Game\n ', "\n The named concepts and compositional operators present in natural language provide a rich source of information about the kinds of abstractions humans use to navigate the world. Can this linguistic background knowledge improve the generality and efficiency of learned classifiers and control policies? This paper aims to show that using the space of natural language strings as a parameter space is an effective way to capture natural task structure. In a pretraining phase, we learn a language interpretation model that transforms inputs (e.g. images) into outputs (e.g. labels) given natural language descriptions. To learn a new concept (e.g. a classifier), we search directly in the space of descriptions to minimize the interpreter's loss on training examples. Crucially, our models do not require language data to learn these concepts: language is used only in pretraining to impose structure on subsequent learning. Results on image classification, text editing, and reinforcement learning show that, in all settings, models with a linguistic parameterization outperform those without.\n ", '\n We investigate the compositional structure of message vectors computed by a deep network trained on a communication game. By comparing truth-conditional representations of encoder-produced message vectors to human-produced referring expressions, we are able to identify aligned (vector, utterance) pairs with the same meaning. We then search for structured relationships among these aligned pairs to discover simple vector space transformations corresponding to negation, conjunction, and disjunction. Our results suggest that neural representations are capable of spontaneously developing a "syntax" with functional analogues to qualitative properties of natural language.\n ', '\n In this work, we present a minimal neural model for constituency parsing based on independent scoring of labels and spans. We show that this model is not only compatible with classical dynamic programming techniques, but also admits a novel greedy top-down inference algorithm based on recursive partitioning of the input. We demonstrate empirically that both prediction schemes are competitive with recent work, and when combined with basic extensions to the scoring model are capable of achieving state-of-the-art single-model performance on the Penn Treebank (91.79 F1) and strong performance on the French Treebank (82.23 F1).\n ', "\n Several approaches have recently been proposed for learning decentralized deep multiagent policies that coordinate via a differentiable communication channel. While these policies are effective for many tasks, interpretation of their induced communication strategies has remained a challenge. Here we propose to interpret agents' messages by translating them. Unlike in typical machine translation problems, we have no parallel data to learn from. Instead we develop a translation model based on the insight that agent messages and natural language strings mean the same thing if they induce the same belief about the world in a listener. We present theoretical guarantees and empirical evidence that our approach preserves both the semantics and pragmatics of messages by ensuring that players communicating through a translation layer do not suffer a substantial loss in reward relative to players with a common language.\n ", '\n Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.\n ', '\n People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.\n ', '\n We describe a framework for multitask deep reinforcement learning guided by policy sketches. Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them---specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). To learn from sketches, we present a model that associates every subtask with a modular subpolicy, and jointly maximizes reward over full task-specific policies by tying parameters across shared subpolicies. Optimization is accomplished via a decoupled actor--critic training objective that facilitates learning common behaviors from multiple dissimilar reward functions. We evaluate the effectiveness of our approach in three environments featuring both discrete and continuous control, and with sparse rewards that can be obtained only after completing a number of high-level subgoals. Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks.\n ', '\n We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics. Like previous learned approaches to language generation, our model uses a simple feature-driven architecture (here a pair of neural "listener" and "speaker" models) to ground language in the world. Like inference-driven approaches to pragmatics, our model actively reasons about listener behavior when selecting utterances. For training, our approach requires only ordinary captions, annotated _without_ demonstration of the pragmatic behavior the model ultimately exhibits. In human evaluations on a referring expression game, our approach succeeds 81% of the time, compared to a 69% success rate using existing techniques.\n ', '\n We describe a question answering model that applies to both images and structured knowledge bases. The model uses natural language strings to automatically assemble neural networks from a collection of composable modules. Parameters for these modules are learned jointly with network-assembly parameters via reinforcement learning, with only (world, question, answer) triples as supervision. Our approach, which we term a dynamic neural model network, achieves state-of-the-art results on benchmark datasets in both visual and structured domains.\n ', '\n Visual question answering is fundamentally compositional in nature---a question like "where is the dog?" shares substructure with questions like "what color is the dog?" and "where is the cat?" This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning *neural module networks*, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.\n ', "\n This paper describes an alignment-based model for interpreting natural language instructions in context. We approach instruction following as a search over plans, scoring sequences of actions conditioned on structured observations of text and the environment. By explicitly modeling both the low-level compositional structure of individual actions and the high-level structure of full plans, we are able to learn both grounded representations of sentence meaning and pragmatic constraints on interpretation. To demonstrate the model's flexibility, we apply it to a diverse set of benchmark tasks. On every task, we outperform strong task-specific baselines, and achieve several new state-of-the-art results.\n ", '\n Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as "self-normalization", which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove generalization bounds on the estimated variance of normalizers and upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributions that self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions.\n ', '\n We investigate models for learning the class of context-free and context-sensitive languages (CFLs and CSLs). We begin with a brief discussion of some early hardness results which show that unrestricted language learning is impossible, and unrestricted CFL learning is computationally infeasible; we then briefly survey the literature on algorithms for learning restricted subclasses of the CFLs. Finally, we introduce a new family of subclasses, the principled parametric context-free grammars (and a corresponding family of principled parametric context-sensitive grammars), which roughly model the "Principles and Parameters" framework in psycholinguistics. We present three hardness results: first, that the PPCFGs are not efficiently learnable given equivalence and membership oracles, second, that the PPCFGs are not efficiently learnable from positive presentations unless P = NP, and third, that the PPCSGs are not efficiently learnable from positive presentations unless integer factorization is in P.\n ', "\n Crystalline materials with broken inversion symmetry can exhibit a spontaneous electric polarization, which originates from a microscopic electric dipole moment. Long-range polar or anti-polar order of such permanent dipoles gives rise to ferroelectricity or antiferroelectricity, respectively. However, the recently discovered antiferroelectrics of fluorite structure (HfO$_2$ and ZrO$_2$) are different: A non-polar phase transforms into a polar phase by spontaneous inversion symmetry breaking upon the application of an electric field. Here, we show that this structural transition in antiferroelectric ZrO$_2$ gives rise to a negative capacitance, which is promising for overcoming the fundamental limits of energy efficiency in electronics. Our findings provide insight into the thermodynamically 'forbidden' region of the antiferroelectric transition in ZrO$_2$ and extend the concept of negative capacitance beyond ferroelectricity. This shows that negative capacitance is a more general phenomenon than previously thought and can be expected in a much broader range of materials exhibiting structural phase transitions.\n ", '\n This paper presents a succinct review of attempts in the literature to use game theory to model decision making scenarios relevant to defence applications. Game theory has been proven as a very effective tool in modelling decision making processes of intelligent agents, entities, and players. It has been used to model scenarios from diverse fields such as economics, evolutionary biology, and computer science. In defence applications, there is often a need to model and predict actions of hostile actors, and players who try to evade or out-smart each other. Modelling how the actions of competitive players shape the decision making of each other is the forte of game theory. In past decades, there have been several studies which applied different branches of game theory to model a range of defence-related scenarios. This paper provides a structured review of such attempts, and classifies existing literature in terms of the kind of warfare modelled, the types of game used, and the players involved. The presented analysis provides a concise summary about the state-of-the-art with regards to the use of game theory in defence applications, and highlights the benefits and limitations of game theory in the considered scenarios.\n ', '\n While different language models are ubiquitous in NLP, it is hard to contrast their outputs and identify which contexts one can handle better than the other. To address this question, we introduce LMdiff, a tool that visually compares probability distributions of two models that differ, e.g., through finetuning, distillation, or simply training with different parameter sizes. LMdiff allows the generation of hypotheses about model behavior by investigating text instances token by token and further assists in choosing these interesting text instances by identifying the most interesting phrases from large corpora. We showcase the applicability of LMdiff for hypothesis generation across multiple case studies. A demo is available at http://lmdiff.net .\n ', '\n In fluid physics, data-driven models to enhance or accelerate solution methods are becoming increasingly popular for many application domains, such as alternatives to turbulence closures, system surrogates, or for new physics discovery. In the context of reduced order models of high-dimensional time-dependent fluid systems, machine learning methods grant the benefit of automated learning from data, but the burden of a model lies on its reduced-order representation of both the fluid state and physical dynamics. In this work, we build a physics-constrained, data-driven reduced order model for the Navier-Stokes equations to approximate spatio-temporal turbulent fluid dynamics. The model design choices mimic numerical and physical constraints by, for example, implicitly enforcing the incompressibility constraint and utilizing continuous Neural Ordinary Differential Equations for tracking the evolution of the differential equation. We demonstrate this technique on three-dimensional, moderate Reynolds number turbulent fluid flow. In assessing the statistical quality and characteristics of the machine-learned model through rigorous diagnostic tests, we find that our model is capable of reconstructing the dynamics of the flow over large integral timescales, favoring accuracy at the larger length scales. More significantly, comprehensive diagnostics suggest that physically-interpretable model parameters, corresponding to the representations of the fluid state and dynamics, have attributable and quantifiable impact on the quality of the model predictions and computational complexity.\n ', "\n The web is global, but privacy laws differ by country. Which set of privacy rules do websites follow? We empirically study this question by detecting and analyzing cookie notices in an automated way. We crawl 1,500 European, American, and Canadian websites from each of 18 countries. We detect cookie notices on 40 percent of websites in our sample. We treat the presence or absence of cookie notices, as well as visual differences, as proxies for differences in privacy rules. Using a series of regression models, we find that the website's Top Level Domain explains a substantial portion of the variance in cookie notice metrics, but the user's vantage point does not. This suggests that websites follow one set of privacy rules for all their users. There is one exception to this finding: cookie notices differ when accessing .com domains from inside versus outside of the EU. We highlight ways in which future research could build on our preliminary findings.\n ", '\n A TeV scale leptoquark (LQ) is one of the promising explanations of the recent anomalies in the semileptonic decays of $B$ mesons. Among the various LQs, the vector $\\rm{U}_1$ is capable of explaining the anomalies in both $R_{D^{(*)}}$ and $R_{K^{(*)}}$ observables. We use the current LHC data to put bounds on the parameter space of $\\rm{U}_1$ relevant for the anomalies. Precise bounds are drawn by recasting the latest $ττ$ and $μμ$ searches by the ATLAS and CMS collaborations. We find that it is imperative to include the resonant production modes for obtaining limits in the low mass regions. For higher mass points, the non-resonant production modes play a dominant role.\n ', "\n We show how to derive state-of-the-art unsupervised neural machine translation systems from generatively pre-trained language models. Our method consists of three steps: few-shot amplification, distillation, and backtranslation. We first use the zero-shot translation ability of large pre-trained language models to generate translations for a small set of unlabeled sentences. We then amplify these zero-shot translations by using them as few-shot demonstrations for sampling a larger synthetic dataset. This dataset is distilled by discarding the few-shot demonstrations and then fine-tuning. During backtranslation, we repeatedly generate translations for a set of inputs and then fine-tune a single language model on both directions of the translation task at once, ensuring cycle-consistency by swapping the roles of gold monotext and generated translations when fine-tuning. By using our method to leverage GPT-3's zero-shot translation capability, we achieve a new state-of-the-art in unsupervised translation on the WMT14 English-French benchmark, attaining a BLEU score of 42.1.\n ", '\n Natural language descriptions sometimes accompany visualizations to better communicate and contextualize their insights, and to improve their accessibility for readers with disabilities. However, it is difficult to evaluate the usefulness of these descriptions, and how effectively they improve access to meaningful information, because we have little understanding of the semantic content they convey, and how different readers receive this content. In response, we introduce a conceptual model for the semantic content conveyed by natural language descriptions of visualizations. Developed through a grounded theory analysis of 2,147 sentences, our model spans four levels of semantic content: enumerating visualization construction properties (e.g., marks and encodings); reporting statistical concepts and relations (e.g., extrema and correlations); identifying perceptual and cognitive phenomena (e.g., complex trends and patterns); and elucidating domain-specific insights (e.g., social and political context). To demonstrate how our model can be applied to evaluate the effectiveness of visualization descriptions, we conduct a mixed-methods evaluation with 30 blind and 90 sighted readers, and find that these reader groups differ significantly on which semantic content they rank as most useful. Together, our model and findings suggest that access to meaningful information is strongly reader-specific, and that research in automatic visualization captioning should orient toward descriptions that more richly communicate overall trends and statistics, sensitive to reader preferences. Our work further opens a space of research on natural language as a data interface coequal with visualization.\n ', '\n The Tensor-Train (TT) format is a highly compact low-rank representation for high-dimensional tensors. TT is particularly useful when representing approximations to the solutions of certain types of parametrized partial differential equations. For many of these problems, computing the solution explicitly would require an infeasible amount of memory and computational time. While the TT format makes these problems tractable, iterative techniques for solving the PDEs must be adapted to perform arithmetic while maintaining the implicit structure. The fundamental operation used to maintain feasible memory and computational time is called rounding, which truncates the internal ranks of a tensor already in TT format. We propose several randomized algorithms for this task that are generalizations of randomized low-rank matrix approximation algorithms and provide significant reduction in computation compared to deterministic TT-rounding algorithms. Randomization is particularly effective in the case of rounding a sum of TT-tensors (where we observe 20x speedup), which is the bottleneck computation in the adaptation of GMRES to vectors in TT format. We present the randomized algorithms and compare their empirical accuracy and computational time with deterministic alternatives.\n ', '\n We give combinatorial proofs of two multivariate Cayley--Hamilton type theorems. The first one is due to Phillips (Amer. J. Math., 1919) involving $2k$ matrices, of which $k$ commute pairwise. The second one regards the mixed discriminant, a matrix function which has generated a lot of interest in recent times. Recently, the Cayley--Hamilton theorem for mixed discriminants was proved by Bapat and Roy (Comb. Math. and Comb. Comp., 2017). We prove a Phillips-type generalization of the Bapat--Roy theorem involving $2nk$ matrices, where $n$ is the size of the matrices, among which $nk$ commute pairwise. Our proofs generalize the univariate proof of Straubing (Disc. Math., 1983) for the original Cayley--Hamilton theorem in a nontrivial way, and involve decorated permutations and decorated paths.\n ', "\n Here, we develop a deep learning algorithm for solving Principal-Agent (PA) mean field games with market-clearing conditions -- a class of problems that have thus far not been studied and one that poses difficulties for standard numerical methods. We use an actor-critic approach to optimization, where the agents form a Nash equilibria according to the principal's penalty function, and the principal evaluates the resulting equilibria. The inner problem's Nash equilibria is obtained using a variant of the deep backward stochastic differential equation (BSDE) method modified for McKean-Vlasov forward-backward SDEs that includes dependence on the distribution over both the forward and backward processes. The outer problem's loss is further approximated by a neural net by sampling over the space of penalty functions. We apply our approach to a stylized PA problem arising in Renewable Energy Certificate (REC) markets, where agents may rent clean energy production capacity, trade RECs, and expand their long-term capacity to navigate the market at maximum profit. Our numerical results illustrate the efficacy of the algorithm and lead to interesting insights into the nature of optimal PA interactions in the mean-field limit of these markets.\n ", '\n We investigate a novel interplay between the decay and annihilation of a particle whose mass undergoes a large shift during a first order phase transition, leading to the particles becoming trapped in the false vacuum and enhancing their annihilation rates as the bubbles of true vacuum expand. This opens up a large region of the parameter space where annihilations can be important. We apply this scenario to baryogenesis, where we find that annihilations can be enhanced enough to generate the requires baryon asymmetry even for relatively tiny annihilation cross sections with modest CP asymmetries.\n ', '\n In this work, we consider the estimation of single mode Gaussian states using four different measurement schemes namely: i) homodyne measurement, ii) sequential measurement, iii) Arthurs-Kelly scheme, and iv) heterodyne measurement, with a view to compare their relative performance. To that end, we work in the phase space formalism, specifically at the covariance matrix level, which provides an elegant and intuitive way to explicitly carry out involved calculations. We show that the optimal performance of the Arthurs-Kelly scheme and the sequential measurement is equal to the heterodyne measurement. While the heterodyne measurement outperforms the homodyne measurement in the mean estimation of squeezed state ensemble, the homodyne measurement outperforms the heterodyne measurement for variance estimation of squeezed state ensemble up to a certain range of squeezing parameter. We then modify the Hamiltonian in the Arthurs-Kelly scheme, such that the two meters can have correlations and show that the optimal performance is achieved when the meters are uncorrelated. We expect that the results will be useful in analyzing various quantum information and quantum communication protocols.\n ', '\n We employ the compressed sensing (CS) algorithm and a heavily reduced data set to experimentally perform true quantum process tomography (QPT) on an NMR quantum processor. We obtain the estimate of the process matrix $χ$ corresponding to various two- and three-qubit quantum gates with a high fidelity. The CS algorithm is implemented using two different operator bases, namely, the standard Pauli basis and the Pauli-error basis. We experimentally demonstrate that the performance of the CS algorithm is significantly better in the Pauli-error basis, where the constructed $χ$ matrix is maximally sparse. We compare the standard least square (LS) optimization QPT method with the CS-QPT method and observe that, provided an appropriate basis is chosen, the CS-QPT method performs significantly better as compared to the LS-QPT method. In all the cases considered, we obtained experimental fidelities greater than 0.9 from a reduced data set, which was approximately five to six times smaller in size than a full data set. We also experimentally characterized the reduced dynamics of a two-qubit subsystem embedded in a three-qubit system, and used the CS-QPT method to characterize processes corresponding to the evolution of two-qubit states under various $J$-coupling interactions.\n ', "\n For a fixed integer $t \\geq 2$, we consider the irreducible characters of representations of the classical groups of types A, B, C and D, namely $\\text{GL}_{tn}, \\text{SO}_{2tn+1}, \\text{Sp}_{2tn}$ and $\\text{O}_{2tn}$, evaluated at elements $ω^k x_i$ for $0 \\leq k \\leq t-1$ and $1 \\leq i \\leq n$, where $ω$ is a primitive $t$'th root of unity. The case of $\\text{GL}_{tn}$ was considered by D. Prasad (Israel J. Math., 2016). In this article, we give a uniform approach for all cases. We also look at $\\text{GL}_{tn+1}$ where we specialize the elements as before and set the last variable to $1$. In each case, we characterize partitions for which the character value is nonzero in terms of what we call $z$-asymmetric partitions, where $z$ is an integer which depends on the group. Moreover, if the character value is nonzero, we prove that it factorizes into characters of smaller classical groups. The proof uses Cauchy-type determinant formulas for these characters and involves a careful study of the beta sets of partitions. We also give product formulas for general $z$-asymmetric partitions and $z$-asymmetric $t$-cores. Lastly, we show that there are infinitely many $z$-asymmetric $t$-cores for $t \\geq 3$.\n ", '\n Resource disaggregation has gained huge popularity in recent years. Existing works demonstrate how to disaggregate compute, memory, and storage resources. We, for the first time, demonstrate how to disaggregate network resources by proposing a new distributed hardware framework called SuperNIC. Each SuperNIC connects a small set of endpoints and consolidates network functionalities for these endpoints. We prototyped SuperNIC with FPGA and demonstrate its performance and cost benefits with real network functions and customized disaggregated applications.\n ', '\n Machine Learning is being extensively used in hydrology, especially streamflow prediction of basins/watersheds. Basin characteristics are essential for modeling the rainfall-runoff response of these watersheds and therefore data-driven methods must take into account this ancillary characteristics data. However there are several limitations, namely uncertainty in the measured characteristics, partially missing characteristics for some of the basins or unknown characteristics that may not be present in the known measured set. In this paper we present an inverse model that uses a knowledge-guided self-supervised learning algorithm to infer basin characteristics using the meteorological drivers and streamflow response data. We evaluate our model on the the CAMELS dataset and the results validate its ability to reduce measurement uncertainty, impute missing characteristics, and identify unknown characteristics.\n ', '\n A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. Next, we leverage the COMeT and reverse COMeT models to do commonsense and counterfactual inference. We then generate multiple hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles creatively with high success rate and intensity scores.\n ', '\n Scaffold based drug discovery (SBDD) is a technique for drug discovery which pins chemical scaffolds as the framework of design. Scaffolds, or molecular frameworks, organize the design of compounds into local neighborhoods. We formalize scaffold based drug discovery into a network design. Utilizing docking data from SARS-CoV-2 virtual screening studies and JAK2 kinase assay data, we showcase how a scaffold based conception of chemical space is intuitive for design. Lastly, we highlight the utility of scaffold based networks for chemical space as a potential solution to the intractable enumeration problem of chemical space by working inductively on local neighborhoods.\n ', '\n Deep neural networks such as AlphaFold and RoseTTAFold predict remarkably accurate structures of proteins compared to other algorithmic approaches. It is known that biologically small perturbations in the protein sequence do not lead to drastic changes in the protein structure. In this paper, we demonstrate that RoseTTAFold does not exhibit such a robustness despite its high accuracy, and biologically small perturbations for some input sequences result in radically different predicted protein structures. This raises the challenge of detecting when these predicted protein structures cannot be trusted. We define the robustness measure for the predicted structure of a protein sequence to be the inverse of the root-mean-square distance (RMSD) in the predicted structure and the structure of its adversarially perturbed sequence. We use adversarial attack methods to create adversarial protein sequences, and show that the RMSD in the predicted protein structure ranges from 0.119Å to 34.162Å when the adversarial perturbations are bounded by 20 units in the BLOSUM62 distance. This demonstrates very high variance in the robustness measure of the predicted structures. We show that the magnitude of the correlation (0.917) between our robustness measure and the RMSD between the predicted structure and the ground truth is high, that is, the predictions with low robustness measure cannot be trusted. This is the first paper demonstrating the susceptibility of RoseTTAFold to adversarial attacks.\n ', "\n The six-vertex model is an important toy-model in statistical mechanics for two-dimensional ice with a natural parameter $Δ$. When $Δ= 0$, the so-called free-fermion point, the model is in natural correspondence with domino tilings of the Aztec diamond. Although this model is integrable for all $Δ$, there has been very little progress in understanding its statistics in the scaling limit for other values. In this work, we focus on the six-vertex model with domain wall boundary conditions at $Δ= 1/2$, where it corresponds to alternating sign matrices (ASMs). We consider the level lines in a height function representation of ASMs. We show that the maximum of the topmost level line for a uniformly random ASMs has the GOE Tracy--Widom distribution after appropriate rescaling. A key ingredient in our proof is Zeilberger's proof of the ASM conjecture. As far as we know, this is the first edge fluctuation result away from the tangency points for the domain-wall six-vertex model when we are not in the free fermion case.\n ", '\n The training of neural networks using different deep learning frameworks may lead to drastically differing accuracy levels despite the use of the same neural network architecture and identical training hyperparameters such as learning rate and choice of optimization algorithms. Currently, our ability to build standardized deep learning models is limited by the availability of a suite of neural network and corresponding training hyperparameter benchmarks that expose differences between existing deep learning frameworks. In this paper, we present a living dataset of models and hyperparameters, called CrossedWires, that exposes semantic differences between two popular deep learning frameworks: PyTorch and Tensorflow. The CrossedWires dataset currently consists of models trained on CIFAR10 images using three different computer vision architectures: VGG16, ResNet50 and DenseNet121 across a large hyperparameter space. Using hyperparameter optimization, each of the three models was trained on 400 sets of hyperparameters suggested by the HyperSpace search algorithm. The CrossedWires dataset includes PyTorch and Tensforflow models with test accuracies as different as 0.681 on syntactically equivalent models and identical hyperparameter choices. The 340 GB dataset and benchmarks presented here include the performance statistics, training curves, and model weights for all 1200 hyperparameter choices, resulting in 2400 total models. The CrossedWires dataset provides an opportunity to study semantic differences between syntactically equivalent models across popular deep learning frameworks. Further, the insights obtained from this study can enable the development of algorithms and tools that improve reliability and reproducibility of deep learning frameworks. The dataset is freely available at https://github.com/maxzvyagin/crossedwires through a Python API and direct download link.\n ', '\n Deep Neural Networks are actively being used in the design of autonomous Cyber-Physical Systems (CPSs). The advantage of these models is their ability to handle high-dimensional state-space and learn compact surrogate representations of the operational state spaces. However, the problem is that the sampled observations used for training the model may never cover the entire state space of the physical environment, and as a result, the system will likely operate in conditions that do not belong to the training distribution. These conditions that do not belong to training distribution are referred to as Out-of-Distribution (OOD). Detecting OOD conditions at runtime is critical for the safety of CPS. In addition, it is also desirable to identify the context or the feature(s) that are the source of OOD to select an appropriate control action to mitigate the consequences that may arise because of the OOD condition. In this paper, we study this problem as a multi-labeled time series OOD detection problem over images, where the OOD is defined both sequentially across short time windows (change points) as well as across the training data distribution. A common approach to solving this problem is the use of multi-chained one-class classifiers. However, this approach is expensive for CPSs that have limited computational resources and require short inference times. Our contribution is an approach to design and train a single $β$-Variational Autoencoder detector with a partially disentangled latent space sensitive to variations in image features. We use the feature sensitive latent variables in the latent space to detect OOD images and identify the most likely feature(s) responsible for the OOD. We demonstrate our approach using an Autonomous Vehicle in the CARLA simulator and a real-world automotive dataset called nuImages.\n ', '\n Let $K/F$ be a finite Galois extension of fields with $Gal(K/F)=Γ$. In an earlier work of Timothy Kohl, the author enumerated dihedral Hopf-Galois structures acting on dihedral extensions. Dihedral group is one particular example of semidirect product of $\\mathbb{Z}_n$ and $\\mathbb{Z}_2$. In this article we count the number of Hopf-Galois structures with Galois group $Γ$ of type $G$, where $Γ,G$ are groups of the form $\\mathbb{Z}_N\\rtimes_φ\\mathbb{Z}_2$ when $N$ is odd with radical of $N$ being a Burnside number. As an application we also find the corresponding number of skew braces.\n ', '\n In this article, we study the regularity of integral closure of powers of edge ideals. We obtain a lower bound for the regularity of integral closure of powers of edge ideals in terms of induced matching number of graphs. We prove that the regularity of integral closure of powers of edge ideals of graphs with at most two odd cycles is the same as the regularity of their powers.\n ', '\n Current methods for viral discovery target evolutionarily conserved proteins that accurately identify virus families but remain unable to distinguish the zoonotic potential of newly discovered viruses. Here, we apply an attention-enhanced long-short-term memory (LSTM) deep neural net classifier to a highly conserved viral protein target to predict zoonotic potential across betacoronaviruses. The classifier performs with a 94% accuracy. Analysis and visualization of attention at the sequence and structure-level features indicate possible association between important protein-protein interactions governing viral replication in zoonotic betacoronaviruses and zoonotic transmission.\n ', "\n While computers play an increasingly important role in every aspect of our lives, their inability to understand what tasks users are physically performing makes a wide range of applications, including health monitoring and context-specific assistance, difficult or impossible. With Human Activity Recognition (HAR), applications could track if a patient took his pills and detect the behavioral changes associated with diseases such as Alzheimer's. Current systems for HAR require diverse sensors (e.g., cameras, microphones, proximity sensors, and accelerometers) placed throughout the environment to provide detailed observations needed for high-accuracy HAR. The difficulty of instrumenting an environment with these sensors makes this approach impractical.\n This project considers whether recent advances in indoor localization (Wi-Fi Round Trip Time) enable high-accuracy HAR using only a smartphone. My design, called Location-Enhanced HAR (LEHAR), uses machine learning to combine acceleration, audio, and location data to detect common human activities. A LEHAR prototype, designed to recognize a dozen common activities conducted in a typical household, achieved an F1-score of 0.965. In contrast, existing approaches, which use only acceleration or audio data, obtained F1-scores of 0.660 and 0.865, respectively, on the same activities. In addition, the F1-score of existing designs dropped significantly as more activities were added for recognition, while LEHAR was able to maintain high accuracy. The results show that using a combination of acceleration, audio, and Wi-Fi Round Trip Time localization can enable a highly accurate and easily deployable HAR system.\n ", "\n Millions of people around the world face motor impairments due to Parkinson's, cerebral palsy, muscular dystrophy and other physical disabilities. The goal of this project is to increase the usable surface-area of devices for users with these disabilities by creating a simple, inexpensive, and portable way to enable high accuracy touch interaction with large surfaces such as a table or even a wall.\n This project uses a novel approach that analyzes the acoustic signals at four piezoelectric microphones placed on the interactive surface to identify sounds related to the same event (e.g., a finger tap) at each of the microphones. ALTo (Acoustic Localized Touch) uses the results of this signal processing to compute the time difference of arrival (TDOA) across the microphones. The collected TDOA data is used to compute an approximate location of a sound source (e.g., a finger tap) using a collection of hyperbolic equations.\n An experimental evaluation of a system prototype was used to identify a number of software and signal processing optimizations needed to significantly improve accuracy and create a usable system. The results of the research indicate that it is possible to detect the location of a touch with high accuracy. The ALTo prototype achieves an accuracy of 1.45cm in the x-direction and 2.72cm the y-direction which is within the range for the target usage (i.e., those with motor impairments).\n ", '\n We introduce the problem of finding a satisfying assignment to a CNF formula that must further belong to a prescribed input subspace. Equivalent formulations of the problem include finding a point outside a union of subspaces (the Union-of-Subspace Avoidance (USA) problem), and finding a common zero of a system of polynomials over $\\F_2$ each of which is a product of affine forms.\n We focus on the case of k-CNF formulas (the k-SUB-SAT problem). Clearly, it is no easier than k-SAT, and might be harder. Indeed, via simple reductions we show NP-hardness for k=2 and W[1]-hardness parameterized by the co-dimension of the subspace. We also prove that the optimization version Max-2-SUB-SAT is NP-hard to approximate better than the trivial 3/4 ratio even on satisfiable instances.\n On the algorithmic front, we investigate fast exponential algorithms which give non-trivial savings over brute-force algorithms. We give a simple branching algorithm with runtime 1.5^r for 2-SUB-SAT, where $r$ is the subspace dimension and an O^*(1.4312)^n time algorithm where $n$ is the number of variables.\n For k more than 2, while known algorithms for solving a system of degree $k$ polynomial equations already imply a solution with runtime 2^{r(1-1/2k)}, we explore a more combinatorial approach. For instance, based on the notion of critical variables, we give an algorithm with running time ${n\\choose {\\le t}} 2^{n-n/k}$, where $n$ is the number of variables and $t$ is the co-dimension of the subspace. This improves upon the running time of the polynomial equations approach for small co-dimension. Our algorithm also achieves polynomial space in contrast to the algebraic approach that uses exponential space.\n ', '\n Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical fixes in the dataset creation process. The premise of our work is that these efforts can be more effective if informed by an understanding of how datasets are used in practice in the research community. We study three influential face and person recognition datasets - DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild (LFW) - by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach that can mitigate these harms, making recommendations to dataset creators, conference program committees, dataset users, and the broader research community.\n ', '\n Simulation can enable the study of recommender system (RS) evolution while circumventing many of the issues of empirical longitudinal studies; simulations are comparatively easier to implement, are highly controlled, and pose no ethical risk to human participants. How simulation can best contribute to scientific insight about RS alongside qualitative and quantitative empirical approaches is an open question. Philosophers and researchers have long debated the epistemological nature of simulation compared to wholly theoretical or empirical methods. Simulation is often implicitly or explicitly conceptualized as occupying a middle ground between empirical and theoretical approaches, allowing researchers to realize the benefits of both. However, what is often ignored in such arguments is that without firm grounding in any single methodological tradition, simulation studies have no agreed upon scientific norms or standards, resulting in a patchwork of theoretical motivations, approaches, and implementations that are difficult to reconcile. In this position paper, we argue that simulation studies of RS are conceptually similar to empirical experimental approaches and therefore can be evaluated using the standards of empirical research methods. Using this empirical lens, we argue that the combination of high heterogeneity in approaches and low transparency in methods in simulation studies of RS has limited their interpretability, generalizability, and replicability. We contend that by adopting standards and practices common in empirical disciplines, simulation researchers can mitigate many of these weaknesses.\n ', '\n Spending on cybersecurity products and services is expected to top 123 billion U.S. dollars for 2020, more than double the 55 billion U.S. dollars spent in 2011.1 In that same period, cyber breaches quadrupled. Organizations globally face increasing liabilities, while boards of directors grapple with a seemingly Sisyphean challenge. Cyber Crossroads was born out of these alarming trends and a realization that the world cannot go on funneling finite resources into an indefinite, intractable problem. Cyber Crossroads brings together expertise from across the world, spanning aspects of the cyber problem (including technology, legal, risk, and economic) with the goal of creating a Cyber Standard of Care built through a global, not-for-profit research collaborative with no commercial interests. A Cyber Standard of Care should be applicable across industries and regardless of the organization size. It should be practical and implementable, with no requirement to purchase any product/service. Cyber Standard of Care should be woven into the existing governance fabric of the organization and it should not be yet another technical checklist, but a process/governance framework that can stand over time. To achieve this, we engaged with cyber risk experts and practitioners with a variety of relevant expertise, secured the advice/guidance of regulators and legal experts across jurisdictions, and interviewed leaders from 56 organizations globally to understand their challenges and identify best practices.\n ', '\n We confirm the planetary nature of TOI-532b, using a combination of precise near-infrared radial velocities with the Habitable-zone Planet Finder, TESS light curves, ground based photometric follow-up, and high-contrast imaging. TOI-532 is a faint (J$\\sim 11.5$) metal-rich M dwarf with Teff = $3957\\pm69$ K and [Fe/H] = $0.38\\pm0.04$; it hosts a transiting gaseous planet with a period of $\\sim 2.3$ days. Joint fitting of the radial velocities with the TESS and ground-based transits reveal a planet with radius of $5.82\\pm0.19$ R$_{\\oplus}$, and a mass of $61.5_{-9.3}^{+9.7}$ M$_{\\oplus}$. TOI-532b is the largest and most massive super Neptune detected around an M dwarf with both mass and radius measurements, and it bridges the gap between the Neptune-sized planets and the heavier Jovian planets known to orbit M dwarfs. It also follows the previously noted trend between gas giants and host star metallicity for M dwarf planets. In addition, it is situated at the edge of the Neptune desert in the Radius--Insolation plane, helping place constraints on the mechanisms responsible for sculpting this region of planetary parameter space.\n ', '\n In this paper, we introduce a method for multivariate function approximation using function evaluations, Chebyshev polynomials, and tensor-based compression techniques via the Tucker format. We develop novel randomized techniques to accomplish the tensor compression, provide a detailed analysis of the computational costs, provide insight into the error of the resulting approximations, and discuss the benefits of the proposed approaches. We also apply the tensor-based function approximation to develop low-rank matrix approximations to kernel matrices that describe pairwise interactions between two sets of points; the resulting low-rank approximations are efficient to compute and store (the complexity is linear in the number of points). We have detailed numerical experiments on example problems involving multivariate function approximation, low-rank matrix approximations of kernel matrices involving well-separated clusters of sources and target points, and a global low-rank approximation of kernel matrices with an application to Gaussian processes.\n ', '\n Univariate Weibull distribution is a well-known lifetime distribution and has been widely used in reliability and survival analysis. In this paper, we introduce a new family of bivariate generalized Weibull (BGW) distributions, whose univariate marginals are exponentiated Weibull distribution. Different statistical quantiles like marginals, conditional distribution, conditional expectation, product moments, correlation and a measure component reliability are derived. Various measures of dependence and statistical properties along with ageing properties are examined. Further, the copula associated with BGW distribution and its various important properties are also considered. The methods of maximum likelihood and Bayesian estimation are employed to estimate unknown parameters of the model. A Monte Carlo simulation and real data study are carried out to demonstrate the performance of the estimators and results have proven the effectiveness of the distribution in real-life situations\n ', "\n Uncertainties in machine learning are a significant roadblock for its application in safety-critical cyber-physical systems (CPS). One source of uncertainty arises from distribution shifts in the input data between training and test scenarios. Detecting such distribution shifts in real-time is an emerging approach to address the challenge. The high dimensional input space in CPS applications involving imaging adds extra difficulty to the task. Generative learning models are widely adopted for the task, namely out-of-distribution (OoD) detection. To improve the state-of-the-art, we studied existing proposals from both machine learning and CPS fields. In the latter, safety monitoring in real-time for autonomous driving agents has been a focus. Exploiting the spatiotemporal correlation of motion in videos, we can robustly detect hazardous motion around autonomous driving agents. Inspired by the latest advances in the Variational Autoencoder (VAE) theory and practice, we tapped into the prior knowledge in data to further boost OoD detection's robustness. Comparison studies over nuScenes and Synthia data sets show our methods significantly improve detection capabilities of OoD factors unique to driving scenarios, 42% better than state-of-the-art approaches. Our model also generalized near-perfectly, 97% better than the state-of-the-art across the real-world and simulation driving data sets experimented. Finally, we customized one proposed method into a twin-encoder model that can be deployed to resource limited embedded devices for real-time OoD detection. Its execution time was reduced over four times in low-precision 8-bit integer inference, while detection capability is comparable to its corresponding floating-point model.\n ", '\n Highly complex deep learning models are increasingly integrated into modern cyber-physical systems (CPS), many of which have strict safety requirements. One problem arising from this is that deep learning lacks interpretability, operating as a black box. The reliability of deep learning is heavily impacted by how well the model training data represents runtime test data, especially when the input space dimension is high as natural images. In response, we propose a robust out-of-distribution (OOD) detection framework. Our approach detects unusual movements from driving video in real-time by combining classical optic flow operation with representation learning via variational autoencoder (VAE). We also design a method to locate OOD factors in images. Evaluation on a driving simulation data set shows that our approach is statistically more robust than related works.\n ', '\n It is known that testing isomorphism of chordal graphs is as hard as the general graph isomorphism problem. Every chordal graph can be represented as the intersection graph of some subtrees of a tree. The leafage of a chordal graph, is defined to be the minimum number of leaves in the representing tree. We construct a fixed-parameter tractable algorithm testing isomorphism of chordal graphs with bounded leafage. The key point is a fixed-parameter tractable algorithm finding the automorphism group of a colored order-3 hypergraph with bounded sizes of color classes of vertices.\n ', '\n Non-topological solitons such as Q-balls and Q-shells have been studied for scalar fields invariant under global and gauged U(1) symmetries. We generalize this framework to include a Proca mass for the gauge boson, which can arise either from spontaneous symmetry breaking or via the Stückelberg mechanism. A heavy (light) gauge boson leads to solitons reminiscent of the global (gauged) case, but for intermediate values these Proca solitons exhibit completely novel features such as disconnected regions of viable parameter space and Q-shells with unbounded radius. We provide numerical solutions and excellent analytic approximations for both Proca Q-balls and Q-shells. These allow us to not only demonstrate the novel features numerically, but also understand and predict their origin analytically.\n ', "\n Saliency methods -- techniques to identify the importance of input features on a model's output -- are a common first step in understanding neural network behavior. However, interpreting saliency requires tedious manual inspection to identify and aggregate patterns in model behavior, resulting in ad hoc or cherry-picked analysis. To address these concerns, we present Shared Interest: a set of metrics for comparing saliency with human annotated ground truths. By providing quantitative descriptors, Shared Interest allows ranking, sorting, and aggregation of inputs thereby facilitating large-scale systematic analysis of model behavior. We use Shared Interest to identify eight recurring patterns in model behavior including focusing on a sufficient subset of ground truth features or being distracted by contextual features. Working with representative real-world users, we show how Shared Interest can be used to rapidly develop or lose trust in a model's reliability, uncover issues that are missed in manual analyses, and enable interactive probing of model behavior.\n ", '\n Simulation has emerged as a popular method to study the long-term societal consequences of recommender systems. This approach allows researchers to specify their theoretical model explicitly and observe the evolution of system-level outcomes over time. However, performing simulation-based studies often requires researchers to build their own simulation environments from the ground up, which creates a high barrier to entry, introduces room for implementation error, and makes it difficult to disentangle whether observed outcomes are due to the model or the implementation.\n We introduce T-RECS, an open-sourced Python package designed for researchers to simulate recommendation systems and other types of sociotechnical systems in which an algorithm mediates the interactions between multiple stakeholders, such as users and content creators. To demonstrate the flexibility of T-RECS, we perform a replication of two prior simulation-based research on sociotechnical systems. We additionally show how T-RECS can be used to generate novel insights with minimal overhead. Our tool promotes reproducibility in this area of research, provides a unified language for simulating sociotechnical systems, and removes the friction of implementing simulations from scratch.\n ', '\n We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise $\\ell_p$-norm with $1 \\leq p < 2$. We study the $\\ell_p$ norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix $A \\in \\mathbb{R}^{d \\times n}$ ($n \\gg d$), our algorithm achieves a multiplicative $k^{\\frac{1}{p} - \\frac{1}{2}}\\text{poly}(\\log nd)$-approximation to the error with respect to the best possible column subset of size $k$. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the $\\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the $\\ell_{1, 2}$ norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.\n ', '\n Turbulent flow control has numerous applications and building reduced-order models (ROMs) of the flow and the associated feedback control laws is extremely challenging. Despite the complexity of building data-driven ROMs for turbulence, the superior representational capacity of deep neural networks has demonstrated considerable success in learning ROMs. Nevertheless, these strategies are typically devoid of physical foundations and often lack interpretability. Conversely, the Proper Orthogonal Decomposition (POD) based Galerkin projection (GP) approach for ROM has been popular in many problems owing to its theoretically consistent and explainable physical foundations. However, a key limitation is that the ordinary differential equations (ODEs) arising from GP ROMs are highly susceptible to instabilities due to truncation of POD modes and lead to deterioration in temporal predictions. In this work, we propose a \\textit{differentiable programming} approach that blends the strengths of both these strategies, by embedding neural networks explicitly into the GP ODE structure, termed Neural Galerkin projection. We demonstrate this approach on the isentropic Navier-Stokes equations for compressible flow over a cavity at a moderate Mach number. When provided the structure of the projected equations, we show that the Neural Galerkin approach implicitly learns stable ODE coefficients from POD coefficients and demonstrates significantly longer and accurate time horizon predictions, when compared to the classical POD-GP assisted by calibration. We observe that the key benefits of this differentiable programming-based approach include increased flexibility in physics-based learning, very low computational costs, and a significant increase in interpretability, when compared to purely data-driven neural networks.\n ', '\n The experimental implementation of selective quantum process tomography (SQPT) involves computing individual elements of the process matrix with the help of a special set of states called quantum 2-design states. However, the number of experimental settings required to prepare input states from quantum 2-design states to selectively and precisely compute a desired element of the process matrix is still high, and hence constructing the corresponding unitary operations in the lab is a daunting task. In order to reduce the experimental complexity, we mathematically reformulated the standard SQPT problem, which we term the modified SQPT (MSQPT) method. We designed the generalized quantum circuit to prepare the required set of input states and formulated an efficient measurement strategy aimed at minimizing the experimental cost of SQPT. We experimentally demonstrated the MSQPT protocol on the IBM QX2 cloud quantum processor and selectively characterized various two- and three-qubit quantum gates.\n ', '\n Blockchain analysis is essential for understanding how cryptocurrencies like Bitcoin are used in practice, and address clustering is a cornerstone of blockchain analysis. However, current techniques rely on heuristics that have not been rigorously evaluated or optimized. In this paper, we tackle several challenges of change address identification and clustering. First, we build a ground truth set of transactions with known change from the Bitcoin blockchain that can be used to validate the efficacy of individual change address detection heuristics. Equipped with this data set, we develop new techniques to predict change outputs with low false positive rates. After applying our prediction model to the Bitcoin blockchain, we analyze the resulting clustering and develop ways to detect and prevent cluster collapse. Finally, we assess the impact our enhanced clustering has on two exemplary applications.\n ', '\n We consider efficient methods for computing solutions to dynamic inverse problems, where both the quantities of interest and the forward operator (measurement process) may change at different time instances but we want to solve for all the images simultaneously. We are interested in large-scale ill-posed problems that are made more challenging by their dynamic nature and, possibly, by the limited amount of available data per measurement step. To remedy these difficulties, we apply regularization methods that enforce simultaneous regularization in space and time (such as edge enhancement at each time instant and proximity at consecutive time instants) and achieve this with low computational cost and enhanced accuracy. More precisely, we develop iterative methods based on a majorization-minimization (MM) strategy with quadratic tangent majorant, which allows the resulting least squares problem to be solved with a generalized Krylov subspace (GKS) method; the regularization parameter can be defined automatically and efficiently at each iteration. Numerical examples from a wide range of applications, such as limited-angle computerized tomography (CT), space-time image deblurring, and photoacoustic tomography (PAT), illustrate the effectiveness of the described approaches.\n ', '\n Machine learning (ML) is actively finding its way into modern cyber-physical systems (CPS), many of which are safety-critical real-time systems. It is well known that ML outputs are not reliable when testing data are novel with regards to model training and validation data, i.e., out-of-distribution (OOD) test data. We implement an unsupervised deep neural network-based OOD detector on a real-time embedded autonomous Duckiebot and evaluate detection performance. Our OOD detector produces a success rate of 87.5% for emergency stopping a Duckiebot on a braking test bed we designed. We also provide case analysis on computing resource challenges specific to the Robot Operating System (ROS) middleware on the Duckiebot.\n ', '\n Let $\\mathbb{K}$ be a field and $R = \\mathbb{K}[x_1, \\ldots, x_n]$. We obtain an improved upper bound for asymptotic resurgence of squarefree monomial ideals in $R$. We study the effect on the resurgence when sum, product and intersection of ideals are taken. We obtain sharp upper and lower bounds for the resurgence and asymptotic resurgence of cover ideals of finite simple graphs in terms of associated combinatorial invariants. We also explicitly compute the resurgence and asymptotic resurgence of cover ideals of several classes of graphs. We characterize a graph being bipartite in terms of the resurgence and asymptotic resurgence of edge and cover ideals. We also compute explicitly the resurgence and asymptotic resurgence of edge ideals of some classes of graphs.\n ', '\n We consider the global optimization of nonconvex mixed-integer quadratic programs with linear equality constraints. In particular, we present a new class of convex quadratic relaxations which are derived via quadratic cuts. To construct these quadratic cuts, we solve a separation problem involving a linear matrix inequality with a special structure that allows the use of specialized solution algorithms. Our quadratic cuts are nonconvex, but define a convex feasible set when intersected with the equality constraints. We show that our relaxations are an outer-approximation of a semi-infinite convex program which under certain conditions is equivalent to a well-known semidefinite program relaxation. The new relaxations are implemented in the global optimization solver BARON, and tested by conducting numerical experiments on a large collection of problems. Results demonstrate that, for our test problems, these relaxations lead to a significant improvement in the performance of BARON.\n ', '\n Explaining Graph Neural Networks predictions to end users of AI applications in easily understandable terms remains an unsolved problem. In particular, we do not have well developed methods for automatically evaluating explanations, in ways that are closer to how users consume those explanations. Based on recent application trends and our own experiences in real world problems, we propose automatic evaluation approaches for GNN Explanations.\n ', '\n We propose a statistical framework to integrate radiological magnetic resonance imaging (MRI) and genomic data to identify the underlying radiogenomic associations in lower grade gliomas (LGG). We devise a novel imaging phenotype by dividing the tumor region into concentric spherical layers that mimics the tumor evolution process. MRI data within each layer is represented by voxel--intensity-based probability density functions which capture the complete information about tumor heterogeneity. Under a Riemannian-geometric framework these densities are mapped to a vector of principal component scores which act as imaging phenotypes. Subsequently, we build Bayesian variable selection models for each layer with the imaging phenotypes as the response and the genomic markers as predictors. Our novel hierarchical prior formulation incorporates the interior-to-exterior structure of the layers, and the correlation between the genomic markers. We employ a computationally-efficient Expectation--Maximization-based strategy for estimation. Simulation studies demonstrate the superior performance of our approach compared to other approaches. With a focus on the cancer driver genes in LGG, we discuss some biologically relevant findings. Genes implicated with survival and oncogenesis are identified as being associated with the spherical layers, which could potentially serve as early-stage diagnostic markers for disease monitoring, prior to routine invasive approaches.\n ', '\n Accurately maintaining digital street maps is labor-intensive. To address this challenge, much work has studied automatically processing geospatial data sources such as GPS trajectories and satellite images to reduce the cost of maintaining digital maps. An end-to-end map update system would first process geospatial data sources to extract insights, and second leverage those insights to update and improve the map. However, prior work largely focuses on the first step of this pipeline: these map extraction methods infer road networks from scratch given geospatial data sources (in effect creating entirely new maps), but do not address the second step of leveraging this extracted information to update the existing digital map data. In this paper, we first explain why current map extraction techniques yield low accuracy when extended to update existing maps. We then propose a novel method that leverages the progression of satellite imagery over time to substantially improve accuracy. Our approach first compares satellite images captured at different times to identify portions of the physical road network that have visibly changed, and then updates the existing map accordingly. We show that our change-based approach reduces map update error rates four-fold.\n ', '\n This paper studies the problem of allocating tasks from different customers to vehicles in mobility platforms, which are used for applications like food and package delivery, ridesharing, and mobile sensing. A mobility platform should allocate tasks to vehicles and schedule them in order to optimize both throughput and fairness across customers. However, existing approaches to scheduling tasks in mobility platforms ignore fairness.\n We introduce Mobius, a system that uses guided optimization to achieve both high throughput and fairness across customers. Mobius supports spatiotemporally diverse and dynamic customer demands. It provides a principled method to navigate inherent tradeoffs between fairness and throughput caused by shared mobility. Our evaluation demonstrates these properties, along with the versatility and scalability of Mobius, using traces gathered from ridesharing and aerial sensing applications. Our ridesharing case study shows that Mobius can schedule more than 16,000 tasks across 40 customers and 200 vehicles in an online manner.\n ', "\n Training high-accuracy object detection models requires large and diverse annotated datasets. However, creating these data-sets is time-consuming and expensive since it relies on human annotators. We design, implement, and evaluate TagMe, a new approach for automatic object annotation in videos that uses GPS data. When the GPS trace of an object is available, TagMe matches the object's motion from GPS trace and the pixels' motions in the video to find the pixels belonging to the object in the video and creates the bounding box annotations of the object. TagMe works using passive data collection and can continuously generate new object annotations from outdoor video streams without any human annotators. We evaluate TagMe on a dataset of 100 video clips. We show TagMe can produce high-quality object annotations in a fully-automatic and low-cost way. Compared with the traditional human-in-the-loop solution, TagMe can produce the same amount of annotations at a much lower cost, e.g., up to 110x.\n ", '\n Throughout the course of the COVID-19 pandemic, several countries have developed and released contact tracing and exposure notification smartphone applications (apps) to help slow the spread of the disease. To support such apps, Apple and Google have released Exposure Notification Application Programming Interfaces (APIs) to infer device (user) proximity using Bluetooth Low Energy (BLE) beacons. The Private Automated Contact Tracing (PACT) team has shown that accurately estimating the distance between devices using only BLE radio signals is challenging. This paper describes the design and implementation of the SonicPACT protocol to use near-ultrasonic signals on commodity iOS and Android smartphones to estimate distances using time-of-flight measurements. The protocol allows Android and iOS devices to interoperate, augmenting and improving the current exposure notification APIs. Our initial experimental results are promising, suggesting that SonicPACT should be considered for implementation by Apple and Google.\n ', "\n Queues allow network operators to control traffic: where queues build, they can enforce scheduling and shaping policies. In the Internet today, however, there is a mismatch between where queues build and where control is most effectively enforced; queues build at bottleneck links that are often not under the control of the data sender. To resolve this mismatch, we propose a new kind of middlebox, called Bundler. Bundler uses a novel inner control loop between a sendbox (in the sender's site) and a receivebox (in the receiver's site) to determine the aggregate rate for the bundle, leaving the end-to-end connections and their control loops intact. Enforcing this sending rate ensures that bottleneck queues that would have built up from the bundle's packets now shift from the bottleneck to the sendbox. The sendbox then exercises control over its traffic by scheduling packets to achieve higher-level objectives. We have implemented Bundler in Linux and evaluated it with real-world and emulation experiments. We find that Bundler allows the sender-chosen policy to be effective: when configured to implement Stochastic Fairness Queueing (SFQ), it improves median flow completion time (FCT) by between 28% and 97% across various scenarios.\n ", '\n Inferring road graphs from satellite imagery is a challenging computer vision task. Prior solutions fall into two categories: (1) pixel-wise segmentation-based approaches, which predict whether each pixel is on a road, and (2) graph-based approaches, which predict the road graph iteratively. We find that these two approaches have complementary strengths while suffering from their own inherent limitations.\n In this paper, we propose a new method, Sat2Graph, which combines the advantages of the two prior categories into a unified framework. The key idea in Sat2Graph is a novel encoding scheme, graph-tensor encoding (GTE), which encodes the road graph into a tensor representation. GTE makes it possible to train a simple, non-recurrent, supervised model to predict a rich set of features that capture the graph structure directly from an image. We evaluate Sat2Graph using two large datasets. We find that Sat2Graph surpasses prior methods on two widely used metrics, TOPO and APLS. Furthermore, whereas prior work only infers planar road graphs, our approach is capable of inferring stacked roads (e.g., overpasses), and does so robustly.\n ', '\n Inferring road attributes such as lane count and road type from satellite imagery is challenging. Often, due to the occlusion in satellite imagery and the spatial correlation of road attributes, a road attribute at one position on a road may only be apparent when considering far-away segments of the road. Thus, to robustly infer road attributes, the model must integrate scattered information and capture the spatial correlation of features along roads. Existing solutions that rely on image classifiers fail to capture this correlation, resulting in poor accuracy. We find this failure is caused by a fundamental limitation -- the limited effective receptive field of image classifiers. To overcome this limitation, we propose RoadTagger, an end-to-end architecture which combines both Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs) to infer road attributes. The usage of graph neural networks allows information propagation on the road network graph and eliminates the receptive field limitation of image classifiers. We evaluate RoadTagger on both a large real-world dataset covering 688 km^2 area in 20 U.S. cities and a synthesized micro-dataset. In the evaluation, RoadTagger improves inference accuracy over the CNN image classifier based approaches. RoadTagger also demonstrates strong robustness against different disruptions in the satellite imagery and the ability to learn complicated inductive rules for aggregating scattered information along the road network.\n ', '\n Street maps are a crucial data source that help to inform a wide range of decisions, from navigating a city to disaster relief and urban planning. However, in many parts of the world, street maps are incomplete or lag behind new construction. Editing maps today involves a tedious process of manually tracing and annotating roads, buildings, and other map features.\n Over the past decade, many automatic map inference systems have been proposed to automatically extract street map data from satellite imagery, aerial imagery, and GPS trajectory datasets. However, automatic map inference has failed to gain traction in practice due to two key limitations: high error rates (low precision), which manifest in noisy inference outputs, and a lack of end-to-end system design to leverage inferred data to update existing street maps.\n At MIT and QCRI, we have developed a number of algorithms and approaches to address these challenges, which we combined into a new system we call Mapster. Mapster is a human-in-the-loop street map editing system that incorporates three components to robustly accelerate the mapping process over traditional tools and workflows: high-precision automatic map inference, data refinement, and machine-assisted map editing.\n Through an evaluation on a large-scale dataset including satellite imagery, GPS trajectories, and ground-truth map data in forty cities, we show that Mapster makes automation practical for map editing, and enables the curation of map datasets that are more complete and up-to-date at less cost.\n ', '\n Mapping road networks today is labor-intensive. As a result, road maps have poor coverage outside urban centers in many countries. Systems to automatically infer road network graphs from aerial imagery and GPS trajectories have been proposed to improve coverage of road maps. However, because of high error rates, these systems have not been adopted by mapping communities. We propose machine-assisted map editing, where automatic map inference is integrated into existing, human-centric map editing workflows. To realize this, we build Machine-Assisted iD (MAiD), where we extend the web-based OpenStreetMap editor, iD, with machine-assistance functionality. We complement MAiD with a novel approach for inferring road topology from aerial imagery that combines the speed of prior segmentation approaches with the accuracy of prior iterative graph construction methods. We design MAiD to tackle the addition of major, arterial roads in regions where existing maps have poor coverage, and the incremental improvement of coverage in regions where major roads are already mapped. We conduct two user studies and find that, when participants are given a fixed time to map roads, they are able to add as much as 3.5x more roads with MAiD.\n ', "\n To reduce transmit power, increase throughput, and improve communication range, radio systems---such as IoT sensor networks, Wi-Fi and cellular networks---benefit from the ability to direct their signals, to ensure that more of the transmitted power reaches the receiver. Many modern systems beamform with antenna arrays for this purpose. However, a radio's ability to direct its signal is fundamentally limited by its size. Unfortunately practical challenges limit the size of modern radios, and consequently, their ability to beamform. In many settings, radios on devices must be small and inexpensive; today, these settings are unable to benefit from high-precision beamforming.\n To address this problem, we introduce RFocus, which moves beamforming functions from the radio endpoints to the environment. RFocus includes a two-dimensional surface with a rectangular array of simple elements, each of which functions as an RF switch. Each element either lets the signal through or reflects it. The surface does not emit any power of its own. The state of the elements is set by a software controller to maximize the signal strength at a receiver, with a novel optimization algorithm that uses signal strength measurements from the receiver. The RFocus surface can be manufactured as an inexpensive thin wallpaper, requiring no wiring. This solution requires only a method to communicate received signal strengths periodically to the RFocus controller. Our prototype implementation improves the median signal strength by 10.5x, and the median channel capacity by 2.1x.\n ", '\n We propose Accel-Brake Control (ABC), a simple and deployable explicit congestion control protocol for network paths with time-varying wireless links. ABC routers mark each packet with an "accelerate" or "brake", which causes senders to slightly increase or decrease their congestion windows. Routers use this feedback to quickly guide senders towards a desired target rate. ABC requires no changes to header formats or user devices, but achieves better performance than XCP. ABC is also incrementally deployable; it operates correctly when the bottleneck is a non-ABC router, and can coexist with non-ABC traffic sharing the same bottleneck link. We evaluate ABC using a Wi-Fi implementation and trace-driven emulation of cellular links. ABC achieves 30-40% higher throughput than Cubic+Codel for similar delays, and 2.2X lower delays than BBR on a Wi-Fi path. On cellular network paths, ABC achieves 50% higher throughput than Cubic+Codel.\n ', '\n Prior research has proposed technical solutions to use peer-to-peer (P2P) content delivery to serve Internet video, showing that it can reduce costs to content providers. Yet, such methods have not become widespread except for a few niche instances. An important challenge is incentivization: what tangible benefits does P2P content delivery offer users who bring resources to the table? In this paper, we ask whether monetary incentives can help attract peers in P2P content delivery systems. We commissioned a professional survey of people around theUnited States to answer several relevant questions. We found that 51% of the 876 respondents--substantially larger than our expectations--answered "yes" to whether they would participate for suitable financial incentives. Encouraged by the results of the survey, we propose Gringotts, a system to structure incentives and securely incorporate P2P delivery into content delivery systems. Gringotts provides a novel Proof of Delivery mechanism that allows content providers to verify correct delivery of their files, and shows how to use cryptocurrency to pay peers while guarding against liars and Sybil attacks.\n ', '\n This paper introduces Nimbus, a robust technique to detect whether the cross traffic competing with a flow is "elastic", and shows that this elasticity detector improves congestion control. If cross traffic is inelastic, then a sender can control queueing delays while achieving high throughput, but in the presence of elastic traffic, it may lose throughput if it attempts to control packet delay. To estimate elasticity, Nimbus modulates the flow\'s sending rate with sinusoidal pulses that create small traffic fluctuations at the bottleneck link, and measures the frequency response of the rate of the cross traffic. Our results on emulated and real-world paths show that congestion control using elasticity detection achieves throughput comparable to Cubic, but with delays that are 50-70 ms lower when cross traffic is inelastic. Nimbus detects the nature of the cross traffic more accurately than Copa, and is usable as a building block by other end-to-end algorithms.\n ', '\n Mapping road networks is currently both expensive and labor-intensive. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity. We show that these segmentation methods have high error rates because noisy CNN outputs are difficult to correct. We propose RoadTracer, a new method to automatically construct accurate road network maps from aerial images. RoadTracer uses an iterative search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We compare our approach with a segmentation method on fifteen cities, and find that at a 5% error rate, RoadTracer correctly captures 45% more junctions across these cities.\n ', '\n Switches today provide a small set of scheduling algorithms. While we can tweak scheduling parameters, we cannot modify algorithmic logic, or add a completely new algorithm, after the switch has been designed. This paper presents a design for a programmable packet scheduler, which allows scheduling algorithms---potentially algorithms that are unknown today---to be programmed into a switch without requiring hardware redesign.\n Our design builds on the observation that scheduling algorithms make two decisions: in what order to schedule packets and when to schedule them. Further, in many scheduling algorithms these decisions can be made when packets are enqueued. We leverage this observation to build a programmable scheduler using a single abstraction: the push-in first-out queue (PIFO), a priority queue that maintains the scheduling order and time for such algorithms.\n We show that a programmable scheduler using PIFOs lets us program a wide variety of scheduling algorithms. We present a detailed hardware design for this scheduler for a 64-port 10 Gbit/s shared-memory switch with <4% chip area overhead on a 16-nm standard-cell library. Our design lets us program many sophisticated algorithms, such as a 5-level hierarchical scheduler with programmable scheduling algorithms at each level.\n ', "\n Many algorithms for congestion control, scheduling, network measurement, active queue management, security, and load balancing require custom processing of packets as they traverse the data plane of a network switch. To run at line rate, these data-plane algorithms must be in hardware. With today's switch hardware, algorithms cannot be changed, nor new algorithms installed, after a switch has been built.\n This paper shows how to program data-plane algorithms in a high-level language and compile those programs into low-level microcode that can run on emerging programmable line-rate switching chipsets. The key challenge is that these algorithms create and modify algorithmic state. The key idea to achieve line-rate programmability for stateful algorithms is the notion of a packet transaction : a sequential code block that is atomic and isolated from other such code blocks. We have developed this idea in Domino, a C-like imperative language to express data-plane algorithms. We show with many examples that Domino provides a convenient and natural way to express sophisticated data-plane algorithms, and show that these algorithms can be run at line rate with modest estimated die-area overhead.\n ", "\n This paper presents an analysis of spinal codes, a class of rateless codes proposed recently. We prove that spinal codes achieve Shannon capacity for the binary symmetric channel (BSC) and the additive white Gaussian noise (AWGN) channel with an efficient polynomial-time encoder and decoder. They are the first rateless codes with proofs of these properties for BSC and AWGN. The key idea in the spinal code is the sequential application of a hash function over the message bits. The sequential structure of the code turns out to be crucial for efficient decoding. Moreover, counter to the wisdom of having an expander structure in good codes, we show that the spinal code, despite its sequential structure, achieves capacity. The pseudo-randomness provided by a hash function suffices for this purpose. Our proof introduces a variant of Gallager's result characterizing the error exponent of random codes for any memoryless channel. We present a novel application of these error-exponent results within the framework of an efficient sequential code. The application of a hash function over the message bits provides a methodical and effective way to de-randomize Shannon's random codebook construction.\n ", '\n This paper describes the implementation and evaluation of an operating system module, the Congestion Manager (CM), which provides integrated network flow management and exports a convenient programming interface that allows applications to be notified of, and adapt to, changing network conditions. We describe the API by which applications interface with the CM, and the architectural considerations that factored into the design. To evaluate the architecture and API, we describe our implementations of TCP; a streaming layered audio/video application; and an interactive audio application using the CM, and show that they achieve adaptive behavior without incurring much end-system overhead. All flows including TCP benefit from the sharing of congestion information, and applications are able to incorporate new functionality such as congestion control and adaptive behavior.\n ', '\n Magnetic domain walls are information tokens in both logic and memory devices, and hold particular interest in applications such as neuromorphic accelerators that combine logic in memory. Here, we show that devices based on the electrical manipulation of magnetic domain walls are capable of implementing linear, as well as programmable nonlinear, functions. Unlike other approaches, domain-wall-based devices are ideal for application to both synaptic weight generators and thresholding in deep neural networks. Prototype micrometer-size devices operate with 8 ns current pulses and the energy consumption required for weight modulation is < 16 pJ. Both speed and energy consumption compare favorably to other synaptic nonvolatile devices, with the expected energy dissipation for scaled 20 nm devices close to that of biological neurons.\n ', '\n Lead halide-based perovskite thin films have attracted great attention due to the explosive increase in perovskite solar cell efficiencies. The same optoelectronic properties that make perovskites ideal absorber materials in solar cells are also beneficial in other light-harvesting applications and make them prime candidates as triplet sensitizers in upconversion via triplet-triplet annihilation in rubrene. In this contribution, we take advantage of long carrier lifetimes and carrier diffusion lengths in perovskite thin films, their high absorption cross sections throughout the visible spectrum, as well as the strong spin-orbit coupling owing to the abundance of heavy atoms to sensitize the upconverter rubrene. Employing bulk perovskite thin films as the absorber layer and spin-mixer in inorganic/organic heterojunction upconversion devices allows us to forego the additional tunneling barrier owing from the passivating ligands required for colloidal sensitizers. Our bilayer device exhibits an upconversion efficiency in excess of 3% under 785 nm illumination.\n ', '\n High-quality micro-lasers are key ingredients in non-linear optics, communication, sensing and low-threshold solar-pumped lasers. However, such micro-lasers exhibit negligible absorption of free-space broadband pump light. Recently, this limitation was lifted by cascade energy transfer, in which the absorption and quality factor are modulated with wavelength, enabling non-resonant pumping of high-quality micro-lasers and solar-pumped laser to operate at record low solar concentration. Here, we present a generic theoretical framework for modeling the absorption, emission and energy transfer of incoherent radiation between cascade sensitizer and laser gain media. Our model is based on linear equations of the modified net radiation method and is therefore robust, fast converging and has low complexity. We apply this formalism to compute the optimal parameters of low-threshold solar-pumped lasers. It is revealed that the interplay between the absorption and self-absorption of such lasers defines the optimal pump absorption below the maximal value, which is in contrast to conventional lasers for which full pump absorption is desired. Numerical results are compared to experimental data on a sensitized Nd:YAG cavity, and quantitative agreement with theoretical models is found. Our work modularizes the gain and sensitizing components and paves the way for the optimal design of broadband-pumped high-quality micro-lasers and efficient solar-pumped lasers.\n ', '\n Plexcitons are polaritonic modes that result from the strong coupling between excitons and plasmons. We consider plexcitons emerging from the interaction of excitons in an organic molecular layer with surface plasmons in a metallic film. We predict the emergence of Dirac cones in the two-dimensional bandstructure of plexcitons due to the inherent alignment of the excitonic transitions in the organic layer. These Dirac cones may open up in energy by simultaneously interfacing the metal with a magneto-optical layer and subjecting the whole system to a perpendicular magnetic field. The resulting energy gap becomes populated with topologically protected one-way modes which travel at the interface of this plexcitonic system. Our theoretical proposal suggests that plexcitons are a convenient and simple platform for the exploration of exotic phases of matter as well as of novel ways to direct energy flow at the nanoscale.\n ', '\n The formation of 360° magnetic domain walls (360DWs) in Co and Ni80Fe20 thin film wires was demonstrated experimentally for different wire widths, by successively injecting two 180° domain walls (180DWs) into the wire. For narrow wires (less than 50 nm wide for Co), edge roughness prevented the combination of the 180DWs into a 360DW, and for wide wires (200 nm for Co) the 360DW collapsed, but over an intermediate range of wire widths, reproducible 360DW formation occurred. The annihilation and dissociation of 360DWs was demonstrated by applying a magnetic field parallel to the wire, showing that annihilation fields were several times higher than dissociation fields in agreement with micromagnetic modeling. The annihilation of a 360DW by current pulsing was demonstrated.\n ', '\n Organic light emitting devices and solar cells are machines that create, manipulate and destroy excited states in organic semiconductors. It is crucial to characterize these excited states, or excitons, to optimize device performance in applications like displays and solar energy harvesting. This is complicated if the excited state is a triplet because the electronic transition is dark with a vanishing oscillator strength. As a consequence, triplet state spectroscopy must usually be performed at cryogenic temperatures to reduce competition from non-radiative rates. Here, we control non-radiative rates by engineering a solid-state host matrix containing the target molecule, allowing the observation of phosphorescence at room temperature and alleviating constraints of cryogenic experiments. We test these techniques on a wide range of materials with functionalities spanning multi-exciton generation (singlet exciton fission), organic light emitting device host materials, and thermally activated delayed fluorescence type emitters. Control of non-radiative modes in the matrix surrounding a target molecule may also have broader applications in light emitting and photovoltaic devices.\n ', '\n We report highly efficient, simultaneous fluorescence and phosphorescence (74% yield) at room temperature from a single molecule ensemble of (BzP)PB dispersed into a polymer host. The slow phosphorescence (208 ms lifetime) is very efficient (50%) at room temperature and only possible because the non-radiative rate for the triplet state is extremely low. The ability of an organic molecule to function as an efficient dual state emitter at room temperature is unusual and opens new fields of applications including the use as broadband down-conversion emitters, optical sensors and attenuators, exciton probes, and spin-independent intermediates for Förster resonant energy transfer.\n ', '\n Searching for novel molecular compounds with desired properties is an important problem in drug discovery. Many existing frameworks generate molecules one atom at a time. We instead propose a flexible editing paradigm that generates molecules using learned molecular fragments--meaningful substructures of molecules. To do so, we train a variational autoencoder (VAE) to encode molecular fragments in a coherent latent space, which we then utilize as a vocabulary for editing molecules to explore the complex chemical property space. Equipped with the learned fragment vocabulary, we propose Fragment-based Sequential Translation (FaST), which learns a reinforcement learning (RL) policy to iteratively translate model-discovered molecules into increasingly novel molecules while satisfying desired properties. Empirical evaluation shows that FaST significantly improves over state-of-the-art methods on benchmark single/multi-objective molecular optimization tasks.\n ', '\n Generating the periodic structure of stable materials is a long-standing challenge for the material design community. This task is difficult because stable materials only exist in a low-dimensional subspace of all possible periodic arrangements of atoms: 1) the coordinates must lie in the local energy minimum defined by quantum mechanics, and 2) global stability also requires the structure to follow the complex, yet specific bonding preferences between different atom types. Existing methods fail to incorporate these factors and often lack proper invariances. We propose a Crystal Diffusion Variational Autoencoder (CDVAE) that captures the physical inductive bias of material stability. By learning from the data distribution of stable materials, the decoder generates materials in a diffusion process that moves atomic coordinates towards a lower energy state and updates atom types to satisfy bonding preferences between neighbors. Our model also explicitly encodes interactions across periodic boundaries and respects permutation, translation, rotation, and periodic invariances. We significantly outperform past methods in three tasks: 1) reconstructing the input structure, 2) generating valid, diverse, and realistic materials, and 3) generating materials that optimize a specific property. We also provide several standard datasets and evaluation metrics for the broader machine learning community.\n ', '\n Antibodies are versatile proteins that bind to pathogens like viruses and stimulate the adaptive immune system. The specificity of antibody binding is determined by complementarity-determining regions (CDRs) at the tips of these Y-shaped proteins. In this paper, we propose a generative model to automatically design the CDRs of antibodies with enhanced binding specificity or neutralization capabilities. Previous generative approaches formulate protein design as a structure-conditioned sequence generation task, assuming the desired 3D structure is given a priori. In contrast, we propose to co-design the sequence and 3D structure of CDRs as graphs. Our model unravels a sequence autoregressively while iteratively refining its predicted global structure. The inferred structure in turn guides subsequent residue choices. For efficiency, we model the conditional dependence between residues inside and outside of a CDR in a coarse-grained manner. Our method achieves superior log-likelihood on the test set and outperforms previous baselines in designing antibodies capable of neutralizing the SARS-CoV-2 virus.\n ', '\n While unbiased machine learning models are essential for many applications, bias is a human-defined concept that can vary across tasks. Given only input-label pairs, algorithms may lack sufficient information to distinguish stable (causal) features from unstable (spurious) features. However, related tasks often share similar biases -- an observation we may leverage to develop stable classifiers in the transfer setting. In this work, we explicitly inform the target classifier about unstable features in the source tasks. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. We achieve robustness by clustering data of the target task according to this representation and minimizing the worst-case risk across these clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task, outperforming the best baseline by 22.9% in absolute accuracy across 12 transfer settings. Our code is available at https://github.com/YujiaBao/Tofu.\n ', "\n Prediction of a molecule's 3D conformer ensemble from the molecular graph holds a key role in areas of cheminformatics and drug discovery. Existing generative models have several drawbacks including lack of modeling important molecular geometry elements (e.g. torsion angles), separate optimization stages prone to error accumulation, and the need for structure fine-tuning based on approximate classical force-fields or computationally expensive methods such as metadynamics with approximate quantum mechanics calculations at each geometry. We propose GeoMol--an end-to-end, non-autoregressive and SE(3)-invariant machine learning approach to generate distributions of low-energy molecular 3D conformers. Leveraging the power of message passing neural networks (MPNNs) to capture local and global graph information, we predict local atomic 3D structures and torsion angles, avoiding unnecessary over-parameterization of the geometric degrees of freedom (e.g. one angle per non-terminal bond). Such local predictions suffice both for the training loss computation, as well as for the full deterministic conformer assembly (at test time). We devise a non-adversarial optimal transport based loss function to promote diverse conformer generation. GeoMol predominantly outperforms popular open-source, commercial, or state-of-the-art machine learning (ML) models, while achieving significant speed-ups. We expect such differentiable 3D structure generators to significantly impact molecular modeling and related applications.\n ", '\n Balancing the needs of data privacy and predictive utility is a central challenge for machine learning in healthcare. In particular, privacy concerns have led to a dearth of public datasets, complicated the construction of multi-hospital cohorts and limited the utilization of external machine learning resources. To remedy this, new methods are required to enable data owners, such as hospitals, to share their datasets publicly, while preserving both patient privacy and modeling utility. We propose NeuraCrypt, a private encoding scheme based on random deep neural networks. NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner, and publishes both the encoded data and associated labels publicly. From a theoretical perspective, we demonstrate that sampling from a sufficiently rich family of encoding functions offers a well-defined and meaningful notion of privacy against a computationally unbounded adversary with full knowledge of the underlying data-distribution. We propose to approximate this family of encoding functions through random deep neural networks. Empirically, we demonstrate the robustness of our encoding to a suite of adversarial attacks and show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks. Moreover, we demonstrate that multiple hospitals, using independent private encoders, can collaborate to train improved x-ray models. Finally, we release a challenge dataset to encourage the development of new attacks on NeuraCrypt.\n ', '\n We propose Predict then Interpolate (PI), a simple algorithm for learning correlations that are stable across environments. The algorithm follows from the intuition that when using a classifier trained on one environment to make predictions on examples from another environment, its mistakes are informative as to which correlations are unstable. In this work, we prove that by interpolating the distributions of the correct predictions and the wrong predictions, we can uncover an oracle distribution where the unstable correlation vanishes. Since the oracle interpolation coefficients are not accessible, we use group distributionally robust optimization to minimize the worst-case risk across all such interpolations. We evaluate our method on both text classification and image classification. Empirical results demonstrate that our algorithm is able to learn robust classifiers (outperforms IRM by 23.85% on synthetic environments and 12.41% on natural environments). Our code and data are available at https://github.com/YujiaBao/Predict-then-Interpolate.\n ', '\n We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.\n ', '\n Communicating new research ideas involves highlighting similarities and differences with past work. Authors write fluent, often long sections to survey the distinction of a new paper with related work. In this work we model generating related work sections while being cognisant of the motivation behind citing papers. Our content planning model generates a tree of cited papers before a surface realization model lexicalizes this skeleton. Our model outperforms several strong state-of-the-art summarization and multi-document summarization models on generating related work on an ACL Anthology (AA) based dataset which we contribute.\n ', '\n We present a method for generating comparative summaries that highlights similarities and contradictions in input documents. The key challenge in creating such summaries is the lack of large parallel training data required for training typical summarization systems. To this end, we introduce a hybrid generation approach inspired by traditional concept-to-text systems. To enable accurate comparison between different sources, the model first learns to extract pertinent relations from input documents. The content planning component uses deterministic operators to aggregate these relations after identifying a subset for inclusion into a summary. The surface realization component lexicalizes this information using a text-infilling language model. By separately modeling content selection and realization, we can effectively train them with limited annotations. We implemented and tested the model in the domain of nutrition and health -- rife with inconsistencies. Compared to conventional methods, our framework leads to more faithful, relevant and aggregation-sensitive summarization -- while being equally fluent.\n ', '\n We introduce \\emph{Nutri-bullets}, a multi-document summarization task for health and nutrition. First, we present two datasets of food and health summaries from multiple scientific studies. Furthermore, we propose a novel \\emph{extract-compose} model to solve the problem in the regime of limited parallel data. We explicitly select key spans from several abstracts using a policy network, followed by composing the selected spans to present a summary via a task specific language model. Compared to state-of-the-art methods, our approach leads to more faithful, relevant and diverse summarization -- properties imperative to this application. For instance, on the BreastCancer dataset our approach gets a more than 50\\% improvement on relevance and faithfulness.\\footnote{Our code and data is available at \\url{https://github.com/darsh10/Nutribullets.}}\n ', '\n Typical fact verification models use retrieved written evidence to verify claims. Evidence sources, however, often change over time as more information is gathered and revised. In order to adapt, models must be sensitive to subtle differences in supporting evidence. We present VitaminC, a benchmark infused with challenging cases that require fact verification models to discern and adjust to slight factual changes. We collect over 100,000 Wikipedia revisions that modify an underlying fact, and leverage these revisions, together with additional synthetically constructed ones, to create a total of over 400,000 claim-evidence pairs. Unlike previous resources, the examples in VitaminC are contrastive, i.e., they contain evidence pairs that are nearly identical in language and content, with the exception that one supports a given claim while the other does not. We show that training using this design increases robustness -- improving accuracy by 10% on adversarial fact verification and 6% on adversarial natural language inference (NLI). Moreover, the structure of VitaminC leads us to define additional tasks for fact-checking resources: tagging relevant words in the evidence for verifying the claim, identifying factual revisions, and providing automatic edits via factually consistent text generation.\n ', '\n We develop a novel approach to conformal prediction when the target task has limited data available for training. Conformal prediction identifies a small set of promising output candidates in place of a single prediction, with guarantees that the set contains the correct answer with high probability. When training data is limited, however, the predicted set can easily become unusably large. In this work, we obtain substantially tighter prediction sets while maintaining desirable marginal guarantees by casting conformal prediction as a meta-learning paradigm over exchangeable collections of auxiliary tasks. Our conformalization algorithm is simple, fast, and agnostic to the choice of underlying model, learning algorithm, or dataset. We demonstrate the effectiveness of this approach across a number of few-shot classification and regression tasks in natural language processing, computer vision, and computational chemistry for drug discovery.\n ', '\n Drug combinations play an important role in therapeutics due to its better efficacy and reduced toxicity. Recent approaches have applied machine learning to identify synergistic combinations for cancer, but they are not applicable to new diseases with limited combination data. Given that drug synergy is closely tied to biological targets, we propose a \\emph{biological bottleneck} model that jointly learns drug-target interaction and synergy. The model consists of two parts: a drug-target interaction and target-disease association module. This design enables the model to \\emph{explain} how a biological target affects drug synergy. By utilizing additional biological information, our model achieves 0.78 test AUC in drug synergy prediction using only 90 COVID drug combinations for training. We experimentally tested the model predictions in the U.S. National Center for Advancing Translational Sciences (NCATS) facilities and discovered two novel drug combinations (Remdesivir + Reserpine and Remdesivir + IQ-1S) with strong synergy in vitro.\n ', '\n The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with a Purpose (CapWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rather than merely provide generic information about an image. In this task, we use question-answer (QA) pairs---a natural expression of information need---from users, instead of reference captions, for both training and post-inference evaluation. We show that it is possible to use reinforcement learning to directly optimize for the intended information need, by rewarding outputs that allow a question answering model to provide correct answers to sampled user questions. We convert several visual question answering datasets into CapWAP datasets, and demonstrate that under a variety of scenarios our purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context.\n ', '\n Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.\n ', '\n In this paper, we present a novel approach for conformal prediction (CP), in which we aim to identify a set of promising prediction candidates -- in place of a single prediction. This set is guaranteed to contain a correct answer with high probability, and is well-suited for many open-ended classification tasks. In the standard CP paradigm, the predicted set can often be unusably large and also costly to obtain. This is particularly pervasive in settings where the correct answer is not unique, and the number of total possible answers is high. We first expand the CP correctness criterion to allow for additional, inferred "admissible" answers, which can substantially reduce the size of the predicted set while still providing valid performance guarantees. Second, we amortize costs by conformalizing prediction cascades, in which we aggressively prune implausible labels early on by using progressively stronger classifiers -- again, while still providing valid performance guarantees. We demonstrate the empirical effectiveness of our approach for multiple applications in natural language processing and computational chemistry for drug discovery.\n ', '\n In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell features at different resolutions and scale to high-resolution images, we develop a novel multi-scale flow architecture based on a Haar wavelet image pyramid. To maximize the mutual information between the generated images and the molecular interventions, we devise a training strategy based on contrastive learning. To evaluate our model, we propose a new set of metrics for biological image generation that are robust, interpretable, and relevant to practitioners. We show quantitatively that our method learns a meaningful embedding of the molecular intervention, which is translated into an image representation reflecting the biological effects of the intervention.\n ', '\n Retrosynthesis prediction is a fundamental problem in organic synthesis, where the task is to identify precursor molecules that can be used to synthesize a target molecule. A key consideration in building neural models for this task is aligning model design with strategies adopted by chemists. Building on this viewpoint, this paper introduces a graph-based approach that capitalizes on the idea that the graph topology of precursor molecules is largely unaltered during a chemical reaction. The model first predicts the set of graph edits transforming the target into incomplete molecules called synthons. Next, the model learns to expand synthons into complete molecules by attaching relevant leaving groups. This decomposition simplifies the architecture, making its predictions more interpretable, and also amenable to manual correction. Our model achieves a top-1 accuracy of $53.7\\%$, outperforming previous template-free and semi-template-based methods.\n ', "\n Current graph neural network (GNN) architectures naively average or sum node embeddings into an aggregated graph representation -- potentially losing structural or semantic information. We here introduce OT-GNN, a model that computes graph embeddings using parametric prototypes that highlight key facets of different graph aspects. Towards this goal, we successfully combine optimal transport (OT) with parametric graph models. Graph representations are obtained from Wasserstein distances between the set of GNN node embeddings and ``prototype'' point clouds as free parameters. We theoretically prove that, unlike traditional sum aggregation, our function class on point clouds satisfies a fundamental universal approximation theorem. Empirically, we address an inherent collapse optimization issue by proposing a noise contrastive regularizer to steer the model towards truly exploiting the OT geometry. Finally, we outperform popular methods on several molecular property prediction tasks, while exhibiting smoother graph representations.\n ", '\n Many biochemical applications such as molecular property prediction require models to generalize beyond their training domains (environments). Moreover, natural environments in these tasks are structured, defined by complex descriptors such as molecular scaffolds or protein families. Therefore, most environments are either never seen during training, or contain only a single training example. To address these challenges, we propose a new regret minimization (RGM) algorithm and its extension for structured environments. RGM builds from invariant risk minimization (IRM) by recasting simultaneous optimality condition in terms of predictive regret, finding a representation that enables the predictor to compete against an oracle with hindsight access to held-out environments. The structured extension adaptively highlights variation due to complex environments via specialized domain perturbations. We evaluate our method on multiple applications: molecular property prediction, protein homology and stability prediction and show that RGM significantly outperforms previous state-of-the-art baselines.\n ', '\n Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While several approaches to UQ have been proposed in the literature, there is no clear consensus on the comparative performance of these models. In this paper, we study this question in the context of regression tasks. We systematically evaluate several methods on five benchmark datasets using multiple complementary performance metrics. Our experiments show that none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple datasets. While we believe these results show that existing UQ methods are not sufficient for all common use-cases and demonstrate the benefits of further research, we conclude with a practical recommendation as to which existing techniques seem to perform well relative to others.\n ', '\n Effective property prediction methods can help accelerate the search for COVID-19 antivirals either through accurate in-silico screens or by effectively guiding on-going at-scale experimental efforts. However, existing prediction tools have limited ability to accommodate scarce or fragmented training data currently available. In this paper, we introduce a novel approach to learn predictors that can generalize or extrapolate beyond the heterogeneous data. Our method builds on and extends recently proposed invariant risk minimization, adaptively forcing the predictor to avoid nuisance variation. We achieve this by continually exercising and manipulating latent representations of molecules to highlight undesirable variation to the predictor. To test the method we use a combination of three data sources: SARS-CoV-2 antiviral screening data, molecular fragments that bind to SARS-CoV-2 main protease and large screening data for SARS-CoV-1. Our predictor outperforms state-of-the-art transfer learning methods by significant margin. We also report the top 20 predictions of our model on Broad drug repurposing hub.\n ', '\n Generative models in molecular design tend to be richly parameterized, data-hungry neural models, as they must create complex structured objects as outputs. Estimating such models from data may be challenging due to the lack of sufficient training data. In this paper, we propose a surprisingly effective self-training approach for iteratively creating additional molecular targets. We first pre-train the generative model together with a simple property predictor. The property predictor is then used as a likelihood model for filtering candidate structures from the generative model. Additional targets are iteratively produced and used in the course of stochastic EM iterations to maximize the log-likelihood that the candidate structures are accepted. A simple rejection (re-weighting) sampler suffices to draw posterior samples since the generative model is already reasonable after pre-training. We demonstrate significant gains over strong baselines for both unconditional and conditional molecular design. In particular, our approach outperforms the previous state-of-the-art in conditional molecular design by over 10% in absolute gain. Finally, we show that our approach is useful in other domains as well, such as program synthesis.\n ', '\n Drug discovery aims to find novel compounds with specified chemical property profiles. In terms of generative modeling, the goal is to learn to sample molecules in the intersection of multiple property constraints. This task becomes increasingly challenging when there are many property constraints. We propose to offset this complexity by composing molecules from a vocabulary of substructures that we call molecular rationales. These rationales are identified from molecules as substructures that are likely responsible for each property of interest. We then learn to expand rationales into a full molecule using graph generative models. Our final generative model composes molecules as mixtures of multiple rationale completions, and this mixture is fine-tuned to preserve the properties of interest. We evaluate our model on various drug design tasks and demonstrate significant improvements over state-of-the-art baselines in terms of accuracy, diversity, and novelty of generated compounds.\n ', '\n Graph generation techniques are increasingly being adopted for drug discovery. Previous graph generation approaches have utilized relatively small molecular building blocks such as atoms or simple cycles, limiting their effectiveness to smaller molecules. Indeed, as we demonstrate, their performance degrades significantly for larger molecules. In this paper, we propose a new hierarchical graph encoder-decoder that employs significantly larger and more flexible graph motifs as basic building blocks. Our encoder produces a multi-resolution representation for each molecule in a fine-to-coarse fashion, from atoms to connected motifs. Each level integrates the encoding of constituents below with the graph at that level. Our autoregressive coarse-to-fine decoder adds one motif at a time, interleaving the decision of selecting a new motif with the process of resolving its attachments to the emerging molecule. We evaluate our model on multiple molecule generation tasks, including polymers, and show that our model significantly outperforms previous state-of-the-art baselines.\n ', '\n We propose Blank Language Model (BLM), a model that generates sequences by dynamically creating and filling in blanks. The blanks control which part of the sequence to expand, making BLM ideal for a variety of text editing and rewriting tasks. The model can start from a single blank or partially completed text with blanks at specified locations. It iteratively determines which word to place in a blank and whether to insert new blanks, and stops generating when no blanks are left to fill. BLM can be efficiently trained using a lower bound of the marginal data likelihood. On the task of filling missing text snippets, BLM significantly outperforms all other baselines in terms of both accuracy and fluency. Experiments on style transfer and damaged ancient text restoration demonstrate the potential of this framework for a wide range of applications.\n ', '\n Automatic question generation can benefit many applications ranging from dialogue systems to reading comprehension. While questions are often asked with respect to long documents, there are many challenges with modeling such long documents. Many existing techniques generate questions by effectively looking at one sentence at a time, leading to questions that are easy and not reflective of the human process of question generation. Our goal is to incorporate interactions across multiple sentences to generate realistic questions for long documents. In order to link a broad document context to the target answer, we represent the relevant context via a multi-stage attention mechanism, which forms the foundation of a sequence to sequence model. We outperform state-of-the-art methods on question generation on three question-answering datasets -- SQuAD, MS MARCO and NewsQA.\n ', '\n We propose a new model for making generalizable and diverse retrosynthetic reaction predictions. Given a target compound, the task is to predict the likely chemical reactants to produce the target. This generative task can be framed as a sequence-to-sequence problem by using the SMILES representations of the molecules. Building on top of the popular Transformer architecture, we propose two novel pre-training methods that construct relevant auxiliary tasks (plausible reactions) for our problem. Furthermore, we incorporate a discrete latent variable model into the architecture to encourage the model to produce a diverse set of alternative predictions. On the 50k subset of reaction examples from the United States patent literature (USPTO-50k) benchmark dataset, our model greatly improves performance over the baseline, while also generating predictions that are more diverse.\n ', '\n Online encyclopediae like Wikipedia contain large amounts of text that need frequent corrections and updates. The new information may contradict existing content in encyclopediae. In this paper, we focus on rewriting such dynamically changing articles. This is a challenging constrained generation task, as the output must be consistent with the new information and fit into the rest of the existing document. To this end, we propose a two-step solution: (1) We identify and remove the contradicting components in a target text for a given claim, using a neutralizing stance model; (2) We expand the remaining text to be consistent with the given claim, using a novel two-encoder sequence-to-sequence model with copy attention. Applied to a Wikipedia fact update dataset, our method successfully generates updated sentences for new claims, achieving the highest SARI score. Furthermore, we demonstrate that generating synthetic data through such rewritten sentences can successfully augment the FEVER fact-checking training dataset, leading to a relative error reduction of 13%.\n ', '\n This paper explores the task of leveraging typology in the context of cross-lingual dependency parsing. While this linguistic information has shown great promise in pre-neural parsing, results for neural architectures have been mixed. The aim of our investigation is to better understand this state-of-the-art. Our main findings are as follows: 1) The benefit of typological information is derived from coarsely grouping languages into syntactically-homogeneous clusters rather than from learning to leverage variations along individual typological dimensions in a compositional manner; 2) Typology consistent with the actual corpus statistics yields better transfer performance; 3) Typological similarity is only a rough proxy of cross-lingual transferability with respect to parsing.\n ', '\n Recent developments in neural language models (LMs) have raised concerns about their potential misuse for automatically spreading misinformation. In light of these concerns, several studies have proposed to detect machine-generated fake news by capturing their stylistic differences from human-written text. These approaches, broadly termed stylometry, have found success in source attribution and misinformation detection in human-written texts. However, in this work, we show that stylometry is limited against machine-generated misinformation. While humans speak differently when trying to deceive, LMs generate stylistically consistent text, regardless of underlying motive. Thus, though stylometry can successfully prevent impersonation by identifying text provenance, it fails to distinguish legitimate LM applications from those that introduce false information. We create two benchmarks demonstrating the stylistic similarity between malicious and legitimate uses of LMs, employed in auto-completion and editing-assistance settings. Our findings highlight the need for non-stylometry approaches in detecting machine-generated misinformation, and open up the discussion on the desired evaluation benchmarks.\n ', '\n In this paper, we explore meta-learning for few-shot text classification. Meta-learning has shown strong performance in computer vision, where low-level patterns are transferable across learning tasks. However, directly applying this approach to text is challenging--lexical features highly informative for one task may be insignificant for another. Thus, rather than learning solely from words, our model also leverages their distributional signatures, which encode pertinent word occurrence patterns. Our model is trained within a meta-learning framework to map these signatures into attention scores, which are then used to weight the lexical representations of words. We demonstrate that our model consistently outperforms prototypical networks learned on lexical knowledge (Snell et al., 2017) in both few-shot text classification and relation classification by a significant margin across six benchmark datasets (20.0% on average in 1-shot classification).\n ', '\n Fact verification requires validating a claim in the context of evidence. We show, however, that in the popular FEVER dataset this might not necessarily be the case. Claim-only classifiers perform competitively with top evidence-aware models. In this paper, we investigate the cause of this phenomenon, identifying strong cues for predicting labels solely based on the claim, without considering any evidence. We create an evaluation set that avoids those idiosyncrasies. The performance of FEVER-trained models significantly drops when evaluated on this test set. Therefore, we introduce a regularization method which alleviates the effect of bias in the training data, obtaining improvements on the newly created test set. This work is a step towards a more sound evaluation of reasoning capabilities in fact verification models.\n ', '\n The problem of accelerating drug discovery relies heavily on automatic tools to optimize precursor molecules to afford them with better biochemical properties. Our work in this paper substantially extends prior state-of-the-art on graph-to-graph translation methods for molecular optimization. In particular, we realize coherent multi-resolution representations by interweaving the encoding of substructure components with the atom-level encoding of the original molecular graph. Moreover, our graph decoder is fully autoregressive, and interleaves each step of adding a new substructure with the process of resolving its attachment to the emerging molecule. We evaluate our model on multiple molecular optimization tasks and show that our model significantly outperforms previous state-of-the-art baselines.\n ', '\n In this paper we propose a novel neural approach for automatic decipherment of lost languages. To compensate for the lack of strong supervision signal, our model design is informed by patterns in language change documented in historical linguistics. The model utilizes an expressive sequence-to-sequence model to capture character-level correspondences between cognates. To effectively train the model in an unsupervised manner, we innovate the training procedure by formalizing it as a minimum-cost flow problem. When applied to the decipherment of Ugaritic, we achieve a 5.5% absolute improvement over state-of-the-art results. We also report the first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where our model correctly translates 67.3% of cognates.\n ', '\n Generative autoencoders offer a promising approach for controllable text generation by leveraging their latent sentence representations. However, current models struggle to maintain coherent latent spaces required to perform meaningful text manipulations via latent vector operations. Specifically, we demonstrate by example that neural encoders do not necessarily map similar sentences to nearby latent vectors. A theoretical explanation for this phenomenon establishes that high capacity autoencoders can learn an arbitrary mapping between sequences and associated latent representations. To remedy this issue, we augment adversarial autoencoders with a denoising objective where original sentences are reconstructed from perturbed versions (referred to as DAAE). We prove that this simple modification guides the latent space geometry of the resulting model by encouraging the encoder to map similar texts to similar latent representations. In empirical comparisons with various types of autoencoders, our model provides the best trade-off between generation quality and reconstruction capacity. Moreover, the improved geometry of the DAAE latent space enables zero-shot text style transfer via simple latent vector arithmetic.\n ', '\n Much of the recent work on learning molecular representations has been based on Graph Convolution Networks (GCN). These models rely on local aggregation operations and can therefore miss higher-order graph properties. To remedy this, we propose Path-Augmented Graph Transformer Networks (PAGTN) that are explicitly built on longer-range dependencies in graph-structured data. Specifically, we use path features in molecular graphs to create global attention layers. We compare our PAGTN model against the GCN model and show that our model consistently outperforms GCNs on molecular property prediction datasets including quantum chemistry (QM7, QM8, QM9), physical chemistry (ESOL, Lipophilictiy) and biochemistry (BACE, BBBP).\n ', '\n PURPOSE: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools monitoring and prioritizing the literature to understand the clinical implications of the pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance (risk of cancer for germline mutation carriers) or prevalence of germline genetic mutations. METHODS: We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated dataset for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule based on the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule based on the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS: For penetrance classification, we annotated 3740 paper titles and abstracts and used 60% for training the model, 20% for tuning the model, and 20% for evaluating the model. The SVM model achieves 89.53% accuracy (percentage of papers that were correctly classified) while the CNN model achieves 88.95 % accuracy. For prevalence classification, we annotated 3753 paper titles and abstracts. The SVM model achieves 89.14% accuracy while the CNN model achieves 89.13 % accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.\n ', '\n How do we know if a particular medical treatment actually works? Ideally one would consult all available evidence from relevant clinical trials. Unfortunately, such results are primarily disseminated in natural language scientific articles, imposing substantial burden on those trying to make sense of them. In this paper, we present a new task and corpus for making this unstructured evidence actionable. The task entails inferring reported findings from a full-text article describing a randomized controlled trial (RCT) with respect to a given intervention, comparator, and outcome of interest, e.g., inferring if an article provides evidence supporting the use of aspirin to reduce risk of stroke, as compared to placebo.\n We present a new corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs. Results using a suite of models --- ranging from heuristic (rule-based) approaches to attentive neural architectures --- demonstrate the difficulty of the task, which we believe largely owes to the lengthy, technical input texts. To facilitate further work on this important, challenging problem we make the corpus, documentation, a website and leaderboard, and code for baselines and evaluation available at http://evidence-inference.ebm-nlp.com/.\n ', '\n Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.\n ', '\n We introduce a novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts, aligning them poses a challenge due to their dynamic nature. To this end, we construct context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces. This mapping readily supports processing of a target language, improving transfer by context-aware embeddings. Our experimental results demonstrate the effectiveness of this approach for zero-shot and few-shot learning of dependency parsing. Specifically, our method consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.\n ', '\n We view molecular optimization as a graph-to-graph translation problem. The goal is to learn to map from one molecular graph to another with better properties based on an available corpus of paired molecules. Since molecules can be optimized in different ways, there are multiple viable translations for each input graph. A key challenge is therefore to model diverse translation outputs. Our primary contributions include a junction tree encoder-decoder for learning diverse graph translations along with a novel adversarial training method for aligning distributions of molecules. Diverse output distributions in our model are explicitly realized by low-dimensional latent vectors that modulate the translation process. We evaluate our model on multiple molecular optimization tasks and show that our model outperforms previous state-of-the-art baselines.\n ', '\n Most modern Information Extraction (IE) systems are implemented as sequential taggers and only model local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. In this paper, we introduce GraphIE, a framework that operates over a graph representing a broad set of dependencies between textual units (i.e. words or sentences). The algorithm propagates information between connected nodes through graph convolutions, generating a richer representation that can be exploited to improve word-level predictions. Evaluation on three different tasks --- namely textual, social media and visual information extraction --- shows that GraphIE consistently outperforms the state-of-the-art sequence tagging model by a significant margin.\n ', '\n We propose a mixture-of-experts approach for unsupervised domain adaptation from multiple sources. The key idea is to explicitly capture the relationship between a target example and different source domains. This relationship, expressed by a point-to-set metric, determines how to combine predictors trained on various domains. The metric is learned in an unsupervised fashion using meta-training. Experimental results on sentiment analysis and part-of-speech tagging demonstrate that our approach consistently outperforms multiple baselines and can robustly handle negative transfer.\n ', '\n Attention-based models are successful when trained on large amounts of data. In this paper, we demonstrate that even in the low-resource scenario, attention can be learned effectively. To this end, we start with discrete human-annotated rationales and map them into continuous attention. Our central hypothesis is that this mapping is general across domains, and thus can be transferred from resource-rich domains to low-resource ones. Our model jointly learns a domain-invariant representation and induces the desired mapping between rationales and attention. Our empirical results validate this hypothesis and show that our approach delivers significant gains over state-of-the-art baselines, yielding over 15% average error reduction on benchmark datasets.\n ', '\n In this position paper, we describe our vision of the future of machine programming through a categorical examination of three pillars of research. Those pillars are: (i) intention, (ii) invention, and(iii) adaptation. Intention emphasizes advancements in the human-to-computer and computer-to-machine-learning interfaces. Invention emphasizes the creation or refinement of algorithms or core hardware and software building blocks through machine learning (ML). Adaptation emphasizes advances in the use of ML-based constructs to autonomously evolve software.\n ', '\n We seek to automate the design of molecules based on specific chemical properties. In computational terms, this task involves continuous embedding and generation of molecular graphs. Our primary contribution is the direct realization of molecular graphs, a task previously approached by generating linear SMILES strings instead of graphs. Our junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network. This approach allows us to incrementally expand molecules while maintaining chemical validity at every step. We evaluate our model on multiple tasks ranging from molecular generation to optimization. Across these tasks, our model outperforms previous state-of-the-art baselines by a significant margin.\n ', '\n The prediction of organic reaction outcomes is a fundamental problem in computational chemistry. Since a reaction may involve hundreds of atoms, fully exploring the space of possible transformations is intractable. The current solution utilizes reaction templates to limit the space, but it suffers from coverage and efficiency issues. In this paper, we propose a template-free approach to efficiently explore the space of product molecules by first pinpointing the reaction center -- the set of nodes and edges where graph edits occur. Since only a small number of atoms contribute to reaction center, we can directly enumerate candidate products. The generated candidates are scored by a Weisfeiler-Lehman Difference Network that models high-order interactions between changes occurring at nodes across the molecule. Our framework outperforms the top-performing template-based approach with a 10\\% margin, while running orders of magnitude faster. Finally, we demonstrate that the model accuracy rivals the performance of domain experts.\n ', '\n In this paper, we explore the utilization of natural language to drive transfer for reinforcement learning (RL). Despite the wide-spread application of deep RL techniques, learning generalized policy representations that work across domains remains a challenging problem. We demonstrate that textual descriptions of environments provide a compact intermediate channel to facilitate effective policy transfer. Specifically, by learning to ground the meaning of text to the dynamics of the environment such as transitions and rewards, an autonomous agent can effectively bootstrap policy learning on a new domain given its description. We employ a model-based RL approach consisting of a differentiable planning module, a model-free component and a factorized state representation to effectively use entity descriptions. Our model outperforms prior work on both transfer and multi-task scenarios in a variety of different environments. For instance, we achieve up to 14% and 11.5% absolute improvement over previously existing models in terms of average and initial rewards, respectively.\n ', '\n Controlling thermal transport is important for a range of devices and technologies, from phase change memories to next-generation electronics. This is especially true in nano-scale devices where thermal transport is altered by the influence of surfaces and changes in dimensionality. In superconducting nanowire single-photon detectors, the thermal boundary conductance (TBC) between the nanowire and the substrate it is fabricated on influences most of the performance metrics that make these detectors attractive for applications. This includes the maximum count rate, latency, jitter, and quantum efficiency. Despite its importance, the study of TBC in superconducting nanowire devices has not been done systematically, primarily due to the lack of a straightforward characterization method. Here, we show that simple electrical measurements can be used to estimate the TBC between nanowires and substrates and that these measurements match acoustic mismatch theory across a variety of substrates. Numerical simulations allow us to refine our understanding, however, open questions remain. This work should enable thermal engineering in superconducting nanowire electronics and cryogenic detectors for improved device performance.\n ', '\n We describe optimization of a cryogenic magnetometer that uses nonlinear kinetic inductance in superconducting nanowires as the sensitive element instead of a superconducting quantum interference device (SQUID). The circuit design consists of a loop geometry with two nanowires in parallel, serving as the inductive section of a lumped LC resonator similar to a kinetic inductance detector (KID). This device takes advantage of the multiplexing capability of the KID, allowing for a natural frequency multiplexed readout. The Kinetic Inductance Magnetometer (KIM) is biased with a DC magnetic flux through the inductive loop. A perturbing signal will cause a flux change through the loop, and thus a change in the induced current, which alters the kinetic inductance of the nanowires, causing the resonant frequency of the KIM to shift. This technology has applications in astrophysics, material science, and the medical field for readout of Metallic Magnetic Calorimeters (MMCs), axion detection, and magnetoencephalography (MEG).\n ', '\n A subset of QuantISED Sensor PIs met virtually on May 26, 2020 to discuss a response to a charge by the DOE Office of High Energy Physics. In this document, we summarize the QuantISED sensor community discussion, including a consideration of HEP science enabled by quantum sensors, describing the distinction between Quantum 1.0 and Quantum 2.0, and discussing synergies/complementarity with the new DOE NQI centers and with research supported by other SC offices.\n Quantum 2.0 advances in sensor technology offer many opportunities and new approaches for HEP experiments. The DOE HEP QuantISED program could support a portfolio of small experiments based on these advances. QuantISED experiments could use sensor technologies that exemplify Quantum 2.0 breakthroughs. They would strive to achieve new HEP science results, while possibly spinning off other domain science applications or serving as pathfinders for future HEP science targets. QuantISED experiments should be led by a DOE laboratory, to take advantage of laboratory technical resources, infrastructure, and expertise in the safe and efficient construction, operation, and review of experiments.\n The QuantISED PIs emphasized that the quest for HEP science results under the QuantISED program is distinct from the ongoing DOE HEP programs on the energy, intensity, and cosmic frontiers. There is robust evidence for the existence of particles and phenomena beyond the Standard Model, including dark matter, dark energy, quantum gravity, and new physics responsible for neutrino masses, cosmic inflation, and the cosmic preference for matter over antimatter. Where is this physics and how do we find it? The QuantISED program can exploit new capabilities provided by quantum technology to probe these kinds of science questions in new ways and over a broader range of science parameters than can be achieved with conventional techniques.\n ', '\n It has previously been shown that 2D spectral mammography can be used to discriminate between (likely benign) cystic and (potentially malignant) solid lesions in order to reduce unnecessary recalls in mammography. One limitation of the technique is, however, that the composition of overlapping tissue needs to be interpolated from a region surrounding the lesion. The purpose of this investigation was to demonstrate that lesion characterization can be done with spectral tomosynthesis, and to investigate whether the 3D information available in tomosynthesis can reduce the uncertainty from the interpolation of surrounding tissue. A phantom experiment was designed to simulate a cyst and a tumor, where the tumor was overlaid with a structure that made it mimic a cyst. In 2D, the two targets appeared similar in composition, whereas spectral tomosynthesis revealed the exact compositional difference. However, the loss of discrimination signal due to spread from the plane of interest was of the same strength as the reduction of anatomical noise. Results from a preliminary investigation on clinical tomosynthesis images of solid lesions yielded results that were consistent with the phantom experiments, but were still to some extent inconclusive. We conclude that lesion characterization is feasible in spectral tomosynthesis, but more data, as well as refinement of the calibration and discrimination algorithms, are needed to draw final conclusions about the benefit compared to 2D.\n ', '\n Spectral imaging is the acquisition of multiple images of an object at different energy spectra. In mammography, dual-energy imaging (spectral imaging with two energy levels) has been investigated for several applications, in particular material decomposition, which allows for quantitative analysis of breast composition and quantitative contrast-enhanced imaging. Material decomposition with dual-energy imaging is based on the assumption that there are two dominant photon interaction effects that determine linear attenuation: the photoelectric effect and Compton scattering. This assumption limits the number of basis materials, i.e. the number of materials that are possible to differentiate between, to two. However, Rayleigh scattering may account for more than 10% of the linear attenuation in the mammography energy range. In this work, we show that a modified version of a scanning multi-slit spectral photon-counting mammography system is able to acquire three images at different spectra and can be used for triple-energy imaging. We further show that triple-energy imaging in combination with the efficient scatter rejection of the system enables measurement of Rayleigh scattering, which adds an additional energy dependency to the linear attenuation and enables material decomposition with three basis materials. Three available basis materials have the potential to improve virtually all applications of spectral imaging.\n ', '\n The development of new x-ray imaging techniques often requires prior knowledge of tissue attenuation, but the sources of such information are sparse. We have measured the attenuation of adipose breast tissue using spectral imaging, in vitro and in vivo. For the in-vitro measurement, fixed samples of adipose breast tissue were imaged on a spectral mammography system, and the energy-dependent x-ray attenuation was measured in terms of equivalent thicknesses of aluminum and poly-methyl methacrylate (PMMA). For the in-vivo measurement, a similar procedure was applied on a number of spectral screening mammograms. The results of the two measurements agreed well and were consistent with published attenuation data and with measurements on tissue-equivalent material.\n ', '\n Measurements of breast density have the potential to improve the efficiency and reduce the cost of screening mammography through personalized screening. Breast density has traditionally been evaluated from the dense area in a mammogram, but volumetric assessment methods, which measure the volumetric fraction of fibro-glandular tissue in the breast, are potentially more consistent and physically sound. The purpose of the present study is to evaluate a method for measuring the volumetric breast density using photon-counting spectral tomosynthesis. The performance of the method was evaluated using phantom measurements and clinical data from a small population (n=18). The precision was determined to 2.4 percentage points (pp) of volumetric breast density. Strong correlations were observed between contralateral (R^2=0.95) and ipsilateral (R^2=0.96) breast-density measurements. The measured breast density was anti-correlated to breast thickness, as expected, and exhibited a skewed distribution in the range [3.7%, 55%] and with a median of 18%. We conclude that the method yields promising results that are consistent with expectations. The relatively high precision of the method may enable novel applications such as treatment monitoring.\n ', '\n We fabricate superconducting ion traps with niobium and niobium nitride and trap single 88Sr ions at cryogenic temperatures. The superconducting transition is verified and characterized by measuring the resistance and critical current using a 4-wire measurement on the trap structure, and observing change in the rf reflection. The lowest observed heating rate is 2.1(3) quanta/sec at 800 kHz at 6 K and shows no significant change across the superconducting transition, suggesting that anomalous heating is primarily caused by noise sources on the surface. This demonstration of superconducting ion traps opens up possibilities for integrating trapped ions and molecular ions with superconducting devices.\n ', '\n We investigate the recovery of superconducting NbN-nanowire photon counters after detection of an optical pulse at a wavelength of 1550 nm, and present a model that quantitatively accounts for our observations. The reset time is found to be limited by the large kinetic inductance of these nanowires, which forces a tradeoff between counting rate and either detection efficiency or active area. Devices of usable size and high detection efficiency are found to have reset times orders of magnitude longer than their intrinsic photoresponse time.\n ', '\n Quantum optical techniques may yield immersion fluids with high indices of refraction without absorption. We describe one such technique in which a probe field experiences a large index of refraction with amplification rather than absorption, and examine its practicality for an immersion lithography application. Enhanced index can be observed in a three-level system with a tunable, near-resonant, coherent probe and incoherent pump field that inverts population of the probe transition. This observation contradicts the common belief that large indices of refraction are impossible without absorption, however it is well in accord with existing electromagnetic theory and practice. Calculations show that a refractive index >> 2 is possible with practical experimental parameters. A scheme with an incoherent mixture of pumped and unpumped atoms is also examined, and is seen to have a lower refractive index (~2) accompanied by neither gain nor loss.\n ', "\n A numerical method for solving Schrodinger's equation based upon a Baker-Campbell-Hausdorff (BCH) expansion of the time evolution operator is presented herein. The technique manifestly preserves wavefunction norm, and it can be applied to problems in any number of spatial dimensions. We also identify a particular dimensionless ratio of potential to kinetic energies as a key coupling constant. This coupling establishes characteristic length and time scales for a large class of low energy quantum states, and it guides the choice of step sizes in numerical work. Using the BCH method in conjunction with an imaginary time rotation, we compute low energy eigenstates for several quantum systems coupled to non-trivial background potentials. The approach is subsequently applied to the study of 1D propagating wave packets and 2D bound state time development. Failures of classical expectations uncovered by simulations of these simple systems help develop quantum intuition.\n Finally, we investigate the response of a Superconducting Quantum Interference Device (SQUID) to a time dependent potential. We discuss how to engineer the potential's energy and time scales so that the SQUID acts as a quantum NOT gate. The notional simulation we present for this gate provides useful insight into the design of one candidate building block for a quantum computer.\n ", '\n In this paper we aim to provide analysis and insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. In particular, through a unifying abstract mathematical framework, we show that the principal AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.\n ', '\n We introduce a contractive abstract dynamic programming framework and related policy iteration algorithms, specifically designed for sequential zero-sum games and minimax problems with a general structure. Aside from greater generality, the advantage of our algorithms over alternatives is that they resolve some long-standing convergence difficulties of the "natural" policy iteration algorithm, which have been known since the Pollatschek and Avi-Itzhak method [PoA69] for finite-state Markov games. Mathematically, this "natural" algorithm is a form of Newton\'s method for solving Bellman\'s equation, but Newton\'s method, contrary to the case of single-player DP problems, is not globally convergent in the case of a minimax problem, because the Bellman operator may have components that are neither convex nor concave. Our algorithms address this difficulty by introducing alternating player choices, and by using a policy-dependent mapping with a uniform sup-norm contraction property, similar to earlier works by Bertsekas and Yu [BeY10], [BeY12], [YuB13]. Moreover, our algorithms allow a convergent and highly parallelizable implementation, which is based on state space partitioning, and distributed asynchronous policy evaluation and policy improvement operations within each set of the partition. Our framework is also suitable for the use of reinforcement learning methods based on aggregation, which may be useful for large-scale problem instances.\n ', '\n In this paper we propose an on-line policy iteration (PI) algorithm for finite-state infinite horizon discounted dynamic programming, whereby the policy improvement operation is done on-line, only for the states that are encountered during operation of the system. This allows the continuous updating/improvement of the current policy, thus resulting in a form of on-line PI that incorporates the improved controls into the current policy as new states and controls are generated. The algorithm converges in a finite number of stages to a type of locally optimal policy, and suggests the possibility of variants of PI and multiagent PI where the policy improvement is simplified. Moreover, the algorithm can be used with on-line replanning, and is also well-suited for on-line PI algorithms with value and policy approximations.\n ', "\n In this paper we consider infinite horizon discounted dynamic programming problems with finite state and control spaces, partial state observations, and a multiagent structure. We discuss and compare algorithms that simultaneously or sequentially optimize the agents' controls by using multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation. Our methods specifically address the computational challenges of partially observable multiagent problems. In particular: 1) We consider rollout algorithms that dramatically reduce required computation while preserving the key cost improvement property of the standard rollout method. The per-step computational requirements for our methods are on the order of $O(Cm)$ as compared with $O(C^m)$ for standard rollout, where $C$ is the maximum cardinality of the constraint set for the control component of each agent, and $m$ is the number of agents. 2) We show that our methods can be applied to challenging problems with a graph structure, including a class of robot repair problems whereby multiple robots collaboratively inspect and repair a system under partial information. 3) We provide a simulation study that compares our methods with existing methods, and demonstrate that our methods can handle larger and more complex partially observable multiagent problems (state space size $10^{37}$ and control space size $10^{7}$, respectively). Finally, we incorporate our multiagent rollout algorithms as building blocks in an approximate policy iteration scheme, where successive rollout policies are approximated by using neural network classifiers. While this scheme requires a strictly off-line implementation, it works well in our computational experiments and produces additional significant performance improvement over the single online rollout iteration method.\n ", '\n We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order, with knowledge of the choices of the preceding agents in the order. As a result, the amount of computation for each policy improvement grows linearly with the number of agents, as opposed to exponentially for the standard all-agents-at-once method. For the case of a finite-state discounted problem, we showed convergence to an agent-by-agent optimal policy. In this paper, this result is extended to value iteration and optimistic versions of policy iteration, as well as to more general DP problems where the Bellman operator is a contraction mapping, such as stochastic shortest path problems with all policies being proper.\n ', "\n We consider an extension of the rollout algorithm that applies to constrained deterministic dynamic programming, including challenging combinatorial optimization problems. The algorithm relies on a suboptimal policy, called base heuristic. Under suitable assumptions, we show that if the base heuristic produces a feasible solution, the rollout algorithm has a cost improvement property: it produces a feasible solution, whose cost is no worse than the base heuristic's cost.\n We then focus on multiagent problems, where the control at each stage consists of multiple components (one per agent), which are coupled either through the cost function or the constraints or both. We show that the cost improvement property is maintained with an alternative implementation that has greatly reduced computational requirements, and makes possible the use of rollout in problems with many agents. We demonstrate this alternative algorithm by applying it to layered graph problems that involve both a spatial and a temporal structure. We consider in some detail a prominent example of such problems: multidimensional assignment, where we use the auction algorithm for 2-dimensional assignment as a base heuristic. This auction algorithm is particularly well-suited for our context, because through the use of prices, it can advantageously use the solution of an assignment problem as a starting point for solving other related assignment problems, and this can greatly speed up the execution of the rollout algorithm.\n ", '\n In this paper we consider infinite horizon discounted dynamic programming problems with finite state and control spaces, and partial state observations. We discuss an algorithm that uses multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation. This algorithm is also used for policy improvement in an approximate policy iteration scheme, where successive policies are approximated by using a neural network classifier. A novel feature of our approach is that it is well suited for distributed computation through an extended belief space formulation and the use of a partitioned architecture, which is trained with multiple neural networks. We apply our methods in simulation to a class of sequential repair problems where a robot inspects and repairs a pipeline with potentially several rupture sites under partial information about the state of the pipeline.\n ', '\n We propose a new aggregation framework for approximate dynamic programming, which provides a connection with rollout algorithms, approximate policy iteration, and other single and multistep lookahead methods. The central novel characteristic is the use of a bias function $V$ of the state, which biases the values of the aggregate cost function towards their correct levels. The classical aggregation framework is obtained when $V\\equiv0$, but our scheme works best when $V$ is a known reasonably good approximation to the optimal cost function $J^*$.\n When $V$ is equal to the cost function $J_μ$ of some known policy $μ$ and there is only one aggregate state, our scheme is equivalent to the rollout algorithm based on $μ$ (i.e., the result of a single policy improvement starting with the policy $μ$). When $V=J_μ$ and there are multiple aggregate states, our aggregation approach can be used as a more powerful form of improvement of $μ$. Thus, when combined with an approximate policy evaluation scheme, our approach can form the basis for a new and enhanced form of approximate policy iteration.\n When $V$ is a generic bias function, our scheme is equivalent to approximation in value space with lookahead function equal to $V$ plus a local correction within each aggregate state. The local correction levels are obtained by solving a low-dimensional aggregate DP problem, yielding an arbitrarily close approximation to $J^*$, when the number of aggregate states is sufficiently large. Except for the bias function, the aggregate DP problem is similar to the one of the classical aggregation framework, and its algorithmic solution by simulation or other methods is nearly identical to one for classical aggregation, assuming values of $V$ are available when needed.\n ', "\n We consider finite and infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. We introduce an approach, whereby at every stage, each agent's decision is made by executing a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of local computation required at every stage by each agent is independent of the number of agents, while the amount of total computation (over all agents) grows linearly with the number of agents. By contrast, with the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the drastic reduction in required computation, we show that our algorithm has the fundamental cost improvement property of rollout: an improved performance relative to the base policy. We also discuss possibilities to improve further the method's computational efficiency through limited agent coordination and parallelization of the agents' computations. Finally, we explore related approximate policy iteration algorithms for infinite horizon problems, and we prove that the cost improvement property steers the algorithm towards convergence to an agent-by-agent optimal policy.\n ", '\n In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural network-based reinforcement learning, thereby potentially leading to more effective policy improvement.\n ', "\n We consider discrete-time infinite horizon deterministic optimal control problems with nonnegative cost per stage, and a destination that is cost-free and absorbing. The classical linear-quadratic regulator problem is a special case. Our assumptions are very general, and allow the possibility that the optimal policy may not be stabilizing the system, e.g., may not reach the destination either asymptotically or in a finite number of steps. We introduce a new unifying notion of stable feedback policy, based on perturbation of the cost per stage, which in addition to implying convergence of the generated states to the destination, quantifies the speed of convergence. We consider the properties of two distinct cost functions: $\\jstar$, the overall optimal, and $\\hat J$, the restricted optimal over just the stable policies. Different classes of stable policies (with different speeds of convergence) may yield different values of $\\hat J$. We show that for any class of stable policies, $\\hat J$ is a solution of Bellman's equation, and we characterize the smallest and the largest solutions: they are $\\jstar$, and $J^+$, the restricted optimal cost function over the class of (finitely) terminating policies. We also characterize the regions of convergence of various modified versions of value and policy iteration algorithms, as substitutes for the standard algorithms, which may not work in general.\n ", "\n We consider stochastic shortest path problems with infinite state and control spaces, a nonnegative cost per stage, and a termination state. We extend the notion of a proper policy, a policy that terminates within a finite expected number of steps, from the context of finite state space to the context of infinite state space. We consider the optimal cost function $J^*$, and the optimal cost function $\\hat J$ over just the proper policies. We show that $J^*$ and $\\hat J$ are the smallest and largest solutions of Bellman's equation, respectively, within a suitable class of Lyapounov-like functions. If the cost per stage is bounded, these functions are those that are bounded over the effective domain of $\\hat J$. The standard value iteration algorithm may be attracted to either $J^*$ or $\\hat J$, depending on the initial condition.\n ", '\n We consider large linear and nonlinear fixed point problems, and solution with proximal algorithms. We show that there is a close connection between two seemingly different types of methods from distinct fields: 1) Proximal iterations for linear systems of equations, which are prominent in numerical analysis and convex optimization, and 2) Temporal difference (TD) type methods, such as TD(lambda), LSTD(lambda), and LSPE(lambda), which are central in simulation-based approximate dynamic programming/reinforcement learning (DP/RL), and its recent prominent successes in large-scale game contexts, among others.\n One benefit of this connection is a new and simple way to accelerate the standard proximal algorithm by extrapolation towards the TD iteration, which generically has a faster convergence rate. Another benefit is the potential integration into the proximal algorithmic context of several new ideas that have emerged in the DP/RL context. We discuss some of the possibilities, and in particular, algorithms that project each proximal iterate onto the subspace spanned by a small number of basis functions, using low-dimensional calculations and simulation. A third benefit is that insights and analysis from proximal algorithms can be brought to bear on the enhancement of TD methods.\n The linear fixed point methodology can be extended to nonlinear fixed point problems involving a contraction, thus providing guaranteed and potentially substantial acceleration of the proximal and forward backward splitting algorithms at no extra cost. Moreover, the connection of proximal and TD methods can be extended to nonlinear (nondifferentiable) fixed point problems through new proximal-like algorithms that involve successive linearization, similar to policy iteration in DP.\n ', '\n We consider challenging dynamic programming models where the associated Bellman equation, and the value and policy iteration algorithms commonly exhibit complex and even pathological behavior. Our analysis is based on the new notion of regular policies. These are policies that are well-behaved with respect to value and policy iteration, and are patterned after proper policies, which are central in the theory of stochastic shortest path problems. We show that the optimal cost function over regular policies may have favorable value and policy iteration properties, which the optimal cost function over all policies need not have. We accordingly develop a unifying methodology to address long standing analytical and algorithmic issues in broad classes of undiscounted models, including stochastic and minimax shortest path problems, as well as positive cost, negative cost, risk-sensitive, and multiplicative cost problems.\n ', '\n In this paper we consider shortest path problems in a directed graph where the transitions between nodes are subject to uncertainty. We use a minimax formulation, where the objective is to guarantee that a special destination state is reached with a minimum cost path under the worst possible instance of the uncertainty. Problems of this type arise, among others, in planning and pursuit-evasion contexts, and in model predictive control. Our analysis makes use of the recently developed theory of abstract semicontractive dynamic programming models. We investigate questions of existence and uniqueness of solution of the optimality equation, existence of optimal paths, and the validity of various algorithms patterned after the classical methods of value and policy iteration, as well as a Dijkstra-like algorithm for problems with nonnegative arc lengths.\n ', "\n In this paper we consider a broad class of infinite horizon discrete-time optimal control models that involve a nonnegative cost function and an affine mapping in their dynamic programming equation. They include as special cases classical models such as stochastic undiscounted nonnegative cost problems, stochastic multiplicative cost problems, and risk-sensitive problems with exponential cost. We focus on the case where the state space is finite and the control space has some compactness properties. We assume that the affine mapping has a semicontractive character, whereby for some policies it is a contraction, while for others it is not. In one line of analysis, we impose assumptions that guarantee that the latter policies cannot be optimal. Under these assumptions, we prove strong results that resemble those for discounted Markovian decision problems, such as the uniqueness of solution of Bellman's equation, and the validity of forms of value and policy iteration. In the absence of these assumptions, the results are weaker and unusual in character: the optimal cost function need not be a solution of Bellman's equation, and an optimal policy may not be found by value or policy iteration. Instead the optimal cost function over just the contractive policies solves Bellman's equation, and can be computed by a variety of algorithms.\n ", '\n We consider minimization of the sum of a large number of convex functions, and we propose an incremental aggregated version of the proximal algorithm, which bears similarity to the incremental aggregated gradient and subgradient methods that have received a lot of recent attention. Under cost function differentiability and strong convexity assumptions, we show linear convergence for a sufficiently small constant stepsize. This result also applies to distributed asynchronous variants of the method, involving bounded interprocessor communication delays.\n We then consider dual versions of incremental proximal algorithms, which are incremental augmented Lagrangian methods for separable equality-constrained optimization problems. Contrary to the standard augmented Lagrangian method, these methods admit decomposition in the minimization of the augmented Lagrangian, and update the multipliers far more frequently. Our incremental aggregated augmented Lagrangian methods bear similarity to several known decomposition algorithms, including the alternating direction method of multipliers (ADMM) and more recent variations. We compare these methods in terms of their properties, and highlight their potential advantages and limitations.\n We also address the solution of separable inequality-constrained optimization problems through the use of nonquadratic augmented Lagrangiias such as the exponential, and we dually consider a corresponding incremental aggregated version of the proximal algorithm that uses nonquadratic regularization, such as an entropy function. We finally propose a closely related linearly convergent method for minimization of large differentiable sums subject to an orthant constraint, which may be viewed as an incremental aggregated version of the mirror descent method.\n ', '\n We survey incremental methods for minimizing a sum $\\sum_{i=1}^mf_i(x)$ consisting of a large number of convex component functions $f_i$. Our methods consist of iterations applied to single components, and have proved very effective in practice. We introduce a unified algorithmic framework for a variety of such methods, some involving gradient and subgradient iterations, which are known, and some involving combinations of subgradient and proximal methods, which are new and offer greater flexibility in exploiting the special structure of $f_i$. We provide an analysis of the convergence and rate of convergence properties of these methods, including the advantages offered by randomization in the selection of components. We also survey applications in inference/machine learning, signal processing, and large-scale and distributed optimization.\n ', '\n In this paper we discuss $ł$-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which offer advantages over well-established PI methods that use LSPE($ł$), LSTD($ł$), or TD($ł$) for policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory.\n ', "\n In this paper, we consider discrete-time infinite horizon problems of optimal control to a terminal set of states. These are the problems that are often taken as the starting point for adaptive dynamic programming. Under very general assumptions, we establish the uniqueness of solution of Bellman's equation, and we provide convergence results for value and policy iteration.\n ", '\n We consider Newton methods for common types of single commodity and multi-commodity network flow problems. Despite the potentially very large dimension of the problem, they can be implemented using the conjugate gradient method and low-dimensional network operations, as shown nearly thirty years ago. We revisit these methods, compare them to more recent proposals, and describe how they can be implemented in a distributed computing system. We also discuss generalizations, including the treatment of arc gains, linear side constraints, and related special structures.\n ', '\n In this paper, we propose a new lower approximation scheme for POMDP with discounted and average cost criterion. The approximating functions are determined by their values at a finite number of belief points, and can be computed efficiently using value iteration algorithms for finite-state MDP. While for discounted problems several lower approximation schemes have been proposed earlier, ours seems the first of its kind for average cost problems. We focus primarily on the average cost case, and we show that the corresponding approximation can be computed efficiently using multi-chain algorithms for finite-state MDP. We give a preliminary analysis showing that regardless of the existence of the optimal average cost J in the POMDP, the approximation obtained is a lower bound of the liminf optimal average cost function, and can also be used to calculate an upper bound on the limsup optimal average cost function, as well as bounds on the cost of executing the stationary policy associated with the approximation. Weshow the convergence of the cost approximation, when the optimal average cost is constant and the optimal differential cost is continuous.\n ', "\n Using Phylogenetic Algebraic Geometry, we analyze computationally the phylogenetic tree of subfamilies of the Indo-European language family, using data of syntactic structures. The two main sources of syntactic data are the SSWL database and Longobardi's recent data of syntactic parameters. We compute phylogenetic invariants and likelihood functions for two sets of Germanic languages, a set of Romance languages, a set of Slavic languages and a set of early Indo-European languages, and we compare the results with what is known through historical linguistics.\n ", '\n Measuring the distance between two bacterial genomes under the inversion process is usually done by assuming all inversions to occur with equal probability. Recently, an approach to calculating inversion distance using group theory was introduced, and is effective for the model in which only very short inversions occur. In this paper, we show how to use the group-theoretic framework to establish minimal distance for any weighting on the set of inversions, generalizing previous approaches. To do this we use the theory of rewriting systems for groups, and exploit the Knuth--Bendix algorithm, the first time this theory has been introduced into genome rearrangement problems.\n The central idea of the approach is to use existing group theoretic methods to find an initial path between two genomes in genome space (for instance using only short inversions), and then to deform this path to optimality using a confluent system of rewriting rules generated by the Knuth--Bendix algorithm.\n ', "\n Modellers of large scale genome rearrangement events, in which segments of DNA are inverted, moved, swapped, or even inserted or deleted, have found a natural syntax in the language of permutations. Despite this, there has been a wide range of modelling choices, assumptions and interpretations that make navigating the literature a significant challenge. Indeed, even authors of papers that use permutations to model genome rearrangement can struggle to interpret each others' work, because of subtle differences in basic assumptions that are often deeply ingrained (and consequently sometimes not even mentioned). In this paper, we describe the different ways in which permutations have been used to model genomes and genome rearrangement events, presenting some features and limitations of each approach, and show how the various models are related. This paper will help researchers navigate the landscape of genome rearrangement models, and make it easier for authors to present clear and consistent models.\n ", '\n Establishing a distance between genomes is a significant problem in computational genomics, because its solution can be used to establish evolutionary relationships including phylogeny.\n The "double cut and join" (DCJ) model of chromosomal rearrangement proposed by Yancopoulos et al. has received attention as it can model inversions, translocations, fusion and fission on a multichromosomal genome that may contain both linear and circular chromosomes. In this paper, we realize the DCJ operator as a group action on the space of multichromosomal genomes. We study this group action, deriving some properties of the group and finding group-theoretic analogues for the key results in the DCJ theory.\n ', '\n We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.\n ', '\n Recent papers have demonstrated that ensemble stumps and trees could be vulnerable to small input perturbations, so robustness verification and defense for those models have become an important research problem. However, due to the structure of decision trees, where each node makes decision purely based on one feature value, all the previous works only consider the $\\ell_\\infty$ norm perturbation. To study robustness with respect to a general $\\ell_p$ norm perturbation, one has to consider the correlation between perturbations on different features, which has not been handled by previous algorithms. In this paper, we study the problem of robustness verification and certified defense with respect to general $\\ell_p$ norm perturbations for ensemble decision stumps and trees. For robustness verification of ensemble stumps, we prove that complete verification is NP-complete for $p\\in(0, \\infty)$ while polynomial time algorithms exist for $p=0$ or $\\infty$. For $p\\in(0, \\infty)$ we develop an efficient dynamic programming based algorithm for sound verification of ensemble stumps. For ensemble trees, we generalize the previous multi-level robustness verification algorithm to $\\ell_p$ norm. We demonstrate the first certified defense method for training ensemble stumps and trees with respect to $\\ell_p$ norm perturbations, and verify its effectiveness empirically on real datasets.\n ', '\n Multi-stage training and knowledge transfer, from a large-scale pretraining task to various finetuning tasks, have revolutionized natural language processing and computer vision resulting in state-of-the-art performance improvements. In this paper, we develop a multi-stage influence function score to track predictions from a finetuned model all the way back to the pretraining data. With this score, we can identify the pretraining examples in the pretraining task that contribute most to a prediction in the finetuning task. The proposed multi-stage influence function generalizes the original influence function for a single model in (Koh & Liang, 2017), thereby enabling influence computation through both pretrained and finetuned models. We study two different scenarios with the pretrained embeddings fixed or updated in the finetuning tasks. We test our proposed method in various experiments to show its effectiveness and potential applications.\n ', '\n A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises. Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions. Several works have shown this vulnerability via adversarial attacks, but existing approaches on improving the robustness of DRL under this setting have limited success and lack for theoretical principles. We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks. We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms, including proximal policy optimization (PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN), for both discrete and continuous action control problems. We significantly improve the robustness of PPO, DDPG and DQN agents under a suite of strong white box adversarial attacks, including new attacks of our own. Additionally, we find that a robust policy noticeably improves DRL performance even without an adversary in a number of environments. Our code is available at https://github.com/chenhongge/StateAdvDRL.\n ', '\n Training neural networks with verifiable robustness guarantees is challenging. Several existing approaches utilize linear relaxation based neural network output bounds under perturbation, but they can slow down training by a factor of hundreds depending on the underlying network architectures. Meanwhile, interval bound propagation (IBP) based training is efficient and significantly outperforms linear relaxation based methods on many tasks, yet it may suffer from stability issues since the bounds are much looser especially at the beginning of training. In this paper, we propose a new certified adversarial training method, CROWN-IBP, by combining the fast IBP bounds in a forward bounding pass and a tight linear relaxation based bound, CROWN, in a backward bounding pass. CROWN-IBP is computationally efficient and consistently outperforms IBP baselines on training verifiably robust neural networks. We conduct large scale experiments on MNIST and CIFAR datasets, and outperform all previous linear relaxation and bound propagation based certified defenses in $\\ell_\\infty$ robustness. Notably, we achieve 7.02% verified test error on MNIST at $ε=0.3$, and 66.94% on CIFAR-10 with $ε=8/255$. Code is available at https://github.com/deepmind/interval-bound-propagation (TensorFlow) and https://github.com/huanzhang12/CROWN-IBP (PyTorch).\n ', '\n We study the robustness verification problem for tree-based models, including decision trees, random forests (RFs) and gradient boosted decision trees (GBDTs). Formal robustness verification of decision tree ensembles involves finding the exact minimal adversarial perturbation or a guaranteed lower bound of it. Existing approaches find the minimal adversarial perturbation by a mixed integer linear programming (MILP) problem, which takes exponential time so is impractical for large ensembles. Although this verification problem is NP-complete in general, we give a more precise complexity characterization. We show that there is a simple linear time algorithm for verifying a single tree, and for tree ensembles, the verification problem can be cast as a max-clique problem on a multi-partite graph with bounded boxicity. For low dimensional problems when boxicity can be viewed as constant, this reformulation leads to a polynomial time algorithm. For general problems, by exploiting the boxicity of the graph, we develop an efficient multi-level verification algorithm that can give tight lower bounds on the robustness of decision tree ensembles, while allowing iterative improvement and any-time termination. OnRF/GBDT models trained on 10 datasets, our algorithm is hundreds of times faster than the previous approach that requires solving MILPs, and is able to give tight robustness verification bounds on large GBDTs with hundreds of deep trees.\n ', '\n Although adversarial examples and model robustness have been extensively studied in the context of linear models and neural networks, research on this issue in tree-based models and how to make tree-based models robust against adversarial examples is still limited. In this paper, we show that tree based models are also vulnerable to adversarial examples and develop a novel algorithm to learn robust trees. At its core, our method aims to optimize the performance under the worst-case perturbation of input features, which leads to a max-min saddle point problem. Incorporating this saddle point objective into the decision tree building procedure is non-trivial due to the discrete nature of trees --- a naive approach to finding the best split according to this saddle point objective will take exponential time. To make our approach practical and scalable, we propose efficient tree building algorithms by approximating the inner minimizer in this saddle point problem, and present efficient implementations for classical information gain based trees as well as state-of-the-art tree boosting models such as XGBoost. Experimental results on real world datasets demonstrate that the proposed algorithms can substantially improve the robustness of tree-based models against adversarial examples.\n ', '\n The adversarial training procedure proposed by Madry et al. (2018) is one of the most effective methods to defend against adversarial examples in deep neural networks (DNNs). In our paper, we shed some lights on the practicality and the hardness of adversarial training by showing that the effectiveness (robustness on test set) of adversarial training has a strong correlation with the distance between a test point and the manifold of training data embedded by the network. Test examples that are relatively far away from this manifold are more likely to be vulnerable to adversarial attacks. Consequentially, an adversarial training based defense is susceptible to a new class of attacks, the "blind-spot attack", where the input images reside in "blind-spots" (low density regions) of the empirical distribution of training data but is still on the ground-truth data manifold. For MNIST, we found that these blind-spots can be easily found by simply scaling and shifting image pixel values. Most importantly, for large datasets with high dimensional and complex data manifold (CIFAR, ImageNet, etc), the existence of blind-spots in adversarial training makes defending on any valid test examples difficult due to the curse of dimensionality and the scarcity of training data. Additionally, we find that blind-spots also exist on provable defenses including (Wong & Kolter, 2018) and (Sinha et al., 2018) because these trainable robustness certificates can only be practically optimized on a limited set of training data.\n ', '\n Verifying the robustness property of a general Rectified Linear Unit (ReLU) network is an NP-complete problem [Katz, Barrett, Dill, Julian and Kochenderfer CAV17]. Although finding the exact minimum adversarial distortion is hard, giving a certified lower bound of the minimum distortion is possible. Current available methods of computing such a bound are either time-consuming or delivering low quality bounds that are too loose to be useful. In this paper, we exploit the special structure of ReLU networks and provide two computationally efficient algorithms Fast-Lin and Fast-Lip that are able to certify non-trivial lower bounds of minimum distortions, by bounding the ReLU units with appropriate linear functions Fast-Lin, or by bounding the local Lipschitz constant Fast-Lip. Experiments show that (1) our proposed methods deliver bounds close to (the gap is 2-3X) exact minimum distortion found by Reluplex in small MNIST networks while our algorithms are more than 10,000 times faster; (2) our methods deliver similar quality of bounds (the gap is within 35% and usually around 10%; sometimes our bounds are even better) for larger networks compared to the methods based on solving linear programming problems but our algorithms are 33-14,000 times faster; (3) our method is capable of solving large MNIST and CIFAR networks up to 7 layers with more than 10,000 neurons within tens of seconds on a single CPU core.\n In addition, we show that, in fact, there is no polynomial time algorithm that can approximately find the minimum $\\ell_1$ adversarial distortion of a ReLU network with a $0.99\\ln n$ approximation ratio unless $\\mathsf{NP}$=$\\mathsf{P}$, where $n$ is the number of neurons in the network.\n ', '\n In this paper, we propose a novel method to estimate and characterize spatial variations on dies or wafers. This new technique exploits recent developments in matrix completion, enabling estimation of spatial variation across wafers or dies with a small number of randomly picked sampling points while still achieving fairly high accuracy. This new approach can be easily generalized, including for estimation of mixed spatial and structure or device type information.\n ', '\n This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown.\n ', '\n We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been the benchmark algorithm for learning tree-structured graphical models. A bound on the sample complexity of the Chow-Liu algorithm with respect to the prediction-centric local total variation loss was shown in [BK19]. While those results demonstrated that it is possible to learn a useful model even when recovering the true underlying graph is impossible, their bound depends on the maximum strength of interactions and thus does not achieve the information-theoretic optimum. In this paper, we introduce a new algorithm that carefully combines elements of the Chow-Liu algorithm with tree metric reconstruction methods to efficiently and optimally learn tree Ising models under a prediction-centric loss. Our algorithm is robust to model misspecification and adversarial corruptions. In contrast, we show that the celebrated Chow-Liu algorithm can be arbitrarily suboptimal.\n ', '\n Let $Φ$ be a uniformly random $k$-SAT formula with $n$ variables and $m$ clauses. We study the algorithmic task of finding a satisfying assignment of $Φ$. It is known that satisfying assignments exist with high probability up to clause density $m/n = 2^k \\log 2 - \\frac12 (\\log 2 + 1) + o_k(1)$, while the best polynomial-time algorithm known, the Fix algorithm of Coja-Oghlan, finds a satisfying assignment at the much lower clause density $(1 - o_k(1)) 2^k \\log k / k$. This prompts the question: is it possible to efficiently find a satisfying assignment at higher clause densities?\n We prove that the class of low degree polynomial algorithms cannot find a satisfying assignment at clause density $(1 + o_k(1)) κ^* 2^k \\log k / k$ for a universal constant $κ^* \\approx 4.911$. This class encompasses Fix, message passing algorithms including Belief and Survey Propagation guided decimation (with bounded or mildly growing number of rounds), and local algorithms on the factor graph. This is the first hardness result for any class of algorithms at clause density within a constant factor of that achieved by Fix. Our proof establishes and leverages a new many-way overlap gap property tailored to random $k$-SAT.\n ', '\n This paper studies the problem of estimating the means $\\pmθ_{*}\\in\\mathbb{R}^{d}$ of a symmetric two-component Gaussian mixture $δ_{*}\\cdot N(θ_{*},I)+(1-δ_{*})\\cdot N(-θ_{*},I)$ where the weights $δ_{*}$ and $1-δ_{*}$ are unequal. Assuming that $δ_{*}$ is known, we show that the population version of the EM algorithm globally converges if the initial estimate has non-negative inner product with the mean of the larger weight component. This can be achieved by the trivial initialization $θ_{0}=0$. For the empirical iteration based on $n$ samples, we show that when initialized at $θ_{0}=0$, the EM algorithm adaptively achieves the minimax error rate $\\tilde{O}\\Big(\\min\\Big\\{\\frac{1}{(1-2δ_{*})}\\sqrt{\\frac{d}{n}},\\frac{1}{\\|θ_{*}\\|}\\sqrt{\\frac{d}{n}},\\left(\\frac{d}{n}\\right)^{1/4}\\Big\\}\\Big)$ in no more than $O\\Big(\\frac{1}{\\|θ_{*}\\|(1-2δ_{*})}\\Big)$ iterations (with high probability). We also consider the EM iteration for estimating the weight $δ_{*}$, assuming a fixed mean $θ$ (which is possibly mismatched to $θ_{*}$). For the empirical iteration of $n$ samples, we show that the minimax error rate $\\tilde{O}\\Big(\\frac{1}{\\|θ_{*}\\|}\\sqrt{\\frac{d}{n}}\\Big)$ is achieved in no more than $O\\Big(\\frac{1}{\\|θ_{*}\\|^{2}}\\Big)$ iterations. These results robustify and complement recent results of Wu and Zhou obtained for the equal weights case $δ_{*}=1/2$.\n ', '\n A recent line of work has studied the relationship between the Wishart matrix $X^\\top X$, where $X\\in \\mathbb{R}^{d\\times n}$ has i.i.d. standard Gaussian entries, and the corresponding Gaussian matrix with independent entries above the diagonal. Jiang and Li (2015) and Bubeck et al. (2016) showed that these two matrix ensembles converge in total variation whenever $d/n^3\\to \\infty$, and Bubeck et al. (2016) showed this to be sharp. In this paper we aim to identify the precise threshold for $d$ in terms of $n$ for subsets of Wishart matrices to converge in total variation to independent Gaussians. It turns out that the combinatorial structure of the revealed entries, viewed as the adjacency matrix of a graph $G$, characterizes the distance from fully independent. Specifically, we show that the threshold for $d$ depends on the number of various small subgraphs in $G$. So, even when the number of revealed entries is fixed, the threshold can vary wildly depending on their configuration. Convergence of masked Wishart to independent Gaussians thus inherently involves an interplay between both probabilistic and combinatorial phenomena. Our results determine the sharp threshold for a large family of $G$, including Erdős-Rényi $G\\sim \\mathcal{G}(n,p)$ at all values $p\\gtrsim n^{-2}\\mathrm{polylog}(n)$. Our proof techniques are both combinatorial and information theoretic, which together allow us to carefully unravel the dependencies in the masked Wishart ensemble.\n ', '\n Researchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approach is to characterize the limits of restricted models of computation, which on the one hand yields strong computational lower bounds for powerful classes of algorithms and on the other hand helps guide the development of efficient algorithms. In this paper, we study two of the most popular restricted computational models, the statistical query framework and low-degree polynomials, in the context of high-dimensional hypothesis testing. Our main result is that under mild conditions on the testing problem, the two classes of algorithms are essentially equivalent in power. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensor PCA and several variants of the planted clique problem.\n ', '\n We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $τ_{\\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results establish that in general, optimization with Markovian data is strictly harder than optimization with independent data and a trivial algorithm (SGD-DD) that works with only one in every $\\tildeΘ(τ_{\\mathsf{mix}})$ samples, which are approximately independent, is minimax optimal. In fact, it is strictly better than the popular Stochastic Gradient Descent (SGD) method with constant step-size which is otherwise minimax optimal in the regression with independent data setting.\n Beyond a worst case analysis, we investigate whether structured datasets seen in practice such as Gaussian auto-regressive dynamics can admit more efficient optimization schemes. Surprisingly, even in this specific and natural setting, Stochastic Gradient Descent (SGD) with constant step-size is still no better than SGD-DD. Instead, we propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate. Our improved rate serves as one of the first results where an algorithm outperforms SGD-DD on an interesting Markov chain and also provides one of the first theoretical analyses to support the use of experience replay in practice.\n ', '\n Restricted Boltzmann Machines (RBMs) are a common family of undirected graphical models with latent variables. An RBM is described by a bipartite graph, with all observed variables in one layer and all latent variables in the other. We consider the task of learning an RBM given samples generated according to it. The best algorithms for this task currently have time complexity $\\tilde{O}(n^2)$ for ferromagnetic RBMs (i.e., with attractive potentials) but $\\tilde{O}(n^d)$ for general RBMs, where $n$ is the number of observed variables and $d$ is the maximum degree of a latent variable. Let the MRF neighborhood of an observed variable be its neighborhood in the Markov Random Field of the marginal distribution of the observed variables. In this paper, we give an algorithm for learning general RBMs with time complexity $\\tilde{O}(n^{2^s+1})$, where $s$ is the maximum number of latent variables connected to the MRF neighborhood of an observed variable. This is an improvement when $s < \\log_2 (d-1)$, which corresponds to RBMs with sparse latent variables. Furthermore, we give a version of this learning algorithm that recovers a model with small prediction error and whose sample complexity is independent of the minimum potential in the Markov Random Field of the observed variables. This is of interest because the sample complexity of current algorithms scales with the inverse of the minimum potential, which cannot be controlled in terms of natural properties of the RBM.\n ', '\n We prove sharp dimension-free representation results for neural networks with $D$ ReLU layers under square loss for a class of functions $\\mathcal{G}_D$ defined in the paper. These results capture the precise benefits of depth in the following sense:\n 1. The rates for representing the class of functions $\\mathcal{G}_D$ via $D$ ReLU layers is sharp up to constants, as shown by matching lower bounds.\n 2. For each $D$, $\\mathcal{G}_{D} \\subseteq \\mathcal{G}_{D+1}$ and as $D$ grows the class of functions $\\mathcal{G}_{D}$ contains progressively less smooth functions.\n 3. If $D^{\\prime} < D$, then the approximation rate for the class $\\mathcal{G}_D$ achieved by depth $D^{\\prime}$ networks is strictly worse than that achieved by depth $D$ networks.\n This constitutes a fine-grained characterization of the representation power of feedforward networks of arbitrary depth $D$ and number of neurons $N$, in contrast to existing representation results which either require $D$ growing quickly with $N$ or assume that the function being represented is highly smooth. In the latter case similar rates can be obtained with a single nonlinear layer. Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions, and indeed, the main technical novelty is to fully exploit the fact that deep networks can produce highly oscillatory functions with few activation functions.\n ', '\n Inference problems with conjectured statistical-computational gaps are ubiquitous throughout modern statistics, computer science and statistical physics. While there has been success evidencing these gaps from the failure of restricted classes of algorithms, progress towards a more traditional reduction-based approach to computational complexity in statistical inference has been limited. Existing reductions have largely been limited to inference problems with similar structure -- primarily mapping among problems representable as a sparse submatrix signal plus a noise matrix, which are similar to the common hardness assumption of planted clique.\n The insight in this work is that a slight generalization of the planted clique conjecture -- secret leakage planted clique -- gives rise to a variety of new average-case reduction techniques, yielding a web of reductions among problems with very different structure. Using variants of the planted clique conjecture for specific forms of secret leakage planted clique, we deduce tight statistical-computational tradeoffs for a diverse range of problems including robust sparse mean estimation, mixtures of sparse linear regressions, robust sparse linear regression, tensor PCA, variants of dense $k$-block stochastic block models, negatively correlated sparse PCA, semirandom planted dense subgraph, detection in hidden partition models and a universality principle for learning sparse mixtures. In particular, a $k$-partite hypergraph variant of the planted clique conjecture is sufficient to establish all of our computational lower bounds. Our techniques also reveal novel connections to combinatorial designs and to random matrix theory. This work gives the first evidence that an expanded set of hardness assumptions, such as for secret leakage planted clique, may be a key first step towards a more complete theory of reductions among statistical problems.\n ', '\n We develop a corrective mechanism for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second group approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks. First, we show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for arbitrary points under under Euclidean distance separation condition using $\\tilde{O}(n)$ ReLUs which is optimal in $n$ up to logarithmic factors. Next, we give a powerful representation result for two-layer neural networks with ReLUs and smoothed ReLUs which can achieve a squared error of at most $ε$ with $O(C(a,d)ε^{-1/(a+1)})$ for $a \\in \\mathbb{N}\\cup\\{0\\}$ when the function is smooth enough (roughly when it has $Θ(ad)$ bounded derivatives). In certain cases $d$ can be replaced with effective dimension $q \\ll d$. Previous results of this type implement Taylor series approximation using deep architectures. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions. Lastly, we obtain the first $O(\\mathrm{subpoly}(1/ε))$ upper bound on the number of neurons required for a two layer network to learn low degree polynomials up to squared error $ε$ via gradient descent. Even though deep networks can express these polynomials with $O(\\mathrm{polylog}(1/ε))$ neurons, the best learning bounds on this problem require $\\mathrm{poly}(1/ε)$ neurons.\n ', '\n Random graphs with latent geometric structure are popular models of social and biological networks, with applications ranging from network user profiling to circuit design. These graphs are also of purely theoretical interest within computer science, probability and statistics. A fundamental initial question regarding these models is: when are these random graphs affected by their latent geometry and when are they indistinguishable from simpler models without latent structure, such as the Erdős-Rényi graph $\\mathcal{G}(n, p)$? We address this question for two of the most well-studied models of random graphs with latent geometry -- the random intersection and random geometric graph.\n Our results are as follows: (1) we prove that the random intersection graph converges in total variation to $\\mathcal{G}(n, p)$ when $d = \\tildeω(n^3)$, and does not if $d = o(n^3)$, resolving an open problem in Fill et al. (2000), Rybarczyk (2011) and Kim et al. (2018); (2) we provide conditions under which the matrix of intersection sizes of random family of sets converges in total variation to a symmetric matrix with independent Poisson entries, yielding the first total variation convergence result for $τ$-random intersection graphs to $\\mathcal{G}(n, p)$; and (3) we show that the random geometric graph on $\\mathbb{S}^{d - 1}$ with edge density $p$ converges in total variation to $\\mathcal{G}(n, p)$ when $d = \\tildeω\\left(\\min\\{ pn^3, p^2 n^{7/2} \\} \\right)$, yielding the first progress towards a conjecture of Bubeck et al. (2016). The first of these three results was obtained simultaneously and independently by Bubeck, Racz and Richey.\n ', '\n This paper develops several average-case reduction techniques to show new hardness results for three central high-dimensional statistics problems, implying a statistical-computational gap induced by robustness, a detection-recovery gap and a universality principle for these gaps. A main feature of our approach is to map to these problems via a common intermediate problem that we introduce, which we call Imbalanced Sparse Gaussian Mixtures. We assume the planted clique conjecture for a version of the planted clique problem where the position of the planted clique is mildly constrained, and from this obtain the following computational lower bounds: (1) a $k$-to-$k^2$ statistical-computational gap for robust sparse mean estimation, providing the first average-case evidence for a conjecture of Li (2017) and Balakrishnan et al. (2017); (2) a tight lower bound for semirandom planted dense subgraph, which shows that a semirandom adversary shifts the detection threshold in planted dense subgraph to the conjectured recovery threshold; and (3) a universality principle for $k$-to-$k^2$ gaps in a broad class of sparse mixture problems that includes many natural formulations such as the spiked covariance model.\n Our main approach is to introduce several average-case techniques to produce structured and Gaussianized versions of an input graph problem, and then to rotate these high-dimensional Gaussians by matrices carefully constructed from hyperplanes in $\\mathbb{F}_r^t$. For our universality result, we introduce a new method to perform an algorithmic change of measure tailored to sparse mixtures. We also provide evidence that the mild promise in our variant of planted clique does not change the complexity of the problem.\n ', '\n We consider the problem of counting $k$-cliques in $s$-uniform Erdos-Renyi hypergraphs $G(n,c,s)$ with edge density $c$, and show that its fine-grained average-case complexity can be based on its worst-case complexity. We prove the following:\n 1. Dense Erdos-Renyi graphs and hypergraphs: Counting $k$-cliques on $G(n,c,s)$ with $k$ and $c$ constant matches its worst-case time complexity up to a $\\mathrm{polylog}(n)$ factor. Assuming randomized ETH, it takes $n^{Ω(k)}$ time to count $k$-cliques in $G(n,c,s)$ if $k$ and $c$ are constant.\n 2. Sparse Erdos-Renyi graphs and hypergraphs: When $c = Θ(n^{-α})$, we give several algorithms exploiting the sparsity of $G(n, c, s)$ that are faster than the best known worst-case algorithms. Complementing this, based on a fine-grained worst-case assumption, our results imply a different average-case phase diagram for each fixed $α$ depicting a tradeoff between a runtime lower bound and $k$. Surprisingly, in the hypergraph case ($s \\ge 3$), these lower bounds are tight against our algorithms exactly when $c$ is above the Erdős-Rényi $k$-clique percolation threshold.\n This is the first worst-case-to-average-case hardness reduction for a problem on Erdős-Rényi hypergraphs that we are aware of. We also give a variant of our result for computing the parity of the $k$-clique count that tolerates higher error probability.\n ', '\n In the past decade, sparse principal component analysis has emerged as an archetypal problem for illustrating statistical-computational tradeoffs. This trend has largely been driven by a line of research aiming to characterize the average-case complexity of sparse PCA through reductions from the planted clique (PC) conjecture - which conjectures that there is no polynomial-time algorithm to detect a planted clique of size $K = o(N^{1/2})$ in $\\mathcal{G}(N, \\frac{1}{2})$. All previous reductions to sparse PCA either fail to show tight computational lower bounds matching existing algorithms or show lower bounds for formulations of sparse PCA other than its canonical generative model, the spiked covariance model. Also, these lower bounds all quickly degrade with the exponent in the PC conjecture. Specifically, when only given the PC conjecture up to $K = o(N^α)$ where $α< 1/2$, there is no sparsity level $k$ at which these lower bounds remain tight. If $α\\le 1/3$ these reductions fail to even show the existence of a statistical-computational tradeoff at any sparsity $k$. We give a reduction from PC that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities $k$. We also show the surprising result that weaker forms of the PC conjecture up to clique size $K = o(N^α)$ for any given $α\\in (0, 1/2]$ imply tight computational lower bounds for sparse PCA at sparsities $k = o(n^{α/3})$. This shows that even a mild improvement in the signal strength needed by the best known polynomial-time sparse PCA algorithms would imply that the hardness threshold for PC is subpolynomial. This is the first instance of a suboptimal hardness assumption implying optimal lower bounds for another problem in unsupervised learning.\n ', '\n In the general submatrix detection problem, the task is to detect the presence of a small $k \\times k$ submatrix with entries sampled from a distribution $\\mathcal{P}$ in an $n \\times n$ matrix of samples from $\\mathcal{Q}$. This formulation includes a number of well-studied problems, such as biclustering when $\\mathcal{P}$ and $\\mathcal{Q}$ are Gaussians and the planted dense subgraph formulation of community detection when the submatrix is a principal minor and $\\mathcal{P}$ and $\\mathcal{Q}$ are Bernoulli random variables. These problems all seem to exhibit a universal phenomenon: there is a statistical-computational gap depending on $\\mathcal{P}$ and $\\mathcal{Q}$ between the minimum $k$ at which this task can be solved and the minimum $k$ at which it can be solved in polynomial time. Our main result is to tightly characterize this computational barrier as a tradeoff between $k$ and the KL divergences between $\\mathcal{P}$ and $\\mathcal{Q}$ through average-case reductions from the planted clique conjecture. These computational lower bounds hold given mild assumptions on $\\mathcal{P}$ and $\\mathcal{Q}$ arising naturally from classical binary hypothesis testing. Our results recover and generalize the planted clique lower bounds for Gaussian biclustering in Ma-Wu (2015) and Brennan et al. (2018) and for the sparse and general regimes of planted dense subgraph in Hajek et al. (2015) and Brennan et al. (2018). This yields the first universality principle for computational lower bounds obtained through average-case reductions.\n ', '\n Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR) have a wide range of applications and have attracted a tremendous amount of attention in the last two decades as canonical examples of statistical problems in high dimension. A variety of algorithms have been proposed for both SPCA and SLR, but an explicit connection between the two had not been made. We show how to efficiently transform a black-box solver for SLR into an algorithm for SPCA: assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, the SPCA algorithm derived from it achieves near state of the art guarantees for testing and for support recovery for the single spiked covariance model as obtained by the current best polynomialtime algorithms. Our reduction not only highlights the inherent similarity between the two problems, but also, from a practical standpoint, allows one to obtain a collection of algorithms for SPCA directly from known algorithms for SLR. We provide experimental results on simulated data comparing our proposed framework to other algorithms for SPCA.\n ', '\n The prototypical high-dimensional statistics problem entails finding a structured signal in noise. Many of these problems exhibit an intriguing phenomenon: the amount of data needed by all known computationally efficient algorithms far exceeds what is needed for inefficient algorithms that search over all possible structures. A line of work initiated by Berthet and Rigollet in 2013 has aimed to explain these statistical-computational gaps by reducing from conjecturally hard average-case problems in computer science. However, the delicate nature of average-case reductions has limited the applicability of this approach. In this work we introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture using natural problems as intermediates. These include tight lower bounds for Planted Independent Set, Planted Dense Subgraph, Sparse Spiked Wigner, Sparse PCA, a subgraph variant of the Stochastic Block Model and a biased variant of Sparse PCA. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we consider.\n ', '\n Graphical models are a rich language for describing high-dimensional distributions in terms of their dependence structure. While there are algorithms with provable guarantees for learning undirected graphical models in a variety of settings, there has been much less progress in the important scenario when there are latent variables. Here we study Restricted Boltzmann Machines (or RBMs), which are a popular model with wide-ranging applications in dimensionality reduction, collaborative filtering, topic modeling, feature extraction and deep learning.\n The main message of our paper is a strong dichotomy in the feasibility of learning RBMs, depending on the nature of the interactions between variables: ferromagnetic models can be learned efficiently, while general models cannot. In particular, we give a simple greedy algorithm based on influence maximization to learn ferromagnetic RBMs with bounded degree. In fact, we learn a description of the distribution on the observed variables as a Markov Random Field. Our analysis is based on tools from mathematical physics that were developed to show the concavity of magnetization. Our algorithm extends straighforwardly to general ferromagnetic Ising models with latent variables.\n Conversely, we show that even for a contant number of latent variables with constant degree, without ferromagneticity the problem is as hard as sparse parity with noise. This hardness result is based on a sharp and surprising characterization of the representational power of bounded degree RBMs: the distribution on their observed variables can simulate any bounded order MRF. This result is of independent interest since RBMs are the building blocks of deep belief networks.\n ', "\n Most information storage devices write data by modifying the local state of matter, in the hope that sub-atomic local interactions stabilize the state for sufficiently long time, thereby allowing later recovery. Motivated to explore how temporal evolution of physical states in magnetic storage media affects their capacity, this work initiates the study of information retention in locally-interacting particle systems. The system dynamics follow the stochastic Ising model (SIM) over a 2-dimensional $\\sqrt{n}\\times\\sqrt{n}$ grid. The initial spin configuration $X_0$ serves as the user-controlled input. The output configuration $X_t$ is produced by running $t$ steps of Glauber dynamics. Our main goal is to evaluate the information capacity $I_n(t):=\\max_{p_{X_0}}I(X_0;X_t)$ when time $t$ scales with the system's size $n$. While the positive (but low) temperature regime is our main interest, we start by exploring the simpler zero-temperature dynamics.\n We first show that at zero temperature, order of $\\sqrt{n}$ bits can be stored in the system indefinitely by coding over stable, striped configurations. While $\\sqrt{n}$ is order optimal for infinite time, backing off to $t<\\infty$, higher orders of $I_n(t)$ are achievable. First, linear coding arguments imply that $I_n(t) = Θ(n)$ for $t=O(n)$. To go beyond the linear scale, we develop a droplet-based achievability scheme that reliably stores $Ω\\left(n/\\log n\\right)$ for $t=O(n\\log n)$ time ($\\log n$ can be replaced with any $o(n)$ function). Moving to the positive but low temperature regime, two main results are provided. First, we show that an initial configuration drawn from the Gibbs measure cannot retain more than a single bit for $t\\geq \\exp(Cβn^{1/4+ε})$ time. On the other hand, when scaling time with the inverse temperature $β$, the stripe-based coding scheme is shown to retain its bits for $e^{cβ}$.\n ", "\n We study the problem of testing, using only a single sample, between mean field distributions (like Curie-Weiss, Erdős-Rényi) and structured Gibbs distributions (like Ising model on sparse graphs and Exponential Random Graphs). Our goal is to test without knowing the parameter values of the underlying models: only the \\emph{structure} of dependencies is known. We develop a new approach that applies to both the Ising and Exponential Random Graph settings based on a general and natural statistical test. The test can distinguish the hypotheses with high probability above a certain threshold in the (inverse) temperature parameter, and is optimal in that below the threshold no test can distinguish the hypotheses.\n The thresholds do not correspond to the presence of long-range order in the models. By aggregating information at a global scale, our test works even at very high temperatures.\n The proofs are based on distributional approximation and sharp concentration of quadratic forms, when restricted to Hamming spheres. The restriction to Hamming spheres is necessary, since otherwise any scalar statistic is useless without explicit knowledge of the temperature parameter. At the same time, this restriction radically changes the behavior of the functions under consideration, resulting in a much smaller variance than in the independent setting; this makes it hard to directly apply standard methods (i.e., Stein's method) for concentration of weakly dependent variables. Instead, we carry out an additional tensorization argument using a Markov chain that respects the symmetry of the Hamming sphere.\n ", "\n We develop a new technique, based on Stein's method, for comparing two stationary distributions of irreducible Markov Chains whose update rules are `close enough'. We apply this technique to compare Ising models on $d$-regular expander graphs to the Curie-Weiss model (complete graph) in terms of pairwise correlations and more generally $k$th order moments. Concretely, we show that $d$-regular Ramanujan graphs approximate the $k$th order moments of the Curie-Weiss model to within average error $k/\\sqrt{d}$ (averaged over the size $k$ subsets). The result applies even in the low-temperature regime; we also derive some simpler approximation results for functionals of Ising models that hold only at high enough temperatures.\n ", "\n We consider an online model for recommendation systems, with each user being recommended an item at each time-step and providing 'like' or 'dislike' feedback. Each user may be recommended a given item at most once. A latent variable model specifies the user preferences: both users and items are clustered into types. All users of a given type have identical preferences for the items, and similarly, items of a given type are either all liked or all disliked by a given user. We assume that the matrix encoding the preferences of each user type for each item type is randomly generated; in this way, the model captures structure in both the item and user spaces, the amount of structure depending on the number of each of the types. The measure of performance of the recommendation system is the expected number of disliked recommendations per user, defined as expected regret. We propose two algorithms inspired by user-user and item-item collaborative filtering (CF), modified to explicitly make exploratory recommendations, and prove performance guarantees in terms of their expected regret. For two regimes of model parameters, with structure only in item space or only in user space, we prove information-theoretic lower bounds on regret that match our upper bounds up to logarithmic factors. Our analysis elucidates system operating regimes in which existing CF algorithms are nearly optimal.\n ", '\n We study the problem of learning a tree Ising model from samples such that subsequent predictions made using the model are accurate. The prediction task considered in this paper is that of predicting the values of a subset of variables given values of some other subset of variables. Virtually all previous work on graphical model learning has focused on recovering the true underlying graph. We define a distance ("small set TV" or ssTV) between distributions $P$ and $Q$ by taking the maximum, over all subsets $\\mathcal{S}$ of a given size, of the total variation between the marginals of $P$ and $Q$ on $\\mathcal{S}$; this distance captures the accuracy of the prediction task of interest. We derive non-asymptotic bounds on the number of samples needed to get a distribution (from the same class) with small ssTV relative to the one generating the samples. One of the main messages of this paper is that far fewer samples are needed than for recovering the underlying tree, which means that accurate predictions are possible using the wrong tree.\n ', '\n There is much empirical evidence that item-item collaborative filtering works well in practice. Motivated to understand this, we provide a framework to design and analyze various recommendation algorithms. The setup amounts to online binary matrix completion, where at each time a random user requests a recommendation and the algorithm chooses an entry to reveal in the user\'s row. The goal is to minimize regret, or equivalently to maximize the number of +1 entries revealed at any time. We analyze an item-item collaborative filtering algorithm that can achieve fundamentally better performance compared to user-user collaborative filtering. The algorithm achieves good "cold-start" performance (appropriately defined) by quickly making good recommendations to new users about whom there is little information.\n ', '\n In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. We first observe that the notoriously difficult problem of learning parities with noise can be captured as a special case of learning graphical models. This leads to an unconditional computational lower bound of $Ω(p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of so-called statistical algorithms recently introduced by Feldman et al (2013). The lower bound suggests that the $O(p^d)$ runtime required to exhaustively search over neighborhoods cannot be significantly improved without restricting the class of models.\n Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., many recent papers on structure learning assume that the model has the correlation decay property. Indeed, focusing on ferromagnetic Ising models, Bento and Montanari (2009) showed that all known low-complexity algorithms fail to learn simple graphs when the interaction strength exceeds a number related to the correlation decay threshold. Our second set of results gives a class of repelling (antiferromagnetic) models that have the opposite behavior: very strong interaction allows efficient learning in time $O(p^2)$. We provide an algorithm whose performance interpolates between $O(p^2)$ and $O(p^{d+2})$ depending on the strength of the repulsion.\n ', '\n Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the "online" setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. We assume there to be $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly $\\log(km)$ initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing $k$. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).\n ', '\n We consider the problem of reconstructing the graph underlying an Ising model from i.i.d. samples. Over the last fifteen years this problem has been of significant interest in the statistics, machine learning, and statistical physics communities, and much of the effort has been directed towards finding algorithms with low computational cost for various restricted classes of models. Nevertheless, for learning Ising models on general graphs with $p$ nodes of degree at most $d$, it is not known whether or not it is possible to improve upon the $p^{d}$ computation needed to exhaustively search over all possible neighborhoods for each node.\n In this paper we show that a simple greedy procedure allows to learn the structure of an Ising model on an arbitrary bounded-degree graph in time on the order of $p^2$. We make no assumptions on the parameters except what is necessary for identifiability of the model, and in particular the results hold at low-temperatures as well as for highly non-uniform models. The proof rests on a new structural property of Ising models: we show that for any node there exists at least one neighbor with which it has a high mutual information. This structural property may be of independent interest.\n ', '\n In this paper we consider the problem of learning undirected graphical models from data generated according to the Glauber dynamics. The Glauber dynamics is a Markov chain that sequentially updates individual nodes (variables) in a graphical model and it is frequently used to sample from the stationary distribution (to which it converges given sufficient time). Additionally, the Glauber dynamics is a natural dynamical model in a variety of settings. This work deviates from the standard formulation of graphical model learning in the literature, where one assumes access to i.i.d. samples from the distribution.\n Much of the research on graphical model learning has been directed towards finding algorithms with low computational cost. As the main result of this work, we establish that the problem of reconstructing binary pairwise graphical models is computationally tractable when we observe the Glauber dynamics. Specifically, we show that a binary pairwise graphical model on $p$ nodes with maximum degree $d$ can be learned in time $f(d)p^2\\log p$, for a function $f(d)$, using nearly the information-theoretic minimum number of samples.\n ', '\n We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan (2008)) but no proof was known.\n Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).\n ', '\n We study vector space interference alignment for the MIMO interference channel with no time or frequency diversity, and no symbol extensions. We prove both necessary and sufficient conditions for alignment. In particular, we characterize the feasibility of alignment for the symmetric three-user channel where all users transmit along d dimensions, all transmitters have M antennas and all receivers have N antennas, as well as feasibility of alignment for the fully symmetric (M=N) channel with an arbitrary number of users.\n An implication of our results is that the total degrees of freedom available in a K-user interference channel, using only spatial diversity from the multiple antennas, is at most 2. This is in sharp contrast to the K/2 degrees of freedom shown to be possible by Cadambe and Jafar with arbitrarily large time or frequency diversity.\n Moving beyond the question of feasibility, we additionally discuss computation of the number of solutions using Schubert calculus in cases where there are a finite number of solutions.\n ', "\n We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets. The results are based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun sequencing analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by Hybridization.\n ", '\n DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of {\\em any} assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomena in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence. We also study the impact of noise in the read process on the performance.\n ', '\n This paper studies vector space interference alignment for the three-user MIMO interference channel with no time or frequency diversity. The main result is a characterization of the feasibility of interference alignment in the symmetric case where all transmitters have M antennas and all receivers have N antennas. If N >= M and all users desire d transmit dimensions, then alignment is feasible if and only if (2r+1)d <= max(rN,(r+1)M) for all nonnegative integers r. The analogous result holds with M and N switched if M >= N.\n It turns out that, just as for the 3-user parallel interference channel \\cite{BT09}, the length of alignment paths captures the essence of the problem. In fact, for each feasible value of M and N the maximum alignment path length dictates both the converse and achievability arguments.\n One of the implications of our feasibility criterion is that simply counting equations and comparing to the number of variables does not predict feasibility. Instead, a more careful investigation of the geometry of the alignment problem is required. The necessary condition obtained by counting equations is implied by our new feasibility criterion.\n ', '\n Determining the feasibility conditions for vector space interference alignment in the K-user MIMO interference channel with constant channel coefficients has attracted much recent attention yet remains unsolved. The main result of this paper is restricted to the symmetric square case where all transmitters and receivers have N antennas, and each user desires d transmit dimensions. We prove that alignment is possible if and only if the number of antennas satisfies N>= d(K+1)/2. We also show a necessary condition for feasibility of alignment with arbitrary system parameters. An algebraic geometry approach is central to the results.\n ', '\n Exponential random graphs are used extensively in the sociology literature. This model seeks to incorporate in random graphs the notion of reciprocity, that is, the larger than expected number of triangles and other small subgraphs. Sampling from these distributions is crucial for parameter estimation hypothesis testing, and more generally for understanding basic features of the network model itself. In practice sampling is typically carried out using Markov chain Monte Carlo, in particular either the Glauber dynamics or the Metropolis-Hasting procedure.\n In this paper we characterize the high and low temperature regimes of the exponential random graph model. We establish that in the high temperature regime the mixing time of the Glauber dynamics is $Θ(n^2 \\log n)$, where $n$ is the number of vertices in the graph; in contrast, we show that in the low temperature regime the mixing is exponentially slow for any local Markov chain. Our results, moreover, give a rigorous basis for criticisms made of such models. In the high temperature regime, where sampling with MCMC is possible, we show that any finite collection of edges are asymptotically independent; thus, the model does not possess the desired reciprocity property, and is not appreciably different from the Erdős-Rényi random graph.\n ', '\n Recently, Etkin, Tse, and Wang found the capacity region of the two-user Gaussian interference channel to within one bit/s/Hz. A natural goal is to apply this approach to the Gaussian interference channel with an arbitrary number of users. We make progress towards this goal by finding the capacity region of the many-to-one and one-to-many Gaussian interference channels to within a constant number of bits. The result makes use of a deterministic model to provide insight into the Gaussian channel. The deterministic model makes explicit the dimension of signal scale. A central theme emerges: the use of lattice codes for alignment of interfering signals on the signal scale.\n ', '\n This paper explores the two-user Gaussian interference channel through the lens of a natural deterministic channel model. The main result is that the deterministic channel uniformly approximates the Gaussian channel, the capacity regions differing by a universal constant. The problem of finding the capacity of the Gaussian channel to within a constant error is therefore reduced to that of finding the capacity of the far simpler deterministic channel. Thus, the paper provides an alternative derivation of the recent constant gap capacity characterization of Etkin, Tse, and Wang. Additionally, the deterministic model gives significant insight towards the Gaussian channel.\n ', '\n Markov random fields are used to model high dimensional distributions in a number of applied areas. Much recent interest has been devoted to the reconstruction of the dependency structure from independent samples from the Markov random fields. We analyze a simple algorithm for reconstructing the underlying graph defining a Markov random field on $n$ nodes and maximum degree $d$ given observations. We show that under mild non-degeneracy conditions it reconstructs the generating graph with high probability using $Θ(d ε^{-2}δ^{-4} \\log n)$ samples where $ε,δ$ depend on the local interactions. For most local interaction $\\eps,δ$ are of order $\\exp(-O(d))$.\n Our results are optimal as a function of $n$ up to a multiplicative constant depending on $d$ and the strength of the local interactions. Our results seem to be the first results for general models that guarantee that {\\em the} generating model is reconstructed. Furthermore, we provide explicit $O(n^{d+2} ε^{-2}δ^{-4} \\log n)$ running time bound. In cases where the measure on the graph has correlation decay, the running time is $O(n^2 \\log n)$ for all fixed $d$. We also discuss the effect of observing noisy samples and show that as long as the noise level is low, our algorithm is effective. On the other hand, we construct an example where large noise implies non-identifiability even for generic noise and interactions. Finally, we briefly show that in some simple cases, models with hidden nodes can also be recovered.\n ', '\n Models like LASSO and ridge regression are extensively used in practice due to their interpretability, ease of use, and strong theoretical guarantees. Cross-validation (CV) is widely used for hyperparameter tuning in these models, but do practical optimization methods minimize the true out-of-sample loss? A recent line of research promises to show that the optimum of the CV loss matches the optimum of the out-of-sample loss (possibly after simple corrections). It remains to show how tractable it is to minimize the CV loss. In the present paper, we show that, in the case of ridge regression, the CV loss may fail to be quasiconvex and thus may have multiple local optima. We can guarantee that the CV loss is quasiconvex in at least one case: when the spectrum of the covariate matrix is nearly flat and the noise in the observed responses is not too high. More generally, we show that quasiconvexity status is independent of many properties of the observed data (response norm, covariate-matrix right singular vectors and singular-value scaling) and has a complex dependence on the few that remain. We empirically confirm our theory using simulated experiments.\n ', '\n Hierarchical Bayesian methods enable information sharing across multiple related regression problems. While standard practice is to model regression parameters (effects) as (1) exchangeable across datasets and (2) correlated to differing degrees across covariates, we show that this approach exhibits poor statistical performance when the number of covariates exceeds the number of datasets. For instance, in statistical genetics, we might regress dozens of traits (defining datasets) for thousands of individuals (responses) on up to millions of genetic variants (covariates). When an analyst has more covariates than datasets, we argue that it is often more natural to instead model effects as (1) exchangeable across covariates and (2) correlated to differing degrees across datasets. To this end, we propose a hierarchical model expressing our alternative perspective. We devise an empirical Bayes estimator for learning the degree of correlation between datasets. We develop theory that demonstrates that our method outperforms the classic approach when the number of covariates dominates the number of datasets, and corroborate this result empirically on several high-dimensional multiple regression and classification problems.\n ', '\n Bayesian models based on the Dirichlet process and other stick-breaking priors have been proposed as core ingredients for clustering, topic modeling, and other unsupervised learning tasks. Prior specification is, however, relatively difficult for such models, given that their flexibility implies that the consequences of prior choices are often relatively opaque. Moreover, these choices can have a substantial effect on posterior inferences. Thus, considerations of robustness need to go hand in hand with nonparametric modeling. In the current paper, we tackle this challenge by exploiting the fact that variational Bayesian methods, in addition to having computational advantages in fitting complex nonparametric models, also yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models. In particular, we demonstrate how to assess the sensitivity of conclusions to the choice of concentration parameter and stick-breaking distribution for inferences under Dirichlet process mixtures and related mixture models. We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.\n ', "\n There is a growing interest in the estimation of the number of unseen features, mostly driven by applications in biological sciences. A recent work brought out the upside and the downside of the popular stable-Beta process prior, and generalizations thereof, in Bayesian nonparametric inference for the unseen-features problem: i) the downside lies in the limited use of the sampling information in the posterior distributions, which depend on the observable sample only through the sample size; ii) the upside lies in the analytical tractability and interpretability of the posterior distributions, which are simple Poisson distributions whose parameters are simple to compute, and depend on the sample size and the prior's parameter. In this paper, we introduce and investigate an alternative nonparametric prior, referred to as the stable-Beta scaled process prior, which is the first prior that allows to enrich the posterior distribution of the number of unseen features, through the inclusion of the sampling information on the number of distinct features in the observable sample, while maintaining the same analytical tractability and interpretability as the stable-Beta process prior. Our prior leads to a negative Binomial posterior distribution, whose parameters depends on the sample size, the observed number of distinct features and the prior's parameter, providing estimates that are simple, linear in the sampling information and computationally efficient. We apply our approach to synthetic and real genetic data, showing that it outperforms parametric and nonparametric competitors in terms of estimation accuracy.\n ", '\n Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. Unfortunately, methods that simultaneously express sparsity, nonlinearity, and interactions are computationally intractable -- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We show that suitable interaction models have a kernel representation, namely there exists a "kernel trick" to perform variable selection and estimation in $O$(# covariates) time. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, high-dimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.\n ', '\n Gaussian processes (GPs) are used to make medical and scientific decisions, including in cardiac care and monitoring of carbon dioxide emissions. But the choice of GP kernel is often somewhat arbitrary. In particular, uncountably many kernels typically align with qualitative prior knowledge (e.g. function smoothness or stationarity). But in practice, data analysts choose among a handful of convenient standard kernels (e.g. squared exponential). In the present work, we ask: Would decisions made with a GP differ under other, qualitatively interchangeable kernels? We show how to formulate this sensitivity analysis as a constrained optimization problem over a finite-dimensional space. We can then use standard optimizers to identify substantive changes in relevant decisions made with a GP. We demonstrate in both synthetic and real-world examples that decisions made with a GP can exhibit substantial sensitivity to kernel choice, even when prior draws are qualitatively interchangeable to a user.\n ', '\n Computational couplings of Markov chains provide a practical route to unbiased Monte Carlo estimation that can utilize parallel computation. However, these approaches depend crucially on chains meeting after a small number of transitions. For models that assign data into groups, e.g. mixture models, the obvious approaches to couple Gibbs samplers fail to meet quickly. This failure owes to the so-called "label-switching" problem; semantically equivalent relabelings of the groups contribute well-separated posterior modes that impede fast mixing and cause large meeting times. We here demonstrate how to avoid label switching by considering chains as exploring the space of partitions rather than labelings. Using a metric on this space, we employ an optimal transport coupling of the Gibbs conditionals. This coupling outperforms alternative couplings that rely on labelings and, on a real dataset, provides estimates more precise than usual ergodic averages in the limited time regime. Code is available at github.com/tinnguyen96/coupling-Gibbs-partition.\n ', '\n Modern statistics provides an ever-expanding toolkit for estimating unknown parameters. Consequently, applied statisticians frequently face a difficult decision: retain a parameter estimate from a familiar method or replace it with an estimate from a newer or complex one. While it is traditional to compare estimators using risk, such comparisons are rarely conclusive in realistic settings. In response, we propose the "c-value" as a measure of confidence that a new estimate achieves smaller loss than an old estimate on a given dataset. We show that it is unlikely that a computed c-value is large and that the new estimate has larger loss than the old. Therefore, just as a small p-value provides evidence to reject a null hypothesis, a large c-value provides evidence to use a new estimate in place of the old. For a wide class of problems and estimators, we show how to compute a c-value by first constructing a data-dependent high-probability lower bound on the difference in loss. The c-value is frequentist in nature, but we show that it can provide a validation of Bayesian estimates in real data applications involving hierarchical models and Gaussian processes.\n ', '\n We propose a method to assess the sensitivity of econometric analyses to the removal of a small fraction of the data. Manually checking the influence of all possible small subsets is computationally infeasible, so we provide an approximation to find the most influential subset. Our metric, the "Approximate Maximum Influence Perturbation," is automatically computable for common methods including (but not limited to) OLS, IV, MLE, GMM, and variational Bayes. We provide finite-sample error bounds on approximation performance. At minimal extra cost, we provide an exact finite-sample lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, is not reflected in standard errors, does not disappear asymptotically, and is not due to misspecification. While some empirical applications are robust, results of several economics papers can be overturned by removing less than 1% of the sample.\n ', '\n Bayesian nonparametric priors based on completely random measures (CRMs) offer a flexible modeling approach when the number of latent components in a dataset is unknown. However, managing the infinite dimensionality of CRMs typically requires practitioners to derive ad-hoc algorithms, preventing the use of general-purpose inference methods and often leading to long compute times. We propose a general but explicit recipe to construct a simple finite-dimensional approximation that can replace the infinite-dimensional CRMs. Our independent finite approximation (IFA) is a generalization of important cases that are used in practice. The independence of atom weights in our approximation (i) makes the construction well-suited for parallel and distributed computation and (ii) facilitates more convenient inference schemes. We quantify the approximation error between IFAs and the target nonparametric prior. We compare IFAs with an alternative approximation scheme -- truncated finite approximations (TFAs), where the atom weights are constructed sequentially. We prove that, for worst-case choices of observation likelihoods, TFAs are a more efficient approximation than IFAs. However, in real-data experiments with image denoising and topic modeling, we find that IFAs perform very similarly to TFAs in terms of task-specific accuracy metrics.\n ', "\n Many recent advances in machine learning are driven by a challenging trifecta: large data size $N$; high dimensions; and expensive algorithms. In this setting, cross-validation (CV) serves as an important tool for model assessment. Recent advances in approximate cross validation (ACV) provide accurate approximations to CV with only a single model fit, avoiding traditional CV's requirement for repeated runs of expensive algorithms. Unfortunately, these ACV methods can lose both speed and accuracy in high dimensions -- unless sparsity structure is present in the data. Fortunately, there is an alternative type of simplifying structure that is present in most data: approximate low rank (ALR). Guided by this observation, we develop a new algorithm for ACV that is fast and accurate in the presence of ALR data. Our first key insight is that the Hessian matrix -- whose inverse forms the computational bottleneck of existing ACV methods -- is ALR. We show that, despite our use of the \\emph{inverse} Hessian, a low-rank approximation using the largest (rather than the smallest) matrix eigenvalues enables fast, reliable ACV. Our second key insight is that, in the presence of ALR data, error in existing ACV methods roughly grows with the (approximate, low) rank rather than with the (full, high) dimension. These insights allow us to prove theoretical guarantees on the quality of our proposed algorithm -- along with fast-to-compute upper bounds on its error. We demonstrate the speed and accuracy of our method, as well as the usefulness of our bounds, on a range of real and simulated data sets.\n ", '\n Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. A common suggestion is to use a finite mixture model (FMM) with a prior on the number of components. Past work has shown the resulting FMM component-count posterior is consistent; that is, the posterior concentrates on the true, generating number of components. But consistency requires the assumption that the component likelihoods are perfectly specified, which is unrealistic in practice. In this paper, we add rigor to data-analysis folk wisdom by proving that under even the slightest model misspecification, the FMM component-count posterior diverges: the posterior probability of any particular finite number of components converges to 0 in the limit of infinite data. Contrary to intuition, posterior-density consistency is not sufficient to establish this result. We develop novel sufficient conditions that are more realistic and easily checkable than those common in the asymptotics literature. We illustrate practical consequences of our theory on simulated and real data.\n ', '\n Many modern data analyses benefit from explicitly modeling dependence structure in data -- such as measurements across time or space, ordered words in a sentence, or genes in a genome. A gold standard evaluation technique is structured cross-validation (CV), which leaves out some data subset (such as data within a time interval or data in a geographic region) in each fold. But CV here can be prohibitively slow due to the need to re-run already-expensive learning algorithms many times. Previous work has shown approximate cross-validation (ACV) methods provide a fast and provably accurate alternative in the setting of empirical risk minimization. But this existing ACV work is restricted to simpler models by the assumptions that (i) data across CV folds are independent and (ii) an exact initial model fit is available. In structured data analyses, both these assumptions are often untrue. In the present work, we address (i) by extending ACV to CV schemes with dependence structure between the folds. To address (ii), we verify -- both theoretically and empirically -- that ACV quality deteriorates smoothly with noise in the initial fit. We demonstrate the accuracy and computational benefits of our proposed methods on a diverse set of real-world applications.\n ', '\n While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains non-trivial. Under a fixed budget, then, scientists face a natural trade-off between quantity and quality; they can spend resources to sequence a greater number of genomes (quantity) or spend resources to sequence genomes with increased accuracy (quality). Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible, and thus as many new scientific insights as possible. In this paper, we consider the common setting where scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. We introduce a Bayesian nonparametric methodology to predict the number of new variants in the follow-up study based on the pilot study. When experimental conditions are kept constant between the pilot and follow-up, we demonstrate on real data from the gnomAD project that our prediction is more accurate than three recent proposals, and competitive with a more classic proposal. Unlike existing methods, though, our method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for (i) more realistic predictions and (ii) optimal allocation of a fixed budget between quality and quantity.\n ', '\n Variational inference has become an increasingly attractive fast alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, a major obstacle to the widespread use of variational methods is the lack of post-hoc accuracy measures that are both theoretically justified and computationally efficient. In this paper, we provide rigorous bounds on the error of posterior mean and uncertainty estimates that arise from full-distribution approximations, as in variational inference. Our bounds are widely applicable, as they require only that the approximating and exact posteriors have polynomial moments. Our bounds are also computationally efficient for variational inference because they require only standard values from variational objectives, straightforward analytic calculations, and simple Monte Carlo estimates. We show that our analysis naturally leads to a new and improved workflow for validated variational inference. Finally, we demonstrate the utility of our proposed workflow and error bounds on a robust regression problem and on a real-data example with a widely used multilevel hierarchical model.\n ', '\n Cross validation (CV) and the bootstrap are ubiquitous model-agnostic tools for assessing the error or variability of machine learning and statistical estimators. However, these methods require repeatedly re-fitting the model with different weighted versions of the original dataset, which can be prohibitively time-consuming. For sufficiently regular optimization problems the optimum depends smoothly on the data weights, and so the process of repeatedly re-fitting can be approximated with a Taylor series that can be often evaluated relatively quickly. The first-order approximation is known as the "infinitesimal jackknife" in the statistics literature and has been the subject of recent interest in machine learning for approximate CV. In this work, we consider high-order approximations, which we call the "higher-order infinitesimal jackknife" (HOIJ). Under mild regularity conditions, we provide a simple recursive procedure to compute approximations of all orders with finite-sample accuracy bounds. Additionally, we show that the HOIJ can be efficiently computed even in high dimensions using forward-mode automatic differentiation. We show that a linear approximation with bootstrap weights approximation is equivalent to those provided by asymptotic normal approximations. Consequently, the HOIJ opens up the possibility of enjoying higher-order accuracy properties of the bootstrap using local approximations. Consistency of the HOIJ for leave-one-out CV under different asymptotic regimes follows as corollaries from our finite-sample bounds under additional regularity assumptions. The generality of the computation and bounds motivate the name "higher-order Swiss Army infinitesimal jackknife."\n ', "\n Exchangeability -- in which the distribution of an infinite sequence is invariant to reorderings of its elements -- implies the existence of a simple conditional independence structure that may be leveraged in the design of probabilistic models, efficient inference algorithms, and randomization-based testing procedures. In practice, however, this assumption is too strong an idealization; the distribution typically fails to be exactly invariant to permutations and de Finetti's representation theory does not apply. Thus there is the need for a distributional assumption that is both weak enough to hold in practice, and strong enough to guarantee a useful underlying representation. We introduce a relaxed notion of local exchangeability -- where swapping data associated with nearby covariates causes a bounded change in the distribution. We prove that locally exchangeable processes correspond to independent observations from an underlying measure-valued stochastic process. We thereby show that de Finetti's theorem is robust to perturbation and provide further justification for the Bayesian modelling approach. Using this probabilistic result, we develop three novel statistical procedures for (1) estimating the underlying process via local empirical measures, (2) testing via local randomization, and (3) estimating the canonical premetric of local exchangeability. These three procedures extend the applicability of previous exchangeability-based methods without sacrificing rigorous statistical guarantees. The paper concludes with examples of popular statistical models that exhibit local exchangeability.\n ", "\n Leave-one-out cross-validation (LOOCV) can be particularly accurate among cross-validation (CV) variants for machine learning assessment tasks -- e.g., assessing methods' error or variability. But it is expensive to re-fit a model $N$ times for a dataset of size $N$. Previous work has shown that approximations to LOOCV can be both fast and accurate -- when the unknown parameter is of small, fixed dimension. But these approximations incur a running time roughly cubic in dimension -- and we show that, besides computational issues, their accuracy dramatically deteriorates in high dimensions. Authors have suggested many potential and seemingly intuitive solutions, but these methods have not yet been systematically evaluated or compared. We find that all but one perform so poorly as to be unusable for approximating LOOCV. Crucially, though, we are able to show, both empirically and theoretically, that one approximation can perform well in high dimensions -- in cases where the high-dimensional parameter exhibits sparsity. Under interpretable assumptions, our theory demonstrates that the problem can be reduced to working within an empirically recovered (small) support. This procedure is straightforward to implement, and we prove that its running time and error depend on the (small) support size even when the full parameter dimension is large.\n ", '\n Due to the ease of modern data collection, applied statisticians often have access to a large set of covariates that they wish to relate to some observed outcome. Generalized linear models (GLMs) offer a particularly interpretable framework for such an analysis. In these high-dimensional problems, the number of covariates is often large relative to the number of observations, so we face non-trivial inferential uncertainty; a Bayesian approach allows coherent quantification of this uncertainty. Unfortunately, existing methods for Bayesian inference in GLMs require running times roughly cubic in parameter dimension, and so are limited to settings with at most tens of thousand parameters. We propose to reduce time and memory costs with a low-rank approximation of the data in an approach we call LR-GLM. When used with the Laplace approximation or Markov chain Monte Carlo, LR-GLM provides a full Bayesian posterior approximation and admits running times reduced by a full factor of the parameter dimension. We rigorously establish the quality of our approximation and show how the choice of rank allows a tunable computational-statistical trade-off. Experiments support our theory and demonstrate the efficacy of LR-GLM on real large-scale datasets.\n ', '\n Discovering interaction effects on a response of interest is a fundamental problem faced in biology, medicine, economics, and many other scientific disciplines. In theory, Bayesian methods for discovering pairwise interactions enjoy many benefits such as coherent uncertainty quantification, the ability to incorporate background knowledge, and desirable shrinkage properties. In practice, however, Bayesian methods are often computationally intractable for even moderate-dimensional problems. Our key insight is that many hierarchical models of practical interest admit a particular Gaussian process (GP) representation; the GP allows us to capture the posterior with a vector of O(p) kernel hyper-parameters rather than O(p^2) interactions and main effects. With the implicit representation, we can run Markov chain Monte Carlo (MCMC) over model hyper-parameters in time and memory linear in p per iteration. We focus on sparsity-inducing models and show on datasets with a variety of covariate behaviors that our method: (1) reduces runtime by orders of magnitude over naive applications of MCMC, (2) provides lower Type I and Type II error relative to state-of-the-art LASSO-based approaches, and (3) offers improved computational scaling in high dimensions relative to existing Bayesian and LASSO-based approaches.\n ', '\n Until recently, transcriptomics was limited to bulk RNA sequencing, obscuring the underlying expression patterns of individual cells in favor of a global average. Thanks to technological advances, we can now profile gene expression across thousands or millions of individual cells in parallel. This new type of data has led to the intriguing discovery that individual cell profiles can reflect the imprint of time or dynamic processes. However, synthesizing this information to reconstruct dynamic biological phenomena from data that are noisy, heterogenous, and sparse---and from processes that may unfold asynchronously---poses a complex computational and statistical challenge. Here, we develop a full generative model for probabilistically reconstructing trees of cellular differentiation from single-cell RNA-seq data. Specifically, we extend the framework of the classical Dirichlet diffusion tree to simultaneously infer branch topology and latent cell states along continuous trajectories over the full tree. In tandem, we construct a novel Markov chain Monte Carlo sampler that interleaves Metropolis-Hastings and message passing to leverage model structure for efficient inference. Finally, we demonstrate that these techniques can recover latent trajectories from simulated single-cell transcriptomes. While this work is motivated by cellular differentiation, we derive a tractable model that provides flexible densities for any data (coupled with an appropriate noise model) that arise from continuous evolution along a latent nonparametric tree.\n ', '\n Bayesian models based on the Dirichlet process and other stick-breaking priors have been proposed as core ingredients for clustering, topic modeling, and other unsupervised learning tasks. Prior specification is, however, relatively difficult for such models, given that their flexibility implies that the consequences of prior choices are often relatively opaque. Moreover, these choices can have a substantial effect on posterior inferences. Thus, considerations of robustness need to go hand in hand with nonparametric modeling. In the current paper, we tackle this challenge by exploiting the fact that variational Bayesian methods, in addition to having computational advantages in fitting complex nonparametric models, also yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models. In particular, we demonstrate how to assess the sensitivity of conclusions to the choice of concentration parameter and stick-breaking distribution for inferences under Dirichlet process mixtures and related mixture models. We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.\n ', '\n Kernel methods offer the flexibility to learn complex relationships in modern, large data sets while enjoying strong theoretical guarantees on quality. Unfortunately, these methods typically require cubic running time in the data set size, a prohibitive cost in the large-data setting. Random feature maps (RFMs) and the Nystrom method both consider low-rank approximations to the kernel matrix as a potential solution. But, in order to achieve desirable theoretical guarantees, the former may require a prohibitively large number of features J+, and the latter may be prohibitively expensive for high-dimensional problems. We propose to combine the simplicity and generality of RFMs with a data-dependent feature selection scheme to achieve desirable theoretical approximation properties of Nystrom with just O(log J+) features. Our key insight is to begin with a large set of random features, then reduce them to a small number of weighted features in a data-dependent, computationally efficient way, while preserving the statistical guarantees of using the original large set of features. We demonstrate the efficacy of our method with theory and experiments--including on a data set with over 50 million observations. In particular, we show that our method achieves small kernel matrix approximation error and better test set accuracy with provably fewer random features than state-of-the-art methods.\n ', '\n Bayesian inference typically requires the computation of an approximation to the posterior distribution. An important requirement for an approximate Bayesian inference algorithm is to output high-accuracy posterior mean and uncertainty estimates. Classical Monte Carlo methods, particularly Markov Chain Monte Carlo, remain the gold standard for approximate Bayesian inference because they have a robust finite-sample theory and reliable convergence diagnostics. However, alternative methods, which are more scalable or apply to problems where Markov Chain Monte Carlo cannot be used, lack the same finite-data approximation theory and tools for evaluating their accuracy. In this work, we develop a flexible new approach to bounding the error of mean and uncertainty estimates of scalable inference algorithms. Our strategy is to control the estimation errors in terms of Wasserstein distance, then bound the Wasserstein distance via a generalized notion of Fisher distance. Unlike computing the Wasserstein distance, which requires access to the normalized posterior distribution, the Fisher distance is tractable to compute because it requires access only to the gradient of the log posterior density. We demonstrate the usefulness of our Fisher distance approach by deriving bounds on the Wasserstein error of the Laplace approximation and Hilbert coresets. We anticipate that our approach will be applicable to many other approximate inference methods such as the integrated Laplace approximation, variational inference, and approximate Bayesian computation\n ', '\n Gaussian processes (GPs) offer a flexible class of priors for nonparametric Bayesian regression, but popular GP posterior inference methods are typically prohibitively slow or lack desirable finite-data guarantees on quality. We develop an approach to scalable approximate GP regression with finite-data guarantees on the accuracy of pointwise posterior mean and variance estimates. Our main contribution is a novel objective for approximate inference in the nonparametric setting: the preconditioned Fisher (pF) divergence. We show that unlike the Kullback--Leibler divergence (used in variational inference), the pF divergence bounds the 2-Wasserstein distance, which in turn provides tight bounds the pointwise difference of the mean and variance functions. We demonstrate that, for sparse GP likelihood approximations, we can minimize the pF divergence efficiently. Our experiments show that optimizing the pF divergence has the same computational requirements as variational sparse GPs while providing comparable empirical performance--in addition to our novel finite-data quality guarantees.\n ', '\n The error or variability of machine learning algorithms is often assessed by repeatedly re-fitting a model with different weighted versions of the observed data. The ubiquitous tools of cross-validation (CV) and the bootstrap are examples of this technique. These methods are powerful in large part due to their model agnosticism but can be slow to run on modern, large data sets due to the need to repeatedly re-fit the model. In this work, we use a linear approximation to the dependence of the fitting procedure on the weights, producing results that can be faster than repeated re-fitting by an order of magnitude. This linear approximation is sometimes known as the "infinitesimal jackknife" in the statistics literature, where it is mostly used as a theoretical tool to prove asymptotic results. We provide explicit finite-sample error bounds for the infinitesimal jackknife in terms of a small number of simple, verifiable assumptions. Our results apply whether the weights and data are stochastic or deterministic, and so can be used as a tool for proving the accuracy of the infinitesimal jackknife on a wide variety of problems. As a corollary, we state mild regularity conditions under which our approximation consistently estimates true leave-$k$-out cross-validation for any fixed $k$. These theoretical results, together with modern automatic differentiation software, support the application of the infinitesimal jackknife to a wide variety of practical problems in machine learning, providing a "Swiss Army infinitesimal jackknife". We demonstrate the accuracy of our methods on a range of simulated and real datasets.\n ', '\n Learning a Bayesian network (BN) from data can be useful for decision-making or discovering causal relationships. However, traditional methods often fail in modern applications, which exhibit a larger number of observed variables than data points. The resulting uncertainty about the underlying network as well as the desire to incorporate prior information recommend a Bayesian approach to learning the BN, but the highly combinatorial structure of BNs poses a striking challenge for inference. The current state-of-the-art methods such as order MCMC are faster than previous methods but prevent the use of many natural structural priors and still have running time exponential in the maximum indegree of the true directed acyclic graph (DAG) of the BN. We here propose an alternative posterior approximation based on the observation that, if we incorporate empirical conditional independence tests, we can focus on a high-probability DAG associated with each order of the vertices. We show that our method allows the desired flexibility in prior specification, removes timing dependence on the maximum indegree and yields provably good posterior approximations; in addition, we show that it achieves superior accuracy, scalability, and sampler mixing on several datasets.\n ', '\n Coherent uncertainty quantification is a key strength of Bayesian methods. But modern algorithms for approximate Bayesian posterior inference often sacrifice accurate posterior uncertainty estimation in the pursuit of scalability. This work shows that previous Bayesian coreset construction algorithms---which build a small, weighted subset of the data that approximates the full dataset---are no exception. We demonstrate that these algorithms scale the coreset log-likelihood suboptimally, resulting in underestimated posterior uncertainty. To address this shortcoming, we develop greedy iterative geodesic ascent (GIGA), a novel algorithm for Bayesian coreset construction that scales the coreset log-likelihood optimally. GIGA provides geometric decay in posterior approximation error as a function of coreset size, and maintains the fast running time of its predecessors. The paper concludes with validation of GIGA on both synthetic and real datasets, demonstrating that it reduces posterior approximation error by orders of magnitude compared with previous coreset constructions.\n ', '\n Clustering procedures typically estimate which data points are clustered together, a quantity of primary importance in many analyses. Often used as a preliminary step for dimensionality reduction or to facilitate interpretation, finding robust and stable clusters is often crucial for appropriate for downstream analysis. In the present work, we consider Bayesian nonparametric (BNP) models, a particularly popular set of Bayesian models for clustering due to their flexibility. Because of its complexity, the Bayesian posterior often cannot be computed exactly, and approximations must be employed. Mean-field variational Bayes forms a posterior approximation by solving an optimization problem and is widely used due to its speed. An exact BNP posterior might vary dramatically when presented with different data. As such, stability and robustness of the clustering should be assessed.\n A popular mean to assess stability is to apply the bootstrap by resampling the data, and rerun the clustering for each simulated data set. The time cost is thus often very expensive, especially for the sort of exploratory analysis where clustering is typically used. We propose to use a fast and automatic approximation to the full bootstrap called the "linear bootstrap", which can be seen by local data perturbation. In this work, we demonstrate how to apply this idea to a data analysis pipeline, consisting of an MFVB approximation to a BNP clustering posterior of time course gene expression data. We show that using auto-differentiation tools, the necessary calculations can be done automatically, and that the linear bootstrap is a fast but approximate alternative to the bootstrap.\n ', '\n The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern datasets, and modifications to make them so are typically model-specific, require expert tuning, and can break theoretical guarantees on inferential quality. Building on the Bayesian coresets framework, this work instead takes advantage of data redundancy to shrink the dataset itself as a preprocessing step, providing fully-automated, scalable Bayesian inference with theoretical guarantees. We begin with an intuitive reformulation of Bayesian coreset construction as sparse vector sum approximation, and demonstrate that its automation and performance-based shortcomings arise from the use of the supremum norm. To address these shortcomings we develop Hilbert coresets, i.e., Bayesian coresets constructed under a norm induced by an inner-product on the log-likelihood function space. We propose two Hilbert coreset construction algorithms---one based on importance sampling, and one based on the Frank-Wolfe algorithm---along with theoretical guarantees on approximation quality as a function of coreset size. Since the exact computation of the proposed inner-products is model-specific, we automate the construction with a random finite-dimensional projection of the log-likelihood functions. The resulting automated coreset construction algorithm is simple to implement, and experiments on a variety of models with real and synthetic datasets show that it provides high-quality posterior approximations and a significant reduction in the computational cost of inference.\n ', '\n Generalized linear models (GLMs) -- such as logistic regression, Poisson regression, and robust regression -- provide interpretable models for diverse data types. Probabilistic approaches, particularly Bayesian ones, allow coherent estimates of uncertainty, incorporation of prior information, and sharing of power across experiments via hierarchical models. In practice, however, the approximate Bayesian methods necessary for inference have either failed to scale to large data sets or failed to provide theoretical guarantees on the quality of inference. We propose a new approach based on constructing polynomial approximate sufficient statistics for GLMs (PASS-GLM). We demonstrate that our method admits a simple algorithm as well as trivial streaming and distributed extensions that do not compound error across computations. We provide theoretical guarantees on the quality of point (MAP) estimates, the approximate posterior, and posterior mean and uncertainty estimates. We validate our approach empirically in the case of logistic regression using a quadratic approximation and show competitive performance with stochastic gradient descent, MCMC, and the Laplace approximation in terms of speed and multiple measures of accuracy -- including on an advertising data set with 40 million data points and 20,000 covariates.\n ', '\n Mean-field Variational Bayes (MFVB) is an approximate Bayesian posterior inference technique that is increasingly popular due to its fast runtimes on large-scale datasets. However, even when MFVB provides accurate posterior means for certain parameters, it often mis-estimates variances and covariances. Furthermore, prior robustness measures have remained undeveloped for MFVB. By deriving a simple formula for the effect of infinitesimal model perturbations on MFVB posterior means, we provide both improved covariance estimates and local robustness measures for MFVB, thus greatly expanding the practical usefulness of MFVB posterior approximations. The estimates for MFVB posterior covariances rely on a result from the classical Bayesian robustness literature relating derivatives of posterior expectations to posterior covariances and include the Laplace approximation as a special case. Our key condition is that the MFVB approximation provides good estimates of a select subset of posterior means---an assumption that has been shown to hold in many practical settings. In our experiments, we demonstrate that our methods are simple, general, and fast, providing accurate posterior uncertainty estimates and robustness measures with runtimes that can be an order of magnitude faster than MCMC.\n ', '\n Many popular network models rely on the assumption of (vertex) exchangeability, in which the distribution of the graph is invariant to relabelings of the vertices. However, the Aldous-Hoover theorem guarantees that these graphs are dense or empty with probability one, whereas many real-world graphs are sparse. We present an alternative notion of exchangeability for random graphs, which we call edge exchangeability, in which the distribution of a graph sequence is invariant to the order of the edges. We demonstrate that edge-exchangeable models, unlike models that are traditionally vertex exchangeable, can exhibit sparsity. To do so, we outline a general framework for graph generative models; by contrast to the pioneering work of Caron and Fox (2015), models within our framework are stationary across steps of the graph sequence. In particular, our model grows the graph by instantiating more latent atoms of a single random measure as the dataset size increases, rather than adding new atoms to the measure.\n ', '\n In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior, since this choice is made by the modeler and is often somewhat subjective. A different, equally subjectively plausible choice of prior may result in a substantially different posterior, and so different conclusions drawn from the data. Were this to be the case, our conclusions would not be robust to the choice of prior. To determine whether our model is robust, we must quantify how sensitive our posterior is to perturbations of our prior. Despite the importance of the problem and a considerable body of literature, generic, easy-to-use methods to quantify Bayesian robustness are still lacking.\n Abstract In this paper, we demonstrate that powerful measures of robustness can be easily calculated from Variational Bayes (VB) approximate posteriors. We begin with local robustness, which measures the effect of infinitesimal changes to the prior on a posterior mean of interest. In particular, we show that the influence function of Gustafson (2012) has a simple, easy-to-calculate closed form expression for VB approximations. We then demonstrate how local robustness measures can be inadequate for non-local prior changes, such as replacing one prior entirely with another. We propose a simple approximate non-local robustness measure and demonstrate its effectiveness on a simulated data set.\n ', '\n Variational inference (VI) provides fast approximations of a Bayesian posterior in part because it formulates posterior approximation as an optimization problem: to find the closest distribution to the exact posterior over some family of distributions. For practical reasons, the family of distributions in VI is usually constrained so that it does not include the exact posterior, even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior. We propose to instead consider a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). For efficient inference, we borrow ideas from gradient boosting to develop an algorithm we call boosting variational inference (BVI). BVI iteratively improves the current approximation by mixing it with a new component from the base distribution family and thereby yields progressively more accurate posterior approximations as more computing time is spent. Unlike a number of common VI variants including mean-field VI, BVI is able to capture multimodality, general posterior covariance, and nonstandard posterior shapes.\n ', '\n Trait allocations are a class of combinatorial structures in which data may belong to multiple groups and may have different levels of belonging in each group. Often the data are also exchangeable, i.e., their joint distribution is invariant to reordering. In clustering---a special case of trait allocation---exchangeability implies the existence of both a de Finetti representation and an exchangeable partition probability function (EPPF), distributional representations useful for computational and theoretical purposes. In this work, we develop the analogous de Finetti representation and exchangeable trait probability function (ETPF) for trait allocations, along with a characterization of all trait allocations with an ETPF. Unlike previous feature allocation characterizations, our proofs fully capture single-occurrence "dust" groups. We further introduce a novel constrained version of the ETPF that we use to establish an intuitive connection between the probability functions for clustering, feature allocations, and trait allocations. As an application of our general theory, we characterize the distribution of all edge-exchangeable graphs, a class of recently-developed models that captures realistic sparse graph sequences.\n ', '\n Bayesian hierarchical models are increasing popular in economics. When using hierarchical models, it is useful not only to calculate posterior expectations, but also to measure the robustness of these expectations to reasonable alternative prior choices. We use variational Bayes and linear response methods to provide fast, accurate posterior means and robustness measures with an application to measuring the effectiveness of microcredit in the developing world.\n ', '\n The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset -- both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it.\n ', '\n Network data appear in a number of applications, such as online social networks and biological networks, and there is growing interest in both developing models for networks as well as studying the properties of such data. Since individual network datasets continue to grow in size, it is necessary to develop models that accurately represent the real-life scaling properties of networks. One behavior of interest is having a power law in the degree distribution. However, other types of power laws that have been observed empirically and considered for applications such as clustering and feature allocation models have not been studied as frequently in models for graph data. In this paper, we enumerate desirable asymptotic behavior that may be of interest for modeling graph data, including sparsity and several types of power laws. We outline a general framework for graph generative models using completely random measures; by contrast to the pioneering work of Caron and Fox (2015), we consider instantiating more of the existing atoms of the random measure as the dataset size increases rather than adding new atoms to the measure. We see that these two models can be complementary; they respectively yield interpretations as (1) time passing among existing members of a network and (2) new individuals joining a network. We detail a particular instance of this framework and show simulated results that suggest this model exhibits some desirable asymptotic power-law behavior.\n ', '\n A known failing of many popular random graph models is that the Aldous-Hoover Theorem guarantees these graphs are dense with probability one; that is, the number of edges grows quadratically with the number of nodes. This behavior is considered unrealistic in observed graphs. We define a notion of edge exchangeability for random graphs in contrast to the established notion of infinite exchangeability for random graphs --- which has traditionally relied on exchangeability of nodes (rather than edges) in a graph. We show that, unlike node exchangeability, edge exchangeability encompasses models that are known to provide a projective sequence of random graphs that circumvent the Aldous-Hoover Theorem and exhibit sparsity, i.e., sub-quadratic growth of the number of edges with the number of nodes. We show how edge-exchangeability of graphs relates naturally to existing notions of exchangeability from clustering (a.k.a. partitions) and other familiar combinatorial structures.\n ', '\n Completely random measures (CRMs) and their normalizations are a rich source of Bayesian nonparametric priors. Examples include the beta, gamma, and Dirichlet processes. In this paper we detail two major classes of sequential CRM representations---series representations and superposition representations---within which we organize both novel and existing sequential representations that can be used for simulation and posterior inference. These two classes and their constituent representations subsume existing ones that have previously been developed in an ad hoc manner for specific processes. Since a complete infinite-dimensional CRM cannot be used explicitly for computation, sequential representations are often truncated for tractability. We provide truncation error analyses for each type of sequential representation, as well as their normalized versions, thereby generalizing and improving upon existing truncation error bounds in the literature. We analyze the computational complexity of the sequential representations, which in conjunction with our error bounds allows us to directly compare representations and discuss their relative efficiency. We include numerous applications of our theoretical results to commonly-used (normalized) CRMs, demonstrating that our results enable a straightforward representation and analysis of CRMs that has not previously been available in a Bayesian nonparametric context.\n ', '\n In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior and likelihood, since this choice is made by the modeler and is necessarily somewhat subjective. Despite the fundamental importance of the problem and a considerable body of literature, the tools of robust Bayes are not commonly used in practice. This is in large part due to the difficulty of calculating robustness measures from MCMC draws. Although methods for computing robustness measures from MCMC draws exist, they lack generality and often require additional coding or computation.\n In contrast to MCMC, variational Bayes (VB) techniques are readily amenable to robustness analysis. The derivative of a posterior expectation with respect to a prior or data perturbation is a measure of local robustness to the prior or likelihood. Because VB casts posterior inference as an optimization problem, its methodology is built on the ability to calculate derivatives of posterior quantities with respect to model parameters, even in very complex models. In the present work, we develop local prior robustness measures for mean-field variational Bayes(MFVB), a VB technique which imposes a particular factorization assumption on the variational posterior approximation. We start by outlining existing local prior measures of robustness. Next, we use these results to derive closed-form measures of the sensitivity of mean-field variational posterior approximation to prior specification. We demonstrate our method on a meta-analysis of randomized controlled interventions in access to microcredit in developing countries.\n ', '\n This article is a translation of Bruno de Finetti\'s paper "Funzione Caratteristica di un fenomeno aleatorio" which appeared in Atti del Congresso Internazionale dei Matematici, Bologna 3-10 Settembre 1928, Tomo VI, pp. 179-190, originally published by Nicola Zanichelli Editore S.p.A. The translation was made as close as possible to the original in form and style, except for apparent mistakes found in the original document, which were corrected and are mentioned as footnotes. Most of these were resolved by comparing against a longer version of this work by de Finetti, published shortly after this one under the same titlea. The interested reader is highly encouraged to consult this other version for a more detailed treatment of the topics covered here. Footnotes regarding the translation are labeled with letters to distinguish them from de Finetti\'s original footnotes.\n ', '\n Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, it is well known that a major failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance.\n We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables---both for individual variables and coherently across variables. We call our method linear response variational Bayes (LRVB). When the MFVB posterior approximation is in the exponential family, LRVB has a simple, analytic form, even for non-conjugate models. Indeed, we make no assumptions about the form of the true posterior. We demonstrate the accuracy and scalability of our method on a range of models for both simulated and real data.\n ', '\n Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, it is well known that a major failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance. We develop a fast, general methodology for exponential families that augments MFVB to deliver accurate uncertainty estimates for model variables -- both for individual variables and coherently across variables. MFVB for exponential families defines a fixed-point equation in the means of the approximating posterior, and our approach yields a covariance estimate by perturbing this fixed point. Inspired by linear response theory, we call our method linear response variational Bayes (LRVB). We also show how LRVB can be used to quickly calculate a measure of the influence of individual data points on parameter point estimates. We demonstrate the accuracy and scalability of our method by learning Gaussian mixture models for both simulated and real data.\n ', '\n Mean Field Variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, it is well known that a major failing of MFVB is its (sometimes severe) underestimates of the uncertainty of model variables and lack of information about model variable covariance. We develop a fast, general methodology for exponential families that augments MFVB to deliver accurate uncertainty estimates for model variables -- both for individual variables and coherently across variables. MFVB for exponential families defines a fixed-point equation in the means of the approximating posterior, and our approach yields a covariance estimate by perturbing this fixed point. Inspired by linear response theory, we call our method linear response variational Bayes (LRVB). We demonstrate the accuracy of our method on simulated data sets.\n ', '\n We demonstrate how to calculate posteriors for general CRM-based priors and likelihoods for Bayesian nonparametric models. We further show how to represent Bayesian nonparametric priors as a sequence of finite draws using a size-biasing approach---and how to represent full Bayesian nonparametric models via finite marginals. Motivated by conjugate priors based on exponential family representations of likelihoods, we introduce a notion of exponential families for CRMs, which we call exponential CRMs. This construction allows us to specify automatic Bayesian nonparametric conjugate priors for exponential CRM likelihoods. We demonstrate that our exponential CRMs allow particularly straightforward recipes for size-biased and marginal representations of Bayesian nonparametric models. Along the way, we prove that the gamma process is a conjugate prior for the Poisson likelihood process and the beta prime process is a conjugate prior for a process we call the odds Bernoulli process. We deliver a size-biased representation of the gamma process and a marginal representation of the gamma process coupled with a Poisson likelihood process.\n ', '\n Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster sizes in entity resolution, and discuss directions for future work in this domain.\n ', '\n Research on distributed machine learning algorithms has focused primarily on one of two extremes - algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this "optimistic concurrency control" paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment.\n ', '\n We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data---a case where SVI may be applied---and in the streaming setting, where SVI does not apply.\n ', '\n Active modulation of quantum dot thin film photoluminescence (PL) has far-reaching potential applications in biomedical and optoelectronic systems, but challenges remain in achieving large PL modulation depth and fast temporal response. Here we report an efficient voltage-controlled optical down-converter by optically exciting a colloidal quantum dot thin film within a quantum dot light-emitting diode (QD-LED) under reverse bias. Utilizing field-induced luminescence quenching, we show that a large electric field can strongly modify carrier dynamics in this nanostructured device, resulting in stable and reversible photoluminescence quenching. The device exhibits photoluminescence reduction of up to 99.5%, corresponding to a contrast ratio of 200:1, under the applied electric field of 3 MV/cm, with a 300 nanosecond response time. Using excitation wavelength dependent and transient PL spectroscopy, we further show that the high degree of quenching is achieved by a synergistic interplay of quantum-confined Stark effect (QCSE) and field-induced exciton dissociation.\n ', '\n Molecules with versatile functionalities and well-defined structures, can serve as building blocks for extreme nanoscale devices. This requires their precise integration into functional heterojunctions, most commonly in the form of metal-molecule-metal architectures. Structural damage and nonuniformities caused by current fabrication techniques, however, limit their effective incorporation. Here, we present a hybrid fabrication approach enabling uniform molecular gaps. Template-stripped lithographically-patterned gold electrodes with sub-nanometer roughness are used as the bottom contacts upon which the molecular layer is formed through self-assembly. The top contacts are assembled using dielectrophoretic trapping of colloidal gold nanorods, resulting in uniform sub-5 nm junctions. In these electrically-active designs, we further explore the possibility of mechanical tunability. The presence of molecules may help control sub-nanometer mechanical modulation which is conventionally difficult to achieve due to instabilities caused by surface adhesive forces. Our approach is versatile, providing a platform to develop and study active molecular gaps towards functional nanodevices.\n ', '\n Energy carrier transport and recombination in emerging semiconductors can be directly monitored with optical microscopy, leading to the measurement of the diffusion coefficient (D), a critical property for design of efficient optoelectronic devices. D is often determined by fitting a time-resolved expanding carrier profile after optical excitation using a Mean Squared Displacement (MSD) Model. Although this approach has gained widespread adoption, its utilization can significantly overestimate D due to the non-linear recombination processes that artificially broaden the carrier distribution profile. Here, we simulate diffusive processes in both excitonic and free carrier semiconductors and present revised MSD Models that take into account second-order (i.e. bimolecular) and third-order (i.e. Auger) processes to accurately recover D for various types of materials. For perovskite thin films, utilization of these models can reduce fitting error by orders of magnitude, especially for commonly deployed excitation conditions where carrier densities are > 5x10$^1$$^6$ cm$^-$$^3$. In addition, we show that commonly-deployed MSD Models are not well-suited for the study of films with microstructure, especially when boundary behavior is unknown and feature sizes are comparable to the diffusion length. Finally, we find that photon recycling only impacts energy carrier profiles on ultrashort time scales or for materials with fast radiative decay times. We present clear strategies to investigate energy transport in disordered materials for more effective design and optimization of electronic and optoelectronic devices.\n ', "\n Recent discoveries of broad classes of quantum materials have spurred fundamental study of what quantum phases can be reached and stabilized, and have suggested intriguing practical applications based on control over transitions between quantum phases with different electrical, magnetic, and$/$or optical properties. Tabletop generation of strong terahertz (THz) light fields has set the stage for dramatic advances in our ability to drive quantum materials into novel states that do not exist as equilibrium phases. However, THz-driven irreversible phase transitions are still unexplored. Large and doping-tunable energy barriers between multiple phases in two-dimensional transition metal dichalcogenides (2D TMDs) provide a testbed for THz polymorph engineering. Here we report experimental demonstration of an irreversible phase transition in 2D MoTe$_{2}$ from a semiconducting hexagonal phase (2H) to a predicted topological insulator distorted octahedral ($1T^{'}$) phase induced by field-enhanced terahertz pulses. This is achieved by THz field-induced carrier liberation and multiplication processes that result in a transient high carrier density that favors the $1T^{'}$ phase. Single-shot time-resolved second harmonic generation (SHG) measurements following THz excitation reveal that the transition out of the 2H phase occurs within 10 ns. This observation opens up new possibilities of THz-based phase patterning and has implications for ultrafast THz control over quantum phases in two-dimensional materials.\n ", '\n Lead halide-based perovskite thin films have attracted great attention due to the explosive increase in perovskite solar cell efficiencies. The same optoelectronic properties that make perovskites ideal absorber materials in solar cells are also beneficial in other light-harvesting applications and make them prime candidates as triplet sensitizers in upconversion via triplet-triplet annihilation in rubrene. In this contribution, we take advantage of long carrier lifetimes and carrier diffusion lengths in perovskite thin films, their high absorption cross sections throughout the visible spectrum, as well as the strong spin-orbit coupling owing to the abundance of heavy atoms to sensitize the upconverter rubrene. Employing bulk perovskite thin films as the absorber layer and spin-mixer in inorganic/organic heterojunction upconversion devices allows us to forego the additional tunneling barrier owing from the passivating ligands required for colloidal sensitizers. Our bilayer device exhibits an upconversion efficiency in excess of 3% under 785 nm illumination.\n ', '\n Photon recycling is required for a solar cell to achieve an open-circuit voltage ($V_{OC}$) and power conversion efficiency (PCE) approaching the Shockley-Queisser theoretical limit. In metal halide perovskite solar cells, the achievable performance gains from photon recycling remain uncertain due to high variability in perovskite material quality and the non-radiative recombination rate ($k_{1}$). In this work, we study state-of-the-art $\\textrm{Cs}_{0.05}(\\textrm{MA}_{0.17}\\textrm{FA}_{0.83})_{0.95}\\textrm{Pb}(\\textrm{I}_{0.83}\\textrm{Br}_{0.17})_{3}$ films and analyze the impact of varying non-radiative recombination rates on photon recycling and device performance. Importantly, we predict the impact of photon recycling at the maximum power point (MPP), demonstrating an absolute PCE increase of up to 2.0% in the radiative limit, primarily due to a 77 mV increase in $V_{MPP}$. Even with finite non-radiative recombination, benefits from photon recycling can be achieved when non-radiative lifetimes and external LED electroluminescence efficiencies measured at open-circuit, $Q_{e}^{LED}(\\textrm{V}_{OC})$, exceed 2 $μ$s and 10%, respectively. This analysis clarifies the opportunity to fully exploit photon recycling to push the real-world performance of perovskite solar cells toward theoretical limits.\n ', '\n Halide perovskites are promising semiconductors for inexpensive, high-performance optoelectronics. Despite a remarkable defect tolerance compared to conventional semiconductors, perovskite thin films still show substantial microscale heterogeneity in key properties such as luminescence efficiency and device performance. This behavior has been attributed to spatial fluctuations in the population of sub-bandgap electronic states that act as trap-mediated non-radiative recombination sites. However, the origin of the variations, trap states and extent of the defect tolerance remains a topic of debate, and a precise understanding is critical to the rational design of defect management strategies. By combining scanning X-ray diffraction beamlines at two different synchrotrons with high-resolution transmission electron microscopy, we reveal levels of heterogeneity on the ten-micrometer scale (super-grains) and even ten-nanometer scale (sub-grain domains). We find that local strain is associated with enhanced defect concentrations, and correlations between the local structure and time-resolved photoluminescence reveal that these strain-related defects are the cause of non-radiative recombination. We reveal a direct connection between defect concentrations and non-radiative losses, as well as complex heterogeneity across multiple length scales, shedding new light on the presence and influence of structural defects in halide perovskites.\n ', '\n Unique optical properties of colloidal semiconductor quantum dots (QDs), arising from quantum mechanical confinement of charge within these structures, present a versatile testbed for the study of how high electric fields affect the electronic structure of nanostructured solids. Earlier studies of quasi-DC electric field modulation of QD properties have been limited by the electrostatic breakdown processes under the high externally applied electric fields, which have restricted the range of modulation of QD properties. In contrast, in the present work we drive CdSe:CdS core:shell QD films with high-field THz-frequency electromagnetic pulses whose duration is only a few picoseconds. Surprisingly, in response to the THz excitation we observe QD luminescence even in the absence of an external charge source. Our experiments show that QD luminescence is associated with a remarkably high and rapid modulation of the QD band-gap, which is changing by more than 0.5 eV (corresponding to 25% of the unperturbed bandgap energy) within the picosecond timeframe of THz field profile. We show that these colossal energy shifts can be consistently explained by the quantum confined Stark effect. Our work demonstrates a route to extreme modulation of material properties without configurational changes in material sets or geometries. Additionally, we expect that this platform can be adapted to a novel compact THz detection scheme where conversion of THz fields (with meV-scale photon energies) to the visible/near-IR band (with eV-scale photon energies) can be achieved at room temperature with high bandwidth and sensitivity.\n ', '\n Plexcitons are polaritonic modes that result from the strong coupling between excitons and plasmons. We consider plexcitons emerging from the interaction of excitons in an organic molecular layer with surface plasmons in a metallic film. We predict the emergence of Dirac cones in the two-dimensional bandstructure of plexcitons due to the inherent alignment of the excitonic transitions in the organic layer. These Dirac cones may open up in energy by simultaneously interfacing the metal with a magneto-optical layer and subjecting the whole system to a perpendicular magnetic field. The resulting energy gap becomes populated with topologically protected one-way modes which travel at the interface of this plexcitonic system. Our theoretical proposal suggests that plexcitons are a convenient and simple platform for the exploration of exotic phases of matter as well as of novel ways to direct energy flow at the nanoscale.\n ', "\n Probabilistic programming languages aid developers performing Bayesian inference. These languages provide programming constructs and tools for probabilistic modeling and automated inference. Prior work introduced a probabilistic programming language, ProbZelus, to extend probabilistic programming functionality to unbounded streams of data. This work demonstrated that the delayed sampling inference algorithm could be extended to work in a streaming context. ProbZelus showed that while delayed sampling could be effectively deployed on some programs, depending on the probabilistic model under consideration, delayed sampling is not guaranteed to use a bounded amount of memory over the course of the execution of the program.\n In this paper, we the present conditions on a probabilistic program's execution under which delayed sampling will execute in bounded memory. The two conditions are dataflow properties of the core operations of delayed sampling: the $m$-consumed property and the unseparated paths property. A program executes in bounded memory under delayed sampling if, and only if, it satisfies the $m$-consumed and unseparated paths properties. We propose a static analysis that abstracts over these properties to soundly ensure that any program that passes the analysis satisfies these properties, and thus executes in bounded memory under delayed sampling.\n ", "\n Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy. In this work, we ask whether a particular architecture's accuracy-sparsity tradeoff can be improved by combining pruning information across multiple runs of training. From a shared ResNet-20 initialization, we train several network copies (\\emph{siblings}) to completion using different SGD data orders on CIFAR-10. While the siblings' pruning masks are naively not much more similar than chance, starting sibling training after a few epochs of shared pretraining significantly increases pruning overlap. We then choose a subnetwork by either (1) taking all weights that survive pruning in any sibling (mask union), or (2) taking only the weights that survive pruning across all siblings (mask intersection). The resulting subnetwork is retrained. Strikingly, we find that union and intersection masks perform very similarly. Both methods match the accuracy-sparsity tradeoffs of the one-shot magnitude pruning baseline, even when we combine masks from up to $k = 10$ siblings.\n ", '\n Computer programs are increasingly being deployed in partially-observable environments. A partially observable environment is an environment whose state is not completely visible to the program, but from which the program receives partial observations. Developers typically deal with partial observability by writing a state estimator that, given observations, attempts to deduce the hidden state of the environment. In safety-critical domains, to formally verify safety properties developers may write an environment model. The model captures the relationship between observations and hidden states and is used to prove the software correct.\n In this paper, we present a new methodology for writing and verifying programs in partially observable environments. We present belief programming, a programming methodology where developers write an environment model that the program runtime automatically uses to perform state estimation. A belief program dynamically updates and queries a belief state that captures the possible states the environment could be in. To enable verification, we present Epistemic Hoare Logic that reasons about the possible belief states of a belief program the same way that classical Hoare logic reasons about the possible states of a program. We develop these concepts by defining a semantics and a program logic for a simple core language called BLIMP. In a case study, we show how belief programming could be used to write and verify a controller for the Mars Polar Lander in BLIMP. We present an implementation of BLIMP called CBLIMP and evaluate it to determine the feasibility of belief programming.\n ', "\n The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its downstream transferability?\n In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch yet still reach the full models' performance. We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_LTH_Pre-training.\n ", "\n CPU simulators are useful tools for modeling CPU execution behavior. However, they suffer from inaccuracies due to the cost and complexity of setting their fine-grained parameters, such as the latencies of individual instructions. This complexity arises from the expertise required to design benchmarks and measurement frameworks that can precisely measure the values of parameters at such fine granularity. In some cases, these parameters do not necessarily have a physical realization and are therefore fundamentally approximate, or even unmeasurable.\n In this paper we present DiffTune, a system for learning the parameters of x86 basic block CPU simulators from coarse-grained end-to-end measurements. Given a simulator, DiffTune learns its parameters by first replacing the original simulator with a differentiable surrogate, another function that approximates the original function; by making the surrogate differentiable, DiffTune is then able to apply gradient-based optimization techniques even when the original function is non-differentiable, such as is the case with CPU simulators. With this differentiable surrogate, DiffTune then applies gradient-based optimization to produce values of the simulator's parameters that minimize the simulator's error on a dataset of ground truth end-to-end performance measurements. Finally, the learned parameters are plugged back into the original simulator.\n DiffTune is able to automatically learn the entire set of microarchitecture-specific parameters within the Intel x86 simulation model of llvm-mca, a basic block CPU simulator based on LLVM's instruction scheduling model. DiffTune's learned parameters lead llvm-mca to an average error that not only matches but lowers that of its original, expert-provided parameter values.\n ", '\n Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.\n ', '\n In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity. We find these subnetworks at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, our results demonstrate that the main lottery ticket observations remain relevant in this context. Codes available at https://github.com/VITA-Group/BERT-Tickets.\n ', "\n A Reduction -- an accumulation over a set of values, using an associative and commutative operator -- is a common computation in many numerical computations, including scientific computations, machine learning, computer vision, and financial analytics.\n Contemporary polyhedral-based compilation techniques make it possible to optimize reductions, such as prefix sums, in which each component of the reduction's output potentially shares computation with another component in the reduction. Therefore an optimizing compiler can identify the computation shared between multiple components and generate code that computes the shared computation only once.\n These techniques, however, do not support reductions that -- when phrased in the language of the polyhedral model -- span multiple dependent statements. In such cases, existing approaches can generate incorrect code that violates the data dependences of the original, unoptimized program.\n In this work, we identify and formalize the optimization of dependent reductions as an integer bilinear program. We present a heuristic optimization algorithm that uses an affine sequential schedule of the program to determine how to simplfy reductions yet still preserve the program's dependences.\n We demonstrate that the algorithm provides optimal complexity for a set of benchmark programs from the literature on probabilistic inference algorithms, whose performance critically relies on simplifying these reductions. The complexities for 10 of the 11 programs improve siginifcantly by factors at least of the sizes of the input data, which are in the range of $10^4$ to $10^6$ for typical real application inputs. We also confirm the significance of the improvement by showing speedups in wall-clock time that range from $1.1\\text{x}$ to over $10^6\\text{x}$.\n ", "\n Deep learning is moving towards increasingly sophisticated optimization objectives that employ higher-order functions, such as integration, continuous optimization, and root-finding. Since differentiable programming frameworks such as PyTorch and TensorFlow do not have first-class representations of these functions, developers must reason about the semantics of such objectives and manually translate them to differentiable code.\n We present a differentiable programming language, $λ_S$, that is the first to deliver a semantics for higher-order functions, higher-order derivatives, and Lipschitz but nondifferentiable functions. Together, these features enable $λ_S$ to expose differentiable, higher-order functions for integration, optimization, and root-finding as first-class functions with automatically computed derivatives. $λ_S$'s semantics is computable, meaning that values can be computed to arbitrary precision, and we implement $λ_S$ as an embedded language in Haskell.\n We use $λ_S$ to construct novel differentiable libraries for representing probability distributions, implicit surfaces, and generalized parametric surfaces -- all as instances of higher-order datatypes -- and present case studies that rely on computing the derivatives of these higher-order functions and datatypes. In addition to modeling existing differentiable algorithms, such as a differentiable ray tracer for implicit surfaces, without requiring any user-level differentiation code, we demonstrate new differentiable algorithms, such as the Hausdorff distance of generalized parametric surfaces.\n ", '\n We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of magnitude in depth, width, dataset size, and density. We show that the functional form holds (generalizes) for large scale data (e.g., ImageNet) and architectures (e.g., ResNets). As neural networks become ever larger and costlier to train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.\n ', '\n In this paper, we demonstrate a compiler that can optimize sparse and recurrent neural networks, both of which are currently outside of the scope of existing neural network compilers (sparse neural networks here stand for networks that can be accelerated with sparse tensor algebra techniques). Our demonstration includes a mapping of sparse and recurrent neural networks to the polyhedral model along with an implementation of our approach in TIRAMISU, our state-of-the-art polyhedral compiler. We evaluate our approach on a set of deep learning benchmarks and compare our results with hand-optimized industrial libraries. Our results show that our approach at least matches Intel MKL-DNN and in some cases outperforms it by 5x (on multicore-CPUs).\n ', '\n Many neural network pruning algorithms proceed in three steps: train the network to completion, remove unwanted structure to compress the network, and retrain the remaining structure to recover lost accuracy. The standard retraining technique, fine-tuning, trains the unpruned weights from their final trained values using a small fixed learning rate. In this paper, we compare fine-tuning to alternative retraining techniques. Weight rewinding (as proposed by Frankle et al., (2019)), rewinds unpruned weights to their values from earlier in training and retrains them from there using the original training schedule. Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding. Both rewinding techniques outperform fine-tuning, forming the basis of a network-agnostic pruning algorithm that matches the accuracy and compression ratios of several more network-specific state-of-the-art techniques.\n ', '\n We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).\n ', "\n Synchronous modeling is at the heart of programming languages like Lustre, Esterel, or Scade used routinely for implementing safety critical control software, e.g., fly-by-wire and engine control in planes. However, to date these languages have had limited modern support for modeling uncertainty -- probabilistic aspects of software's environment or behavior -- even though modeling uncertainty is a primary activity when designing a control system.\n In this paper we present ProbZelus the first synchronous probabilistic programming language. ProbZelus conservatively provides the facilities of a synchronous language to write control software, with probabilistic constructs to model uncertainties and perform inference-in-the-loop.\n We present the design and implementation of the language. We propose a measure-theoretic semantics of probabilistic stream functions and a simple type discipline to separate deterministic and probabilistic expressions. We demonstrate a semantics-preserving compilation into a first-order functional language that lends itself to a simple presentation of inference algorithms for streaming models. We also redesign the delayed sampling inference algorithm to provide efficient streaming inference. Together with an evaluation on several reactive applications, our results demonstrate that ProbZelus enables the design of reactive probabilistic applications and efficient, bounded memory inference.\n ", '\n Pruning is a well-established technique for removing unnecessary structure from neural networks after training to improve the performance of inference. Several recent results have explored the possibility of pruning at initialization time to provide similar benefits during training. In particular, the "lottery ticket hypothesis" conjectures that typical neural networks contain small subnetworks that can train to similar accuracy in a commensurate number of steps. The evidence for this claim is that a procedure based on iterative magnitude pruning (IMP) reliably finds such subnetworks retroactively on small vision tasks. However, IMP fails on deeper networks, and proposed methods to prune before training or train pruned networks encounter similar scaling limitations. In this paper, we argue that these efforts have struggled on deeper networks because they have focused on pruning precisely at initialization. We modify IMP to search for subnetworks that could have been obtained by pruning early in training (0.1% to 7% through) rather than at iteration 0. With this change, it finds small subnetworks of deeper networks (e.g., 80% sparsity on Resnet-50) that can complete the training process to match the accuracy of the original network on more challenging tasks (e.g., ImageNet). In situations where IMP fails at iteration 0, the accuracy benefits of delaying pruning accrue rapidly over the earliest iterations of training. To explain these behaviors, we study subnetwork "stability," finding that - as accuracy improves in this fashion - IMP subnetworks train to parameters closer to those of the full network and do so with improved consistency in the face of gradient noise. These results offer new insights into the opportunity to prune large-scale networks early in training and the behaviors underlying the lottery ticket hypothesis\n ', '\n When a computational task tolerates a relaxation of its specification or when an algorithm tolerates the effects of noise in its execution, hardware, programming languages, and system software can trade deviations from correct behavior for lower resource usage. We present, for the first time, a synthesis of research results on computing systems that only make as many errors as their users can tolerate, from across the disciplines of computer aided design of circuits, digital system design, computer architecture, programming languages, operating systems, and information theory.\n Rather than over-provisioning resources at each layer to avoid errors, it can be more efficient to exploit the masking of errors occurring at one layer which can prevent them from propagating to a higher layer. We survey tradeoffs for individual layers of computing systems from the circuit level to the operating system level and illustrate the potential benefits of end-to-end approaches using two illustrative examples. To tie together the survey, we present a consistent formalization of terminology, across the layers, which does not significantly deviate from the terminology traditionally used by research communities in their layer of focus.\n ', "\n Predicting the number of clock cycles a processor takes to execute a block of assembly instructions in steady state (the throughput) is important for both compiler designers and performance engineers. Building an analytical model to do so is especially complicated in modern x86-64 Complex Instruction Set Computer (CISC) machines with sophisticated processor microarchitectures in that it is tedious, error prone, and must be performed from scratch for each processor generation. In this paper we present Ithemal, the first tool which learns to predict the throughput of a set of instructions. Ithemal uses a hierarchical LSTM--based approach to predict throughput based on the opcodes and operands of instructions in a basic block. We show that Ithemal is more accurate than state-of-the-art hand-written tools currently used in compiler backends and static machine code analyzers. In particular, our model has less than half the error of state-of-the-art analytical models (LLVM's llvm-mca and Intel's IACA). Ithemal is also able to predict these throughput values just as fast as the aforementioned tools, and is easily ported across a variety of processor microarchitectures with minimal developer effort.\n ", "\n Researchers have recently designed a number of application-specific fault tolerance mechanisms that enable applications to either be naturally resilient to errors or include additional detection and correction steps that can bring the overall execution of an application back into an envelope for which an acceptable execution is eventually guaranteed. A major challenge to building an application that leverages these mechanisms, however, is to verify that the implementation satisfies the basic invariants that these mechanisms require--given a model of how faults may manifest during the application's execution.\n To this end we present Leto, an SMT based automatic verification system that enables developers to verify their applications with respect to a first-class execution model specification. Namely, Leto enables software and platform developers to programmatically specify the execution semantics of the underlying hardware system as well as verify assertions about the behavior of the application's resulting execution. In this paper, we present the Leto programming language and its corresponding verification system. We also demonstrate Leto on several applications that leverage application-specific fault tolerance mechanisms.\n ", "\n Researchers have recently proposed several systems that ease the process of performing Bayesian probabilistic inference. These include systems for automatic inference algorithm synthesis as well as stronger abstractions for manual algorithm development. However, existing systems whose performance relies on the developer manually constructing a part of the inference algorithm have limited support for reasoning about the correctness of the resulting algorithm.\n In this paper, we present Shuffle, a programming language for manually developing inference procedures that 1) enforces the basic rules of probability theory, 2) enforces the statistical dependencies of the algorithm's corresponding probabilistic model, and 3) generates an optimized implementation. We have used Shuffle to develop inference algorithms for several standard probabilistic models. Our results demonstrate that Shuffle enables a developer to deliver correct and performant implementations of these algorithms.\n ", '\n Though many safety-critical software systems use floating point to represent real-world input and output, programmers usually have idealized versions in mind that compute with real numbers. Significant deviations from the ideal can cause errors and jeopardize safety. Some programming systems implement exact real arithmetic, which resolves this matter but complicates others, such as decision making. In these systems, it is impossible to compute (total and deterministic) discrete decisions based on connected spaces such as $\\mathbb{R}$. We present programming-language semantics based on constructive topology with variants allowing nondeterminism and/or partiality. Either nondeterminism or partiality suffices to allow computable decision making on connected spaces such as $\\mathbb{R}$. We then introduce pattern matching on spaces, a language construct for creating programs on spaces, generalizing pattern matching in functional programming, where patterns need not represent decidable predicates and also may overlap or be inexhaustive, giving rise to nondeterminism or partiality, respectively. Nondeterminism and/or partiality also yield formal logics for constructing approximate decision procedures. We implemented these constructs in the Marshall language for exact real arithmetic.\n ', '\n In this position paper, we describe our vision of the future of machine programming through a categorical examination of three pillars of research. Those pillars are: (i) intention, (ii) invention, and(iii) adaptation. Intention emphasizes advancements in the human-to-computer and computer-to-machine-learning interfaces. Invention emphasizes the creation or refinement of algorithms or core hardware and software building blocks through machine learning (ML). Adaptation emphasizes advances in the use of ML-based constructs to autonomously evolve software.\n ', '\n Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.\n We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.\n We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.\n ', '\n We propose a novel approach to improving software security called Cryptographic Path Hardening, which is aimed at hiding security vulnerabilities in software from attackers through the use of provably secure and obfuscated cryptographic devices to harden paths in programs.\n By "harden" we mean that certain error-checking if-conditionals in a given program P are replaced by equivalent" we mean that adversaries cannot use semi-automatic program analysis techniques to reason about the hardened program paths and thus cannot discover as-yet-unknown errors along those paths, except perhaps through black-box dictionary attacks or random testing (which we can never prevent).\n Other than these unpreventable attack methods, we can make program analysis aimed at error-finding "provably hard" for a resource-bounded attacker, in the same sense that cryptographic schemes are hard to break. Unlike security-through-obscurity, in Cryptographic Path Hardening we use provably-secure crypto devices to hide errors and our mathematical arguments of security are the same as the standard ones used in cryptography.\n One application of Cryptographic Path Hardening is that software patches or filters often reveal enough information to an attacker that they can be used to construct error-revealing inputs to exploit an unpatched version of the program. By "hardening" the patch we make it difficult for the attacker to analyze the patched program to construct error-revealing inputs, and thus prevent him from potentially constructing exploits.\n ', '\n Let $E \\subseteq R^n$ be a closed set of Hausdorff dimension $α$. For $m \\geq n$, let $\\{B_1,\\ldots,B_k\\}$ be $n \\times (m-n)$ matrices. We prove that if the system of matrices $B_j$ is non-degenerate in a suitable sense, $α$ is sufficiently close to $n$, and if $E$ supports a probability measure obeying appropriate dimensionality and Fourier decay conditions, then for a range of $m$ depending on $n$ and $k$, the set $E$ contains a translate of a non-trivial $k$-point configuration $\\{B_1y,\\ldots,B_ky\\}$. As a consequence, we are able to establish existence of certain geometric configurations in Salem sets (such as parallelograms in $ R^n$ and isosceles right triangles in $R^2$). This can be viewed as a multidimensional analogue of an earlier result of Laba and Pramanik on 3-term arithmetic progressions in subsets of $R$.\n ', '\n We study the spindown of isolated neutron stars from initially rapid rotation rates, driven by two factors: (i) gravitational wave emission due to r-modes and (ii) magnetic braking. In the context of isolated neutron stars, we present the first study including self-consistently the magnetic damping of r-modes in the spin evolution. We track the spin evolution employing the RNS code, which accounts for the rotating structure of neutron stars for various equations of state. We find that, despite the strong damping due to the magnetic field, r-modes alter the braking rate from pure magnetic braking for B<10^{13}G. For realistic values of the saturation amplitude, the r-mode can also decrease the time to reach the threshold central density for quark deconfinement. Within a phenomenological model, we assess the gravitational waveform that would result from r-mode driven spindown of a magnetized neutron star. To contrast with the persistent signal during the spindown phase, we also present a preliminary estimate of the transient gravitational wave signal from an explosive quark-hadron phase transition, which can be a signal for the deconfinement of quarks inside neutron stars.\n ', '\n Throughout the course of the COVID-19 pandemic, several countries have developed and released contact tracing and exposure notification smartphone applications (apps) to help slow the spread of the disease. To support such apps, Apple and Google have released Exposure Notification Application Programming Interfaces (APIs) to infer device (user) proximity using Bluetooth Low Energy (BLE) beacons. The Private Automated Contact Tracing (PACT) team has shown that accurately estimating the distance between devices using only BLE radio signals is challenging. This paper describes the design and implementation of the SonicPACT protocol to use near-ultrasonic signals on commodity iOS and Android smartphones to estimate distances using time-of-flight measurements. The protocol allows Android and iOS devices to interoperate, augmenting and improving the current exposure notification APIs. Our initial experimental results are promising, suggesting that SonicPACT should be considered for implementation by Apple and Google.\n ', "\n The growing popularity of cloud-based machine learning raises a natural question about the privacy guarantees that can be provided in such a setting. Our work tackles this problem in the context where a client wishes to classify private images using a convolutional neural network (CNN) trained by a server. Our goal is to build efficient protocols whereby the client can acquire the classification result without revealing their input to the server, while guaranteeing the privacy of the server's neural network.\n To this end, we design Gazelle, a scalable and low-latency system for secure neural network inference, using an intricate combination of homomorphic encryption and traditional two-party computation techniques (such as garbled circuits). Gazelle makes three contributions. First, we design the Gazelle homomorphic encryption library which provides fast algorithms for basic homomorphic operations such as SIMD (single instruction multiple data) addition, SIMD multiplication and ciphertext permutation. Second, we implement the Gazelle homomorphic linear algebra kernels which map neural network layers to optimized homomorphic matrix-vector multiplication and convolution routines. Third, we design optimized encryption switching protocols which seamlessly convert between homomorphic and garbled circuit encodings to enable implementation of complete neural network inference.\n We evaluate our protocols on benchmark neural networks trained on the MNIST and CIFAR-10 datasets and show that Gazelle outperforms the best existing systems such as MiniONN (ACM CCS 2017) by 20 times and Chameleon (Crypto Eprint 2017/1164) by 30 times in online runtime. Similarly when compared with fully homomorphic approaches like CryptoNets (ICML 2016) we demonstrate three orders of magnitude faster online run-time.\n ", '\n Ultra-wideband (UWB) communication is an emerging wireless technology that promises high data rates over short distances and precise locationing. The large available bandwidth and the constraint of a maximum power spectral density drives a unique set of system challenges. This paper addresses these challenges using two UWB transceivers and a discrete prototype platform.\n ', '\n Wireless microsensor networks, which have been the topic of intensive research in recent years, are now emerging in industrial applications. An important milestone in this transition has been the release of the IEEE 802.15.4 standard that specifies interoperable wireless physical and medium access control layers targeted to sensor node radios. In this paper, we evaluate the potential of an 802.15.4 radio for use in an ultra low power sensor node operating in a dense network. Starting from measurements carried out on the off-the-shelf radio, effective radio activation and link adaptation policies are derived. It is shown that, in a typical sensor network scenario, the average power per node can be reduced down to 211m mm mW. Next, the energy consumption breakdown between the different phases of a packet transmission is presented, indicating which part of the transceiver architecture can most effectively be optimized in order to further reduce the radio power, enabling self-powered wireless microsensor networks.\n ', '\n RISC-V is a relatively new, open instruction set architecture with a mature ecosystem and an official formal machine-readable specification. It is therefore a promising playground for formal-methods research.\n However, we observe that different formal-methods research projects are interested in different aspects of RISC-V and want to simplify, abstract, approximate, or ignore the other aspects. Often, they also require different encoding styles, resulting in each project starting a new formalization from-scratch. We set out to identify the commonalities between projects and to represent the RISC-V specification as a program with holes that can be instantiated differently by different projects.\n Our formalization of the RISC-V specification is written in Haskell and leverages existing tools rather than requiring new domain-specific tools, contrary to other approaches. To our knowledge, it is the first RISC-V specification able to serve as the interface between a processor-correctness proof and a compiler-correctness proof, while supporting several other projects with diverging requirements as well.\n ', '\n Though many safety-critical software systems use floating point to represent real-world input and output, programmers usually have idealized versions in mind that compute with real numbers. Significant deviations from the ideal can cause errors and jeopardize safety. Some programming systems implement exact real arithmetic, which resolves this matter but complicates others, such as decision making. In these systems, it is impossible to compute (total and deterministic) discrete decisions based on connected spaces such as $\\mathbb{R}$. We present programming-language semantics based on constructive topology with variants allowing nondeterminism and/or partiality. Either nondeterminism or partiality suffices to allow computable decision making on connected spaces such as $\\mathbb{R}$. We then introduce pattern matching on spaces, a language construct for creating programs on spaces, generalizing pattern matching in functional programming, where patterns need not represent decidable predicates and also may overlap or be inexhaustive, giving rise to nondeterminism or partiality, respectively. Nondeterminism and/or partiality also yield formal logics for constructing approximate decision procedures. We implemented these constructs in the Marshall language for exact real arithmetic.\n ', '\n It is a neat result from functional programming that libraries of parser combinators can support rapid construction of decoders for quite a range of formats. With a little more work, the same combinator program can denote both a decoder and an encoder. Unfortunately, the real world is full of gnarly formats, as with the packet formats that make up the standard Internet protocol stack. Most past parser-combinator approaches cannot handle these formats, and the few exceptions require redundancy -- one part of the natural grammar needs to be hand-translated into hints in multiple parts of a parser program. We show how to recover very natural and nonredundant format specifications, covering all popular network packet formats and generating both decoders and encoders automatically. The catch is that we use the Coq proof assistant to derive both kinds of artifacts using tactics, automatically, in a way that guarantees that they form inverses of each other. We used our approach to reimplement packet processing for a full Internet protocol stack, inserting our replacement into the OCaml-based MirageOS unikernel, resulting in minimal performance degradation.\n ', "\n We describe our experience implementing a broad category-theory library in Coq. Category theory and computational performance are not usually mentioned in the same breath, but we have needed substantial engineering effort to teach Coq to cope with large categorical constructions without slowing proof script processing unacceptably. In this paper, we share the lessons we have learned about how to represent very abstract mathematical objects and arguments in Coq and how future proof assistants might be designed to better support such reasoning. One particular encoding trick to which we draw attention allows category-theoretic arguments involving duality to be internalized in Coq's logic with definitional equality. Ours may be the largest Coq development to date that uses the relatively new Coq version developed by homotopy type theorists, and we reflect on which new features were especially helpful.\n ", "\n We describe a method for building composable and extensible verification procedures within the Coq proof assistant. Unlike traditional methods that rely on run-time generation and checking of proofs, we use verified-correct procedures with Coq soundness proofs. Though they are internalized in Coq's logic, our provers support sound extension by users with hints over new domains, enabling automated reasoning about user-defined abstract predicates. We maintain soundness by developing an architecture for modular packaging, construction, and composition of hint databases, which had previously only been implemented in Coq at the level of its dynamically typed, proof-generating tactic language. Our provers also include rich handling of unification variables, enabling integration with other tactic-based deduction steps within Coq. We have implemented our techniques in MirrorShard, an open-source framework for reflective verification. We demonstrate its applicability by instantiating it to separation logic in order to reason about imperative program verification.\n ", '\n We report on the implementation of a certified compiler for a high-level hardware description language (HDL) called Fe-Si (FEatherweight SynthesIs). Fe-Si is a simplified version of Bluespec, an HDL based on a notion of guarded atomic actions. Fe-Si is defined as a dependently typed deep embedding in Coq. The target language of the compiler corresponds to a synthesisable subset of Verilog or VHDL. A key aspect of our approach is that input programs to the compiler can be defined and proved correct inside Coq. Then, we use extraction and a Verilog back-end (written in OCaml) to get a certified version of a hardware design.\n ', "\n Receiver operating characteristics (ROCs) are a well-established representation of the tradeoff between detection and false alarm probabilities in classical binary hypothesis testing. We use classical ROCs as motivation for two types of operating characteristics for binary hypothesis testing in quantum systems -- decision operating characteristics (QDOCs) and measurement operating characteristics (QMOCs). Both are described in the context of a framework we propose that encompasses the typical formulations of binary hypothesis testing in both the classical and quantum scenarios. We interpret Helstrom's well-known result regarding discrimination between two quantum density operators with minimum probability of error in this framework. We also present a generalization of previous results regarding the correspondence between classical Parseval frames and quantum measurements. The derivation naturally leads to a constructive procedure for generating many different measurements besides Helstrom's optimal measurement, some standard and others non-standard, that achieve minimum probability of error.\n ", "\n Massive open online courses (MOOCs) promise to make rigorous higher education accessible to everyone, but prior research has shown that registrants tend to come from backgrounds of higher socioeconomic status. We study geographically granular economic patterns in about 76,000 U.S. registrations for about 600 HarvardX and MITx courses between 2012 and 2018, identifying registrants' locations using both IP geolocation and user-reported mailing addresses. By either metric, we find higher registration rates among postal codes with greater prosperity or population density. However, we also find evidence of bias in IP geolocation: it makes greater errors, both geographically and economically, for users from more economically distressed areas; it disproportionately places users in prosperous areas; and it underestimates the regressive pattern in MOOC registration. Researchers should use IP geolocation in MOOC studies with care, and consider the possibility of similar economic biases affecting its other academic, commercial, and legal uses.\n ", '\n The efficient simulation of quantum systems is a primary motivating factor for developing controllable quantum machines. For addressing systems with underlying bosonic structure, it is advantageous to utilize a naturally bosonic platform. Optical photons passing through linear networks may be configured to perform quantum simulation tasks, but the efficient preparation and detection of multiphoton quantum states of light in linear optical systems are challenging. Here, we experimentally implement a boson sampling protocol for simulating molecular vibronic spectra [Nature Photonics $\\textbf{9}$, 615 (2015)] in a two-mode superconducting device. In addition to enacting the requisite set of Gaussian operations across both modes, we fulfill the scalability requirement by demonstrating, for the first time in any platform, a high-fidelity single-shot photon number resolving detection scheme capable of resolving up to 15 photons per mode. Furthermore, we exercise the capability of synthesizing non-Gaussian input states to simulate spectra of molecular ensembles in vibrational excited states. We show the re-programmability of our implementation by extracting the spectra of photoelectron processes in H$_2$O, O$_3$, NO$_2$, and SO$_2$. The capabilities highlighted in this work establish the superconducting architecture as a promising platform for bosonic simulations, and by combining them with tools such as Kerr interactions and engineered dissipation, enable the simulation of a wider class of bosonic systems.\n ', '\n Experimentally realizable quantum computers are rapidly approaching the threshold of quantum supremacy. Quantum Hamiltonian simulation promises to be one of the first practical applications for which such a device could demonstrate an advantage over all classical systems. However, these early devices will inevitably remain both noisy and small, precluding the use of quantum error correction. We use high-performance classical tools to construct, optimize, and simulate quantum circuits subject to realistic error models in order to empirically determine the "simulation capacity" of near-term simulation experiments implemented via quantum signal processing (QSP), describing the relationship between simulation time, system size, and resolution of QSP circuits which are optimally configured to balance algorithmic precision and external noise. From simulation capacity models, we estimate maximum tolerable error rate for meaningful simulation experiments on a near-term quantum computer.\n By exploiting symmetry inherent to the QSP circuit, we further demonstrate that its capacity for quantum simulation can be increased by at least two orders of magnitude if errors are systematic and unitary. We find that a device with $ε^2=10^{-5}$ systematic amplitude errors could meaningfully simulate systems up to $n\\approx16$ with an expected failure rate below $10\\%$, whereas the largest system a device with a stochastic error rate of $p_ε=10^{-5}$ could meaningfully simulate with the same rate of failure is between $n=3$ and $n=5$ (depending on the stochastic channel). Extrapolating from empirical results, we estimate that one would typically need a stochastic error rate below $p_ε=10^{-8}$ to perform a meaningful $n=50$ simulation experiment with a failure rate below $10\\%$, while the same experiment could tolerate systematic unitary errors with strength $ε^2\\approx10^{-6}$.\n ', "\n To understand the fundamental trade-offs between training stability, temporal dynamics and architectural complexity of recurrent neural networks~(RNNs), we directly analyze RNN architectures using numerical methods of ordinary differential equations~(ODEs). We define a general family of RNNs--the ODERNNs--by relating the composition rules of RNNs to integration methods of ODEs at discrete time steps. We show that the degree of RNN's functional nonlinearity $n$ and the range of its temporal memory $t$ can be mapped to the corresponding stage of Runge-Kutta recursion and the order of time-derivative of the ODEs. We prove that popular RNN architectures, such as LSTM and URNN, fit into different orders of $n$-$t$-ODERNNs. This exact correspondence between RNN and ODE helps us to establish the sufficient conditions for RNN training stability and facilitates more flexible top-down designs of new RNN architectures using large varieties of toolboxes from numerical integration of ODEs. We provide such an example: Quantum-inspired Universal computing Neural Network~(QUNN), which reduces the required number of training parameters from polynomial in both data length and temporal memory length to only linear in temporal memory length.\n ", '\n Compared to humans, machine learning models generally require significantly more training examples and fail to extrapolate from experience to solve previously unseen challenges. To help close this performance gap, we augment single-task neural networks with a meta-recognition model which learns a succinct model code via its autoencoder structure, using just a few informative examples. The model code is then employed by a meta-generative model to construct parameters for the task-specific model. We demonstrate that for previously unseen tasks, without additional training, this Meta-Learning Autoencoder (MeLA) framework can build models that closely match the true underlying models, with loss significantly lower than given by fine-tuned baseline networks, and performance that compares favorably with state-of-the-art meta-learning algorithms. MeLA also adds the ability to identify influential training examples and predict which additional data will be most valuable to acquire to improve model prediction.\n ', '\n Each time a learner in a self-paced online course seeks to answer an assessment question, it takes some time for the student to read the question and arrive at an answer to submit. If multiple attempts are allowed, and the first answer is incorrect, it takes some time to provide a second answer. Here we study the distribution of such "response times." We find that the log-normal statistical model for such times, previously suggested in the literature, holds for online courses. Users who, according to this model, tend to take longer on submits are more likely to complete the course, have a higher level of engagement, and achieve a higher grade. This finding can be the basis for designing interventions in online courses, such as MOOCs, which would encourage "fast" users to slow down.\n ', '\n We present a novel set of reversible modular multipliers applicable to quantum computing, derived from three classical techniques: 1) traditional integer division, 2) Montgomery residue arithmetic, and 3) Barrett reduction. Each multiplier computes an exact result for all binary input values, while maintaining the asymptotic resource complexity of a single (non-modular) integer multiplier. We additionally conduct an empirical resource analysis of our designs in order to determine the total gate count and circuit depth of each fully constructed circuit, with inputs as large as 2048 bits. Our comparative analysis considers both circuit implementations which allow for arbitrary (controlled) rotation gates, as well as those restricted to a typical fault-tolerant gate set.\n ', '\n A non-Clifford gate is required for universal quantum computation, and, typically, this is the most error-prone and resource intensive logical operation on an error-correcting code. Small, single-qubit rotations are popular choices for this non-Clifford gate, but certain three-qubit gates, such as Toffoli or controlled-controlled-Z (CCZ), are equivalent options that are also more suited for implementing some quantum algorithms, for instance those with coherent classical subroutines. Here, we calculate error rates and resource overheads for implementing logical CCZ with pieceable fault-tolerance, a non-transversal method for implementing logical gates. We provide a comparison with a non-local magic-state scheme on a concatenated code and a local magic-state scheme on the surface code. We find the pieceable fault-tolerance scheme particularly advantaged over magic states on concatenated codes and in certain regimes over magic states on the surface code. Our results suggest that pieceable fault-tolerance is a promising candidate for fault-tolerance in a near-future quantum computer.\n ', '\n Noisy PN learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate rho1 for positive examples and rho0 for negative examples. We propose Rank Pruning (RP) to solve noisy PN learning and the open problem of estimating the noise rates, i.e. the fraction of wrong positive and negative labels. Unlike prior solutions, RP is time-efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. We prove RP has consistent noise estimation and equivalent expected risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. RP achieves state-of-the-art noise estimation and F1, error, and AUC-PR for both MNIST and CIFAR datasets, regardless of the amount of noise and performs similarly impressively when a large portion of training examples are noise drawn from a third distribution. To highlight, RP with a CNN classifier can predict if an MNIST digit is a "one"or "not" with only 0.25% error, and 0.46 error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples.\n ', '\n Classical imaging works by scattering photons from an object to be imaged, and achieves resolution scaling as $1/\\sqrt{t}$, with $t$ the imaging time. By contrast, the laws of quantum mechanics allow one to utilize quantum coherence to obtain imaging resolution that can scale as quickly as $1/t$ -- the so-called "Heisenberg limit." However, ambiguities in the obtained signal often preclude taking full advantage of this quantum enhancement, while imaging techniques designed to be unambiguous often lose this optimal Heisenberg scaling. Here, we demonstrate an imaging technique which combines unambiguous detection of the target with Heisenberg scaling of the resolution. We also demonstrate a binary search algorithm which can efficiently locate a coherent target using the technique, resolving a target trapped ion to within 0.3% of the $1/e^2$ diameter of the excitation beam.\n ', '\n We report on a method for measuring the branching ratios of dipole transitions of trapped atomic ions by performing nested sequences of population inversions. This scheme is broadly applicable and does not use ultrafast pulsed or narrow linewidth lasers. It is simple to perform and insensitive to experimental variables such as laser and magnetic field noise as well as ion heating. To demonstrate its effectiveness, we make the most accurate measurements thus far of the branching ratios of both 5P1/2 and 5P3/2 states in 88Sr+ with sub-1% uncertainties. We measure 17.175(27) for the branching ratio of 5P1/2-5S1/2, 15.845(71) for 5P3/2-5S1/2, and 0.05609(21) for 5P3/2-4D5/2, ten- fold and thirty-fold improvements in precision for 5P1/2 and 5P3/2 branching ratios respectively over the best previous experimental values.\n ', '\n We describe a cheating strategy enabled by the features of massive open online courses (MOOCs) and detectable by virtue of the sophisticated data systems that MOOCs provide. The strategy, Copying Answers using Multiple Existences Online (CAMEO), involves a user who gathers solutions to assessment questions using a "harvester" account and then submits correct answers using a separate "master" account. We use "clickstream" learner data to detect CAMEO use among 1.9 million course participants in 115 MOOCs from two universities. Using conservative thresholds, we estimate CAMEO prevalence at 1,237 certificates, accounting for 1.3% of the certificates in the 69 MOOCs with CAMEO users. Among earners of 20 or more certificates, 25% have used the CAMEO strategy. CAMEO users are more likely to be young, male, and international than other MOOC certificate earners. We identify preventive strategies that can decrease CAMEO rates and show evidence of their effectiveness in science courses.\n ', '\n Quantum computers are able to outperform classical algorithms. This was long recognized by the visionary Richard Feynman who pointed out in the 1980s that quantum mechanical problems were better solved with quantum machines. It was only in 1994 that Peter Shor came up with an algorithm that is able to calculate the prime factors of a large number vastly more efficiently than known possible with a classical computer. This paradigmatic algorithm stimulated the flourishing research in quantum information processing and the quest for an actual implementation of a quantum computer. Over the last fifteen years, using skillful optimizations, several instances of a Shor algorithm have been implemented on various platforms and clearly proved the feasibility of quantum factoring. For general scalability, though, a different approach has to be pursued. Here, we report the realization of a fully scalable Shor algorithm as proposed by Kitaev. For this, we demonstrate factoring the number fifteen by effectively employing and controlling seven qubits and four "cache-qubits", together with the implementation of generalized arithmetic operations, known as modular multipliers. The scalable algorithm has been realized with an ion-trap quantum computer exhibiting success probabilities in excess of 90%.\n ', '\n We study the vacuum-induced degradation of high-finesse optical cavities with mirror coatings composed of SiO$_2$-Ta$_{2}$O$_{5}$ dielectric stacks, and present methods to protect these coatings and to recover their initial quality factor. For separate coatings with reflectivities centered at 370 nm and 422 nm, a vacuum-induced continuous increase in optical loss occurs if the surface-layer coating is made of Ta$_{2}$O$_{5}$, while it does not occur if it is made of SiO$_2$. The incurred optical loss can be reversed by filling the vacuum chamber with oxygen at atmospheric pressure, and the recovery rate can be strongly accelerated by continuous laser illumination at 422 nm. Both the degradation and the recovery processes depend strongly on temperature. We find that a 1 nm-thick layer of SiO$_2$ passivating the Ta$_{2}$O$_{5}$ surface layer is sufficient to reduce the degradation rate by more than a factor of 10, strongly supporting surface oxygen depletion as the primary degradation mechanism.\n ', '\n Massive Open Online Courses are an exciting new avenue for instruction and research, yet they are full of unknowns. In the Spring of 2013, MITx released its first introductory physics MOOC through the edX platform, generating a total enrollment of 43,000 students from around the world. We describe the population of participants in terms of their age, gender, level of education, and country of origin, highlighting both the diversity of 8.02x enrollees as well as gender gap and retention. Using three midterm exams and the final as waypoints, we highlight performance by different demographic subpopulations and their retention rates. Our work is generally aimed at making a bridge between available MOOC data and topics associated with the Physics Education Research community.\n ', '\n We present a novel hybrid system where an optical cavity is integrated with a microfabricated planar-electrode ion trap. The trap electrodes produce a tunable periodic potential allowing the trapping of up to 50 separate ion chains spaced by 160 $μ$m along the cavity axis. Each chain can contain up to 20 individually addressable Yb\\textsuperscript{+} ions coupled to the cavity mode. We demonstrate deterministic distribution of ions between the sites of the electrostatic periodic potential and control of the ion-cavity coupling. The measured strength of this coupling should allow access to the strong collective coupling regime with $\\lesssim$10 ions. The optical cavity could serve as a quantum information bus between ions or be used to generate a strong wavelength-scale periodic optical potential.\n ', '\n Graphs are closely related to quantum error-correcting codes: every stabilizer code is locally equivalent to a graph code, and every codeword stabilized code can be described by a graph and a classical code. For the construction of good quantum codes of relatively large block length, concatenated quantum codes and their generalizations play an important role. We develop a systematic method for constructing concatenated quantum codes based on "graph concatenation", where graphs representing the inner and outer codes are concatenated via a simple graph operation called "generalized local complementation." Our method applies to both binary and non-binary concatenated quantum codes as well as their generalizations.\n ', '\n Many-body entangled quantum states studied in condensed matter physics can be primary resources for quantum information, allowing any quantum computation to be realized using measurements alone, on the state. Such a universal state would be remarkably valuable, if only it were thermodynamically stable and experimentally accessible, by virtue of being the unique ground state of a physically reasonable Hamiltonian made of two-body, nearest neighbor interactions. We introduce such a state, composed of six-state particles on a hexagonal lattice, and describe a general method for analyzing its properties based on its projected entangled pair state representation.\n ', '\n The codeword stabilized ("CWS") quantum codes formalism presents a unifying approach to both additive and nonadditive quantum error-correcting codes (arXiv:0708.1021). This formalism reduces the problem of constructing such quantum codes to finding a binary classical code correcting an error pattern induced by a graph state. Finding such a classical code can be very difficult. Here, we consider an algorithm which maps the search for CWS codes to a problem of identifying maximum cliques in a graph. While solving this problem is in general very hard, we prove three structure theorems which reduce the search space, specifying certain admissible and optimal ((n,K,d)) additive codes. In particular, we find there does not exist any ((7,3,3)) CWS code though the linear programming bound does not rule it out. The complexity of the CWS search algorithm is compared with the contrasting method introduced by Aggarwal and Calderbank (arXiv:cs/0610159).\n ', '\n We produce large numbers of low-energy ions by photoionization of laser-cooled atoms inside a surface-electrode-based Paul trap. The isotope-selective trap loading rate of $4\\times10^{5}$ Yb$^{+}$ ions/s exceeds that attained by photoionization (electron impact ionization) of an atomic beam by four (six) orders of magnitude. Traps as shallow as 0.13 eV are easily loaded with this technique. The ions are confined in the same spatial region as the laser-cooled atoms, which will allow the experimental investigation of interactions between cold ions and cold atoms or Bose-Einstein condensates.\n ', '\n Nuclear Magnetic Resonance (NMR) has provided a valuable experimental testbed for quantum information processing (QIP). Here, we briefly review the use of nuclear spins as qubits, and discuss the current status of NMR-QIP. Advances in the techniques available for control are described along with the various implementations of quantum algorithms and quantum simulations that have been performed using NMR. The recent application of NMR control techniques to other quantum computing systems are reviewed before concluding with a description of the efforts currently underway to transition to solid state NMR systems that hold promise for scalable architectures.\n ', '\n The Schur basis on n d-dimensional quantum systems is a generalization of the total angular momentum basis that is useful for exploiting symmetry under permutations or collective unitary rotations. We present efficient (size poly(n,d,log(1/ε)) for accuracy ε) quantum circuits for the Schur transform, which is the change of basis between the computational and the Schur bases. These circuits are based on efficient circuits for the Clebsch-Gordan transformation. We also present an efficient circuit for a limited version of the Schur transform in which one needs only to project onto different Schur subspaces. This second circuit is based on a generalization of phase estimation to any nonabelian finite group for which there exists a fast quantum Fourier transform.\n ', '\n Systematic errors in quantum operations can be the dominating source of imperfection in achieving control over quantum systems. This problem, which has been well studied in nuclear magnetic resonance, can be addressed by replacing single operations with composite sequences of pulsed operations, which cause errors to cancel by symmetry. Remarkably, this can be achieved without knowledge of the amount of error epsilon. Independent of the initial state of the system, current techniques allow the error to be reduced to O(epsilon^3). Here, we extend the composite pulse technique to cancel errors to O(epsilon^n), for arbitrary n.\n ', '\n We report the realization of a nuclear magnetic resonance computer with three quantum bits that simulates an adiabatic quantum optimization algorithm. Adiabatic quantum algorithms offer new insight into how quantum resources can be used to solve hard problems. This experiment uses a particularly well suited three quantum bit molecule and was made possible by introducing a technique that encodes general instances of the given optimization problem into an easily applicable Hamiltonian. Our results indicate an optimal run time of the adiabatic algorithm that agrees well with the prediction of a simple decoherence model.\n ', '\n We present a quantum digital signature scheme whose security is based on fundamental principles of quantum physics. It allows a sender (Alice) to sign a message in such a way that the signature can be validated by a number of different people, and all will agree either that the message came from Alice or that it has been tampered with. To accomplish this task, each recipient of the message must have a copy of Alice\'s "public key," which is a set of quantum states whose exact identity is known only to Alice. Quantum public keys are more difficult to deal with than classical public keys: for instance, only a limited number of copies can be in circulation, or the scheme becomes insecure. However, in exchange for this price, we achieve unconditionally secure digital signatures. Sending an m-bit message uses up O(m) quantum bits for each recipient of the public key. We briefly discuss how to securely distribute quantum public keys, and show the signature scheme is absolutely secure using one method of key distribution. The protocol provides a model for importing the ideas of classical public key cryptography into the quantum world.\n ', '\n The clock synchronization problem is to determine the time difference $Δ$ between two spatially separated clocks. When message delivery times between the two clocks are uncertain, $O(2^{2n})$ classical messages must be exchanged between the clocks to determine $n$ digits of $Δ$. On the other hand, as we show, there exists a quantum algorithm to obtain $n$ digits of $Δ$ while communicating only O(n) quantum messages.\n ', '\n Using nuclear magnetic resonance techniques, we experimentally investigated the effects of applying a two bit phase error detection code to preserve quantum information in nuclear spin systems. Input states were stored with and without coding, and the resulting output states were compared with the originals and with each other. The theoretically expected result, net reduction of distortion and conditional error probabilities to second order, was indeed observed, despite imperfect coding operations which increased the error probabilities by approximately 5%. Systematic study of the deviations from the ideal behavior provided quantitative measures of different sources of error, and good agreement was found with a numerical model. Theoretical questions in quantum error correction in bulk nuclear spin systems including fidelity measures, signal strength and syndrome measurements are discussed.\n ', '\n In bulk quantum computation one can manipulate a large number of indistinguishable quantum computers by parallel unitary operations and measure expectation values of certain observables with limited sensitivity. The initial state of each computer in the ensemble is known but not pure. Methods for obtaining effective pure input states by a series of manipulations have been described by Gershenfeld and Chuang (logical labeling) and Cory et al. (spatial averaging) for the case of quantum computation with nuclear magnetic resonance. We give a different technique called temporal averaging. This method is based on classical randomization, requires no ancilla qubits and can be implemented in nuclear magnetic resonance without using gradient fields. We introduce several temporal averaging algorithms suitable for both high temperature and low temperature bulk quantum computing and analyze the signal to noise behavior of each.\n ', "\n This paper presents a new protocol for solving the private heavy-hitters problem. In this problem, there are many clients and a small set of data-collection servers. Each client holds a private bitstring. The servers want to recover the set of all popular strings, without learning anything else about any client's string. A web-browser vendor, for instance, can use our protocol to figure out which homepages are popular, without learning any user's homepage. We also consider the simpler private subset-histogram problem, in which the servers want to count how many clients hold strings in a particular set without revealing this set to the clients.\n Our protocols use two data-collection servers and, in a protocol run, each client send sends only a single message to the servers. Our protocols protect client privacy against arbitrary misbehavior by one of the servers and our approach requires no public-key cryptography (except for secure channels), nor general-purpose multiparty computation. Instead, we rely on incremental distributed point functions, a new cryptographic tool that allows a client to succinctly secret-share the labels on the nodes of an exponentially large binary tree, provided that the tree has a single non-zero path. Along the way, we develop new general tools for providing malicious security in applications of distributed point functions.\n In an experimental evaluation with two servers on opposite sides of the U.S., the servers can find the 200 most popular strings among a set of 400,000 client-held 256-bit strings in 54 minutes. Our protocols are highly parallelizable. We estimate that with 20 physical machines per logical server, our protocols could compute heavy hitters over ten million clients in just over one hour of computation.\n ", "\n We present the design and implementation of SafetyPin, a system for encrypted mobile-device backups. Like existing cloud-based mobile-backup systems, including those of Apple and Google, SafetyPin requires users to remember only a short PIN and defends against brute-force PIN-guessing attacks using hardware security protections. Unlike today's systems, SafetyPin splits trust over a cluster of hardware security modules (HSMs) in order to provide security guarantees that scale with the number of HSMs. In this way, SafetyPin protects backed-up user data even against an attacker that can adaptively compromise many of the system's constituent HSMs. SafetyPin provides this protection without sacrificing scalability or fault tolerance. Decentralizing trust while respecting the resource limits of today's HSMs requires a synthesis of systems-design principles and cryptographic tools. We evaluate SafetyPin on a cluster of 100 low-cost HSMs and show that a SafetyPin-protected recovery takes 1.01 seconds. To process 1B recoveries a year, we estimate that a SafetyPin deployment would need 3,100 low-cost HSMs.\n ", '\n Existing systems for metadata-hiding messaging that provide cryptographic privacy properties have either high communication costs, high computation costs, or both. In this paper, we introduce Express, a metadata-hiding communication system that significantly reduces both communication and computation costs. Express is a two-server system that provides cryptographic security against an arbitrary number of malicious clients and one malicious server. In terms of communication, Express only incurs a constant-factor overhead per message sent regardless of the number of users, whereas previous cryptographically-secure systems Pung and Riposte had communication costs proportional to roughly the square root of the number of users. In terms of computation, Express only uses symmetric key cryptographic primitives and makes both practical and asymptotic improvements on protocols employed by prior work. These improvements enable Express to increase message throughput, reduce latency, and consume over 100x less bandwidth than Pung and Riposte, dropping the end to end cost of running a realistic whistleblowing application by 6x.\n ', "\n We present True2F, a system for second-factor authentication that provides the benefits of conventional authentication tokens in the face of phishing and software compromise, while also providing strong protection against token faults and backdoors. To do so, we develop new lightweight two-party protocols for generating cryptographic keys and ECDSA signatures, and we implement new privacy defenses to prevent cross-origin token-fingerprinting attacks. To facilitate real-world deployment, our system is backwards-compatible with today's U2F-enabled web services and runs on commodity hardware tokens after a firmware modification. A True2F-protected authentication takes just 57ms to complete on the token, compared with 23ms for unprotected U2F.\n ", "\n This paper presents Prio, a privacy-preserving system for the collection of aggregate statistics. Each Prio client holds a private data value (e.g., its current location), and a small set of servers compute statistical functions over the values of all clients (e.g., the most popular location). As long as at least one server is honest, the Prio servers learn nearly nothing about the clients' private data, except what they can infer from the aggregate statistics that the system computes. To protect functionality in the face of faulty or malicious clients, Prio uses secret-shared non-interactive proofs (SNIPs), a new cryptographic technique that yields a hundred-fold performance improvement over conventional zero-knowledge approaches. Prio extends classic private aggregation techniques to enable the collection of a large class of useful statistics. For example, Prio can perform a least-squares regression on high-dimensional client-provided data without ever seeing the data in the clear.\n ", '\n Atom is an anonymous messaging system that protects against traffic-analysis attacks. Unlike many prior systems, each Atom server touches only a small fraction of the total messages routed through the network. As a result, the system\'s capacity scales near-linearly with the number of servers. At the same time, each Atom user benefits from "best possible" anonymity: a user is anonymous among all honest users of the system, against an active adversary who controls the entire network, a portion of the system\'s servers, and any number of malicious users. The architectural ideas behind Atom have been known in theory, but putting them into practice requires new techniques for (1) avoiding the reliance on heavy general-purpose multi-party computation protocols, (2) defeating active attacks by malicious servers at minimal performance cost, and (3) handling server failure and churn.\n Atom is most suitable for sending a large number of short messages, as in a microblogging application or a high-security communication bootstrapping ("dialing") for private messaging systems. We show that, on a heterogeneous network of 1,024 servers, Atom can transit a million Tweet-length messages in 28 minutes. This is over 23x faster than prior systems with similar privacy guarantees.\n ', "\n Website publishers can derive enormous performance benefits and cost savings by directing traffic to their sites through content distribution networks (CDNs). However, publishers who use CDNs today must trust their CDN not to modify the site's JavaScript, CSS, images or other media en route to end users. A CDN that violates this trust could inject ads into websites, downsample media to save bandwidth or, worse, inject malicious JavaScript code to steal user secrets it could not otherwise access. We present Stickler, a system for website publishers that guarantees the end-to-end authenticity of content served to end users while simultaneously allowing publishers to reap the benefits of CDNs. Crucially, Stickler achieves these guarantees without requiring modifications to the browser.\n ", '\n This paper presents Riposte, a new system for anonymous broadcast messaging. Riposte is the first such system, to our knowledge, that simultaneously protects against traffic-analysis attacks, prevents anonymous denial-of-service by malicious clients, and scales to million-user anonymity sets. To achieve these properties, Riposte makes novel use of techniques used in systems for private information retrieval and secure multi-party computation. For latency-tolerant workloads with many more readers than writers (e.g. Twitter, Wikileaks), we demonstrate that a three-server Riposte cluster can build an anonymity set of 2,895,216 users in 32 hours.\n ', "\n The security of any cryptosystem relies on the secrecy of the system's secret keys. Yet, recent experimental work demonstrates that tens of thousands of devices on the Internet use RSA and DSA secrets drawn from a small pool of candidate values. As a result, an adversary can derive the device's secret keys without breaking the underlying cryptosystem. We introduce a new threat model, under which there is a systemic solution to such randomness flaws. In our model, when a device generates a cryptographic key, it incorporates some random values from an entropy authority into its cryptographic secrets and then proves to the authority, using zero-knowledge-proof techniques, that it performed this operation correctly. By presenting an entropy-authority-signed public-key certificate to a third party (like a certificate authority or SSH client), the device can demonstrate that its public key incorporates randomness from the authority and is therefore drawn from a large pool of candidate values. Where possible, our protocol protects against eavesdroppers, entropy authority misbehavior, and devices attempting to discredit the entropy authority. To demonstrate the practicality of our protocol, we have implemented and evaluated its performance on a commodity wireless home router. When running on a home router, our protocol incurs a 2.1x slowdown over conventional RSA key generation and it incurs a 4.4x slowdown over conventional EC-DSA key generation.\n ", '\n We present the design and prototype implementation of ConScript, a framework for using JavaScript to allow casual Web users to participate in an anonymous communication system. When a Web user visits a cooperative Web site, the site serves a JavaScript application that instructs the browser to create and submit "dummy" messages into the anonymity system. Users who want to send non-dummy messages through the anonymity system use a browser plug-in to replace these dummy messages with real messages. Creating such conscripted anonymity sets can increase the anonymity set size available to users of remailer, e-voting, and verifiable shuffle-style anonymity systems. We outline ConScript\'s architecture, we address a number of potential attacks against ConScript, and we discuss the ethical issues related to deploying such a system. Our implementation results demonstrate the practicality of ConScript: a workstation running our ConScript prototype JavaScript client generates a dummy message for a mix-net in 81 milliseconds and it generates a dummy message for a DoS-resistant DC-net in 156 milliseconds.\n ', '\n The DC-nets approach to anonymity has long held attraction for its strength against traffic analysis, but practical implementations remain vulnerable to internal disruption or "jamming" attacks requiring time-consuming tracing procedures to address. We present Verdict, the first practical anonymous group communication system built using proactively verifiable DC-nets: participants use public key cryptography to construct DC-net ciphertexts, and knowledge proofs to detect and detect and exclude misbehavior before disruption. We compare three alternative constructions for verifiable DC-nets, one using bilinear maps and two based on simpler ElGamal encryption. While verifiable DC-nets incurs higher computation overheads due to the public-key cryptography involved, our experiments suggest Verdict is practical for anonymous group messaging or microblogging applications, supporting groups of 100 clients at 1 second per round or 1000 clients at 10 seconds per round. Furthermore, we show how existing symmetric-key DC-nets can "fall back" to a verifiable DC-net to quickly identify mis- behavior improving previous detections schemes by two orders of magnitude than previous approaches.\n ', '\n Users often wish to participate in online groups anonymously, but misbehaving users may abuse this anonymity to spam or disrupt the group. Messaging protocols such as Mix-nets and DC-nets leave online groups vulnerable to denial-of-service and Sybil attacks, while accountable voting protocols are unusable or inefficient for general anonymous messaging.\n We present the first general messaging protocol that offers provable anonymity with accountability for moderate-size groups, and efficiently handles unbalanced loads where few members have much data to transmit in a given round. The N group members first cooperatively shuffle an NxN matrix of pseudorandom seeds, then use these seeds in N "pre-planned" DC-nets protocol runs. Each DC-nets run transmits the variable-length bulk data comprising one member\'s message, using the minimum number of bits required for anonymity under our attack model. The protocol preserves message integrity and one-to-one correspondence between members and messages, makes denial-of-service attacks by members traceable to the culprit, and efficiently handles large and unbalanced message loads. A working prototype demonstrates the protocol\'s practicality for anonymous messaging in groups of 40+ member nodes.\n ', '\n Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed that the entries of the matrix are "missing completely at random" (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of "latent confounders", i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, in the context of movie recommender systems -- a canonical application for matrix completion -- a user who vehemently dislikes horror films is unlikely to ever watch horror films. In general, these confounders yield "missing not at random" (MNAR) data, which can severely impact any inference procedure that does not correct for this bias. We develop a formal causal model for matrix completion through the language of potential outcomes, and provide novel identification arguments for a variety of causal estimands of interest. We design a procedure, which we call "synthetic nearest neighbors" (SNN), to estimate these causal estimands. We prove finite-sample consistency and asymptotic normality of our estimator. Our analysis also leads to new theoretical results for the matrix completion literature. In particular, we establish entry-wise, i.e., max-norm, finite-sample consistency and asymptotic normality results for matrix completion with MNAR data. As a special case, this also provides entry-wise bounds for matrix completion with MCAR data. Across simulated and real data, we demonstrate the efficacy of our proposed estimator.\n ', '\n Access to capital is a major constraint for economic growth in the developing world. Yet those attempting to lend in this space face high defaults due to their inability to distinguish creditworthy borrowers from the rest. In this paper, we propose two novel scoring mechanisms that incentivize community members to truthfully report their signal on the creditworthiness of others in their community. We first design a truncated asymmetric scoring-rule for a setting where the lender has no liquidity constraints. We then derive a novel, strictly-proper VCG scoring mechanism for the liquidity-constrained setting. Whereas Chen et al. [2011] give an impossibility result for an analogous setting in which sequential reports are made in the context of decision markets, we achieve a positive result through appeal to interim beliefs about the reports of others in a setting with simultaneous reports.Moreover, the use of VCG methods allows for the integration of linear belief aggregation methods.\n ', "\n Most of the new technological changes in power systems are expected to take place in distribution grids. The enormous potential for distribution flexibility could meet the transmission system's needs, changing the paradigm of generator-centric energy and ancillary services provided to a demand-centric one, by placing more importance on smaller resources, such as flexible demands and electric vehicles. For unlocking such capabilities, it is essential to understand the aggregated flexibility that can be harvested from the large population of new technologies located in distribution grids. Distribution grids, therefore, could provide aggregated flexibility at the transmission level. To date, most computational methods for estimating the aggregated flexibility at the interface between distribution grids and transmission grids have the drawback of requiring significant computational time, which hinders their applicability. This paper presents a new algorithm, coined as QuickFlex} for constructing the flexibility domain of distribution grids. Contrary to previous methods, a priory flexibility domain accuracy can be selected. Our method requires few iterations for constructing the flexibility region. The number of iterations needed is mainly independent of the distribution grid's input size and flexible elements. Numerical experiments are performed in four grids ranging from 5 nodes to 123 nodes. It is shown that QuickFlex outperforms existing proposals in the literature in both speed and accuracy.\n ", '\n In this work, inspired by secret sharing schemes, we introduce a privacy-preserving approach for network consensus, by which all nodes in a network can reach an agreement on their states without exposing the individual state to neighbors. With the privacy degree defined for the agents, the proposed method makes the network resistant to the collusion of any given number of neighbors, and protects the consensus procedure from communication eavesdropping. Unlike existing works, the proposed privacy-preserving algorithm is resilient to node failures. When a node fails, the method offers the possibility of rebuilding the lost node via the information kept in its neighbors, even though none of the neighbors knows the exact state of the failing node. Moreover, it is shown that the proposed method can achieve consensus and average consensus almost surely, when the agents have arbitrary privacy degrees and a common privacy degree, respectively. To illustrate the theory, two numerical examples are presented.\n ', "\n This paper proposes a privacy-preserving algorithm to solve the average consensus problem based on Shamir's secret sharing scheme, in which a network of agents reach an agreement on their states without exposing their individual state until an agreement is reached. Unlike other methods, the proposed algorithm renders the network resistant to the collusion of any given number of neighbors (even with all neighbors' colluding). Another virtue of this work is that such a method can protect the network consensus procedure from eavesdropping.\n ", "\n The design of data markets has gained importance as firms increasingly use predictions from machine learning models to streamline operations, yet need to externally acquire training data to fit such models. One aspect that has received limited consideration is the externality a firm faces when data is allocated to competing firms. Such externalities couple firms' optimal allocations, despite the inherent free replicability of data. In this paper, we demonstrate that the presence of externalities increases the optimal revenue of a monopolistic data seller by letting firms pay to prevent allocations to other competing firms. This is shown by first reducing the combinatorial problem of allocating and pricing multiple datasets to the auction of a single digital good. We achieve this by modeling utility for data solely through the increase in prediction accuracy it provides. Then, we find the welfare and revenue maximizing mechanisms, highlighting how the form of firms' private information - whether they know the externalities they exert on others or vice-versa - affects their overall structures. In all cases, the optimal allocation rule is a single threshold (one per firm), where either all data is allocated or none is.\n ", "\n Personal data is essential in showing users targeted ads - the economic backbone of the web. Still, there are major inefficiencies in how data is transacted online: (1) users don't decide what information is released nor get paid for this privacy loss; (2) algorithmic advertisers are stuck in inefficient long-term contracts where they purchase user data without knowing the value it provides. This paper proposes a system, Zorro, which aims to rectify aforementioned two problems.\n As the main contribution, we provide a natural, 'absolute' definition of 'Value of Data' (VoD) - for any quantity of interest, it is the delta between an individual's value and population mean. The challenge remains how to operationalize this definition, independently of a buyer's model for VoD. We propose a model-agnostic solution, relying on matrix estimation, and use it to estimate click-through-rate (CTR), as an example.\n Regarding (2), Zorro empowers advertisers to measure value of user data on a query-by-query basis and based only on the increase in accuracy it provides in estimating CTR. In contrast advertisers currently engage in inefficient long-term data contracts with third party data sellers. We highlight two results on a large ad-click dataset: (i) our system has R^2=0.58, in line with best-in-class results for related problems (e.g. content recommendation). Crucially, our system is model-agnostic - we estimate CTR without accessing an advertiser's proprietary models, a required property of any such pricing system;(ii) our experiments show selling user data has incremental value ranging from 30%-69% depending on ad category. Roughly, this translates to at least USD 16 Billion loss in value for advertisers if user data is not provided.\n Regarding (1), in addition to allowing users to get paid for data sharing, we extend our mathematical framework to when users provide explicit intent.\n ", '\n Spurred by the growth of transportation network companies and increasing data capabilities, vehicle routing and ride-matching algorithms can improve the efficiency of private transportation services. However, existing routing solutions do not address where drivers should travel after dropping off a passenger and before receiving the next passenger ride request, i.e., during the between-ride period. We address this problem by developing an efficient algorithm to find the optimal policy for drivers between rides in order to maximize driver profits. We model the road network as a graph, and we show that the between-ride routing problem is equivalent to a stochastic shortest path problem, an infinite dynamic program with no discounting. We prove under reasonable assumptions that an optimal routing policy exists that avoids cycles; policies of this type can be efficiently found. We present an iterative approach to find an optimal routing policy. Our approach can account for various factors, including the frequency of passenger ride requests at different locations, traffic conditions, and surge pricing. We demonstrate the effectiveness of the approach by implementing it on road network data from Boston and New York City.\n ', "\n In this work, we aim to design a data marketplace; a robust real-time matching mechanism to efficiently buy and sell training data for Machine Learning tasks. While the monetization of data and pre-trained models is an essential focus of industry today, there does not exist a market mechanism to price training data and match buyers to sellers while still addressing the associated (computational and other) complexity. The challenge in creating such a market stems from the very nature of data as an asset: (i) it is freely replicable; (ii) its value is inherently combinatorial due to correlation with signal in other data; (iii) prediction tasks and the value of accuracy vary widely; (iv) usefulness of training data is difficult to verify a priori without first applying it to a prediction task. As our main contributions we: (i) propose a mathematical model for a two-sided data market and formally define the key associated challenges; (ii) construct algorithms for such a market to function and analyze how they meet the challenges defined. We highlight two technical contributions: (i) a new notion of 'fairness' required for cooperative games with freely replicable goods; (ii) a truthful, zero regret mechanism to auction a class of combinatorial goods based on utilizing Myerson's payment function and the Multiplicative Weights algorithm. These might be of independent interest.\n ", '\n Voltage control plays an important role in the operation of electricity distribution networks, especially when there is a large penetration of renewable energy resources. In this paper, we focus on voltage control through reactive power compensation and study how different information structures affect the control performance. In particular, we first show that using only voltage measurements to determine reactive power compensation is insufficient to maintain voltage in the acceptable range. Then we propose two fully decentralized and robust algorithms by adding additional information, which can stabilize the voltage in the acceptable range. The one with higher complexity can further minimize a cost of reactive power compensation in a particular form. Both of the two algorithms use only local measurements and local variables and require no communication. In addition, the two algorithms are robust against heterogeneous update rates and delays.\n ', '\n Consider a stationary discrete random process with alphabet size d, which is assumed to be the output process of an unknown stationary Hidden Markov Model (HMM). Given the joint probabilities of finite length strings of the process, we are interested in finding a finite state generative model to describe the entire process. In particular, we focus on two classes of models: HMMs and quasi-HMMs, which is a strictly larger class of models containing HMMs. In the main theorem, we show that if the random process is generated by an HMM of order less or equal than k, and whose transition and observation probability matrix are in general position, namely almost everywhere on the parameter space, both the minimal quasi-HMM realization and the minimal HMM realization can be efficiently computed based on the joint probabilities of all the length N strings, for N > 4 lceil log_d(k) rceil +1. In this paper, we also aim to compare and connect the two lines of literature: realization theory of HMMs, and the recent development in learning latent variable models with tensor decomposition techniques.\n ', '\n We extend a relaxation technique due to Bertsimas and Nino-Mora for the restless bandit problem to the case where arbitrary costs penalize switching between the bandits. We also construct a one-step lookahead policy using the solution of the relaxation. Computational experiments and a bound for approximate dynamic programming provide some empirical support for the heuristic.\n ', '\n In this work, we propose a method for the compression of the coupling matrix in volume\\hyp surface integral equation (VSIE) formulations. VSIE methods are used for electromagnetic analysis in magnetic resonance imaging (MRI) applications, for which the coupling matrix models the interactions between the coil and the body. We showed that these effects can be represented as independent interactions between remote elements in 3D tensor formats, and subsequently decomposed with the Tucker model. Our method can work in tandem with the adaptive cross approximation technique to provide fast solutions of VSIE problems. We demonstrated that our compression approaches can enable the use of VSIE matrices of prohibitive memory requirements, by allowing the effective use of modern graphical processing units (GPUs) to accelerate the arising matrix\\hyp vector products. This is critical to enable numerical MRI simulations at clinical voxel resolutions in a feasible computation time. In this paper, we demonstrate that the VSIE matrix\\hyp vector products needed to calculate the electromagnetic field produced by an MRI coil inside a numerical body model with $1$ mm$^3$ voxel resolution, could be performed in $\\sim 33$ seconds in a GPU, after compressing the associated coupling matrix from $\\sim 80$ TB to $\\sim 43$ MB.\n ', '\n Recent works have developed several methods of defending neural networks against adversarial attacks with certified guarantees. However, these techniques can be computationally costly due to the use of certification during training. We develop a new regularizer that is both more efficient than existing certified defenses, requiring only one additional forward propagation through a network, and can be used to train networks with similar certified accuracy. Through experiments on MNIST and CIFAR-10 we demonstrate improvements in training speed and comparable certified accuracy compared to state-of-the-art certified defenses.\n ', '\n This paper develops a predictive modeling algorithm, denoted as Real-Time Vector Fitting (RTVF), which is capable of approximating the real-time linearized dynamics of multi-input multi-output (MIMO) dynamical systems via rational transfer function matrices. Based on a generalization of the well-known Time-Domain Vector Fitting (TDVF) algorithm, RTVF is suitable for online modeling of dynamical systems which experience both initial-state decay contributions in the measured output signals and concurrently active input signals. These adaptations were specifically contrived to meet the needs currently present in the electrical power systems community, where real-time modeling of low frequency power system dynamics is becoming an increasingly coveted tool by power system operators. After introducing and validating the RTVF scheme on synthetic test cases, this paper presents a series of numerical tests on high-order closed-loop generator systems in the IEEE 39-bus test system.\n ', '\n DC microgrids are prone to small-signal instabilities due to the presence of tightly regulated loads. This paper develops a decentralized stability certificate which is capable of certifying the small-signal stability of an islanded DC network containing such loads. Utilizing a novel homotopy approach, the proposed standards ensure that no system eigenmodes are able to cross into the unstable right half plane for a continuous range of controller gain levels. The resulting "standards" can be applied to variety of grid components which meet the specified, but non-unique, criteria. These standards thus take a step towards offering plug-and-play operability of DC microgrids. The proposed theorems are explicitly illustrated and numerically validated on multiple DC microgrid test-cases containing both buck and boost converter dynamics.\n ', '\n This paper applies a custom model order reduction technique to the distribution grid state estimation problem. Specifically, the method targets the situation where, due to pseudo-measurement uncertainty, it is advantageous to run the state estimation solver potentially thousands of times over sampled input perturbations in order to compute probabilistic bounds on the underlying system state. This routine, termed the Accelerated Probabilistic State Estimator (APSE), efficiently searches for the solutions of sequential state estimation problems in a low dimensional subspace with a reduced order model (ROM). When a sufficiently accurate solution is not found, the APSE reverts to a conventional QR factorization-based Gauss-Newton solver. It then uses the resulting solution to preform a simple basis expansion of the low-dimensional subspace, thus improving the reduced model solver. Simulated test results, collected from the unbalanced three-phase 8500-node distribution grid, show the resulting algorithm to be almost an order of magnitude faster than a comparable full-order Gauss-Newton solver and thus potentially fast enough for real-time use.\n ', '\n This paper develops a computationally efficient algorithm which speeds up the probabilistic power flow (PPF) problem by exploiting the inherently low-rank nature of the voltage profile in electrical power distribution networks. The algorithm is accordingly termed the Accelerated-PPF (APPF), since it can accelerate "any" sampling-based PPF solver. As the APPF runs, it concurrently generates a low-dimensional subspace of orthonormalized solution vectors. This subspace is used to construct and update a reduced order model (ROM) of the full nonlinear system, resulting in a highly efficient simulation for future voltage profiles. When constructing and updating the subspace, the power flow problem must still be solved on the full nonlinear system. In order to accelerate the computation of these solutions, a Neumann expansion of a modified power flow Jacobian is implemented. Applicable when load bus injections are small, this Neumann expansion allows for a considerable speed up of Jacobian system solves during the standard Newton iterations. APPF test results, from experiments run on the full IEEE 8500-node test feeder, are finally presented.\n ', '\n Randomized smoothing is a recently proposed defense against adversarial attacks that has achieved SOTA provable robustness against $\\ell_2$ perturbations. A number of publications have extended the guarantees to other metrics, such as $\\ell_1$ or $\\ell_\\infty$, by using different smoothing measures. Although the current framework has been shown to yield near-optimal $\\ell_p$ radii, the total safety region certified by the current framework can be arbitrarily small compared to the optimal. In this work, we propose a framework to improve the certified safety region for these smoothed classifiers without changing the underlying smoothing scheme. The theoretical contributions are as follows: 1) We generalize the certification for randomized smoothing by reformulating certified radius calculation as a nested optimization problem over a class of functions. 2) We provide a method to calculate the certified safety region using $0^{th}$-order and $1^{st}$-order information for Gaussian-smoothed classifiers. We also provide a framework that generalizes the calculation for certification using higher-order information. 3) We design efficient, high-confidence estimators for the relevant statistics of the first-order information. Combining the theoretical contribution 2) and 3) allows us to certify safety region that are significantly larger than the ones provided by the current methods. On CIFAR10 and Imagenet datasets, the new regions certified by our approach achieve significant improvements on general $\\ell_1$ certified radii and on the $\\ell_2$ certified radii for color-space attacks ($\\ell_2$ restricted to 1 channel) while also achieving smaller improvements on the general $\\ell_2$ certified radii. Our framework can also provide a way to circumvent the current impossibility results on achieving higher magnitude of certified radii without requiring the use of data-dependent smoothing techniques.\n ', '\n Deep neural networks, including reinforcement learning agents, have been proven vulnerable to small adversarial changes in the input, thus making deploying such networks in the real world problematic. In this paper, we propose RADIAL-RL, a method to train reinforcement learning agents with improved robustness against any $l_p$-bounded adversarial attack. By simply minimizing an upper bound of the loss functions under worst case adversarial perturbation derived from efficient robustness verification methods, we significantly improve robustness of RL-agents trained on Atari-2600 games and show that RADIAL-RL can beat state-of-the-art robust training algorithms when evaluated against PGD-attacks. We also propose a new evaluation method, Greedy Worst-Case Reward (GWC), for measuring attack agnostic robustness of RL agents. GWC can be evaluated efficiently and it serves as a good estimate of the reward under the worst possible sequence of adversarial attacks; in particular, GWC accounts for the importance of each action and their temporal dependency, improving upon previous approaches that only evaluate whether each single action can change under input perturbations. Our code is available at https://github.com/tuomaso/radial_rl.\n ', '\n Recent works have empirically shown that there exist adversarial examples that can be hidden from neural network interpretability (namely, making network interpretation maps visually similar), or interpretability is itself susceptible to adversarial attacks. In this paper, we theoretically show that with a proper measurement of interpretation, it is actually difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted ImageNet. Spurred by that, we develop an interpretability-aware defensive scheme built only on promoting robust interpretation (without the need for resorting to adversarial loss minimization). We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation in particular.\n ', "\n Objective: Global Maxwell Tomography (GMT) is a recently introduced volumetric technique for noninvasive estimation of electrical properties (EP) from magnetic resonance measurements. Previous work evaluated GMT using ideal radiofrequency (RF) excitations. The aim of this simulation study was to assess GMT performance with a realistic RF coil. Methods: We designed a transmit-receive RF coil with $8$ decoupled channels for $7$T head imaging. We calculated the RF transmit field ($B_1^+$) inside heterogeneous head models for different RF shimming approaches, and used them as input for GMT to reconstruct EP for all voxels. Results: Coil tuning/decoupling remained relatively stable when the coil was loaded with different head models. Mean error in EP estimation changed from $7.5\\%$ to $9.5\\%$ and from $4.8\\%$ to $7.2\\%$ for relative permittivity and conductivity, respectively, when changing head model without re-tuning the coil. Results slightly improved when an SVD-based RF shimming algorithm was applied, in place of excitation with one coil at a time. Despite errors in EP, RF transmit field ($B_1^+$) and absorbed power could be predicted with less than $0.5\\%$ error over the entire head. GMT could accurately detect a numerically inserted tumor. Conclusion: This work demonstrates that GMT can reliably reconstruct EP in realistic simulated scenarios using a tailored 8-channel RF coil design at $7$T. Future work will focus on construction of the coil and optimization of GMT's robustness to noise, to enable in-vivo GMT experiments. Significance: GMT could provide accurate estimations of tissue EP, which could be used as biomarkers and could enable patient-specific estimation of RF power deposition, which is an unsolved problem for ultra-high-field magnetic resonance imaging.\n ", '\n The fragility of modern machine learning models has drawn a considerable amount of attention from both academia and the public. While immense interests were in either crafting adversarial attacks as a way to measure the robustness of neural networks or devising worst-case analytical robustness verification with guarantees, few methods could enjoy both scalability and robustness guarantees at the same time. As an alternative to these attempts, randomized smoothing adopts a different prediction rule that enables statistical robustness arguments which easily scale to large networks. However, in this paper, we point out the side effects of current randomized smoothing workflows. Specifically, we articulate and prove two major points: 1) the decision boundaries of smoothed classifiers will shrink, resulting in disparity in class-wise accuracy; 2) applying noise augmentation in the training process does not necessarily resolve the shrinking issue due to the inconsistent learning objectives.\n ', '\n Verifying robustness of neural networks given a specified threat model is a fundamental yet challenging task. While current verification methods mainly focus on the $\\ell_p$-norm threat model of the input instances, robustness verification against semantic adversarial attacks inducing large $\\ell_p$-norm perturbations, such as color shifting and lighting adjustment, are beyond their capacity. To bridge this gap, we propose \\textit{Semantify-NN}, a model-agnostic and generic robustness verification approach against semantic perturbations for neural networks. By simply inserting our proposed \\textit{semantic perturbation layers} (SP-layers) to the input layer of any given model, \\textit{Semantify-NN} is model-agnostic, and any $\\ell_p$-norm based verification tools can be used to verify the model robustness against semantic perturbations. We illustrate the principles of designing the SP-layers and provide examples including semantic perturbations to image classification in the space of hue, saturation, lightness, brightness, contrast and rotation, respectively. In addition, an efficient refinement technique is proposed to further significantly improve the semantic certificate. Experiments on various network architectures and different datasets demonstrate the superior verification performance of \\textit{Semantify-NN} over $\\ell_p$-norm-based verification frameworks that naively convert semantic perturbation to $\\ell_p$-norm. The results show that \\textit{Semantify-NN} can support robustness verification against a wide range of semantic perturbations.\n Code available https://github.com/JeetMo/Semantify-NN\n ', '\n The rapid growth of deep learning applications in real life is accompanied by severe safety concerns. To mitigate this uneasy phenomenon, much research has been done providing reliable evaluations of the fragility level in different deep neural networks. Apart from devising adversarial attacks, quantifiers that certify safeguarded regions have also been designed in the past five years. The summarizing work of Salman et al. unifies a family of existing verifiers under a convex relaxation framework. We draw inspiration from such work and further demonstrate the optimality of deterministic CROWN (Zhang et al. 2018) solutions in a given linear programming problem under mild constraints. Given this theoretical result, the computationally expensive linear programming based method is shown to be unnecessary. We then propose an optimization-based approach \\textit{FROWN} (\\textbf{F}astened C\\textbf{ROWN}): a general algorithm to tighten robustness certificates for neural networks. Extensive experiments on various networks trained individually verify the effectiveness of FROWN in safeguarding larger robust regions.\n ', '\n Deep neural networks are known to be fragile to small adversarial perturbations. This issue becomes more critical when a neural network is interconnected with a physical system in a closed loop. In this paper, we show how to combine recent works on neural network certification tools (which are mainly used in static settings such as image classification) with robust control theory to certify a neural network policy in a control loop. Specifically, we give a sufficient condition and an algorithm to ensure that the closed loop state and control constraints are satisfied when the persistent adversarial perturbation is l-infinity norm bounded. Our method is based on finding a positively invariant set of the closed loop dynamical system, and thus we do not require the differentiability or the continuity of the neural network policy. Along with the verification result, we also develop an effective attack strategy for neural network control systems that outperforms exhaustive Monte-Carlo search significantly. We show that our certification algorithm works well on learned models and achieves 5 times better result than the traditional Lipschitz-based method to certify the robustness of a neural network policy on a cart pole control problem.\n ', '\n The vulnerability to adversarial attacks has been a critical issue for deep neural networks. Addressing this issue requires a reliable way to evaluate the robustness of a network. Recently, several methods have been developed to compute $\\textit{robustness quantification}$ for neural networks, namely, certified lower bounds of the minimum adversarial perturbation. Such methods, however, were devised for feed-forward networks, e.g. multi-layer perceptron or convolutional networks. It remains an open problem to quantify robustness for recurrent networks, especially LSTM and GRU. For such networks, there exist additional challenges in computing the robustness quantification, such as handling the inputs at multiple steps and the interaction between gates and states. In this work, we propose $\\textit{POPQORN}$ ($\\textbf{P}$ropagated-$\\textbf{o}$ut$\\textbf{p}$ut $\\textbf{Q}$uantified R$\\textbf{o}$bustness for $\\textbf{RN}$Ns), a general algorithm to quantify robustness of RNNs, including vanilla RNNs, LSTMs, and GRUs. We demonstrate its effectiveness on different network architectures and show that the robustness quantification on individual steps can lead to new insights.\n ', "\n With deep neural networks providing state-of-the-art machine learning models for numerous machine learning tasks, quantifying the robustness of these models has become an important area of research. However, most of the research literature merely focuses on the \\textit{worst-case} setting where the input of the neural network is perturbed with noises that are constrained within an $\\ell_p$ ball; and several algorithms have been proposed to compute certified lower bounds of minimum adversarial distortion based on such worst-case analysis. In this paper, we address these limitations and extend the approach to a \\textit{probabilistic} setting where the additive noises can follow a given distributional characterization. We propose a novel probabilistic framework PROVEN to PRObabilistically VErify Neural networks with statistical guarantees -- i.e., PROVEN certifies the probability that the classifier's top-1 prediction cannot be altered under any constrained $\\ell_p$ norm perturbation to a given input. Importantly, we show that it is possible to derive closed-form probabilistic certificates based on current state-of-the-art neural network robustness verification frameworks. Hence, the probabilistic certificates provided by PROVEN come naturally and with almost no overhead when obtaining the worst-case certified lower bounds from existing methods such as Fast-Lin, CROWN and CNN-Cert. Experiments on small and large MNIST and CIFAR neural network models demonstrate our probabilistic approach can achieve up to around $75\\%$ improvement in the robustness certification with at least a $99.99\\%$ confidence compared with the worst-case robustness certificate delivered by CROWN.\n ", '\n Verifying robustness of neural network classifiers has attracted great interests and attention due to the success of deep neural networks and their unexpected vulnerability to adversarial perturbations. Although finding minimum adversarial distortion of neural networks (with ReLU activations) has been shown to be an NP-complete problem, obtaining a non-trivial lower bound of minimum distortion as a provable robustness guarantee is possible. However, most previous works only focused on simple fully-connected layers (multilayer perceptrons) and were limited to ReLU activations. This motivates us to propose a general and efficient framework, CNN-Cert, that is capable of certifying robustness on general convolutional neural networks. Our framework is general -- we can handle various architectures including convolutional layers, max-pooling layers, batch normalization layer, residual blocks, as well as general activation functions; our approach is efficient -- by exploiting the special structure of convolutional layers, we achieve up to 17 and 11 times of speed-up compared to the state-of-the-art certification algorithms (e.g. Fast-Lin, CROWN) and 366 times of speed-up compared to the dual-LP approach while our algorithm obtains similar or even better verification bounds. In addition, CNN-Cert generalizes state-of-the-art algorithms e.g. Fast-Lin and CROWN. We demonstrate by extensive experiments that our method outperforms state-of-the-art lower-bound-based certification algorithms in terms of both bound quality and speed.\n ', '\n Finding minimum distortion of adversarial examples and thus certifying robustness in neural network classifiers for given data points is known to be a challenging problem. Nevertheless, recently it has been shown to be possible to give a non-trivial certified lower bound of minimum adversarial distortion, and some recent progress has been made towards this direction by exploiting the piece-wise linear nature of ReLU activations. However, a generic robustness certification for general activation functions still remains largely unexplored. To address this issue, in this paper we introduce CROWN, a general framework to certify robustness of neural networks with general activation functions for given input data points. The novelty in our algorithm consists of bounding a given activation function with linear and quadratic functions, hence allowing it to tackle general activation functions including but not limited to four popular choices: ReLU, tanh, sigmoid and arctan. In addition, we facilitate the search for a tighter certified lower bound by adaptively selecting appropriate surrogates for each neuron activation. Experimental results show that CROWN on ReLU networks can notably improve the certified lower bounds compared to the current state-of-the-art algorithm Fast-Lin, while having comparable computational efficiency. Furthermore, CROWN also demonstrates its effectiveness and flexibility on networks with general activation functions, including tanh, sigmoid and arctan.\n ', '\n CLEVER (Cross-Lipschitz Extreme Value for nEtwork Robustness) is an Extreme Value Theory (EVT) based robustness score for large-scale deep neural networks (DNNs). In this paper, we propose two extensions on this robustness score. First, we provide a new formal robustness guarantee for classifier functions that are twice differentiable. We apply extreme value theory on the new formal robustness guarantee and the estimated robustness is called second-order CLEVER score. Second, we discuss how to handle gradient masking, a common defensive technique, using CLEVER with Backward Pass Differentiable Approximation (BPDA). With BPDA applied, CLEVER can evaluate the intrinsic robustness of neural networks of a broader class -- networks with non-differentiable input transformations. We demonstrate the effectiveness of CLEVER with BPDA in experiments on a 121-layer Densenet model trained on the ImageNet dataset.\n ', '\n Verifying the robustness property of a general Rectified Linear Unit (ReLU) network is an NP-complete problem [Katz, Barrett, Dill, Julian and Kochenderfer CAV17]. Although finding the exact minimum adversarial distortion is hard, giving a certified lower bound of the minimum distortion is possible. Current available methods of computing such a bound are either time-consuming or delivering low quality bounds that are too loose to be useful. In this paper, we exploit the special structure of ReLU networks and provide two computationally efficient algorithms Fast-Lin and Fast-Lip that are able to certify non-trivial lower bounds of minimum distortions, by bounding the ReLU units with appropriate linear functions Fast-Lin, or by bounding the local Lipschitz constant Fast-Lip. Experiments show that (1) our proposed methods deliver bounds close to (the gap is 2-3X) exact minimum distortion found by Reluplex in small MNIST networks while our algorithms are more than 10,000 times faster; (2) our methods deliver similar quality of bounds (the gap is within 35% and usually around 10%; sometimes our bounds are even better) for larger networks compared to the methods based on solving linear programming problems but our algorithms are 33-14,000 times faster; (3) our method is capable of solving large MNIST and CIFAR networks up to 7 layers with more than 10,000 neurons within tens of seconds on a single CPU core.\n In addition, we show that, in fact, there is no polynomial time algorithm that can approximately find the minimum $\\ell_1$ adversarial distortion of a ReLU network with a $0.99\\ln n$ approximation ratio unless $\\mathsf{NP}$=$\\mathsf{P}$, where $n$ is the number of neurons in the network.\n ', '\n The robustness of neural networks to adversarial examples has received great attention due to security implications. Despite various attack approaches to crafting visually imperceptible adversarial examples, little has been developed towards a comprehensive measure of robustness. In this paper, we provide a theoretical justification for converting robustness analysis into a local Lipschitz constant estimation problem, and propose to use the Extreme Value Theory for efficient evaluation. Our analysis yields a novel robustness metric called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork Robustness. The proposed CLEVER score is attack-agnostic and computationally feasible for large neural networks. Experimental results on various networks, including ResNet, Inception-v3 and MobileNet, show that (i) CLEVER is aligned with the robustness indication measured by the $\\ell_2$ and $\\ell_\\infty$ norms of adversarial examples from powerful attacks, and (ii) defended networks using defensive distillation or bounded ReLU indeed achieve better CLEVER scores. To the best of our knowledge, CLEVER is the first attack-independent robustness metric that can be applied to any neural network classifier.\n ', '\n We propose a new algorithm for the computation of a singular value decomposition (SVD) low-rank approximation of a matrix in the Matrix Product Operator (MPO) format, also called the Tensor Train Matrix format. Our tensor network randomized SVD (TNrSVD) algorithm is an MPO implementation of the randomized SVD algorithm that is able to compute dominant singular values and their corresponding singular vectors. In contrast to the state-of-the-art tensor-based alternating least squares SVD (ALS-SVD) and modified alternating least squares SVD (MALS-SVD) matrix approximation methods, TNrSVD can be up to 17 times faster while achieving the same accuracy. In addition, our TNrSVD algorithm also produces accurate approximations in particular cases where both ALS-SVD and MALS-SVD fail to converge. We also propose a new algorithm for the fast conversion of a sparse matrix into its corresponding MPO form, which is up to 509 times faster than the standard Tensor Train SVD (TT-SVD) method while achieving machine precision accuracy. The efficiency and accuracy of both algorithms are demonstrated in numerical experiments.\n ', '\n Fabrication process variations are a major source of yield degradation in the nano-scale design of integrated circuits (IC), microelectromechanical systems (MEMS) and photonic circuits. Stochastic spectral methods are a promising technique to quantify the uncertainties caused by process variations. Despite their superior efficiency over Monte Carlo for many design cases, these algorithms suffer from the curse of dimensionality; i.e., their computational cost grows very fast as the number of random parameters increases. In order to solve this challenging problem, this paper presents a high-dimensional uncertainty quantification algorithm from a big-data perspective. Specifically, we show that the huge number of (e.g., $1.5 \\times 10^{27}$) simulation samples in standard stochastic collocation can be reduced to a very small one (e.g., $500$) by exploiting some hidden structures of a high-dimensional data array. This idea is formulated as a tensor recovery problem with sparse and low-rank constraints; and it is solved with an alternating minimization approach. Numerical results show that our approach can simulate efficiently some ICs, as well as MEMS and photonic problems with over 50 independent random parameters, whereas the traditional algorithm can only handle several random parameters.\n ', '\n Many critical EDA problems suffer from the curse of dimensionality, i.e. the very fast-scaling computational burden produced by large number of parameters and/or unknown variables. This phenomenon may be caused by multiple spatial or temporal factors (e.g. 3-D field solvers discretizations and multi-rate circuit simulation), nonlinearity of devices and circuits, large number of design or optimization parameters (e.g. full-chip routing/placement and circuit sizing), or extensive process variations (e.g. variability/reliability analysis and design for manufacturability). The computational challenges generated by such high dimensional problems are generally hard to handle efficiently with traditional EDA core algorithms that are based on matrix and vector computation. This paper presents "tensor computation" as an alternative general framework for the development of efficient EDA algorithms and tools. A tensor is a high-dimensional generalization of a matrix and a vector, and is a natural choice for both storing and solving efficiently high-dimensional EDA problems. This paper gives a basic tutorial on tensors, demonstrates some recent examples of EDA applications (e.g., nonlinear circuit modeling and high-dimensional uncertainty quantification), and suggests further open EDA problems where the use of tensor computation could be of advantage.\n ', '\n Stochastic spectral methods have become a popular technique to quantify the uncertainties of nano-scale devices and circuits. They are much more efficient than Monte Carlo for certain design cases with a small number of random parameters. However, their computational cost significantly increases as the number of random parameters increases. This paper presents a big-data approach to solve high-dimensional uncertainty quantification problems. Specifically, we simulate integrated circuits and MEMS at only a small number of quadrature samples, then, a huge number of (e.g., $1.5 \\times 10^{27}$) solution samples are estimated from the available small-size (e.g., $500$) solution samples via a low-rank and tensor-recovery method. Numerical results show that our algorithm can easily extend the applicability of tensor-product stochastic collocation to IC and MEMS problems with over 50 random parameters, whereas the traditional algorithm can only handle several random parameters.\n ', '\n This paper presents a tensor-recovery method to solve probabilistic power flow problems. Our approach generates a high-dimensional and sparse generalized polynomial-chaos expansion that provides useful statistical information. The result can also speed up other essential routines in power systems (e.g., stochastic planning, operations and controls).\n Instead of simulating a power flow equation at all quadrature points, our approach only simulates an extremely small subset of samples. We suggest a model to exploit the underlying low-rank and sparse structure of high-dimensional simulation data arrays, making our technique applicable to power systems with many random parameters. We also present a numerical method to solve the resulting nonlinear optimization problem.\n Our algorithm is implemented in MATLAB and is verified by several benchmarks in MATPOWER $5.1$. Accurate results are obtained for power systems with up to $50$ independent random parameters, with a speedup factor up to $9\\times 10^{20}$.\n ', '\n Uncertainties have become a major concern in integrated circuit design. In order to avoid the huge number of repeated simulations in conventional Monte Carlo flows, this paper presents an intrusive spectral simulator for statistical circuit analysis. Our simulator employs the recently developed generalized polynomial chaos expansion to perform uncertainty quantification of nonlinear transistor circuits with both Gaussian and non-Gaussian random parameters. We modify the nonintrusive stochastic collocation (SC) method and develop an intrusive variant called stochastic testing (ST) method to accelerate the numerical simulation. Compared with the stochastic Galerkin (SG) method, the resulting coupled deterministic equations from our proposed ST method can be solved in a decoupled manner at each time point. At the same time, ST uses fewer samples and allows more flexible time step size controls than directly using a nonintrusive SC solver. These two properties make ST more efficient than SG and than existing SC methods, and more suitable for time-domain circuit simulation. Simulation results of several digital, analog and RF circuits are reported. Since our algorithm is based on generic mathematical models, the proposed ST algorithm can be applied to many other engineering problems.\n ', '\n Stochastic spectral methods are efficient techniques for uncertainty quantification. Recently they have shown excellent performance in the statistical analysis of integrated circuits. In stochastic spectral methods, one needs to determine a set of orthonormal polynomials and a proper numerical quadrature rule. The former are used as the basis functions in a generalized polynomial chaos expansion. The latter is used to compute the integrals involved in stochastic spectral methods. Obtaining such information requires knowing the density function of the random input {\\it a-priori}. However, individual system components are often described by surrogate models rather than density functions. In order to apply stochastic spectral methods in hierarchical uncertainty quantification, we first propose to construct physically consistent closed-form density functions by two monotone interpolation schemes. Then, by exploiting the special forms of the obtained density functions, we determine the generalized polynomial-chaos basis functions and the Gauss quadrature rules that are required by a stochastic spectral simulator. The effectiveness of our proposed algorithm is verified by both synthetic and practical circuit examples.\n ', '\n This brief paper proposes an uncertainty quantification method for the periodic steady-state (PSS) analysis with both Gaussian and non-Gaussian variations. Our stochastic testing formulation for the PSS problem provides superior efficiency over both Monte Carlo methods and existing spectral methods. The numerical implementation of a stochastic shooting Newton solver is presented for both forced and autonomous circuits. Simulation results on some analog/RF circuits are reported to show the effectiveness of our proposed algorithms.\n ', '\n Due to significant manufacturing process variations, the performance of integrated circuits (ICs) has become increasingly uncertain. Such uncertainties must be carefully quantified with efficient stochastic circuit simulators. This paper discusses the recent advances of stochastic spectral circuit simulators based on generalized polynomial chaos (gPC). Such techniques can handle both Gaussian and non-Gaussian random parameters, showing remarkable speedup over Monte Carlo for circuits with a small or medium number of parameters. We focus on the recently developed stochastic testing and the application of conventional stochastic Galerkin and stochastic collocation schemes to nonlinear circuit problems. The uncertainty quantification algorithms for static, transient and periodic steady-state simulations are presented along with some practical simulation results. Some open problems in this field are discussed.\n ', "\n Process variations are a major concern in today's chip design since they can significantly degrade chip performance. To predict such degradation, existing circuit and MEMS simulators rely on Monte Carlo algorithms, which are typically too slow. Therefore, novel fast stochastic simulators are highly desired. This paper first reviews our recently developed stochastic testing simulator that can achieve speedup factors of hundreds to thousands over Monte Carlo. Then, we develop a fast hierarchical stochastic spectral simulator to simulate a complex circuit or system consisting of several blocks. We further present a fast simulation approach based on anchored ANOVA (analysis of variance) for some design problems with many process variations. This approach can reduce the simulation cost and can identify which variation sources have strong impacts on the circuit's performance. The simulation results of some circuit and MEMS examples are reported to show the effectiveness of our simulator\n ", '\n Hierarchical uncertainty quantification can reduce the computational cost of stochastic circuit simulation by employing spectral methods at different levels. This paper presents an efficient framework to simulate hierarchically some challenging stochastic circuits/systems that include high-dimensional subsystems. Due to the high parameter dimensionality, it is challenging to both extract surrogate models at the low level of the design hierarchy and to handle them in the high-level simulation. In this paper, we develop an efficient ANOVA-based stochastic circuit/MEMS simulator to extract efficiently the surrogate models at the low level. In order to avoid the curse of dimensionality, we employ tensor-train decomposition at the high level to construct the basis functions and Gauss quadrature points. As a demonstration, we verify our algorithm on a stochastic oscillator with four MEMS capacitors and 184 random parameters. This challenging example is simulated efficiently by our simulator at the cost of only 10 minutes in MATLAB on a regular personal computer.\n ', "\n Machine learning has developed a variety of tools for learning and representing high-dimensional distributions with structure. Recent years have also seen big advances in designing multi-item mechanisms. Akin to overfitting, however, these mechanisms can be extremely sensitive to the Bayesian prior that they target, which becomes problematic when that prior is only approximately known. We consider a multi-item mechanism design problem where the bidders' value distributions can be approximated by a topic model. Our solution builds on a recent robustification framework by Brustle et al., which disentangles the statistical challenge of estimating a multi-dimensional prior from the task of designing a good mechanism for it, robustifying the performance of the latter against the estimation error of the former. We provide an extension of the framework that allows us to exploit the expressive power of topic models to reduce the effective dimensionality of the mechanism design problem.\n ", '\n We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\\rm poly}(\\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\\rm poly}(\\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Pen, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\\tilde{O}\\left(\\frac 1T\\right)$.\n ', '\n We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a {\\em single} sample. {We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed.}\n ', '\n We show a statistical version of Taylor\'s theorem and apply this result to non-parametric density estimation from truncated samples, which is a classical challenge in Statistics \\cite{woodroofe1985estimating, stute1993almost}. The single-dimensional version of our theorem has the following implication: "For any distribution $P$ on $[0, 1]$ with a smooth log-density function, given samples from the conditional distribution of $P$ on $[a, a + \\varepsilon] \\subset [0, 1]$, we can efficiently identify an approximation to $P$ over the \\emph{whole} interval $[0, 1]$, with quality of approximation that improves with the smoothness of $P$."\n To the best of knowledge, our result is the first in the area of non-parametric density estimation from truncated samples, which works under the hard truncation model, where the samples outside some survival set $S$ are never observed, and applies to multiple dimensions. In contrast, previous works assume single dimensional data where each sample has a different survival set $S$ so that samples from the whole support will ultimately be collected.\n ', '\n We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation.\n ', '\n The use of min-max optimization in adversarial training of deep neural network classifiers and training of generative adversarial networks has motivated the study of nonconvex-nonconcave optimization objectives, which frequently arise in these applications. Unfortunately, recent results have established that even approximate first-order stationary points of such objectives are intractable, even under smoothness conditions, motivating the study of min-max objectives with additional structure. We introduce a new class of structured nonconvex-nonconcave min-max optimization problems, proposing a generalization of the extragradient algorithm which provably converges to a stationary point. The algorithm applies not only to Euclidean spaces, but also to general $\\ell_p$-normed finite-dimensional real vector spaces. We also discuss its stability under stochastic oracles and provide bounds on its sample complexity. Our iteration complexity and sample complexity bounds either match or improve the best known bounds for the same or less general nonconvex-nonconcave settings, such as those that satisfy variational coherence or in which a weak solution to the associated variational inequality problem is assumed to exist.\n ', '\n We show that $n$-variable tree-structured Ising models can be learned computationally-efficiently to within total variation distance $ε$ from an optimal $O(n \\ln n/ε^2)$ samples, where $O(\\cdot)$ hides an absolute constant which, importantly, does not depend on the model being learned - neither its tree nor the magnitude of its edge strengths, on which we place no assumptions. Our guarantees hold, in fact, for the celebrated Chow-Liu [1968] algorithm, using the plug-in estimator for estimating mutual information. While this (or any other) algorithm may fail to identify the structure of the underlying model correctly from a finite sample, we show that it will still learn a tree-structured model that is $ε$-close to the true one in total variation distance, a guarantee called "proper learning."\n Our guarantees do not follow from known results for the Chow-Liu algorithm and the ensuing literature on learning graphical models, including a recent renaissance of algorithms on this learning challenge, which only yield asymptotic consistency results, or sample-inefficient and/or time-inefficient algorithms, unless further assumptions are placed on the graphical model, such as bounds on the "strengths" of the model\'s edges/hyperedges. While we establish guarantees for a widely known and simple algorithm, the analysis that this algorithm succeeds and is sample-optimal is quite complex, requiring a hierarchical classification of the edges into layers with different reconstruction guarantees, depending on their strength, combined with delicate uses of the subadditivity of the squared Hellinger distance over graphical models to control the error accumulation.\n ', '\n We study the question of obtaining last-iterate convergence rates for no-regret learning algorithms in multi-player games. We show that the optimistic gradient (OG) algorithm with a constant step-size, which is no-regret, achieves a last-iterate rate of $O(1/\\sqrt{T})$ with respect to the gap function in smooth monotone games. This result addresses a question of Mertikopoulos & Zhou (2018), who asked whether extra-gradient approaches (such as OG) can be applied to achieve improved guarantees in the multi-agent learning setting. The proof of our upper bound uses a new technique centered around an adaptive choice of potential function at each iteration. We also show that the $O(1/\\sqrt{T})$ rate is tight for all $p$-SCLI algorithms, which includes OG as a special case. As a byproduct of our lower bound analysis we additionally present a proof of a conjecture of Arjevani et al. (2015) which is more direct than previous approaches.\n ', '\n We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = w^T x + ε$ and its corresponding vector of covariates $x \\in R^k$ are only revealed if the dependent variable falls in some subset $S \\subseteq R$; otherwise the existence of the pair $(x, y)$ is hidden. This problem has remained a challenge since the early works of [Tobin 1958, Amemiya 1973, Hausman and Wise 1977], its applications are abundant, and its history dates back even further to the work of Galton, Pearson, Lee, and Fisher. While consistent estimators of the regression coefficients have been identified, the error rates are not well-understood, especially in high dimensions.\n Under a thickness assumption about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vector $w$ from $n$ revealed samples that attains $l_2$ error $\\tilde{O}(\\sqrt{k/n})$. Our estimator uses Projected Stochastic Gradient Descent (PSGD) without replacement on the negative log-likelihood of the truncated sample. For the statistically efficient estimation we only need oracle access to the set $S$.In order to achieve computational efficiency we need to assume that $S$ is a union of a finite number of intervals but still can be complicated. PSGD without replacement must be restricted to an appropriately defined convex cone to guarantee that the negative log-likelihood is strongly convex, which in turn is established using concentration of matrices on variables with sub-exponential tails. We perform experiments on simulated data to illustrate the accuracy of our estimator.\n As a corollary, we show that SGD learns the parameters of single-layer neural networks with noisy activation functions.\n ', '\n Despite its important applications in Machine Learning, min-max optimization of nonconvex-nonconcave objectives remains elusive. Not only are there no known first-order methods converging even to approximate local min-max points, but the computational complexity of identifying them is also poorly understood. In this paper, we provide a characterization of the computational complexity of the problem, as well as of the limitations of first-order methods in constrained min-max optimization problems with nonconvex-nonconcave objectives and linear constraints.\n As a warm-up, we show that, even when the objective is a Lipschitz and smooth differentiable function, deciding whether a min-max point exists, in fact even deciding whether an approximate min-max point exists, is NP-hard. More importantly, we show that an approximate local min-max point of large enough approximation is guaranteed to exist, but finding one such point is PPAD-complete. The same is true of computing an approximate fixed point of Gradient Descent/Ascent.\n An important byproduct of our proof is to establish an unconditional hardness result in the Nemirovsky-Yudin model. We show that, given oracle access to some function $f : P \\to [-1, 1]$ and its gradient $\\nabla f$, where $P \\subseteq [0, 1]^d$ is a known convex polytope, every algorithm that finds a $\\varepsilon$-approximate local min-max point needs to make a number of queries that is exponential in at least one of $1/\\varepsilon$, $L$, $G$, or $d$, where $L$ and $G$ are respectively the smoothness and Lipschitzness of $f$ and $d$ is the dimension. This comes in sharp contrast to minimization problems, where finding approximate local minima in the same setting can be done with Projected Gradient Descent using $O(L/\\varepsilon)$ many queries. Our result is the first to show an exponential separation between these two fundamental optimization problems.\n ', '\n We propose a new method for inferring the governing stochastic ordinary differential equations (SODEs) by observing particle ensembles at discrete and sparse time instants, i.e., multiple "snapshots". Particle coordinates at a single time instant, possibly noisy or truncated, are recorded in each snapshot but are unpaired across the snapshots. By training a physics-informed generative model that generates "fake" sample paths, we aim to fit the observed particle ensemble distributions with a curve in the probability measure space, which is induced from the inferred particle dynamics. We employ different metrics to quantify the differences between distributions, e.g., the sliced Wasserstein distances and the adversarial losses in generative adversarial networks (GANs). We refer to this method as generative "ensemble-regression" (GER), in analogy to the classic "point-regression", where we infer the dynamics by performing regression in the Euclidean space. We illustrate the GER by learning the drift and diffusion terms of particle ensembles governed by SODEs with Brownian motions and Levy processes up to 100 dimensions. We also discuss how to treat cases with noisy or truncated observations. Apart from systems consisting of independent particles, we also tackle nonlocal interacting particle systems with unknown interaction potential parameters by constructing a physics-informed loss function. Finally, we investigate scenarios of paired observations and discuss how to reduce the dimensionality in such cases by proving a convergence theorem that provides theoretical support.\n ', '\n As in standard linear regression, in truncated linear regression, we are given access to observations $(A_i, y_i)_i$ whose dependent variable equals $y_i= A_i^{\\rm T} \\cdot x^* + η_i$, where $x^*$ is some fixed unknown vector of interest and $η_i$ is independent noise; except we are only given an observation if its dependent variable $y_i$ lies in some "truncation set" $S \\subset \\mathbb{R}$. The goal is to recover $x^*$ under some favorable conditions on the $A_i$\'s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering $k$-sparse $n$-dimensional vectors $x^*$ from $m$ truncated samples, which attains an optimal $\\ell_2$ reconstruction error of $O(\\sqrt{(k \\log n)/m})$. As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) [Daskalakis et al. 2018], where no regularization is needed due to the low-dimensionality of the data, and (2) [Wainright 2009], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.\n ', '\n Generative neural networks have been empirically found very promising in providing effective structural priors for compressed sensing, since they can be trained to span low-dimensional data manifolds in high-dimensional signal spaces. Despite the non-convexity of the resulting optimization problem, it has also been shown theoretically that, for neural networks with random Gaussian weights, a signal in the range of the network can be efficiently, approximately recovered from a few noisy measurements. However, a major bottleneck of these theoretical guarantees is a network expansivity condition: that each layer of the neural network must be larger than the previous by a logarithmic factor. Our main contribution is to break this strong expansivity assumption, showing that constant expansivity suffices to get efficient recovery algorithms, besides it also being information-theoretically necessary. To overcome the theoretical bottleneck in existing approaches we prove a novel uniform concentration theorem for random functions that might not be Lipschitz but satisfy a relaxed notion which we call "pseudo-Lipschitzness." Using this theorem we can show that a matrix concentration inequality known as the Weight Distribution Condition (WDC), which was previously only known to hold for Gaussian matrices with logarithmic aspect ratio, in fact holds for constant aspect ratios too. Since the WDC is a fundamental matrix concentration inequality in the heart of all existing theoretical guarantees on this problem, our tighter bound immediately yields improvements in all known results in the literature on compressed sensing with deep generative priors, including one-bit recovery, phase retrieval, low-rank matrix recovery, and more.\n ', "\n There have been two separate lines of work on estimating Ising models: (1) estimating them from multiple independent samples under minimal assumptions about the model's interaction matrix; and (2) estimating them from one sample in restrictive settings. We propose a unified framework that smoothly interpolates between these two settings, enabling significantly richer estimation guarantees from one, a few, or many samples.\n Our main theorem provides guarantees for one-sample estimation, quantifying the estimation error in terms of the metric entropy of a family of interaction matrices. As corollaries of our main theorem, we derive bounds when the model's interaction matrix is a (sparse) linear combination of known matrices, or it belongs to a finite set, or to a high-dimensional manifold. In fact, our main result handles multiple independent samples by viewing them as one sample from a larger model, and can be used to derive estimation bounds that are qualitatively similar to those obtained in the afore-described multiple-sample literature. Our technical approach benefits from sparsifying a model's interaction network, conditioning on subsets of variables that make the dependencies in the resulting conditional distribution sufficiently weak. We use this sparsification technique to prove strong concentration and anti-concentration results for the Ising model, which we believe have applications beyond the scope of this paper.\n ", '\n We provide approximation guarantees for a linear-time inferential framework for Gaussian processes, using two low-rank kernel approximations based on random Fourier features and truncation of Mercer expansions. In particular, we bound the Kullback-Leibler divergence between the idealized Gaussian process and the one resulting from a low-rank approximation to its kernel. Additionally, we present strong evidence that these two approximations, enhanced by an initial automatic feature extraction through deep neural networks, outperform a broad range of state-of-the-art methods in terms of time efficiency, negative log-predictive density, and root mean squared error.\n ', '\n Spin glass models, such as the Sherrington-Kirkpatrick, Hopfield and Ising models, are all well-studied members of the exponential family of discrete distributions, and have been influential in a number of application domains where they are used to model correlation phenomena on networks. Conventionally these models have quadratic sufficient statistics and consequently capture correlations arising from pairwise interactions. In this work we study extensions of these to models with higher-order sufficient statistics, modeling behavior on a social network with peer-group effects. In particular, we model binary outcomes on a network as a higher-order spin glass, where the behavior of an individual depends on a linear function of their own vector of covariates and some polynomial function of the behavior of others, capturing peer-group effects. Using a {\\em single}, high-dimensional sample from such model our goal is to recover the coefficients of the linear function as well as the strength of the peer-group effects. The heart of our result is a novel approach for showing strong concavity of the log pseudo-likelihood of the model, implying statistical error rate of $\\sqrt{d/n}$ for the Maximum Pseudo-Likelihood Estimator (MPLE), where $d$ is the dimensionality of the covariate vectors and $n$ is the size of the network (number of nodes). Our model generalizes vanilla logistic regression as well as the peer-effect models studied in recent works, and our results extend these results to accommodate higher-order interactions.\n ', '\n Generative Adversarial Networks (GANs) are modern methods to learn the underlying distribution of a data set. GANs have been widely used in sample synthesis, de-noising, domain transfer, etc. GANs, however, are designed in a model-free fashion where no additional information about the underlying distribution is available. In many applications, however, practitioners have access to the underlying independence graph of the variables, either as a Bayesian network or a Markov Random Field (MRF). We ask: how can one use this additional information in designing model-based GANs? In this paper, we provide theoretical foundations to answer this question by studying subadditivity properties of probability divergences, which establish upper bounds on the distance between two high-dimensional distributions by the sum of distances between their marginals over (local) neighborhoods of the graphical structure of the Bayes-net or the MRF. We prove that several popular probability divergences satisfy some notion of subadditivity under mild conditions. These results lead to a principled design of a model-based GAN that uses a set of simple discriminators on the neighborhoods of the Bayes-net/MRF, rather than a giant discriminator on the entire network, providing significant statistical and computational benefits. Our experiments on synthetic and real-world datasets demonstrate the benefits of our principled design of model-based GANs.\n ', '\n We identify the first static credible mechanism for multi-item additive auctions that achieves a constant factor of the optimal revenue. This is one instance of a more general framework for designing two-part tariff auctions, adapting the duality framework of Cai et al [CDW16]. Given a (not necessarily incentive compatible) auction format $A$ satisfying certain technical conditions, our framework augments the auction with a personalized entry fee for each bidder, which must be paid before the auction can be accessed. These entry fees depend only on the prior distribution of bidder types, and in particular are independent of realized bids. Our framework can be used with many common auction formats, such as simultaneous first-price, simultaneous second-price, and simultaneous all-pay auctions. If all-pay auctions are used, we prove that the resulting mechanism is credible in the sense that the auctioneer cannot benefit by deviating from the stated mechanism after observing agent bids. If second-price auctions are used, we obtain a truthful $O(1)$-approximate mechanism with fixed entry fees that are amenable to tuning via online learning techniques. Our results for first price and all-pay are the first revenue guarantees of non-truthful mechanisms in multi-dimensional environments; an open question in the literature [RST17].\n ', '\n In this paper we study the smooth convex-concave saddle point problem. Specifically, we analyze the last iterate convergence properties of the Extragradient (EG) algorithm. It is well known that the ergodic (averaged) iterates of EG converge at a rate of $O(1/T)$ (Nemirovski, 2004). In this paper, we show that the last iterate of EG converges at a rate of $O(1/\\sqrt{T})$. To the best of our knowledge, this is the first paper to provide a convergence rate guarantee for the last iterate of EG for the smooth convex-concave saddle point problem. Moreover, we show that this rate is tight by proving a lower bound of $Ω(1/\\sqrt{T})$ for the last iterate. This lower bound therefore shows a quadratic separation of the convergence rates of ergodic and last iterates in smooth convex-concave saddle point problems.\n ', "\n We study the sample complexity of learning revenue-optimal multi-item auctions. We obtain the first set of positive results that go beyond the standard but unrealistic setting of item-independence. In particular, we consider settings where bidders' valuations are drawn from correlated distributions that can be captured by Markov Random Fields or Bayesian Networks -- two of the most prominent graphical models. We establish parametrized sample complexity bounds for learning an up-to-$\\varepsilon$ optimal mechanism in both models, which scale polynomially in the size of the model, i.e.~the number of items and bidders, and only exponential in the natural complexity measure of the model, namely either the largest in-degree (for Bayesian Networks) or the size of the largest hyper-edge (for Markov Random Fields).\n We obtain our learnability results through a novel and modular framework that involves first proving a robustness theorem. We show that, given only ``approximate distributions'' for bidder valuations, we can learn a mechanism whose revenue is nearly optimal simultaneously for all ``true distributions'' that are close to the ones we were given in Prokhorov distance. Thus, to learn a good mechanism, it suffices to learn approximate distributions. When item values are independent, learning in Prokhorov distance is immediate, hence our framework directly implies the main result of Gonczarowski and Weinberg. When item values are sampled from more general graphical models, we combine our robustness theorem with novel sample complexity results for learning Markov Random Fields or Bayesian Networks in Prokhorov distance, which may be of independent interest. Finally, in the single-item case, our robustness result can be strengthened to hold under an even weaker distribution distance, the Lévy distance.\n ", '\n Generative adversarial networks (GANs) are a widely used framework for learning generative models. Wasserstein GANs (WGANs), one of the most successful variants of GANs, require solving a minmax optimization problem to global optimality, but are in practice successfully trained using stochastic gradient descent-ascent. In this paper, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity.\n ', "\n Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples. Motivated by applications involving time-series data, there has been a growing literature on learning and generalization in settings where data is sampled from an ergodic process. This work has also developed complexity measures, which appropriately extend the notion of Rademacher complexity to bound the generalization error and learning rates of hypothesis classes in this setting. Rather than time-series data, our work is motivated by settings where data is sampled on a network or a spatial domain, and thus do not fit well within the framework of prior work. We provide learning and generalization bounds for data that are complexly dependent, yet their distribution satisfies the standard Dobrushin's condition. Indeed, we show that the standard complexity measures of Gaussian and Rademacher complexities and VC dimension are sufficient measures of complexity for the purposes of bounding the generalization error and learning rates of hypothesis classes in our setting. Moreover, our generalization bounds only degrade by constant factors compared to their i.i.d. analogs, and our learnability bounds degrade by log factors in the size of the training set.\n ", '\n The standard linear and logistic regression models assume that the response variables are independent, but share the same linear relationship to their corresponding vectors of covariates. The assumption that the response variables are independent is, however, too strong. In many applications, these responses are collected on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects. Regression with dependent responses has thus received a lot of attention in the Statistics and Economics literature, but there are no strong consistency results unless multiple independent samples of the vectors of dependent responses can be collected from these models. We present computationally and statistically efficient methods for linear and logistic regression models when the response variables are dependent on a network. Given one sample from a networked linear or logistic regression model and under mild assumptions, we prove strong consistency results for recovering the vector of coefficients and the strength of the dependencies, recovering the rates of standard regression under independent observations. We use projected gradient descent on the negative log-likelihood, or negative log-pseudolikelihood, and establish their strong convexity and consistency using concentration of measure for dependent random variables.\n ', "\n Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an accurate method for estimating probabilities of events on a small number of variables of a graphical model satisfying Dobrushin's condition~\\cite{DeSaOR16}. We investigate whether it can be used to accurately estimate expectations of functions of {\\em all the variables} of the model. Under the same condition, we show that the synchronous (sequential) and asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between their (multivariate) samples remains bounded by $O(τ\\log n),$ where $n$ is the number of variables in the graphical model, and $τ$ is a measure of the asynchronicity. A similar bound holds for any constant power of the Hamming distance. Hence, the expectation of any function that is Lipschitz with respect to a power of the Hamming distance, can be estimated with a bias that grows logarithmically in $n$. Going beyond Lipschitz functions, we consider the bias arising from asynchronicity in estimating the expectation of polynomial functions of all variables in the model. Using recent concentration of measure results, we show that the bias introduced by the asynchronicity is of smaller order than the standard deviation of the function value already present in the true model. We perform experiments on a multi-processor machine to empirically illustrate our theoretical findings.\n ", "\n We analyze linear independence of rank one tensors produced by tensor powers of randomly perturbed vectors. This enables efficient decomposition of sums of high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider range of perturbation models, including discrete ones. We give an application to recovering assemblies of neurons.\n Assemblies are large sets of neurons representing specific memories or concepts. The size of the intersection of two assemblies has been shown in experiments to represent the extent to which these memories co-occur or these concepts are related; the phenomenon is called association of assemblies. This suggests that an animal's memory is a complex web of associations, and poses the problem of recovering this representation from cognitive data. Motivated by this problem, we study the following more general question: Can we reconstruct the Venn diagram of a family of sets, given the sizes of their $\\ell$-wise intersections? We show that as long as the family of sets is randomly perturbed, it is enough for the number of measurements to be polynomially larger than the number of nonempty regions of the Venn diagram to fully reconstruct the diagram.\n ", '\n We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a $d$-variate normal ${\\cal N}(\\mathbfμ,\\mathbfΣ)$ means a samples is only revealed if it falls in some subset $S \\subseteq \\mathbb{R}^d$; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean $\\mathbfμ$ and covariance matrix $\\mathbfΣ$ can be estimated with arbitrary accuracy in polynomial-time, as long as we have oracle access to $S$, and $S$ has non-trivial measure under the unknown $d$-variate normal distribution. Additionally we show that without oracle access to $S$, any non-trivial estimation is impossible.\n ', '\n Motivated by applications in Game Theory, Optimization, and Generative Adversarial Networks, recent work of Daskalakis et al \\cite{DISZ17} and follow-up work of Liang and Stokes \\cite{LiangS18} have established that a variant of the widely used Gradient Descent/Ascent procedure, called "Optimistic Gradient Descent/Ascent (OGDA)", exhibits last-iterate convergence to saddle points in {\\em unconstrained} convex-concave min-max optimization problems. We show that the same holds true in the more general problem of {\\em constrained} min-max optimization under a variant of the no-regret Multiplicative-Weights-Update method called "Optimistic Multiplicative-Weights Update (OMWU)". This answers an open question of Syrgkanis et al \\cite{SALS15}.\n The proof of our result requires fundamentally different techniques from those that exist in no-regret learning literature and the aforementioned papers. We show that OMWU monotonically improves the Kullback-Leibler divergence of the current iterate to the (appropriately normalized) min-max solution until it enters a neighborhood of the solution. Inside that neighborhood we show that OMWU is locally (asymptotically) stable converging to the exact solution. We believe that our techniques will be useful in the analysis of the last iterate of other learning algorithms.\n ', '\n Motivated by applications in Optimization, Game Theory, and the training of Generative Adversarial Networks, the convergence properties of first order methods in min-max problems have received extensive study. It has been recognized that they may cycle, and there is no good understanding of their limit points when they do not. When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We show that both dynamics avoid unstable critical points for almost all initializations. Moreover, for small step sizes and under mild assumptions, the set of \\{OGDA\\}-stable critical points is a superset of \\{GDA\\}-stable critical points, which is a superset of local min-max solutions (strict in some cases). The connecting thread is that the behavior of these dynamics can be studied from a dynamical systems perspective.\n ', "\n We consider testing and learning problems on causal Bayesian networks as defined by Pearl (Pearl, 2009). Given a causal Bayesian network $\\mathcal{M}$ on a graph with $n$ discrete variables and bounded in-degree and bounded `confounded components', we show that $O(\\log n)$ interventions on an unknown causal Bayesian network $\\mathcal{X}$ on the same graph, and $\\tilde{O}(n/ε^2)$ samples per intervention, suffice to efficiently distinguish whether $\\mathcal{X}=\\mathcal{M}$ or whether there exists some intervention under which $\\mathcal{X}$ and $\\mathcal{M}$ are farther than $ε$ in total variation distance. We also obtain sample/time/intervention efficient algorithms for: (i) testing the identity of two unknown causal Bayesian networks on the same graph; and (ii) learning a causal Bayesian network on a given graph. Although our algorithms are non-adaptive, we show that adaptivity does not help in general: $Ω(\\log n)$ interventions are necessary for testing the identity of two unknown causal Bayesian networks on the same graph, even adaptively. Our algorithms are enabled by a new subadditivity inequality for the squared Hellinger distance between two causal Bayesian networks.\n ", '\n We study revenue optimization in a repeated auction between a single seller and a single buyer. Traditionally, the design of repeated auctions requires strong modeling assumptions about the bidder behavior, such as it being myopic, infinite lookahead, or some specific form of learning behavior. Is it possible to design mechanisms which are simultaneously optimal against a multitude of possible buyer behaviors? We answer this question by designing a simple state-based mechanism that is simultaneously approximately optimal against a $k$-lookahead buyer for all $k$, a buyer who is a no-regret learner, and a buyer who is a policy-regret learner. Against each type of buyer our mechanism attains a constant fraction of the optimal revenue attainable against that type of buyer. We complement our positive results with almost tight impossibility results, showing that the revenue approximation tradeoffs achieved by our mechanism for different lookahead attitudes are near-optimal.\n ', '\n We propose a new type of attack for finding adversarial examples for image classifiers. Our method exploits spanners, i.e. deep neural networks whose input space is low-dimensional and whose output range approximates the set of images of interest. Spanners may be generators of GANs or decoders of VAEs. The key idea in our attack is to search over latent code pairs to find ones that generate nearby images with different classifier outputs. We argue that our attack is stronger than searching over perturbations of real images. Moreover, we show that our stronger attack can be used to reduce the accuracy of Defense-GAN to 3\\%, resolving an open problem from the well-known paper by Athalye et al. We combine our attack with normal adversarial training to obtain the most robust known MNIST classifier, significantly improving the state of the art against PGD attacks. Our formulation involves solving a min-max problem, where the min player sets the parameters of the classifier and the max player is running our attack, and is thus searching for adversarial examples in the {\\em low-dimensional} input space of the spanner.\n All code and models are available at \\url{https://github.com/ajiljalal/manifold-defense.git}\n ', '\n We address the issue of limit cycling behavior in training Generative Adversarial Networks and propose the use of Optimistic Mirror Decent (OMD) for training Wasserstein GANs. Recent theoretical results have shown that optimistic mirror decent (OMD) can enjoy faster regret rates in the context of zero-sum games. WGANs is exactly a context of solving a zero-sum game with simultaneous no-regret dynamics. Moreover, we show that optimistic mirror decent addresses the limit cycling problem in training WGANs. We formally show that in the case of bi-linear zero-sum games the last iterate of OMD dynamics converges to an equilibrium, in contrast to GD dynamics which are bound to cycle. We also portray the huge qualitative difference between GD and OMD dynamics with toy examples, even when GD is modified with many adaptations proposed in the recent literature, such as gradient penalty or momentum. We apply OMD WGAN training to a bioinformatics problem of generating DNA sequences. We observe that models trained with OMD achieve consistently smaller KL divergence with respect to the true underlying distribution, than models trained with GD variants. Finally, we introduce a new algorithm, Optimistic Adam, which is an optimistic variant of Adam. We apply it to WGAN training on CIFAR10 and observe improved performance in terms of inception score as compared to Adam.\n ', '\n We prove near-tight concentration of measure for polynomial functions of the Ising model under high temperature. For any degree $d$, we show that a degree-$d$ polynomial of a $n$-spin Ising model exhibits exponential tails that scale as $\\exp(-r^{2/d})$ at radius $r=\\tildeΩ_d(n^{d/2})$. Our concentration radius is optimal up to logarithmic factors for constant $d$, improving known results by polynomial factors in the number of spins. We demonstrate the efficacy of polynomial functions as statistics for testing the strength of interactions in social networks in both synthetic and real world data.\n ', '\n We provide algorithms that learn simple auctions whose revenue is approximately optimal in multi-item multi-bidder settings, for a wide range of valuations including unit-demand, additive, constrained additive, XOS, and subadditive. We obtain our learning results in two settings. The first is the commonly studied setting where sample access to the bidders\' distributions over valuations is given, for both regular distributions and arbitrary distributions with bounded support. Our algorithms require polynomially many samples in the number of items and bidders. The second is a more general max-min learning setting that we introduce, where we are given "approximate distributions," and we seek to compute an auction whose revenue is approximately optimal simultaneously for all "true distributions" that are close to the given ones. These results are more general in that they imply the sample-based results, and are also applicable in settings where we have no sample access to the underlying distributions but have estimated them indirectly via market research or by observation of previously run, potentially non-truthful auctions.\n Our results hold for valuation distributions satisfying the standard (and necessary) independence-across-items property. They also generalize and improve upon recent works, which have provided algorithms that learn approximately optimal auctions in more restricted settings with additive, subadditive and unit-demand valuations using sample access to distributions. We generalize these results to the complete unit-demand, additive, and XOS setting, to i.i.d. subadditive bidders, and to the max-min setting.\n Our results are enabled by new uniform convergence bounds for hypotheses classes under product measures. Our bounds result in exponential savings in sample complexity compared to bounds derived by bounding the VC dimension, and are of independent interest.\n ', '\n Given samples from an unknown distribution $p$ and a description of a distribution $q$, are $p$ and $q$ close or far? This question of "identity testing" has received significant attention in the case of testing whether $p$ and $q$ are equal or far in total variation distance. However, in recent work, the following questions have been been critical to solving problems at the frontiers of distribution testing:\n -Alternative Distances: Can we test whether $p$ and $q$ are far in other distances, say Hellinger?\n -Tolerance: Can we test when $p$ and $q$ are close, rather than equal? And if so, close in which distances?\n Motivated by these questions, we characterize the complexity of distribution testing under a variety of distances, including total variation, $\\ell_2$, Hellinger, Kullback-Leibler, and $χ^2$. For each pair of distances $d_1$ and $d_2$, we study the complexity of testing if $p$ and $q$ are close in $d_1$ versus far in $d_2$, with a focus on identifying which problems allow strongly sublinear testers (i.e., those with complexity $O(n^{1 - γ})$ for some $γ> 0$ where $n$ is the size of the support of the distributions $p$ and $q$). We provide matching upper and lower bounds for each case. We also study these questions in the case where we only have samples from $q$ (equivalence testing), showing qualitative differences from identity testing in terms of when tolerance can be achieved. Our algorithms fall into the classical paradigm of $χ^2$-statistics, but require crucial changes to handle the challenges introduced by each distance we consider. Finally, we survey other recent results in an attempt to serve as a reference for the complexity of various distribution testing problems.\n ', '\n Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a single trajectory of a Markov Chain. In particular, we observe a single trajectory X0,...,Xt,... of an unknown, symmetric, and finite state Markov Chain M. We do not control the starting state X0, and we cannot restart the chain. Given our single trajectory, the goal is to test whether M is identical to a model Markov Chain M0 , or far from it under an appropriate notion of difference. We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [Kaz78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain M0 is O(n) in the size of the state space n.\n ', '\n We develop differentially private hypothesis testing methods for the small sample regime. Given a sample $\\cal D$ from a categorical distribution $p$ over some domain $Σ$, an explicitly described distribution $q$ over $Σ$, some privacy parameter $\\varepsilon$, accuracy parameter $α$, and requirements $β_{\\rm I}$ and $β_{\\rm II}$ for the type I and type II errors of our test, the goal is to distinguish between $p=q$ and $d_{\\rm{TV}}(p,q) \\geq α$.\n We provide theoretical bounds for the sample size $|{\\cal D}|$ so that our method both satisfies $(\\varepsilon,0)$-differential privacy, and guarantees $β_{\\rm I}$ and $β_{\\rm II}$ type I and type II errors. We show that differential privacy may come for free in some regimes of parameters, and we always beat the sample complexity resulting from running the $χ^2$-test with noisy counts, or standard approaches such as repetition for endowing non-private $χ^2$-style statistics with differential privacy guarantees. We experimentally compare the sample complexity of our method to that of recently proposed methods for private hypothesis testing.\n ', "\n Banach's fixed point theorem for contraction maps has been widely used to analyze the convergence of iterative methods in non-convex problems. It is a common experience, however, that iterative maps fail to be globally contracting under the natural metric in their domain, making the applicability of Banach's theorem limited. We explore how generally we can apply Banach's fixed point theorem to establish the convergence of iterative methods when pairing it with carefully designed metrics.\n Our first result is a strong converse of Banach's theorem, showing that it is a universal analysis tool for establishing global convergence of iterative methods to unique fixed points, and for bounding their convergence rate. In other words, we show that, whenever an iterative map globally converges to a unique fixed point, there exists a metric under which the iterative map is contracting and which can be used to bound the number of iterations until convergence. We illustrate our approach in the widely used power method, providing a new way of bounding its convergence rate through contraction arguments.\n We next consider the computational complexity of Banach's fixed point theorem. Making the proof of our converse theorem constructive, we show that computing a fixed point whose existence is guaranteed by Banach's fixed point theorem is CLS-complete. We thus provide the first natural complete problem for the class CLS, which was defined in [Daskalakis, Papadimitriou 2011] to capture the complexity of problems such as P-matrix LCP, computing KKT-points, and finding mixed Nash equilibria in congestion and network coordination games.\n ", '\n We show that the square Hellinger distance between two Bayesian networks on the same directed graph, $G$, is subadditive with respect to the neighborhoods of $G$. Namely, if $P$ and $Q$ are the probability distributions defined by two Bayesian networks on the same DAG, our inequality states that the square Hellinger distance, $H^2(P,Q)$, between $P$ and $Q$ is upper bounded by the sum, $\\sum_v H^2(P_{\\{v\\} \\cup Π_v}, Q_{\\{v\\} \\cup Π_v})$, of the square Hellinger distances between the marginals of $P$ and $Q$ on every node $v$ and its parents $Π_v$ in the DAG. Importantly, our bound does not involve the conditionals but the marginals of $P$ and $Q$. We derive a similar inequality for more general Markov Random Fields.\n As an application of our inequality, we show that distinguishing whether two Bayesian networks $P$ and $Q$ on the same (but potentially unknown) DAG satisfy $P=Q$ vs $d_{\\rm TV}(P,Q)>ε$ can be performed from $\\tilde{O}(|Σ|^{3/4(d+1)} \\cdot n/ε^2)$ samples, where $d$ is the maximum in-degree of the DAG and $Σ$ the domain of each variable of the Bayesian networks. If $P$ and $Q$ are defined on potentially different and potentially unknown trees, the sample complexity becomes $\\tilde{O}(|Σ|^{4.5} n/ε^2)$, whose dependence on $n, ε$ is optimal up to logarithmic factors. Lastly, if $P$ and $Q$ are product distributions over $\\{0,1\\}^n$ and $Q$ is known, the sample complexity becomes $O(\\sqrt{n}/ε^2)$, which is optimal up to constant factors.\n ', '\n Given samples from an unknown multivariate distribution $p$, is it possible to distinguish whether $p$ is the product of its marginals versus $p$ being far from every product distribution? Similarly, is it possible to distinguish whether $p$ equals a given distribution $q$ versus $p$ and $q$ being far from each other? These problems of testing independence and goodness-of-fit have received enormous attention in statistics, information theory, and theoretical computer science, with sample-optimal algorithms known in several interesting regimes of parameters. Unfortunately, it has also been understood that these problems become intractable in large dimensions, necessitating exponential sample complexity.\n Motivated by the exponential lower bounds for general distributions as well as the ubiquity of Markov Random Fields (MRFs) in the modeling of high-dimensional distributions, we initiate the study of distribution testing on structured multivariate distributions, and in particular the prototypical example of MRFs: the Ising Model. We demonstrate that, in this structured setting, we can avoid the curse of dimensionality, obtaining sample and time efficient testers for independence and goodness-of-fit. One of the key technical challenges we face along the way is bounding the variance of functions of the Ising model.\n ', '\n The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, $\\tilde{O}(d/ε^2)$ samples suffice to compute the unknown vectors to within $ε$ in Mahalanobis distance, where $d$ is the dimension. In particular, the error rate of the EM based estimator is $\\tilde{O}\\left(\\sqrt{d \\over n}\\right)$ where $n$ is the number of samples, which is optimal up to logarithmic factors.\n ', '\n We consider the problem of a revenue-maximizing seller with m items for sale to n additive bidders with hard budget constraints, assuming that the seller has some prior distribution over bidder values and budgets. The prior may be correlated across items and budgets of the same bidder, but is assumed independent across bidders. We target mechanisms that are Bayesian Incentive Compatible, but that are ex-post Individually Rational and ex-post budget respecting. Virtually no such mechanisms are known that satisfy all these conditions and guarantee any revenue approximation, even with just a single item. We provide a computationally efficient mechanism that is a $3$-approximation with respect to all BIC, ex-post IR, and ex-post budget respecting mechanisms. Note that the problem is NP-hard to approximate better than a factor of 16/15, even in the case where the prior is a point mass \\cite{ChakrabartyGoel}. We further characterize the optimal mechanism in this setting, showing that it can be interpreted as a distribution over virtual welfare maximizers.\n We prove our results by making use of a black-box reduction from mechanism to algorithm design developed by \\cite{CaiDW13b}. Our main technical contribution is a computationally efficient $3$-approximation algorithm for the algorithmic problem that results by an application of their framework to this problem. The algorithmic problem has a mixed-sign objective and is NP-hard to optimize exactly, so it is surprising that a computationally efficient approximation is possible at all. In the case of a single item ($m=1$), the algorithmic problem can be solved exactly via exhaustive search, leading to a computationally efficient exact algorithm and a stronger characterization of the optimal mechanism as a distribution over virtual value maximizers.\n ', '\n We provide a characterization of revenue-optimal dynamic mechanisms in settings where a monopolist sells k items over k periods to a buyer who realizes his value for item i in the beginning of period i. We require that the mechanism satisfies a strong individual rationality constraint, requiring that the stage utility of each agent be positive during each period. We show that the optimum mechanism can be computed by solving a nested sequence of static (single-period) mechanisms that optimize a tradeoff between the surplus of the allocation and the buyer\'s utility. We also provide a simple dynamic mechanism that obtains at least half of the optimal revenue. The mechanism either ignores history and posts the optimal monopoly price in each period, or allocates with a probability that is independent of the current report of the agent and is based only on previous reports. Our characterization extends to multi-agent auctions. We also formulate a discounted infinite horizon version of the problem, where we study the performance of "Markov mechanisms."\n ', '\n An $(n,k)$-Poisson Multinomial Distribution (PMD) is the distribution of the sum of $n$ independent random vectors supported on the set ${\\cal B}_k=\\{e_1,\\ldots,e_k\\}$ of standard basis vectors in $\\mathbb{R}^k$. We show that any $(n,k)$-PMD is ${\\rm poly}\\left({k\\over σ}\\right)$-close in total variation distance to the (appropriately discretized) multi-dimensional Gaussian with the same first two moments, removing the dependence on $n$ from the Central Limit Theorem of Valiant and Valiant. Interestingly, our CLT is obtained by bootstrapping the Valiant-Valiant CLT itself through the structural characterization of PMDs shown in recent work by Daskalakis, Kamath, and Tzamos. In turn, our stronger CLT can be leveraged to obtain an efficient PTAS for approximate Nash equilibria in anonymous games, significantly improving the state of the art, and matching qualitatively the running time dependence on $n$ and $1/\\varepsilon$ of the best known algorithm for two-strategy anonymous games. Our new CLT also enables the construction of covers for the set of $(n,k)$-PMDs, which are proper and whose size is shown to be essentially optimal. Our cover construction combines our CLT with the Shapley-Folkman theorem and recent sparsification results for Laplacian matrices by Batson, Spielman, and Srivastava. Our cover size lower bound is based on an algebraic geometric construction. Finally, leveraging the structural properties of the Fourier spectrum of PMDs we show that these distributions can be learned from $O_k(1/\\varepsilon^2)$ samples in ${\\rm poly}_k(1/\\varepsilon)$-time, removing the quasi-polynomial dependence of the running time on $1/\\varepsilon$ from the algorithm of Daskalakis, Kamath, and Tzamos.\n ', '\n A line of recent work provides welfare guarantees of simple combinatorial auction formats, such as selling m items via simultaneous second price auctions (SiSPAs) (Christodoulou et al. 2008, Bhawalkar and Roughgarden 2011, Feldman et al. 2013). These guarantees hold even when the auctions are repeatedly executed and players use no-regret learning algorithms. Unfortunately, off-the-shelf no-regret algorithms for these auctions are computationally inefficient as the number of actions is exponential. We show that this obstacle is insurmountable: there are no polynomial-time no-regret algorithms for SiSPAs, unless RP$\\supseteq$ NP, even when the bidders are unit-demand. Our lower bound raises the question of how good outcomes polynomially-bounded bidders may discover in such auctions.\n To answer this question, we propose a novel concept of learning in auctions, termed "no-envy learning." This notion is founded upon Walrasian equilibrium, and we show that it is both efficiently implementable and results in approximately optimal welfare, even when the bidders have fractionally subadditive (XOS) valuations (assuming demand oracles) or coverage valuations (without demand oracles). No-envy learning outcomes are a relaxation of no-regret outcomes, which maintain their approximate welfare optimality while endowing them with computational tractability. Our results extend to other auction formats that have been studied in the literature via the smoothness paradigm.\n Our results for XOS valuations are enabled by a novel Follow-The-Perturbed-Leader algorithm for settings where the number of experts is infinite, and the payoff function of the learner is non-linear. This algorithm has applications outside of auction settings, such as in security games. Our result for coverage valuations is based on a novel use of convex rounding schemes and a reduction to online convex optimization.\n ', '\n Reconstructing the tree of life from molecular sequences is a fundamental problem in computational biology. Modern data sets often contain a large number of genes, which can complicate the reconstruction problem due to the fact that different genes may undergo different evolutionary histories. This is the case in particular in the presence of horizontal genetic transfer (HGT), where a gene is inherited from a distant species rather than an immediate ancestor. Such an event produces a gene tree which is distinct from, but related to, the species phylogeny.\n In previous work, a natural stochastic models of HGT was introduced and studied. It was shown, both in simulation and theoretical studies, that a species phylogeny can be reconstructed from gene trees despite surprisingly high rates of HGT under this model. Rigorous lower and upper bounds on this achievable rate were also obtained, but a large gap remained. Here we close this gap, up to a constant. Specifically we show that a species phylogeny can be reconstructed correctly from gene trees even when, on each gene, each edge of the species tree has a constant probability of being the location of an HGT event. Our new reconstruction algorithm, which relies only on unrooted gene tree topologies, builds the tree recursively from the leaves and runs in polynomial time.\n We also provide a matching bound in the negative direction (up to a constant) and extend our results to some cases where gene trees are not perfectly known.\n ', '\n Given samples from an unknown distribution $p$, is it possible to distinguish whether $p$ belongs to some class of distributions $\\mathcal{C}$ versus $p$ being far from every distribution in $\\mathcal{C}$? This fundamental question has received tremendous attention in statistics, focusing primarily on asymptotic analysis, and more recently in information theory and theoretical computer science, where the emphasis has been on small sample size and computational complexity. Nevertheless, even for basic properties of distributions such as monotonicity, log-concavity, unimodality, independence, and monotone-hazard rate, the optimal sample complexity is unknown.\n We provide a general approach via which we obtain sample-optimal and computationally efficient testers for all these distribution families. At the core of our approach is an algorithm which solves the following problem: Given samples from an unknown distribution $p$, and a known distribution $q$, are $p$ and $q$ close in $χ^2$-distance, or far in total variation distance?\n The optimality of our testers is established by providing matching lower bounds with respect to both $n$ and $\\varepsilon$. Finally, a necessary building block for our testers and an important byproduct of our work are the first known computationally efficient proper learners for discrete log-concave and monotone hazard rate distributions.\n ', "\n An $(n,k)$-Poisson Multinomial Distribution (PMD) is the distribution of the sum of $n$ independent random vectors supported on the set ${\\cal B}_k=\\{e_1,\\ldots,e_k\\}$ of standard basis vectors in $\\mathbb{R}^k$. We prove a structural characterization of these distributions, showing that, for all $\\varepsilon >0$, any $(n, k)$-Poisson multinomial random vector is $\\varepsilon$-close, in total variation distance, to the sum of a discretized multidimensional Gaussian and an independent $(\\text{poly}(k/\\varepsilon), k)$-Poisson multinomial random vector. Our structural characterization extends the multi-dimensional CLT of Valiant and Valiant, by simultaneously applying to all approximation requirements $\\varepsilon$. In particular, it overcomes factors depending on $\\log n$ and, importantly, the minimum eigenvalue of the PMD's covariance matrix from the distance to a multidimensional Gaussian random variable.\n We use our structural characterization to obtain an $\\varepsilon$-cover, in total variation distance, of the set of all $(n, k)$-PMDs, significantly improving the cover size of Daskalakis and Papadimitriou, and obtaining the same qualitative dependence of the cover size on $n$ and $\\varepsilon$ as the $k=2$ cover of Daskalakis and Papadimitriou. We further exploit this structure to show that $(n,k)$-PMDs can be learned to within $\\varepsilon$ in total variation distance from $\\tilde{O}_k(1/\\varepsilon^2)$ samples, which is near-optimal in terms of dependence on $\\varepsilon$ and independent of $n$. In particular, our result generalizes the single-dimensional result of Daskalakis, Diakonikolas, and Servedio for Poisson Binomials to arbitrary dimension.\n ", "\n We show that computing the revenue-optimal deterministic auction in unit-demand single-buyer Bayesian settings, i.e. the optimal item-pricing, is computationally hard even in single-item settings where the buyer's value distribution is a sum of independently distributed attributes, or multi-item settings where the buyer's values for the items are independent. We also show that it is intractable to optimally price the grand bundle of multiple items for an additive bidder whose values for the items are independent. These difficulties stem from implicit definitions of a value distribution. We provide three instances of how different properties of implicit distributions can lead to intractability: the first is a #P-hardness proof, while the remaining two are reductions from the SQRT-SUM problem of Garey, Graham, and Johnson. While simple pricing schemes can oftentimes approximate the best scheme in revenue, they can have drastically different underlying structure. We argue therefore that either the specification of the input distribution must be highly restricted in format, or it is necessary for the goal to be mere approximation to the optimal scheme's revenue instead of computing properties of the scheme itself.\n ", "\n Optimal mechanisms have been provided in quite general multi-item settings, as long as each bidder's type distribution is given explicitly by listing every type in the support along with its associated probability. In the implicit setting, e.g. when the bidders have additive valuations with independent and/or continuous values for the items, these results do not apply, and it was recently shown that exact revenue optimization is intractable, even when there is only one bidder. Even for item distributions with special structure, optimal mechanisms have been surprisingly rare and the problem is challenging even in the two-item case. In this paper, we provide a framework for designing optimal mechanisms using optimal transport theory and duality theory. We instantiate our framework to obtain conditions under which only pricing the grand bundle is optimal in multi-item settings (complementing the work of [Manelli and Vincent 2006], as well as to characterize optimal two-item mechanisms. We use our results to derive closed-form descriptions of the optimal mechanism in several two-item settings, exhibiting also a setting where a continuum of lotteries is necessary for revenue optimization but a closed-form representation of the mechanism can still be found efficiently using our framework.\n ", '\n The Clock Drawing Test (CDT) is a rapid, inexpensive, and popular neuropsychological screening tool for cognitive conditions. The Digital Clock Drawing Test (dCDT) uses novel software to analyze data from a digitizing ballpoint pen that reports its position with considerable spatial and temporal precision, making possible the analysis of both the drawing process and final product. We developed methodology to analyze pen stroke data from these drawings, and computed a large collection of features which were then analyzed with a variety of machine learning techniques. The resulting scoring systems were designed to be more accurate than the systems currently used by clinicians, but just as interpretable and easy to use. The systems also allow us to quantify the tradeoff between accuracy and interpretability. We created automated versions of the CDT scoring systems currently used by clinicians, allowing us to benchmark our models, which indicated that our machine learning models substantially outperformed the existing scoring systems.\n ', '\n We describe a sketch interpretation system that detects and classifies clock numerals created by subjects taking the Clock Drawing Test, a clinical tool widely used to screen for cognitive impairments (e.g., dementia). We describe how it balances appearance and context, and document its performance on some 2,000 drawings (about 24K clock numerals) produced by a wide spectrum of patients. We calibrate the utility of different forms of context, describing experiments with Conditional Random Fields trained and tested using a variety of features. We identify context that contributes to interpreting otherwise ambiguous or incomprehensible strokes. We describe ST-slices, a novel representation that enables "unpeeling" the layers of ink that result when people overwrite, which often produces ink impossible to analyze if only the final drawing is examined. We characterize when ST-slices work, calibrate their impact on performance, and consider their breadth of applicability.\n ', '\n Spiral Galaxies is a pencil-and-paper puzzle played on a grid of unit squares: given a set of points called centers, the goal is to partition the grid into polyominoes such that each polyomino contains exactly one center and is $180^\\circ$ rotationally symmetric about its center. We show that this puzzle is NP-complete even if the polyominoes are restricted to be (a) rectangles of arbitrary size or (b) 1$\\times$1, 1$\\times$3, and 3$\\times$1 rectangles. The proof for the latter variant also implies NP-completeness of finding a non-crossing matching in modified grid graphs where edges connect vertices of distance $2$. Moreover, we prove NP-completeness of the design problem of minimizing the number of centers such that there exist a set of Spiral Galaxies that exactly cover a given shape.\n ', '\n Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (non-linear) low-dimensional embeddings of high-dimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into low dimensions. Despite its ubiquity, our theoretical understanding of MDS remains limited as its objective function is highly non-convex. In this paper, we prove that minimizing the Kamada-Kawai objective is NP-hard and give a provable approximation algorithm for optimizing it, which in particular is a PTAS on low-diameter graphs. We supplement this result with experiments suggesting possible connections between our greedy approximation algorithm and gradient-based methods.\n ', '\n We prove NP-completeness of Yin-Yang / Shiromaru-Kuromaru pencil-and-paper puzzles. Viewed as a graph partitioning problem, we prove NP-completeness of partitioning a rectangular grid graph into two induced trees (normal Yin-Yang), or into two induced connected subgraphs (Yin-Yang without $2 \\times 2$ rule), subject to some vertices being pre-assigned to a specific tree/subgraph.\n ', '\n We show how to edge-unfold a new class of convex polyhedra, specifically a new class of prismatoids (the convex hull of two parallel convex polygons, called the top and base), by constructing a nonoverlapping "petal unfolding" in two new cases: (1) when the top and base are sufficiently far from each other; and (2) when the base is a rectangle and all other faces are nonobtuse triangles. The latter result extends a previous result by O\'Rourke that the petal unfolding of a prismatoid avoids overlap when the base is a triangle (possibly obtuse) and all other faces are nonobtuse triangles. We also illustrate the difficulty of extending this result to a general quadrilateral base by giving a counterexample to our technique.\n ', '\n We prove that any finite polyhedral manifold in 3D can be continuously flattened into 2D while preserving intrinsic distances and avoiding crossings, answering a 19-year-old open problem, if we extend standard folding models to allow for countably infinite creases. The most general cases previously known to be continuously flattenable were convex polyhedra and semi-orthogonal polyhedra. For non-orientable manifolds, even the existence of an instantaneous flattening (flat folded state) is a new result. Our solution extends a method for flattening semi-orthogonal polyhedra: slice the polyhedron along parallel planes and flatten the polyhedral strips between consecutive planes. We adapt this approach to arbitrary nonconvex polyhedra by generalizing strip flattening to nonorthogonal corners and slicing along a countably infinite number of parallel planes, with slices densely approaching every vertex of the manifold. We also show that the area of the polyhedron that needs to support moving creases (which are necessary for closed polyhedra by the Bellows Theorem) can be made arbitrarily small.\n ', "\n We study Snipperclips, a computer puzzle game whose objective is to create a target shape with two tools. The tools start as constant-complexity shapes, and each tool can snip (i.e., subtract its current shape from) the other tool. We study the computational problem of, given a target shape represented by a polygonal domain of $n$ vertices, is it possible to create it as one of the tools' shape via a sequence of snip operations? If so, how many snip operations are required? We consider several variants of the problem (such as allowing the tools to be disconnected and/or using an undo operation) and bound the number of operations needed for each of the variants.\n ", '\n Given a graph where every vertex has exactly one labeled token, how can we most quickly execute a given permutation on the tokens? In (sequential) token swapping, the goal is to use the shortest possible sequence of swaps, each of which exchanges the tokens at the two endpoints of an edge of the graph. In parallel token swapping, the goal is to use the fewest rounds, each of which consists of one or more swaps on the edges of a matching. We prove that both of these problems remain NP-hard when the graph is restricted to be a tree.\n These token swapping problems have been studied by disparate groups of researchers in discrete mathematics, theoretical computer science, robot motion planning, game theory, and engineering. Previous work establishes NP-completeness on general graphs (for both problems); polynomial-time algorithms for simple graph classes such as cliques, stars, paths, and cycles; and constant-factor approximation algorithms in some cases. The two natural cases of sequential and parallel token swapping in trees were first studied over thirty years ago (as "sorting with a transposition tree") and over twenty-five years ago (as "routing permutations via matchings"), yet their complexities were previously unknown.\n We also show limitations on approximation of sequential token swapping on trees: we identify a broad class of algorithms that encompass all three known polynomial-time algorithms that achieve the best known approximation factor (which is $2$) and show that no such algorithm can achieve an approximation factor less than $2$.\n ', '\n We prove that Strings-and-Coins -- the combinatorial two-player game generalizing the dual of Dots-and-Boxes -- is strongly PSPACE-complete on multigraphs. This result improves the best previous result, NP-hardness, argued in Winning Ways. Our result also applies to the Nimstring variant, where the winner is determined by normal play; indeed, one step in our reduction is the standard reduction (also from Winning Ways) from Nimstring to Strings-and-Coins.\n ', '\n When can $n$ given numbers be combined using arithmetic operators from a given subset of $\\{+, -, \\times, ÷\\}$ to obtain a given target number? We study three variations of this problem of Arithmetic Expression Construction: when the expression (1) is unconstrained; (2) has a specified pattern of parentheses and operators (and only the numbers need to be assigned to blanks); or (3) must match a specified ordering of the numbers (but the operators and parenthesization are free). For each of these variants, and many of the subsets of $\\{+,-,\\times,÷\\}$, we prove the problem NP-complete, sometimes in the weak sense and sometimes in the strong sense. Most of these proofs make use of a "rational function framework" which proves equivalence of these problems for values in rational functions with values in positive integers.\n ', '\n We prove that the classic falling-block video game Tetris (both survival and board clearing) remains NP-complete even when restricted to 8 columns, or to 4 rows, settling open problems posed over 15 years ago [BDH+04]. Our reduction is from 3-Partition, similar to the previous reduction for unrestricted board sizes, but with a better packing of buckets. On the positive side, we prove that 2-column Tetris (and 1-row Tetris) is polynomial. We also prove that the generalization of Tetris to larger $k$-omino pieces is NP-complete even when the board starts empty, even when restricted to 3 columns or 2 rows or constant-size pieces. Finally, we present an animated Tetris font.\n ', "\n A closed quasigeodesic is a closed loop on the surface of a polyhedron with at most $180^\\circ$ of surface on both sides at all points; such loops can be locally unfolded straight. In 1949, Pogorelov proved that every convex polyhedron has at least three (non-self-intersecting) closed quasigeodesics, but the proof relies on a nonconstructive topological argument. We present the first finite algorithm to find a closed quasigeodesic on a given convex polyhedron, which is the first positive progress on a 1990 open problem by O'Rourke and Wyman. The algorithm's running time is pseudopolynomial, namely $O\\left({n^2 \\over \\varepsilon^2} {L \\over \\ell} b\\right)$ time, where $\\varepsilon$ is the minimum curvature of a vertex, $L$ is the length of the longest edge, $\\ell$ is the smallest distance within a face between a vertex and a nonincident edge (minimum feature size of any face), and $b$ is the maximum number of bits of an integer in a constant-size radical expression of a real number representing the polyhedron. We take special care with the model of computation, introducing the $O(1)$-expression RAM and showing that it can be implemented in the standard word RAM.\n ", '\n Given a set of point sites, a sona drawing is a single closed curve, disjoint from the sites and intersecting itself only in simple crossings, so that each bounded region of its complement contains exactly one of the sites. We prove that it is NP-hard to find a minimum-length sona drawing for $n$ given points, and that such a curve can be longer than the TSP tour of the same points by a factor $> 1.5487875$. When restricted to tours that lie on the edges of a square grid, with points in the grid cells, we prove that it is NP-hard even to decide whether such a tour exists. These results answer questions posed at CCCG 2006.\n ', '\n We present new examples of topologically convex edge-ununfoldable polyhedra, i.e., polyhedra that are combinatorially equivalent to convex polyhedra, yet cannot be cut along their edges and unfolded into one planar piece without overlap. One family of examples is acutely triangulated, i.e., every face is an acute triangle. Another family of examples is stacked, i.e., the result of face-to-face gluings of tetrahedra. Both families achieve another natural property, which we call very ununfoldable: for every $k$, there is an example such that every nonoverlapping multipiece edge unfolding has at least $k$ pieces.\n ', '\n Suppose an "escaping" player moves continuously at maximum speed 1 in the interior of a region, while a "pursuing" player moves continuously at maximum speed $r$ outside the region. For what $r$ can the first player escape the region, that is, reach the boundary a positive distance away from the pursuing player, assuming optimal play by both players? We formalize a model for this infinitesimally alternating 2-player game that we prove has a unique winner in any region with locally rectifiable boundary, avoiding pathological behaviors (where both players can have "winning strategies") previously identified for pursuit-evasion games such as the Lion and Man problem in certain metric spaces. For some regions, including both equilateral triangle and square, we give exact results for the critical speed ratio, above which the pursuing player can win and below which the escaping player can win (and at which the pursuing player can win). For simple polygons, we give a simple formula and polynomial-time algorithm that is guaranteed to give a 10.89898-approximation to the critical speed ratio, and we give a pseudopolynomial-time approximation scheme for arbitrarily approximating the critical speed ratio. On the negative side, we prove NP-hardness of the problem for polyhedral domains in 3D, and prove stronger results (PSPACE-hardness and NP-hardness even to approximate) for generalizations to multiple escaping and pursuing players.\n ', '\n We prove PSPACE-completeness of all but one problem in a large space of pulling-block problems where the goal is for the agent to reach a target destination. The problems are parameterized by whether pulling is optional, the number of blocks which can be pulled simultaneously, whether there are fixed blocks or thin walls, and whether there is gravity. We show NP-hardness for the remaining problem, Pull?-1FG (optional pulling, strength 1, fixed blocks, with gravity).\n ', '\n A door gadget has two states and three tunnels that can be traversed by an agent (player, robot, etc.): the "open" and "close" tunnel sets the gadget\'s state to open and closed, respectively, while the "traverse" tunnel can be traversed if and only if the door is in the open state. We prove that it is PSPACE-complete to decide whether an agent can move from one location to another through a planar assembly of such door gadgets, removing the traditional need for crossover gadgets and thereby simplifying past PSPACE-hardness proofs of Lemmings and Nintendo games Super Mario Bros., Legend of Zelda, and Donkey Kong Country. Our result holds in all but one of the possible local planar embedding of the open, close, and traverse tunnels within a door gadget; in the one remaining case, we prove NP-hardness.\n We also introduce and analyze a simpler type of door gadget, called the self-closing door. This gadget has two states and only two tunnels, similar to the "open" and "traverse" tunnels of doors, except that traversing the traverse tunnel also closes the door. In a variant called the symmetric self-closing door, the "open" tunnel can be traversed if and only if the door is closed. We prove that it is PSPACE-complete to decide whether an agent can move from one location to another through a planar assembly of either type of self-closing door. Then we apply this framework to prove new PSPACE-hardness results for eight different 3D Mario games and Sokobond.\n ', "\n Can an infinite-strength magnetic beacon always ``catch'' an iron ball, when the beacon is a point required to be remain nonstrictly outside a polygon, and the ball is a point always moving instantaneously and maximally toward the beacon subject to staying nonstrictly within the same polygon? Kouhestani and Rappaport [JCDCG 2017] gave an algorithm for determining whether a ball-capturing beacon strategy exists, while conjecturing that such a strategy always exists. We disprove this conjecture by constructing orthogonal and general-position polygons in which the ball and the beacon can never be united.\n ", '\n We analyze the computational complexity of motion planning through local "input/output" gadgets with separate entrances and exits, and a subset of allowed traversals from entrances to exits, each of which changes the state of the gadget and thereby the allowed traversals. We study such gadgets in the 0-, 1-, and 2-player settings, in particular extending past motion-planning-through-gadgets work to 0-player games for the first time, by considering "branchless" connections between gadgets that route every gadget\'s exit to a unique gadget\'s entrance. Our complexity results include containment in L, NL, P, NP, and PSPACE; as well as hardness for NL, P, NP, and PSPACE. We apply these results to show PSPACE-completeness for certain mechanics in Factorio, [the Sequence], and a restricted version of Trainyard, improving prior results. This work strengthens prior results on switching graphs and reachability switching games.\n ', '\n We give an overview of the 2020 Computational Geometry Challenge, which targeted the problem of partitioning the convex hull of a given planar point set P into the smallest number of convex faces, such that no point of P is contained in the interior of a face.\n ', '\n Consider $n^2-1$ unit-square blocks in an $n \\times n$ square board, where each block is labeled as movable horizontally (only), movable vertically (only), or immovable -- a variation of Rush Hour with only $1 \\times 1$ cars and fixed blocks. We prove that it is PSPACE-complete to decide whether a given block can reach the left edge of the board, by reduction from Nondeterministic Constraint Logic via 2-color oriented Subway Shuffle. By contrast, polynomial-time algorithms are known for deciding whether a given block can be moved by one space, or when each block either is immovable or can move both horizontally and vertically. Our result answers a 15-year-old open problem by Tromp and Cilibrasi, and strengthens previous PSPACE-completeness results for Rush Hour with vertical $1 \\times 2$ and horizontal $2 \\times 1$ movable blocks and 4-color Subway Shuffle.\n ', '\n In the Nikoli pencil-and-paper game Tatamibari, a puzzle consists of an $m \\times n$ grid of cells, where each cell possibly contains a clue among +, -, |. The goal is to partition the grid into disjoint rectangles, where every rectangle contains exactly one clue, rectangles containing + are square, rectangles containing - are strictly longer horizontally than vertically, rectangles containing | are strictly longer vertically than horizontally, and no four rectangles share a corner. We prove this puzzle NP-complete, establishing a Nikoli gap of 16 years. Along the way, we introduce a gadget framework for proving hardness of similar puzzles involving area coverage, and show that it applies to an existing NP-hardness proof for Spiral Galaxies. We also present a mathematical puzzle font for Tatamibari.\n ', '\n Recursed is a 2D puzzle platform video game featuring treasure chests that, when jumped into, instantiate a room that can later be exited (similar to function calls), optionally generating a jar that returns back to that room (similar to continuations). We prove that Recursed is RE-complete and thus undecidable (not recursive) by a reduction from the Post Correspondence Problem. Our reduction is "practical": the reduction from PCP results in fully playable levels that abide by all constraints governing levels (including the 15x20 room size) designed for the main game. Our reduction is also "efficient": a Turing machine can be simulated by a Recursed level whose size is linear in the encoding size of the Turing machine and whose solution length is polynomial in the running time of the Turing machine.\n ', '\n We present the first universal reconfiguration algorithm for transforming a modular robot between any two facet-connected square-grid configurations using pivot moves. More precisely, we show that five extra "helper" modules ("musketeers") suffice to reconfigure the remaining $n$ modules between any two given configurations. Our algorithm uses $O(n^2)$ pivot moves, which is worst-case optimal. Previous reconfiguration algorithms either require less restrictive "sliding" moves, do not preserve facet-connectivity, or for the setting we consider, could only handle a small subset of configurations defined by a local forbidden pattern. Configurations with the forbidden pattern do have disconnected reconfiguration graphs (discrete configuration spaces), and indeed we show that they can have an exponential number of connected components. But forbidding the local pattern throughout the configuration is far from necessary, as we show that just a constant number of added modules (placed to be freely reconfigurable) suffice for universal reconfigurability. We also classify three different models of natural pivot moves that preserve facet-connectivity, and show separations between these models.\n ', '\n An open problem of Manuel Abellanas asks whether every set of disjoint closed unit disks in the plane can be connected by a conveyor belt, which means a tight simple closed curve that touches the boundary of each disk, possibly multiple times. We prove three main results. First, for unit disks whose centers are both $x$-monotone and $y$-monotone, or whose centers have $x$-coordinates that differ by at least two units, a conveyor belt always exists and can be found efficiently. Second, it is NP-complete to determine whether disks of varying radii have a conveyor belt, and it remains NP-complete when we constrain the belt to touch disks exactly once. Third, any disjoint set of $n$ disks of arbitrary radii can be augmented by $O(n)$ "guide" disks so that the augmented system has a conveyor belt touching each disk exactly once, answering a conjecture of Demaine, Demaine, and Palop.\n ', '\n It is unknown whether every polycube (polyhedron constructed by gluing cubes face-to-face) has an edge unfolding, that is, cuts along edges of the cubes that unfolds the polycube to a single nonoverlapping polygon in the plane. Here we construct polycubes that have no *edge zipper unfolding* where the cut edges are further restricted to form a path.\n ', '\n Self-assembly refers to the process by which small, simple components mix and combine to form complex structures using only local interactions. Designed as a hybrid between tile assembly models and cellular automata, the Tile Automata (TA) model was recently introduced as a platform to help study connections between various models of self-assembly. However, in this paper we present a result in which we use TA to simulate arbitrary systems within the amoebot model, a theoretical model of programmable matter in which the individual components are relatively simple state machines that are able to sense the states of their neighbors and to move via series of expansions and contractions. We show that for every amoebot system, there is a TA system capable of simulating the local information transmission built into amoebot particles, and that the TA "macrotiles" used to simulate its particles are capable of simulating movement (via attachment and detachment operations) while maintaining the necessary properties of amoebot particle systems. The TA systems are able to utilize only the local interactions of state changes and binding and unbinding along tile edges, but are able to fully simulate the dynamics of these programmable matter systems.\n ', '\n We study the problem of deciding whether a crease pattern can be folded by simple folds (folding along one line at a time) under the infinite all-layers model introduced by [Akitaya et al., 2017], in which each simple fold is defined by an infinite line and must fold all layers of paper that intersect this line. This model is motivated by folding in manufacturing such as sheet-metal bending. We improve on [Arkin et al., 2004] by giving a deterministic $O(n)$-time algorithm to decide simple foldability of 1D crease patterns in the all-layers model. Then we extend this 1D result to 2D, showing that simple foldability in this model can be decided in linear time for unassigned axis-aligned orthogonal crease patterns on axis-aligned 2D orthogonal paper. On the other hand, we show that simple foldability is strongly NP-complete if a subset of the creases have a mountain-valley assignment, even for an axis-aligned rectangle of paper.\n ', '\n We build a general theory for characterizing the computational complexity of motion planning of robot(s) through a graph of "gadgets", where each gadget has its own state defining a set of allowed traversals which in turn modify the gadget\'s state. We study two families of such gadgets, one which naturally leads to motion planning problems with polynomially bounded solutions, and another which leads to polynomially unbounded (potentially exponential) solutions. We also study a range of competitive game-theoretic scenarios, from one player controlling one robot to teams of players each controlling their own robot and racing to achieve their team\'s goal. Under small restrictions on these gadgets, we fully characterize the complexity of bounded 1-player motion planning (NL vs. NP-complete), unbounded 1-player motion planning (NL vs. PSPACE-complete), and bounded 2-player motion planning (P vs. PSPACE-complete), and we partially characterize the complexity of unbounded 2-player motion planning (P vs. EXPTIME-complete), bounded 2-team motion planning (P vs. NEXPTIME-complete), and unbounded 2-team motion planning (P vs. undecidable). These results can be seen as an alternative to Constraint Logic (which has already proved useful as a basis for hardness reductions), providing a wide variety of agent-based gadgets, any one of which suffices to prove a problem hard.\n ', '\n We characterize when two conic curved creases are compatible with each other, when the rule lines must converge to conic foci and reflect at the crease. Namely, two conics are compatible (can be connected by rule segments in a foldable curved crease pattern) if and only if they have equal or reciprocal eccentricity. Thus, circles (eccentricity 0) and parabolas (eccentricity 1) are compatible with only themselves (when scaled from a focus), and ellipses (eccentricity strictly between 0 and 1) and hyperbolas (eccentricity above 1) are compatible with themselves and each other (but only in specific pairings). The foundation of this result is a general condition relating any two curved creases connected by rule segments. We also use our characterization to analyze several curved crease designs.\n ', '\n In this paper, we show that deciding rigid foldability of a given crease pattern using all creases is weakly NP-hard by a reduction from Partition, and that deciding rigid foldability with optional creases is strongly NP-hard by a reduction from 1-in-3 SAT. Unlike flat foldability of origami or flexibility of other kinematic linkages, whose complexity originates in the complexity of the layer ordering and possible self-intersection of the material, rigid foldability from a planar state is hard even though there is no potential self-intersection. In fact, the complexity comes from the combinatorial behavior of the different possible rigid folding configurations at each vertex. The results underpin the fact that it is harder to fold from an unfolded sheet of paper than to unfold a folded state back to a plane, frequently encountered problem when realizing folding-based systems such as self-folding matter and reconfigurable robots.\n ', '\n Cookie Clicker is a popular online incremental game where the goal of the game is to generate as many cookies as possible. In the game you start with an initial cookie generation rate, and you can use cookies as currency to purchase various items that increase your cookie generation rate. In this paper, we analyze strategies for playing Cookie Clicker optimally. While simple to state, the game gives rise to interesting analysis involving ideas from NP-hardness, approximation algorithms, and dynamic programming.\n ', '\n We prove computational intractability of variants of checkers: (1) deciding whether there is a move that forces the other player to win in one move is NP-complete; (2) checkers where players must always be able to jump on their turn is PSPACE-complete; and (3) cooperative versions of (1) and (2) are NP-complete. We also give cooperative checkers puzzles whose solutions are the letters of the alphabet.\n ', '\n We develop a new framework for generalizing approximation algorithms from the structural graph algorithm literature so that they apply to graphs somewhat close to that class (a scenario we expect is common when working with real-world networks) while still guaranteeing approximation ratios. The idea is to $\\textit{edit}$ a given graph via vertex- or edge-deletions to put the graph into an algorithmically tractable class, apply known approximation algorithms for that class, and then $\\textit{lift}$ the solution to apply to the original graph. We give a general characterization of when an optimization problem is amenable to this approach, and show that it includes many well-studied graph problems, such as Independent Set, Vertex Cover, Feedback Vertex Set, Minimum Maximal Matching, Chromatic Number, ($\\ell$-)Dominating Set, Edge ($\\ell$-)Dominating Set, and Connected Dominating Set.\n To enable this framework, we develop new editing algorithms that find the approximately-fewest edits required to bring a given graph into one of several important graph classes (in some cases, also approximating the target parameter of the family). For bounded degeneracy, we obtain a bicriteria $(4,4)$-approximation which also extends to a smoother bicriteria trade-off. For bounded treewidth, we obtain a bicriteria $(O(\\log^{1.5} n), O(\\sqrt{\\log w}))$-approximation, and for bounded pathwidth, we obtain a bicriteria $(O(\\log^{1.5} n), O(\\sqrt{\\log w} \\cdot \\log n))$-approximation. For treedepth $2$ (also related to bounded expansion), we obtain a $4$-approximation. We also prove complementary hardness-of-approximation results assuming $\\mathrm{P} \\neq \\mathrm{NP}$: in particular, these problems are all log-factor inapproximable, except the last which is not approximable below some constant factor ($2$ assuming UGC).\n ', "\n We consider the computational complexity of reconfiguration problems, in which one is given two combinatorial configurations satisfying some constraints, and is asked to transform one into the other using elementary transformations, while satisfying the constraints at all times. Such problems appear naturally in many contexts, such as model checking, motion planning, enumeration and sampling, and recreational mathematics. We provide hardness results for problems in this family, in which the constraints and operations are particularly simple. More precisely, we prove the PSPACE-completeness of the following decision problems:\n $\\bullet$ Given two satisfying assignments to a planar monotone instance of Not-All-Equal 3-SAT, can one assignment be transformed into the other by single variable `flips' (assignment changes), preserving satisfiability at every step?\n $\\bullet$ Given two subsets of a set S of integers with the same sum, can one subset be transformed into the other by adding or removing at most three elements of S at a time, such that the intermediate subsets also have the same sum?\n $\\bullet$ Given two points in $\\{0,1\\}^n$ contained in a polytope P specified by a constant number of linear inequalities, is there a path in the n-hypercube connecting the two points and contained in P?\n These problems can be interpreted as reconfiguration analogues of standard problems in NP. Interestingly, the instances of the NP problems that appear as input to the reconfiguration problems in our reductions can be shown to lie in P. In particular, the elements of S and the coefficients of the inequalities defining P can be restricted to have logarithmic bit-length.\n ", '\n We analyze the computational complexity of the many types of pencil-and-paper-style puzzles featured in the 2016 puzzle video game The Witness. In all puzzles, the goal is to draw a simple path in a rectangular grid graph from a start vertex to a destination vertex. The different puzzle types place different constraints on the path: preventing some edges from being visited (broken edges); forcing some edges or vertices to be visited (hexagons); forcing some cells to have certain numbers of incident path edges (triangles); or forcing the regions formed by the path to be partially monochromatic (squares), have exactly two special cells (stars), or be singly covered by given shapes (polyominoes) and/or negatively counting shapes (antipolyominoes). We show that any one of these clue types (except the first) is enough to make path finding NP-complete ("witnesses exist but are hard to find"), even for rectangular boards. Furthermore, we show that a final clue type (antibody), which necessarily "cancels" the effect of another clue in the same region, makes path finding $Σ_2$-complete ("witnesses do not exist"), even with a single antibody (combined with many anti/polyominoes), and the problem gets no harder with many antibodies. On the positive side, we give a polynomial-time algorithm for monomino clues, by reducing to hexagon clues on the boundary of the puzzle, even in the presence of broken edges, and solving "subset Hamiltonian path" for terminals on the boundary of an embedded planar graph in polynomial time.\n ', '\n We analyze the computational complexity of optimally playing the two-player board game Push Fight, generalized to an arbitrary board and number of pieces. We prove that the game is PSPACE-hard to decide who will win from a given position, even for simple (almost rectangular) hole-free boards. We also analyze the mate-in-1 problem: can the player win in a single turn? One turn in Push Fight consists of up to two "moves" followed by a mandatory "push". With these rules, or generalizing the number of allowed moves to any constant, we show mate-in-1 can be solved in polynomial time. If, however, the number of moves per turn is part of the input, the problem becomes NP-complete. On the other hand, without any limit on the number of moves per turn, the problem becomes polynomially solvable again.\n ', '\n We prove that path puzzles with complete row and column information--or equivalently, 2D orthogonal discrete tomography with Hamiltonicity constraint--are strongly NP-complete, ASP-complete, and #P-complete. Along the way, we newly establish ASP-completeness and #P-completeness for 3-Dimensional Matching and Numerical 3-Dimensional Matching.\n ', '\n We prove that two polygons $A$ and $B$ have a reversible hinged dissection (a chain hinged dissection that reverses inside and outside boundaries when folding between $A$ and $B$) if and only if $A$ and $B$ are two noncrossing nets of a common polyhedron. Furthermore, monotone reversible hinged dissections (where all hinges rotate in the same direction when changing from $A$ to $B$) correspond exactly to noncrossing nets of a common convex polyhedron. By envelope/parcel magic, it becomes easy to design many hinged dissections.\n ', '\n We study the problem of folding a polyomino $P$ into a polycube $Q$, allowing faces of $Q$ to be covered multiple times. First, we define a variety of folding models according to whether the folds (a) must be along grid lines of $P$ or can divide squares in half (diagonally and/or orthogonally), (b) must be mountain or can be both mountain and valley, (c) can remain flat (forming an angle of $180^\\circ$), and (d) must lie on just the polycube surface or can have interior faces as well. Second, we give all the inclusion relations among all models that fold on the grid lines of $P$. Third, we characterize all polyominoes that can fold into a unit cube, in some models. Fourth, we give a linear-time dynamic programming algorithm to fold a tree-shaped polyomino into a constant-size polycube, in some models. Finally, we consider the triangular version of the problem, characterizing which polyiamonds fold into a regular tetrahedron.\n ', '\n This paper initiates the study of I/O algorithms (minimizing cache misses) from the perspective of fine-grained complexity (conditional polynomial lower bounds). Specifically, we aim to answer why sparse graph problems are so hard, and why the Longest Common Subsequence problem gets a savings of a factor of the size of cache times the length of a cache line, but no more. We take the reductions and techniques from complexity and fine-grained complexity and apply them to the I/O model to generate new (conditional) lower bounds as well as faster algorithms. We also prove the existence of a time hierarchy for the I/O model, which motivates the fine-grained reductions.\n Using fine-grained reductions, we give an algorithm for distinguishing 2 vs. 3 diameter and radius that runs in $O(|E|^2/(MB))$ cache misses, which for sparse graphs improves over the previous $O(|V|^2/B)$ running time. We give new reductions from radius and diameter to Wiener index and median. We show meaningful reductions between problems that have linear-time solutions in the RAM model. The reductions use low I/O complexity (typically $O(n/B)$), and thus help to finely capture the relationship between "I/O linear time" $Θ(n/B)$ and RAM linear time $Θ(n)$. We generate new I/O assumptions based on the difficulty of improving sparse graph problem running times in the I/O model. We create conjectures that the current best known algorithms for Single Source Shortest Paths (SSSP), diameter, and radius are optimal. From these I/O-model assumptions, we show that many of the known reductions in the word-RAM model can naturally extend to hold in the I/O model as well (e.g., a lower bound on the I/O complexity of Longest Common Subsequence that matches the best known running time). Finally, we prove an analog of the Time Hierarchy Theorem in the I/O model.\n ', '\n We analyze a directed variation of the book embedding problem when the page partition is prespecified and the nodes on the spine must be in topological order (upward book embedding). Given a directed acyclic graph and a partition of its edges into $k$ pages, can we linearly order the vertices such that the drawing is upward (a topological sort) and each page avoids crossings? We prove that the problem is NP-complete for $k\\ge 3$, and for $k\\ge 4$ even in the special case when each page is a matching. By contrast, the problem can be solved in linear time for $k=2$ pages when pages are restricted to matchings. The problem comes from Jack Edmonds (1997), motivated as a generalization of the map folding problem from computational origami.\n ', '\n The 15 puzzle is a classic reconfiguration puzzle with fifteen uniquely labeled unit squares within a $4 \\times 4$ board in which the goal is to slide the squares (without ever overlapping) into a target configuration. By generalizing the puzzle to an $n \\times n$ board with $n^2-1$ squares, we can study the computational complexity of problems related to the puzzle; in particular, we consider the problem of determining whether a given end configuration can be reached from a given start configuration via at most a given number of moves. This problem was shown NP-complete in Ratner and Warmuth (1990). We provide an alternative simpler proof of this fact by reduction from the rectilinear Steiner tree problem.\n ', '\n In 2007, Arkin et al. initiated a systematic study of the complexity of the Hamiltonian cycle problem on square, triangular, or hexagonal grid graphs, restricted to polygonal, thin, superthin, degree-bounded, or solid grid graphs. They solved many combinations of these problems, proving them either polynomially solvable or NP-complete, but left three combinations open. In this paper, we prove two of these unsolved combinations to be NP-complete: Hamiltonicity of Square Polygonal Grid Graphs and Hamiltonicity of Hexagonal Thin Grid Graphs. We also consider a new restriction, where the grid graph is both thin and polygonal, and prove that Hamiltonicity then becomes polynomially solvable for square, triangular, and hexagonal grid graphs.\n ', '\n In this paper, we introduce a new problem called Tree-Residue Vertex-Breaking (TRVB): given a multigraph $G$ some of whose vertices are marked "breakable," is it possible to convert $G$ into a tree via a sequence of "vertex-breaking" operations (replacing a degree-$k$ breakable vertex by $k$ degree-$1$ vertices, disconnecting the $k$ incident edges)?\n We characterize the computational complexity of TRVB with any combination of the following additional constraints: $G$ must be planar, $G$ must be a simple graph, the degree of every breakable vertex must belong to an allowed list $B$, and the degree of every unbreakable vertex must belong to an allowed list $U$. The two results which we expect to be most generally applicable are that (1) TRVB is polynomially solvable when breakable vertices are restricted to have degree at most $3$; and (2) for any $k \\ge 4$, TRVB is NP-complete when the given multigraph is restricted to be planar and to consist entirely of degree-$k$ breakable vertices. To demonstrate the use of TRVB, we give a simple proof of the known result that Hamiltonicity in max-degree-$3$ square grid graphs is NP-hard.\n We also demonstrate a connection between TRVB and the Hypergraph Spanning Tree problem. This connection allows us to show that the Hypergraph Spanning Tree problem in $k$-uniform $2$-regular hypergraphs is NP-complete for any $k \\ge 4$, even when the incidence graph of the hypergraph is planar.\n ', "\n In this paper, we prove that optimally solving an $n \\times n \\times n$ Rubik's Cube is NP-complete by reducing from the Hamiltonian Cycle problem in square grid graphs. This improves the previous result that optimally solving an $n \\times n \\times n$ Rubik's Cube with missing stickers is NP-complete. We prove this result first for the simpler case of the Rubik's Square---an $n \\times n \\times 1$ generalization of the Rubik's Cube---and then proceed with a similar but more complicated proof for the Rubik's Cube case.\n ", '\n This paper addresses the problem of finding minimum forcing sets in origami. The origami material folds flat along straight lines called creases that can be labeled as mountains or valleys. A forcing set is a subset of creases that force all the other creases to fold according to their labels. The result is a flat folding of the origami material. In this paper we develop a linear time algorithm that finds minimum forcing sets in one dimensional origami.\n ', '\n We study the complexity of symmetric assembly puzzles: given a collection of simple polygons, can we translate, rotate, and possibly flip them so that their interior-disjoint union is line symmetric? On the negative side, we show that the problem is strongly NP-complete even if the pieces are all polyominos. On the positive side, we show that the problem can be solved in polynomial time if the number of pieces is a fixed constant.\n ', "\n A conflict-free k-coloring of a graph assigns one of k different colors to some of the vertices such that, for every vertex v, there is a color that is assigned to exactly one vertex among v and v's neighbors. Such colorings have applications in wireless networking, robotics, and geometry, and are well-studied in graph theory. Here we study the natural problem of the conflict-free chromatic number chi_CF(G) (the smallest k for which conflict-free k-colorings exist). We provide results both for closed neighborhoods N[v], for which a vertex v is a member of its neighborhood, and for open neighborhoods N(v), for which vertex v is not a member of its neighborhood.\n For closed neighborhoods, we prove the conflict-free variant of the famous Hadwiger Conjecture: If an arbitrary graph G does not contain K_{k+1} as a minor, then chi_CF(G) <= k. For planar graphs, we obtain a tight worst-case bound: three colors are sometimes necessary and always sufficient. We also give a complete characterization of the computational complexity of conflict-free coloring. Deciding whether chi_CF(G)<= 1 is NP-complete for planar graphs G, but polynomial for outerplanar graphs. Furthermore, deciding whether chi_CF(G)<= 2 is NP-complete for planar graphs G, but always true for outerplanar graphs. For the bicriteria problem of minimizing the number of colored vertices subject to a given bound k on the number of colors, we give a full algorithmic characterization in terms of complexity and approximation for outerplanar and planar graphs.\n For open neighborhoods, we show that every planar bipartite graph has a conflict-free coloring with at most four colors; on the other hand, we prove that for k in {1,2,3}, it is NP-complete to decide whether a planar bipartite graph has a conflict-free k-coloring. Moreover, we establish that any general} planar graph has a conflict-free coloring with at most eight colors.\n ", '\n We prove the computational intractability of rotating and placing $n$ square tiles into a $1 \\times n$ array such that adjacent tiles are compatible--either equal edge colors, as in edge-matching puzzles, or matching tab/pocket shapes, as in jigsaw puzzles. Beyond basic NP-hardness, we prove that it is NP-hard even to approximately maximize the number of placed tiles (allowing blanks), while satisfying the compatibility constraint between nonblank tiles, within a factor of 0.9999999851. (On the other hand, there is an easy $1 \\over 2$-approximation.) This is the first (correct) proof of inapproximability for edge-matching and jigsaw puzzles. Along the way, we prove NP-hardness of distinguishing, for a directed graph on $n$ nodes, between having a Hamiltonian path (length $n-1$) and having at most $0.999999284 (n-1)$ edges that form a vertex-disjoint union of paths. We use this gap hardness and gap-preserving reductions to establish similar gap hardness for $1 \\times n$ jigsaw and edge-matching puzzles.\n ', '\n We classify the computational complexity of the popular video games Portal and Portal 2. We isolate individual mechanics of the game and prove NP-hardness, PSPACE-completeness, or (pseudo)polynomiality depending on the specific game mechanics allowed. One of our proofs generalizes to prove NP-hardness of many other video games such as Half-Life 2, Halo, Doom, Elder Scrolls, Fallout, Grand Theft Auto, Left 4 Dead, Mass Effect, Deus Ex, Metal Gear Solid, and Resident Evil.\n These results build on the established literature on the complexity of video games.\n ', "\n Fully Homomorphic Encryption (FHE) allows computing on encrypted data, enabling secure offloading of computation to untrusted serves. Though it provides ideal security, FHE is expensive when executed in software, 4 to 5 orders of magnitude slower than computing on unencrypted data. These overheads are a major barrier to FHE's widespread adoption. We present F1, the first FHE accelerator that is programmable, i.e., capable of executing full FHE programs. F1 builds on an in-depth architectural analysis of the characteristics of FHE computations that reveals acceleration opportunities. F1 is a wide-vector processor with novel functional units deeply specialized to FHE primitives, such as modular arithmetic, number-theoretic transforms, and structured permutations. This organization provides so much compute throughput that data movement becomes the bottleneck. Thus, F1 is primarily designed to minimize data movement. The F1 hardware provides an explicitly managed memory hierarchy and mechanisms to decouple data movement from execution. A novel compiler leverages these mechanisms to maximize reuse and schedule off-chip and on-chip data movement. We evaluate F1 using cycle-accurate simulations and RTL synthesis. F1 is the first system to accelerate complete FHE programs and outperforms state-of-the-art software implementations by gmean 5400x and by up to 17000x. These speedups counter most of FHE's overheads and enable new applications, like real-time private deep learning in the cloud.\n ", '\n In this paper, we consider the problem of designing Differentially Private (DP) algorithms for Stochastic Convex Optimization (SCO) on heavy-tailed data. The irregularity of such data violates some key assumptions used in almost all existing DP-SCO and DP-ERM methods, resulting in failure to provide the DP guarantees. To better understand this type of challenges, we provide in this paper a comprehensive study of DP-SCO under various settings. First, we consider the case where the loss function is strongly convex and smooth. For this case, we propose a method based on the sample-and-aggregate framework, which has an excess population risk of $\\tilde{O}(\\frac{d^3}{nε^4})$ (after omitting other factors), where $n$ is the sample size and $d$ is the dimensionality of the data. Then, we show that with some additional assumptions on the loss functions, it is possible to reduce the \\textit{expected} excess population risk to $\\tilde{O}(\\frac{ d^2}{ nε^2 })$. To lift these additional conditions, we also provide a gradient smoothing and trimming based scheme to achieve excess population risks of $\\tilde{O}(\\frac{ d^2}{nε^2})$ and $\\tilde{O}(\\frac{d^\\frac{2}{3}}{(nε^2)^\\frac{1}{3}})$ for strongly convex and general convex loss functions, respectively, \\textit{with high probability}. Experiments suggest that our algorithms can effectively deal with the challenges caused by data irregularity.\n ', "\n Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. We discuss how a database OS (DBOS) can improve the programmability and performance of many of today's most important applications and propose a plan for the development of a DBOS proof of concept.\n ", '\n Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.\n ', '\n Mapping road networks is currently both expensive and labor-intensive. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity. We show that these segmentation methods have high error rates because noisy CNN outputs are difficult to correct. We propose RoadTracer, a new method to automatically construct accurate road network maps from aerial images. RoadTracer uses an iterative search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We compare our approach with a segmentation method on fifteen cities, and find that at a 5% error rate, RoadTracer correctly captures 45% more junctions across these cities.\n ', "\n This paper introduces the CondorJ2 cluster management system. Traditionally, cluster management systems such as Condor employ a process-oriented approach with little or no use of modern database system technology. In contrast, CondorJ2 employs a data-centric, 3-tier web-application architecture for all system functions (e.g., job submission, monitoring and scheduling; node configuration, monitoring and management, etc.) except for job execution. Employing a data-oriented approach allows the core challenge (i.e., managing and coordinating a large set of distributed computing resources) to be transformed from a relatively low-level systems problem into a more abstract, higher-level data management problem. Preliminary results suggest that CondorJ2's use of standard 3-tier software represents a significant step forward to the design and implementation of large clusters (1,000 to 10,000 nodes).\n ", '\n This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.\n ', '\n A group of senior database researchers gathers every few years to assess the state of database research and to point out problem areas that deserve additional focus. This report summarizes the discussion and conclusions of the sixth ad-hoc meeting held May 4-6, 2003 in Lowell, Mass. It observes that information management continues to be a critical component of most complex software systems. It recommends that database researchers increase focus on: integration of text, data, code, and streams; fusion of information from heterogeneous data sources; reasoning about uncertain data; unsupervised data mining for interesting correlations; information privacy; and self-adaptation and repair.\n ', '\n The database research community is rightly proud of success in basic research, and its remarkable record of technology transfer. Now the field needs to radically broaden its research focus to attack the issues of capturing, storing, analyzing, and presenting the vast array of online data. The database research community should embrace a broader research agenda -- broadening the definition of database management to embrace all the content of the Web and other online data stores, and rethinking our fundamental assumptions in light of technology shifts. To accelerate this transition, we recommend changing the way research results are evaluated and presented. In particular, we advocate encouraging more speculative and long-range work, moving conferences to a poster format, and publishing all research literature on the Web.\n ', '\n Many application domains, spanning from computational photography to medical imaging, require recovery of high-fidelity images from noisy, incomplete or partial/compressed measurements. State of the art methods for solving these inverse problems combine deep learning with iterative model-based solvers, a concept known as deep algorithm unfolding. By combining a-priori knowledge of the forward measurement model with learned (proximal) mappings based on deep networks, these methods yield solutions that are both physically feasible (data-consistent) and perceptually plausible. However, current proximal mappings only implicitly learn such image priors. In this paper, we propose to make these image priors fully explicit by embedding deep generative models in the form of normalizing flows within the unfolded proximal gradient algorithm. We demonstrate that the proposed method outperforms competitive baselines on various image recovery tasks, spanning from image denoising to inpainting and deblurring.\n ', '\n We study the low-rank phase retrieval problem, where we try to recover a $d_1\\times d_2$ low-rank matrix from a series of phaseless linear measurements. This is a fourth-order inverse problem, as we are trying to recover factors of matrix that have been put through a quadratic nonlinearity after being multiplied together.\n We propose a solution to this problem using the recently introduced technique of anchored regression. This approach uses two different types of convex relaxations: we replace the quadratic equality constraints for the phaseless measurements by a search over a polytope, and enforce the rank constraint through nuclear norm regularization. The result is a convex program that works in the space of $d_1 \\times d_2$ matrices.\n We analyze two specific scenarios. In the first, the target matrix is rank-$1$, and the observations are structured to correspond to a phaseless blind deconvolution. In the second, the target matrix has general rank, and we observe the magnitudes of the inner products against a series of independent Gaussian random matrices. In each of these problems, we show that the anchored regression returns an accurate estimate from a near-optimal number of measurements given that we have access to an anchor matrix of sufficient quality. We also show how to create such an anchor in the phaseless blind deconvolution problem, again from an optimal number of measurements, and present a partial result in this direction for the general rank problem.\n ', '\n This work considers the problem of super-resolution. The goal is to resolve a Dirac distribution from knowledge of its discrete, low-pass, Fourier measurements. Classically, such problems have been dealt with parameter estimation methods. Recently, it has been shown that convex-optimization based formulations facilitate a continuous time solution to the super-resolution problem. Here we treat super-resolution from low-pass measurements in Phase Space. The Phase Space transformation parametrically generalizes a number of well known unitary mappings such as the Fractional Fourier, Fresnel, Laplace and Fourier transforms. Consequently, our work provides a general super- resolution strategy which is backward compatible with the usual Fourier domain result. We consider low-pass measurements of Dirac distributions in Phase Space and show that the super-resolution problem can be cast as Total Variation minimization. Remarkably, even though are setting is quite general, the bounds on the minimum separation distance of Dirac distributions is comparable to existing methods.\n ', '\n The goal of subspace learning is to find a $k$-dimensional subspace of $\\mathbb{R}^d$, such that the expected squared distance between instance vectors and the subspace is as small as possible. In this paper we study subspace learning in a partial information setting, in which the learner can only observe $r \\le d$ attributes from each instance vector. We propose several efficient algorithms for this task, and analyze their sample complexity\n ', '\n This paper develops a novel framework for phase retrieval, a problem which arises in X-ray crystallography, diffraction imaging, astronomical imaging and many other applications. Our approach combines multiple structured illuminations together with ideas from convex programming to recover the phase from intensity measurements, typically from the modulus of the diffracted wave. We demonstrate empirically that any complex-valued object can be recovered from the knowledge of the magnitude of just a few diffracted patterns by solving a simple convex optimization problem inspired by the recent literature on matrix completion. More importantly, we also demonstrate that our noise-aware algorithms are stable in the sense that the reconstruction degrades gracefully as the signal-to-noise ratio decreases. Finally, we introduce some theory showing that one can design very simple structured illumination patterns such that three diffracted figures uniquely determine the phase of the object we wish to recover.\n ', '\n Sparse modeling is a powerful framework for data analysis and processing. Traditionally, encoding in this framework is performed by solving an L1-regularized linear regression problem, commonly referred to as Lasso or Basis Pursuit. In this work we combine the sparsity-inducing property of the Lasso model at the individual feature level, with the block-sparsity property of the Group Lasso model, where sparse groups of features are jointly encoded, obtaining a sparsity pattern hierarchically structured. This results in the Hierarchical Lasso (HiLasso), which shows important practical modeling advantages. We then extend this approach to the collaborative case, where a set of simultaneously coded signals share the same sparsity pattern at the higher (group) level, but not necessarily at the lower (inside the group) level, obtaining the collaborative HiLasso model (C-HiLasso). Such signals then share the same active groups, or classes, but not necessarily the same active set. This model is very well suited for applications such as source identification and separation. An efficient optimization procedure, which guarantees convergence to the global optimum, is developed for these new models. The underlying presentation of the new framework and optimization approach is complemented with experimental examples and theoretical results regarding recovery guarantees for the proposed models.\n ', '\n Motivation: Array Comparative Genomic Hybridization (aCGH) is used to scan the entire genome for variations in DNA copy number. A central task in the analysis of aCGH data is the segmentation into groups of probes sharing the same DNA copy number. Some well known segmentation methods suffer from very long running times, preventing interactive data analysis. Results: We suggest a new segmentation method based on wavelet decomposition and thresholding, which detects significant breakpoints in the data. Our algorithm is over 1,000 times faster than leading approaches, with similar performance. Another key advantage of the proposed method is its simplicity and flexibility. Due to its intuitive structure it can be easily generalized to incorporate several types of side information. Here we consider two extensions which include side information indicating the reliability of each measurement, and compensating for a changing variability in the measurement noise. The resulting algorithm outperforms existing methods, both in terms of speed and performance, when applied on real high density CGH data. Availability: Implementation is available under software tab at: http://www.ee.technion.ac.il/Sites/People/YoninaEldar/ Contact: yonina@ee.technion.ac.il\n ', '\n The problem addressed is to design a detector which is maximally sensitive to specific quantum states. Here we concentrate on quantum state detection using the worst-case a posteriori probability of detection as the design criterion. This objective is equivalent to asking the question: if the detector declares that a specific state is present, what is the probability of that state actually being present? We show that maximizing this worst-case probability (maximizing the smallest possible value of this probability) is a quasiconvex optimization over the matrices of the POVM (positive operator valued measure) which characterize the measurement apparatus. We also show that with a given POVM, the optimization is quasiconvex in the matrix which characterizes the Kraus operator sum representation (OSR) in a fixed basis. We use Lagrange Duality Theory to establish the optimality conditions for both deterministic and randomized detection. We also examine the special case of detecting a single pure state. Numerical aspects of using convex optimization for quantum state detection are also discussed.\n ', '\n As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve more complex problems. This trend has been enabled by an increase in available compute power; however, efforts to continue to scale electronic processors are impeded by the costs of communication, thermal management, power delivery and clocking. To improve scalability, we propose a digital optical neural network (DONN) with intralayer optical interconnects and reconfigurable input values. The near path-length-independence of optical energy consumption enables information locality between a transmitter and arbitrarily arranged receivers, which allows greater flexibility in architecture design to circumvent scaling limitations. In a proof-of-concept experiment, we demonstrate optical multicast in the classification of 500 MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumption of the DONN and find that optical data transfer is beneficial over electronics when the spacing of computational units is on the order of >10 micrometers.\n ', '\n High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle strikes on low-level state up to the program output. Existing approaches that rely on low-level simulations with fault injection cannot evaluate full applications because of their slow speeds, while application-level accelerated fault testing in accelerated particle beams is often impractical. We present a new two-level methodology for application resilience evaluation that overcomes these challenges. The proposed approach decomposes application failure rate estimation into (1) identifying how particle strikes in low-level unprotected state manifest at the architecture-level, and (2) measuring how such architecture-level manifestations propagate to the program output. We demonstrate the effectiveness of this approach on GPU architectures. We also show that using just one of the two steps can overestimate SDC rates and produce different trends---the composition of the two is needed for accurate reliability modeling.\n ', '\n A recent trend in DNN development is to extend the reach of deep learning applications to platforms that are more resource and energy constrained, e.g., mobile devices. These endeavors aim to reduce the DNN model size and improve the hardware processing efficiency, and have resulted in DNNs that are much more compact in their structures and/or have high data sparsity. These compact or sparse models are different from the traditional large ones in that there is much more variation in their layer shapes and sizes, and often require specialized hardware to exploit sparsity for performance improvement. Thus, many DNN accelerators designed for large DNNs do not perform well on these models. In this work, we present Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs. To deal with the widely varying layer shapes and sizes, it introduces a highly flexible on-chip network, called hierarchical mesh, that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources. Furthermore, Eyeriss v2 can process sparse data directly in the compressed domain for both weights and activations, and therefore is able to improve both processing speed and energy efficiency with sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than the original Eyeriss running MobileNet. We also present an analysis methodology called Eyexam that provides a systematic way of understanding the performance limits for DNN processors as a function of specific characteristics of the DNN model and accelerator design; it applies these characteristics as sequential steps to increasingly tighten the bound on the performance limits.\n ', '\n Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator applied during inference. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to the multiplier array, where they are extensively reused. In addition, the accumulation of multiplication products are performed in a novel accumulator array. Our results show that on contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.\n ', '\n Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.\n This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.\n The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.\n ', '\n Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy and/or latency concerns. Accordingly, energy-efficient embedded vision hardware delivering real-time and robust performance is crucial. While deep learning is gaining popularity in several computer vision algorithms, a significant energy consumption difference exists compared to traditional hand-crafted approaches. In this paper, we provide an in-depth analysis of the computation, energy and accuracy trade-offs between learned features such as deep Convolutional Neural Networks (CNN) and hand-crafted features such as Histogram of Oriented Gradients (HOG). This analysis is supported by measurements from two chips that implement these algorithms. Our goal is to understand the source of the energy discrepancy between the two approaches and to provide insight about the potential areas where CNNs can be improved and eventually approach the energy-efficiency of HOG while maintaining its outstanding performance accuracy.\n ', '\n Machine learning plays a critical role in extracting meaningful information out of the zetabytes of sensor data collected every day. For some applications, the goal is to analyze and understand the data to identify trends (e.g., surveillance, portable/wearable electronics); in other applications, the goal is to take immediate action based the data (e.g., robotics/drones, self-driving cars, smart Internet of Things). For many of these applications, local embedded processing near the sensor is preferred over the cloud due to privacy or latency concerns, or limitations in the communication bandwidth. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to throughput and accuracy requirements. Furthermore, flexibility is often required such that the processing can be adapted for different applications or environments (e.g., update the weights and model in the classifier). In many applications, machine learning often involves transforming the input data into a higher dimensional space, which, along with programmable weights, increases data movement and consequently energy consumption. In this paper, we will discuss how these challenges can be addressed at various levels of hardware design ranging from architecture, hardware-friendly algorithms, mixed-signal circuits, and advanced technologies (including memories and sensors).\n ', '\n Next generation optoelectronic systems will require the efficient transfer of optical signals between many discrete photonic components integrated onto a single substrate. While modern assembly processes can easily integrate thousands of electrical components onto a single board, photonic assembly is far more challenging due to the wavelength-scale alignment tolerances required. Here we address this problem by introducing an alignment-free photonic coupler robust to x, y, z displacement and angular misalignment. This alignment-free coupler engineers a translationally-invariant evanescent interaction between waveguides by intersecting them at an angle, which enables a lateral and angular alignment tolerance fundamentally larger than non-evanescent approaches such as edge coupling. We show that our approach can function as a universal photonic connector interfacing photonic integrated circuits and microchiplets across different platforms. As a potential use case, we describe a new type of self-aligning photonic circuit board enabled by our approach that greatly simplifies assembly of complex optoelectronic systems.\n ', '\n Group convolutions and cross-correlations, which are equivariant to the actions of group elements, are commonly used in mathematics to analyze or take advantage of symmetries inherent in a given problem setting. Here, we provide efficient quantum algorithms for performing linear group convolutions and cross-correlations on data stored as quantum states. Runtimes for our algorithms are logarithmic in the dimension of the group thus offering an exponential speedup compared to classical algorithms when input data is provided as a quantum state and linear operations are well conditioned. Motivated by the rich literature on quantum algorithms for solving algebraic problems, our theoretical framework opens a path for quantizing many algorithms in machine learning and numerical methods that employ group operations.\n ', '\n Component errors limit the scaling of multiport interferometers based on MZI meshes. These errors arise because imperfect MZIs cannot be perfectly programmed to the cross state. Here, we introduce two modified mesh architectures that overcome this limitation: (1) a 3-splitter MZI for generic errors, and (2) a broadband MZI+Crossing design for correlated errors. Because these designs allow for perfect realization of the cross state, the matrix fidelity no longer decreases with mesh size, allowing scaling to arbitrarily large meshes. The proposed architectures support progressive self-configuration, are more compact than previous MZI-doubling schemes, and do not require additional phase shifters. This eliminates a major obstacle to the development of very-large-scale linear photonic circuits.\n ', "\n Quantum sensors based on spin defect ensembles have seen rapid development in recent years, with a wide array of target applications. Historically, these sensors have used optical methods to prepare or read out quantum states. However, these methods are limited to optically-polarizable spin defects, and the spin ensemble size is typically limited by the available optical power or acceptable optical heat load. We demonstrate a solid-state sensor employing a non-optical state preparation technique, which harnesses thermal population imbalances induced by the defect's zero-field splitting. Readout is performed using the recently-demonstrated microwave cavity readout technique, resulting in a sensor architecture that is entirely non-optical and broadly applicable to all solid-state paramagnetic defects with a zero-field splitting. The implementation in this work uses Cr$^{3+}$ defects in a sapphire (Al$_2$O$_3$) crystal and a simple microwave architecture where the host crystal also serves as the high quality-factor resonator. This approach yields a near-unity filling factor and high single-spin-photon coupling, producing a magnetometer with a broadband sensitivity of 9.7 pT/$\\sqrt{\\text{Hz}}$.\n ", '\n Magnetic interaction between photons and dipoles is essential in electronics, sensing, spectroscopy, and quantum computing. However, its weak strength often requires resonators to confine and store the photons. Here, we present mode engineering techniques to create resonators with ultrasmall mode volume and ultrahigh quality factor. In particular, we show that it is possible to achieve an arbitrarily small mode volume only limited by materials or fabrication with minimal Q degradation. We compare mode-engineered cavities in a trade-off space and show that the magnetic interaction can be strengthened more than $10^{16}$ times compared to free space. These methods enable new applications from high-cooperativity microwave-spin coupling in quantum computing or compact electron paramagnetic resonance (EPR) sensors to fundamental science such as dark matter searches.\n ', "\n We introduce a constructive algorithm for universal linear electromagnetic transformations between the $N$ input and $N$ output modes of a dielectric slab. The approach uses out-of-plane phase modulation programmed down to $N^2$ degrees of freedom. The total area of these modulators equals that of the entire slab: our scheme satisfies the minimum area constraint for programmable linear optical transformations. We also present error correction schemes that enable high-fidelity unitary transformations at large $N$. This ``programmable multimode interferometer'' (ProMMI) thus translates the algorithmic simplicity of Mach-Zehnder meshes into a holographically programmed slab, yielding DoF-limited compactness and error tolerance while eliminating the dominant sidewall-related optical losses and directional-coupler-related patterning challenges.\n ", '\n Realistic multiport interferometers (beamsplitter meshes) are sensitive to component imperfections, and this sensitivity increases with size. Self-configuration techniques can be employed to correct these imperfections, but not all techniques are equal. This paper highlights the importance of algorithmic stability in self-configuration. Naive approaches based on sequentially setting matrix elements are unstable and perform poorly for large meshes, while techniques based on power ratios perform well in all cases, even in the presence of large errors. Based on this insight, we propose a self-configuration scheme for triangular meshes that requires only external detectors and works without prior knowledge of the component imperfections. This scheme extends to the rectangular mesh by adding a single array of detectors along the diagonal.\n ', '\n Multiport interferometers based on integrated beamsplitter meshes are widely used in photonic technologies. While the rectangular mesh is favored for its compactness and uniformity, its geometry resists conventional self-configuration approaches, which are essential to programming large meshes in the presence of fabrication error. Here, we present a new configuration algorithm, related to the $2\\times 2$ block decomposition of a unitary matrix, that overcomes this limitation. Our proposed algorithm is robust to errors, requires no prior knowledge of the process variations, and relies only on external sources and detectors. We show that self-configuration using this technique reduces the effect of fabrication errors by the same quadratic factor observed in triangular meshes. This relaxes a significant limit to the size of multiport interferometers, removing a major roadblock to the scaling of optical quantum and machine-learning hardware.\n ', '\n Group-IV color centers in diamond are a promising light-matter interface for quantum networking devices. The negatively charged tin-vacancy center (SnV) is particularly interesting, as its large spin-orbit coupling offers strong protection against phonon dephasing and robust cyclicity of its optical transitions towards spin-photon entanglement schemes. Here, we demonstrate multi-axis coherent control of the SnV spin qubit via an all-optical stimulated Raman drive between the ground and excited states. We use coherent population trapping and optically driven electronic spin resonance to confirm coherent access to the qubit at 1.7 K, and obtain spin Rabi oscillations at a rate of $Ω/2π$=3.6(1) MHz. All-optical Ramsey interferometry reveals a spin dephasing time of $T_2^*$=1.3(3)$μ$s and two-pulse dynamical decoupling already extends the spin coherence time to $T_2$=0.33(14) ms. Combined with transform-limited photons and integration into photonic nanostructures, our results make the SnV a competitive spin-photon building block for quantum networks.\n ', '\n Recent advances in photonic integrated circuits (PICs) have enabled a new generation of "programmable many-mode interferometers" (PMMIs) realized by cascaded Mach Zehnder Interferometers (MZIs) capable of universal linear-optical transformations on N input-output optical modes. PMMIs serve critical functions in photonic quantum information processing, quantum-enhanced sensor networks, machine learning and other applications. However, PMMI implementations reported to date rely on thermo-optic phase shifters, which limit applications due to slow response times and high power consumption. Here, we introduce a large-scale PMMI platform, based on a 200 mm CMOS process, that uses aluminum nitride (AlN) piezo-optomechanical actuators coupled to silicon nitride (SiN) waveguides, enabling low-loss propagation with phase modulation at greater than 100 MHz in the visible to near-infrared wavelengths. Moreover, the vanishingly low holding-power consumption of the piezo-actuators enables these PICs to operate at cryogenic temperatures, paving the way for a fully integrated device architecture for a range of quantum applications.\n ', '\n With the ability to transfer and process quantum information, large-scale quantum networks will enable a suite of fundamentally new applications, from quantum communications to distributed sensing, metrology, and computing. This perspective reviews requirements for quantum network nodes and color centers in diamond as suitable node candidates. We give a brief overview of state-of-the-art quantum network experiments employing color centers in diamond, and discuss future research directions, focusing in particular on the control and coherence of qubits that distribute and store entangled states, and on efficient spin-photon interfaces. We discuss a route towards large-scale integrated devices combining color centers in diamond with other photonic materials and give an outlook towards realistic future quantum network protocol implementations and applications.\n ', '\n A quantum random access memory (qRAM) is considered an essential computing unit to enable polynomial speedups in quantum information processing. Proposed implementations include using neutral atoms and superconducting circuits to construct a binary tree, but these systems still require demonstrations of the elementary components. Here, we propose a photonic integrated circuit (PIC) architecture integrated with solid-state memories as a viable platform for constructing a qRAM. We also present an alternative scheme based on quantum teleportation and extend it to the context of quantum networks. Both implementations rely on already demonstrated components: electro-optic modulators, a Mach-Zehnder interferometer (MZI) network, and nanocavities coupled to artificial atoms for spin-based memory writing and retrieval. Our approaches furthermore benefit from built-in error-detection based on photon heralding. Detailed theoretical analysis of the qRAM efficiency and query fidelity shows that our proposal presents viable near-term designs for a general qRAM.\n ', "\n Increasing demand for algorithms that can learn quickly and efficiently has led to a surge of development within the field of artificial intelligence (AI). An important paradigm within AI is reinforcement learning (RL), where agents interact with environments by exchanging signals via a communication channel. Agents can learn by updating their behaviour based on obtained feedback. The crucial question for practical applications is how fast agents can learn to respond correctly. An essential figure of merit is therefore the learning time. While various works have made use of quantum mechanics to speed up the agent's decision-making process, a reduction in learning time has not been demonstrated yet. Here we present a RL experiment where the learning of an agent is boosted by utilizing a quantum communication channel with the environment. We further show that the combination with classical communication enables the evaluation of such an improvement, and additionally allows for optimal control of the learning progress. This novel scenario is therefore demonstrated by considering hybrid agents, that alternate between rounds of quantum and classical communication. We implement this learning protocol on a compact and fully tunable integrated nanophotonic processor. The device interfaces with telecom-wavelength photons and features a fast active feedback mechanism, allowing us to demonstrate the agent's systematic quantum advantage in a setup that could be readily integrated within future large-scale quantum communication networks.\n ", '\n Programmable photonic circuits of reconfigurable interferometers can be used to implement arbitrary operations on optical modes, facilitating a flexible platform for accelerating tasks in quantum simulation, signal processing, and artificial intelligence. A major obstacle to scaling up these systems is static fabrication error, where small component errors within each device accrue to produce significant errors within the circuit computation. Mitigating this error usually requires numerical optimization dependent on real-time feedback from the circuit, which can greatly limit the scalability of the hardware. Here we present a deterministic approach to correcting circuit errors by locally correcting hardware errors within individual optical gates. We apply our approach to simulations of large scale optical neural networks and infinite impulse response filters implemented in programmable photonics, finding that they remain resilient to component error well beyond modern day process tolerances. Our results highlight a new avenue for scaling up programmable photonics to hundreds of modes within current day fabrication processes.\n ', '\n Quantum emitters in diamond are leading optically-accessible solid-state qubits. Among these, Group IV-vacancy defect centers have attracted great interest as coherent and stable optical interfaces to long-lived spin states. Theory indicates that their inversion symmetry provides first-order insensitivity to stray electric fields, a common limitation for optical coherence in any host material. Here we experimentally quantify this electric field dependence via an external electric field applied to individual tin-vacancy (SnV) centers in diamond. These measurements reveal that the permanent electric dipole moment and polarizability are at least four orders of magnitude smaller than for the diamond nitrogen vacancy (NV) centers, representing the first direct measurement of the inversion symmetry protection of a Group IV defect in diamond. Moreover, we show that by modulating the electric-field-induced dipole we can use the SnV as a nanoscale probe of local electric field noise, and we employ this technique to highlight the effect of spectral diffusion on the SnV.\n ', '\n The monolayer transition metal dichalcogenides are an emergent semiconductor platform exhibiting rich excitonic physics with coupled spin-valley degree of freedom and optical addressability. Here, we report a new series of low energy excitonic emission lines in the photoluminescence spectrum of ultraclean monolayer WSe2. These excitonic satellites are composed of three major peaks with energy separations matching known phonons, and appear only with electron doping. They possess homogenous spatial and spectral distribution, strong power saturation, and anomalously long population (> 6 $μ$s) and polarization lifetimes (> 100 ns). Resonant excitation of the free inter- and intra-valley bright trions leads to opposite optical orientation of the satellites, while excitation of the free dark trion resonance suppresses the satellites photoluminescence. Defect-controlled crystal synthesis and scanning tunneling microscopy measurements provide corroboration that these features are dark excitons bound to dilute donors, along with associated phonon replicas. Our work opens opportunities to engineer homogenous single emitters and explore collective quantum optical phenomena using intrinsic donor-bound excitons in ultraclean 2D semiconductors.\n ', '\n We propose a field-based design for dielectric antennas to interface diamond color centers with a Gaussian propagating far field. This antenna design enables an efficient spin-photon interface with a Purcell factor exceeding 400 and a 93% mode overlap to a 0.4 numerical aperture far-field Gaussian mode. The antenna design is robust to fabrication imperfections, such as variations in the dimensions of the dielectric perturbations and the emitter dipole location. The field-based dielectric antenna design provides an efficient free-space interface to closely packed arrays of quantum memories for multiplexed quantum repeaters, arrayed quantum sensors, and modular quantum computers.\n ', '\n Nitrogen vacancy (NV) centers in diamond have emerged as a leading quantum sensor platform, combining exceptional sensitivity with nanoscale spatial resolution by optically detected magnetic resonance (ODMR). Because fluorescence-based ODMR techniques are limited by low photon collection efficiency and modulation contrast, there has been growing interest in infrared (IR)-absorption-based readout of the NV singlet state transition. IR readout can improve contrast and collection efficiency, but it has thus far been limited to long-pathlength geometries in bulk samples due to the small absorption cross section of the NV singlet state. Here, we amplify the IR absorption by introducing a resonant diamond metallodielectric metasurface that achieves a quality factor of Q ~ 1,000. This "plasmonic quantum sensing metasurface" (PQSM) combines localized surface plasmon polariton resonances with long-range Rayleigh-Wood anomaly modes and achieves the desired balance between field localization and sensing volume to optimize spin readout sensitivity. From combined electromagnetic and rate-equation modeling, we estimate a sensitivity below 1 nT/Hz$^{1/2}$ per um$^2$ of sensing area using numbers for present-day NV diamond samples and fabrication techniques. The proposed PQSM enables a new form of microscopic ODMR sensing with infrared readout near the spin-projection-noise-limited sensitivity, making it appealing for the most demanding applications such as imaging through scattering tissue and spatially-resolved chemical NMR detection.\n ', '\n Josephson junctions (JJs) are ubiquitous superconducting devices, enabling high sensitivity magnetometers and voltage amplifiers, as well as forming the basis of high performance cryogenic computer and superconducting quantum computers. While JJ performance can be degraded by quasiparticles (QPs) formed from broken Cooper pairs, this phenomenon also opens opportunities to sensitively detect electromagnetic radiation. Here we demonstrate single near-infrared photon detection by coupling photons to the localized surface plasmons of a graphene-based JJ. Using the photon-induced switching statistics of the current-biased JJ, we reveal the critical role of QPs generated by the absorbed photon in the detection mechanism. The photon-sensitive JJ will enable a high-speed, low-power optical interconnect for future JJ-based computing architectures.\n ', '\n Quantum algorithms for both differential equation solving and for machine learning potentially offer an exponential speedup over all known classical algorithms. However, there also exist obstacles to obtaining this potential speedup in useful problem instances. The essential obstacle for quantum differential equation solving is that outputting useful information may require difficult post-processing, and the essential obstacle for quantum machine learning is that inputting the training set is a difficult task just by itself. In this paper, we demonstrate, when combined, these difficulties solve one another. We show how the output of quantum differential equation solving can serve as the input for quantum machine learning, allowing dynamical analysis in terms of principal components, power spectra, and wavelet decompositions. To illustrate this, we consider continuous time Markov processes on epidemiological and social networks. These quantum algorithms provide an exponential advantage over existing classical Monte Carlo methods.\n ', '\n The majority of sources of coherent optical radiation rely on laser oscillators driven by population inversion. Despite their technological importance in communications, medicine, industry, and other fields, it remains a challenge to access the spectral range of 0.1-10 THz (the "terahertz gap"), a frequency band for applications ranging from spectroscopy to security and high-speed wireless communications. Here, we propose a way to produce coherent radiation spanning the THz gap by efficient second-harmonic generation (SHG) in low-loss dielectric structures, starting from technologically mature electronic oscillators (EOs) in the ~100 GHz range. To achieve this goal, we introduce hybrid THz-band dielectric cavity designs that combine (1) extreme field concentration in high-quality-factor resonators with (2) nonlinear materials enhanced by phonon resonances. We theoretically predict conversion efficiencies of >$10^3$ %/W and the potential to bridge the THz gap with 1 W of input power. This approach enables efficient, cascaded parametric frequency converters, representing a new generation of light sources extensible into the mid-IR spectrum and beyond.\n ', '\n Heat capacity is an invaluable quantity in condensed matter physics, yet it has been so far experimentally inaccessible in two-dimensional (2D) van der Waals (vdW) materials, owing to their ultra-fast thermal relaxation times and the lack of suitable nano-scale thermometers. Here, we demonstrate a novel thermal relaxation calorimetry scheme that allows the first measurements of the electronic heat capacity of graphene Ce. It is enabled by the grouping of a radio-frequency Johnson noise thermometer, which can measure the electronic temperature Te with a measurement sensitivity of δTe ~ 20 mK, and an ultra-fast photo-mixed optical heater, which can simultaneously modulate Te with a frequency of up to Ω=0.2 THz. This combination allows record sensitive and record fast measurements of the electronic heat capacity Ce < 10^(-19) J/K, with an electronic thermal relaxation time τe < 10^(-13), representing orders of magnitude improvements as compared to previous state-of-the-art calorimeters. These features embody a breakthrough in heat capacity metrology of nano-scale and low-dimensional systems, and provide a new avenue for the investigation of their thermodynamic quantities.\n ', '\n Solid-state quantum emitters (QEs) are fundamental in photonic-based quantum information processing. There is strong interest to develop high-quality QEs in III-nitride semiconductors because of their sophisticated manufacturing driven by large and growing applications in optoelectronics, high voltage power transistors, and microwave amplifiers. Here, we report the generation and direct integration of QEs in an aluminium nitride-based photonic integrated circuit platform. For individual waveguide-integrated QEs, we measure an off-chip count rate exceeding $6 \\times 10^{4}$ counts per second (cps) (saturation rate > $8.6 \\times 10^{4}$ cps). In an unpatterned thin-film sample, we measure antibunching with $g^{(2)}(0) \\sim 0.05$ and photon count rates exceeding $8 \\times 10^{5}$ cps (saturation rate > $1 \\times 10^{6}$ cps). Although spin and detailed optical linewidth measurements are left for future work, these results already show the potential for high-quality QEs monolithically integrated in a wide range of III-nitride device technologies that would enable new quantum device opportunities and industrial scalability.\n ', '\n As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve more complex problems. This trend has been enabled by an increase in available compute power; however, efforts to continue to scale electronic processors are impeded by the costs of communication, thermal management, power delivery and clocking. To improve scalability, we propose a digital optical neural network (DONN) with intralayer optical interconnects and reconfigurable input values. The near path-length-independence of optical energy consumption enables information locality between a transmitter and arbitrarily arranged receivers, which allows greater flexibility in architecture design to circumvent scaling limitations. In a proof-of-concept experiment, we demonstrate optical multicast in the classification of 500 MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumption of the DONN and find that optical data transfer is beneficial over electronics when the spacing of computational units is on the order of >10 micrometers.\n ', '\n Integrating and manipulating the nano-optoelectronic properties of Van der Waals heterostructures can enable unprecedented platforms for photodetection and sensing. The main challenge of infrared photodetectors is to funnel the light into a small nanoscale active area and efficiently convert it into an electrical signal. Here, we overcome all of those challenges in one device, by efficient coupling of a plasmonic antenna to hyperbolic phonon-polaritons in hexagonal-BN to highly concentrate mid-infrared light into a graphene pn-junction. We balance the interplay of the absorption, electrical and thermal conductivity of graphene via the device geometry. This novel approach yields remarkable device performance featuring room temperature high sensitivity (NEP of 82 pW-per-square-root-Hz) and fast rise time of 17 nanoseconds (setup-limited), among others, hence achieving a combination currently not present in the state-of-the-art graphene and commercial mid-infrared detectors. We also develop a multiphysics model that shows excellent quantitative agreement with our experimental results and reveals the different contributions to our photoresponse, thus paving the way for further improvement of these types of photodetectors even beyond mid-infrared range.\n ', '\n We develop a protocol for entanglement generation in the quantum internet that allows a repeater node to use $n$-qubit Greenberger-Horne-Zeilinger (GHZ) projective measurements that can fuse $n$ successfully-entangled {\\em links}, i.e., two-qubit entangled Bell pairs shared across $n$ network edges, incident at that node. Implementing $n$-fusion, for $n \\ge 3$, is in principle not much harder than $2$-fusions (Bell-basis measurements) in solid-state qubit memories. If we allow even $3$-fusions at the nodes, we find---by developing a connection to a modified version of the site-bond percolation problem---that despite lossy (hence probabilistic) link-level entanglement generation, and probabilistic success of the fusion measurements at nodes, one can generate entanglement between end parties Alice and Bob at a rate that stays constant as the distance between them increases. We prove that this powerful network property is not possible to attain with any quantum networking protocol built with Bell measurements and multiplexing alone. We also design a two-party quantum key distribution protocol that converts the entangled states shared between two nodes into a shared secret, at a key generation rate that is independent of the distance between the two parties.\n ', '\n We present a joint theoretical and experimental characterization of thermo-refractive noise in high quality factor ($Q$), small mode volume ($V$) optical microcavities. Analogous to well-studied stability limits imposed by Brownian motion in macroscopic Fabry-Perot resonators, microcavity thermo-refractive noise gives rise to a mode volume-dependent maximum effective quality factor. State-of-the-art fabricated microcavities are found to be within one order of magnitude of this bound. We confirm the assumptions of our theory by measuring the noise spectrum of high-$Q/V$ silicon photonic crystal cavities and apply our results to estimate the optimal performance of proposed room temperature, all-optical qubits using cavity-enhanced bulk material nonlinearities.\n ', '\n The past decade has seen tremendous progress in experimentally realizing the building blocks of quantum repeaters. Repeater architectures with multiplexed quantum memories have been proposed to increase entanglement distribution rates, but an open challenge is to maintain entanglement fidelity over long-distance links. Here, we address this with a quantum router architecture comprising many quantum memories connected in a photonic switchboard to broker entanglement flows across quantum networks. We compute the rate and fidelity of entanglement distribution under this architecture using an event-based simulator, finding that the router improves the entanglement fidelity as multiplexing depth increases without a significant drop in the entanglement distribution rate. Specifically, the router permits channel-loss-invariant fidelity, i.e. the same fidelity achievable with lossless links. Furthermore, this scheme automatically prioritizes entanglement flows across the full network without requiring global network information. The proposed architecture uses present-day photonic technology, opening a path to near-term deployable multi-node quantum networks.\n ', "\n We propose an integrated photonics device for mapping qubits encoded in the polarization of a photon onto the spin state of a solid-state defect coupled to a photonic crystal cavity: a `Polarization-Encoded Photon-to-Spin Interface' (PEPSI). We perform a theoretical analysis of the state fidelity's dependence on the device's polarization extinction ratio and atom-cavity cooperativity. Furthermore, we explore the rate-fidelity trade-off through analytical and numerical models. In simulation, we show that our design enables efficient, high fidelity photon-to-spin mapping.\n ", '\n We introduce a method for high-fidelity quantum state transduction between a superconducting microwave qubit and the ground state spin system of a solid-state artificial atom, mediated via an acoustic bus connected by piezoelectric transducers. Applied to present-day experimental parameters for superconducting circuit qubits and diamond silicon vacancy centers in an optimized phononic cavity, we estimate quantum state transduction with fidelity exceeding 99\\% at a MHz-scale bandwidth. By combining the complementary strengths of superconducting circuit quantum computing and artificial atoms, the hybrid architecture provides high-fidelity qubit gates with long-lived quantum memory, high-fidelity measurement, large qubit number, reconfigurable qubit connectivity, and high-fidelity state and gate teleportation through optical quantum networks.\n ', '\n The great promise of quantum computers comes with the dual challenges of building them and finding their useful applications. We argue that these two challenges should be considered together, by co-designing full-stack quantum computer systems along with their applications in order to hasten their development and potential for scientific discovery. In this context, we identify scientific and community needs, opportunities, a sampling of a few use case studies, and significant challenges for the development of quantum computers for science over the next 2--10 years. This document is written by a community of university, national laboratory, and industrial researchers in the field of Quantum Information Science and Technology, and is based on a summary from a U.S. National Science Foundation workshop on Quantum Computing held on October 21--22, 2019 in Alexandria, VA.\n ', '\n Just as classical information technology rests on a foundation built of interconnected information-processing systems, quantum information technology (QIT) must do the same. A critical component of such systems is the interconnect, a device or process that allows transfer of information between disparate physical media, for example, semiconductor electronics, individual atoms, light pulses in optical fiber, or microwave fields. While interconnects have been well engineered for decades in the realm of classical information technology, quantum interconnects (QuICs) present special challenges, as they must allow the transfer of fragile quantum states between different physical parts or degrees of freedom of the system. The diversity of QIT platforms (superconducting, atomic, solid-state color center, optical, etc.) that will form a quantum internet poses additional challenges. As quantum systems scale to larger size, the quantum interconnect bottleneck is imminent, and is emerging as a grand challenge for QIT. For these reasons, it is the position of the community represented by participants of the NSF workshop on Quantum Interconnects that accelerating QuIC research is crucial for sustained development of a national quantum science and technology program. Given the diversity of QIT platforms, materials used, applications, and infrastructure required, a convergent research program including partnership between academia, industry and national laboratories is required.\n This document is a summary from a U.S. National Science Foundation supported workshop held on 31 October - 1 November 2019 in Alexandria, VA. Attendees were charged to identify the scientific and community needs, opportunities, and significant challenges for quantum interconnects over the next 2-5 years.\n ', '\n The goal of integrated quantum photonics is to combine components for the generation, manipulation, and detection of non-classical light in a phase stable and efficient platform. Solid-state quantum emitters have recently reached outstanding performance as single photon sources. In parallel, photonic integrated circuits have been advanced to the point that thousands of components can be controlled on a chip with high efficiency and phase stability. Consequently, researchers are now beginning to combine these leading quantum emitters and photonic integrated circuit platforms to realize the best properties of each technology. In this article, we review recent advances in integrated quantum photonics based on such hybrid systems. Although hybrid integration solves many limitations of individual platforms, it also introduces new challenges that arise from interfacing different materials. We review various issues in solid-state quantum emitters and photonic integrated circuits, the hybrid integration techniques that bridge these two systems, and methods for chip-based manipulation of photons and emitters. Finally, we discuss the remaining challenges and future prospects of on-chip quantum photonics with integrated quantum emitters.\n ', "\n A central challenge in developing quantum computers and long-range quantum networks lies in the distribution of entanglement across many individually controllable qubits. Colour centres in diamond have emerged as leading solid-state 'artificial atom' qubits, enabling on-demand remote entanglement, coherent control of over 10 ancillae qubits with minute-long coherence times, and memory-enhanced quantum communication. A critical next step is to integrate large numbers of artificial atoms with photonic architectures to enable large-scale quantum information processing systems. To date, these efforts have been stymied by qubit inhomogeneities, low device yield, and complex device requirements. Here, we introduce a process for the high-yield heterogeneous integration of 'quantum micro-chiplets' (QMCs) -- diamond waveguide arrays containing highly coherent colour centres -- with an aluminium nitride (AlN) photonic integrated circuit (PIC). Our process enables the development of a 72-channel defect-free array of germanium-vacancy (GeV) and silicon-vacancy (SiV) colour centres in a PIC. Photoluminescence spectroscopy reveals long-term stable and narrow average optical linewidths of 54 MHz (146 MHz) for GeV (SiV) emitters, close to the lifetime-limited linewidth of 32 MHz (93 MHz). Additionally, inhomogeneities in the individual qubits can be compensated in situ with integrated tuning of the optical frequencies over 100 GHz. The ability to assemble large numbers of nearly indistinguishable artificial atoms into phase-stable PICs provides an architecture toward multiplexed quantum repeaters and general-purpose quantum computers.\n ", '\n Sensitive microwave detectors are critical instruments in radioastronomy, dark matter axion searches, and superconducting quantum information science. The conventional strategy towards higher-sensitivity bolometry is to nanofabricate an ever-smaller device to augment the thermal response. However, this direction is increasingly more difficult to obtain efficient photon coupling and maintain the material properties in a device with a large surface-to-volume ratio. Here we advance this concept to an ultimately thin bolometric sensor based on monolayer graphene. To utilize its minute electronic specific heat and thermal conductivity, we develop a superconductor-graphene-superconductor (SGS) Josephson junction bolometer embedded in a microwave resonator of resonant frequency 7.9 GHz with over 99\\% coupling efficiency. From the dependence of the Josephson switching current on the operating temperature, charge density, input power, and frequency, we demonstrate a noise equivalent power (NEP) of 7 $\\times 10^{-19}$ W/Hz$^{1/2}$, corresponding to an energy resolution of one single photon at 32 GHz and reaching the fundamental limit imposed by intrinsic thermal fluctuation at 0.19 K.\n ', '\n The ability to communicate quantum information over long distances is of central importance in quantum science and engineering. For example, it enables secure quantum key distribution (QKD) relying on fundamental principles that prohibit the "cloning" of unknown quantum states. While QKD is being successfully deployed, its range is currently limited by photon losses and cannot be extended using straightforward measure-and-repeat strategies without compromising its unconditional security. Alternatively, quantum repeaters, which utilize intermediate quantum memory nodes and error correction techniques, can extend the range of quantum channels. However, their implementation remains an outstanding challenge, requiring a combination of efficient and high-fidelity quantum memories, gate operations, and measurements. Here we report the experimental realization of memory-enhanced quantum communication. We use a single solid-state spin memory integrated in a nanophotonic diamond resonator to implement asynchronous Bell-state measurements. This enables a four-fold increase in the secret key rate of measurement device independent (MDI)-QKD over the loss-equivalent direct-transmission method while operating megahertz clock rates. Our results represent a significant step towards practical quantum repeaters and large-scale quantum networks.\n ', '\n Spatial light modulators (SLMs) are central to numerous applications ranging from high-speed displays to adaptive optics, structured illumination microscopy, and holography. After decades of advances, SLM arrays based on liquid crystals can now reach large pixel counts exceeding 10^6 with phase-only modulation with a pixel pitch of less than 10 μm and reflectance around 75%. However, the rather slow modulation speed in such SLMs (below hundreds of Hz) presents limitations for many applications. Here we propose an SLM architecture that can achieve high pixel count with high-resolution phase-only modulation at high speed in excess of GHz. The architecture consists of a tunable two-dimensional array of vertically oriented, one-sided microcavities that are tuned through an electro-optic material such as barium titanate (BTO). We calculate that the optimized microcavity design achieves a π phase shift under an applied bias voltage below 10 V, while maintaining nearly constant reflection amplitude. As two model applications, we consider high-speed 2D beam steering as well as beam forming. The outlined design methodology could also benefit future design of spatial light modulators with other specifications (for example amplitude modulators). This high-speed SLM architecture promises a wide range of new applications ranging from fully tunable metasurfaces to optical computing accelerators, high-speed interconnects, true 2D phased array beam steering, and quantum computing with cold atom arrays.\n ', '\n The ability to control excitons in semiconductors underlies numerous proposed applications, from excitonic circuits for computing and communications to polariton condensates to energy transport in photovoltaics. 2D semiconductors are particularly promising for room-temperature applications due to their large exciton binding energy. Their enormous stretchability gives rise to a strain-engineerable bandgap that has been used to induce static exciton flux in predetermined structures. However, dynamic control of exciton flux represents an outstanding challenge. Here, we introduce a method to tune the bandgap of suspended 2D semiconductors by applying a local strain gradient with a nanoscale tip. This strain allows us to locally and reversibly shift the exciton energy and to steer the exciton flux over micron-scale distances, as observed by wide-field imaging and time-resolved photoluminescence spectroscopy. We anticipate that the ability to strongly and dynamically modulate the bandgap of a semiconductor at the nanoscale not only marks an important experimental tool but will also open a broad range of new applications from information processing to energy conversion.\n ', '\n Quantum algorithms for Noisy Intermediate-Scale Quantum (NISQ) machines have recently emerged as new promising routes towards demonstrating near-term quantum advantage (or supremacy) over classical systems. In these systems samples are typically drawn from probability distributions which --- under plausible complexity-theoretic conjectures --- cannot be efficiently generated classically. Rather than first define a physical system and then determine computational features of the output state, we ask the converse question: given direct access to the quantum state, what features of the generating system can we efficiently learn? In this work we introduce the Variational Quantum Unsampling (VQU) protocol, a nonlinear quantum neural network approach for verification and inference of near-term quantum circuits outputs. In our approach one can variationally train a quantum operation to unravel the action of an unknown unitary on a known input state; essentially learning the inverse of the black-box quantum dynamics. While the principle of our approach is platform independent, its implementation will depend on the unique architecture of a specific quantum processor. Here, we experimentally demonstrate the VQU protocol on a quantum photonic processor. Alongside quantum verification, our protocol has broad applications; including optimal quantum measurement and tomography, quantum sensing and imaging, and ansatz validation.\n ', '\n Photonic integrated circuits (PICs) enable miniaturization of optical quantum circuits because several optic and electronic functionalities can be added on the same chip. Single photon emitters (SPEs) are central building blocks for such quantum circuits and several approaches have been developed to interface PICs with a host material containing SPEs. SPEs embedded in 2D transition metal dichalcogenides have unique properties that make them particularly appealing as PIC-integrated SPEs. They can be easily interfaced with PICs and stacked together to create complex heterostructures. Since the emitters are embedded in a monolayer there is no total internal reflection, enabling very high light extraction efficiencies without the need of any additional processing to allow efficient single photon transfer between the host and the underlying PIC. Arrays of 2D-based SPEs can moreover be fabricated deterministically through STEM patterning or strain engineering. Finally, 2D materials grown with high wafer-scale uniformity are becoming more readily available, such that they can be matched at the wafer level with underlying PICs. Here we report on the integration of a WSe$_{2}$ monolayer onto a Silicon Nitride (SiN) chip. We demonstrate the coupling of SPEs with the guided mode of a SiN waveguide and study how the on-chip single photon extraction can be maximized by interfacing the 2D-SPE with an integrated dielectric cavity. Our approach allows the use of optimized PIC platforms without the need for additional processing in the host material. In combination with improved wafer-scale CVD growth of 2D materials, this approach provides a promising route towards scalable quantum photonic chips.\n ', '\n Nanodiamonds hosting colour centres are a promising material platform for various quantum technologies. The fabrication of non-aggregated and uniformly-sized nanodiamonds with systematic integration of single quantum emitters has so far been lacking. Here, we present a top-down fabrication method to produce 30.0$\\pm$5.4 nm uniformly-sized single-crystal nanodiamonds by block copolymer self-assembled nanomask patterning together with directional and isotropic reactive ion etching. We show detected emission from bright single nitrogen vacancy centres hosted in the fabricated nanodiamonds. The lithographically precise patterning of large areas of diamond by self-assembled masks and their release into uniformly sized nanodiamonds open up new possibilities for quantum information processing and sensing.\n ', '\n Recent success in deep neural networks has generated strong interest in hardware accelerators to improve speed and energy consumption. This paper presents a new type of photonic accelerator based on coherent detection that is scalable to large ($N \\gtrsim 10^6$) networks and can be operated at high (GHz) speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using the massive spatial multiplexing enabled by standard free-space optical components. In contrast to previous approaches, both weights and inputs are optically encoded so that the network can be reprogrammed and trained on the fly. Simulations of the network using models for digit- and image-classification reveal a "standard quantum limit" for optical neural networks, set by photodetector shot noise. This bound, which can be as low as 50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for digital irreversible computation is theoretically possible in this device. The proposed accelerator can implement both fully-connected and convolutional networks. We also present a scheme for back-propagation and training that can be performed in the same hardware. This architecture will enable a new class of ultra-low-energy processors for deep learning.\n ', '\n We consider the free carrier dispersion effect in a semiconductor nanocavity in the limit of discrete photoexcited electron-hole pairs. This analysis reveals the possibility of ultrafast, incoherent transduction and gain from a single photon signal to a strong coherent probe field. Homodyne detection of the displaced probe field enables a new method for room temperature, photon-number-resolving single photon detection. In particular, we estimate that a single photon absorbed within a silicon nanocavity can, within tens of picoseconds, be detected with $\\sim 99\\%$ efficiency and a dark count rate on the order of kHz assuming a mode volume $V_\\text{eff}\\sim 10^{-2}(λ/n_\\text{Si})^3$ for a 4.5 micron probe wavelength and a loaded quality factor $Q$ on the order of $10^4$.\n ', '\n Solid-state quantum emitters that couple coherent optical transitions to long-lived spin qubits are essential for quantum networks. Here we report on the spin and optical properties of individual tin-vacancy (SnV) centers in diamond nanostructures. Through cryogenic magneto-optical and spin spectroscopy, we verify the inversion-symmetric electronic structure of the SnV, identify spin-conserving and spin-flipping transitions, characterize transition linewidths, measure electron spin lifetimes and evaluate the spin dephasing time. We find that the optical transitions are consistent with the radiative lifetime limit even in nanofabricated structures. The spin lifetime is phononlimited with an exponential temperature scaling leading to $T_1$ $>$ 10 ms, and the coherence time, $T_2$ reaches the nuclear spin-bath limit upon cooling to 2.9 K. These spin properties exceed those of other inversion-symmetric color centers for which similar values require millikelvin temperatures. With a combination of coherent optical transitions and long spin coherence without dilution refrigeration, the SnV is a promising candidate for feasable and scalable quantum networking applications.\n ', '\n Large-scale quantum technologies require exquisite control over many individual quantum systems. Typically, such systems are very sensitive to environmental fluctuations, and diagnosing errors via measurements causes unavoidable perturbations. In this work we present an in situ frequency locking technique that monitors and corrects frequency variations in single photon sources based on microring resonators. By using the same classical laser fields required for photon generation as a probe to diagnose variations in the resonator frequency, our protocol applies feedback control to correct photon frequency errors in parallel to the optical quantum computation without disturbing the physical qubit. We implement our technique on a silicon photonic device and demonstrate sub 1 pm frequency stabilization in the presence of applied environmental noise, corresponding to a fractional frequency drift of <1 % of a photon linewidth. Using these methods we demonstrate feedback controlled quantum state engineering. By distributing a single local oscillator across a single chip or network of chips, our approach enables frequency locking of many single photon sources for large-scale photonic quantum technologies.\n ', '\n The inability of conventional electronic architectures to efficiently solve large combinatorial problems motivates the development of novel computational hardware. There has been much effort recently toward developing novel, application-specific hardware, across many different fields of engineering, such as integrated circuits, memristors, and photonics. However, unleashing the true potential of such novel architectures requires the development of featured algorithms which optimally exploit their fundamental properties. We here present the Photonic Recurrent Ising Sampler (PRIS), a heuristic method tailored for parallel architectures that allows for fast and efficient sampling from distributions of combinatorially hard Ising problems. Since the PRIS relies essentially on vector-to-fixed matrix multiplications, we suggest the implementation of the PRIS in photonic parallel networks, which realize these operations at an unprecedented speed. The PRIS provides sample solutions to the ground state of arbitrary Ising models, by converging in probability to their associated Gibbs distribution. By running the PRIS at various noise levels, we probe the critical behavior of universality classes and their critical exponents. In addition to the attractive features of photonic networks, the PRIS relies on intrinsic dynamic noise and eigenvalue dropout to find ground states more efficiently. Our work suggests speedups in heuristic methods via photonic implementations of the PRIS. We also hint at a broader class of (meta)heuristic algorithms derived from the PRIS, such as combined simulated annealing on the noise and eigenvalue dropout levels. Our algorithm can also be implemented in a competitive manner on fast parallel electronic hardware, such as FPGAs and ASICs.\n ', "\n Recently, Grange et al. [Phys. Rev. Lett. 114, 193601 (2015)] showed the possibility of single photon generation with high indistinguishability from a quantum emitter, despite strong pure dephasing, by `funneling' emission into a photonic cavity. Here, we show that cascaded two-cavity system can further improve the photon characteristics and greatly reduce the Q-factor requirement to levels achievable with present-day technology. Our approach leverages recent advances in nanocavities with ultrasmall mode volume and does not require ultrafast excitation of the emitter. These results were obtained by numerical and closed-form analytical models with strong emitter dephasing, representing room-temperature quantum emitters.\n ", '\n Physically motivated quantum algorithms for specific near-term quantum hardware will likely be the next frontier in quantum information science. Here, we show how many of the features of neural networks for machine learning can naturally be mapped into the quantum optical domain by introducing the quantum optical neural network (QONN). Through numerical simulation and analysis we train the QONN to perform a range of quantum information processing tasks, including newly developed protocols for quantum optical state compression, reinforcement learning, and black-box quantum simulation. We consistently demonstrate our system can generalize from only a small set of training data onto states for which it has not been trained. Our results indicate QONNs are a powerful design tool for quantum optical systems and, leveraging advances in integrated quantum photonics, a promising architecture for next generation quantum processors.\n ', '\n We report on quantum emission from Pb-related color centers in diamond following ion implantation and high temperature vacuum annealing. First-principles calculations predict a negatively-charged Pb-vacancy center in a split-vacancy configuration, with a zero-phonon transition around 2.3 eV. Cryogenic photoluminescence measurements performed on emitters in nanofabricated pillars reveal several transitions, including a prominent doublet near 520 nm. The splitting of this doublet, 2 THz, exceeds that reported for other group-IV centers. These observations are consistent with the PbV center, which is expected to have the combination of narrow optical transitions and stable spin states, making it a promising system for quantum network nodes.\n ', '\n Medium-scale ensembles of coupled qubits offer a platform for near-term quantum technologies including computing, sensing, and the study of mesoscopic quantum systems. Atom-like emitters in solids have emerged as promising quantum memories, with demonstrations of spin-spin entanglement by optical and magnetic interactions. Magnetic coupling in particular is attractive for efficient and deterministic entanglement gates, but raises the problem of individual spin addressing at the necessary nanometer-scale separation. Current super-resolution techniques can reach this resolution, but are destructive to the states of nearby qubits. Here, we demonstrate the measurement of individual qubit states in a sub-diffraction cluster by selectively exciting spectrally distinguishable nitrogen vacancy (NV) centers. We demonstrate super-resolution localization of single centers with nanometer spatial resolution, as well as individual control and readout of spin populations. These measurements indicate a readout-induced crosstalk on non-addressed qubits below $4\\times10^{-2}$. This approach opens the door to high-speed control and measurement of qubit registers in mesoscopic spin clusters, with applications ranging from entanglement-enhanced sensors to error-corrected qubit registers to multiplexed quantum repeater nodes.\n ', '\n We use a sample of 350 star-forming galaxies at $1.25<z<2.66$ from the MOSFIRE Deep Evolution Field survey to demonstrate an improved Voronoi binning technique that we use to study the properties of resolved stellar populations in $z\\sim2$ galaxies. Stellar population and dust maps are constructed from the high-resolution CANDELS/3D-HST multi-band imaging. Rather than constructing the layout of resolved elements (i.e., Voronoi bins) from the S/N distribution of the $H_{160}$-band alone, we introduce a modified Voronoi binning method that additionally incorporates the S/N distribution of several resolved filters. The SED-derived resolved E(B-V)$_{\\text{stars}}$, stellar population ages, SFRs, and stellar masses that are inferred from the Voronoi bins constructed from multiple filters are generally consistent with the properties inferred from the integrated photometry within the uncertainties, with the exception of the inferred E(B-V)$_{\\text{stars}}$ from our $z\\sim1.5$ sample due to their UV slopes being unconstrained by the resolved photometry. The results from our multi-filter Voronoi binning technique are compared to those derived from a "traditional" single-filter Voronoi binning approach. We find that single-filter binning produces inferred E(B-V)$_{\\text{stars}}$ that are systematically redder by 0.02 mag on average, but could differ by up to 0.20 mag, and could be attributed to poorly constrained resolved photometry covering the UV slope. Overall, we advocate that our methodology produces more reliable SED-derived parameters due to the best-fit resolved SEDs being better constrained at all resolved wavelengths--particularly those covering the UV slope.\n ', '\n We present results from the MOSFIRE Deep Evolution Field (MOSDEF) survey on broad flux from the nebular emission lines H$α$, [NII], [OIII], H$β$, and [SII]. The sample consists of 127 star-forming galaxies at $1.37 < z < 2.61$ and 84 galaxies at $2.95 < z < 3.80$. We decompose the emission lines using narrow ($\\text{FWHM} < 275 \\ \\text{km s}^{-1}$) and broad ($\\text{FWHM} > 300 \\ \\text{km s}^{-1}$) Gaussian components for individual galaxies and stacks. Broad emission is detected at $>3σ$ in $<10$% of galaxies and the broad flux accounts for 10-70% of the total flux. We find a slight increase in broad to narrow flux ratio with mass but note that we cannot reliably detect broad emission with $\\text{FWHM} < 275 \\ \\text{km s}^{-1}$, which may be significant at low masses. Notably, there is a correlation between higher signal-to-noise (S/N) spectra and a broad component detection indicating a S/N dependence in our ability to detect broad flux. When placed on the N2-BPT diagram ([OIII]/H$β$ vs. [NII]/H$α$) the broad components of the stacks are shifted towards higher [OIII]/H$β$ and [NII]/$α$ ratios compared to the narrow component. We compare the location of the broad components to shock models and find that the broad component could be due to shocks, but we do not rule out other possibilities such as the presence of an AGN. We estimate the mass loading factor (mass outflow rate/star formation rate) assuming the broad component is a photoionized outflow and find that the mass loading factor increases as a function of mass which agrees with previous studies. We show that adding emission from shocked gas to $z\\sim0$ SDSS spectra shifts galaxies towards the location of $z\\sim2$ galaxies on several emission line diagnostic diagrams.\n ', '\n Using observations from the first two years of the MOSFIRE Deep Evolution Field (MOSDEF) survey, we study 13 AGN-driven outflows detected from a sample of 67 X-ray, IR and/or optically-selected AGN at $z \\sim 2$. The AGN have bolometric luminosities of $\\sim10^{44}-10^{46} ~\\mathrm{erg~s^{-1}}$, including both quasars and moderate-luminosity AGN. We detect blueshifted, ionized gas outflows in the H$β$ , [OIII], H$α$ ~and/or [NII] emission lines of $19\\%$ of the AGN, while only 1.8\\% of the MOSDEF galaxies have similarly-detected outflows. The outflow velocities span $\\sim$300 to 1000 km s$^{-1}$. Eight of the 13 outflows are spatially extended on similar scales as the host galaxies, with spatial extents of 2.5 to 11.0 kpc. Outflows are detected uniformly across the star-forming main sequence, showing little trend with the host galaxy SFR. Line ratio diagnostics indicate that the outflowing gas is photoionized by the AGN. We do not find evidence for positive AGN feedback, in either our small MOSDEF sample or a much larger SDSS sample, using the BPT diagram. Given that a galaxy with an AGN is ten times more likely to have a detected outflow, the outflowing gas is photoionzed by the AGN, and estimates of the mass and energy outflow rates indicate that stellar feedback is insufficient to drive at least some of these outflows, they are very likely to be AGN-driven. The outflows have mass-loading factors of the order of unity, suggesting that they help regulate star formation in their host galaxies, though they may be insufficient to fully quench it.\n ', '\n We present results on the variation of 7.7 micron Polycyclic Aromatic Hydrocarbon (PAH) emission in galaxies spanning a wide range in metallicity at z ~ 2. For this analysis, we use rest-frame optical spectra of 476 galaxies at 1.37 < z < 2.61 from the MOSFIRE Deep Evolution Field (MOSDEF) survey to infer metallicities and ionization states. Spitzer/MIPS 24 micron and Herschel/PACS 100 and 160 micron observations are used to derive rest-frame 7.7 micron luminosities (L(7.7)) and total IR luminosities (L(IR)), respectively. We find significant trends between the ratio of L(7.7) to L(IR) (and to dust-corrected SFR) and both metallicity and [OIII]/[OII] (O32) emission-line ratio. The latter is an empirical proxy for the ionization parameter. These trends indicate a paucity of PAH emission in low metallicity environments with harder and more intense radiation fields. Additionally, L(7.7)/L(IR) is significantly lower in the youngest quartile of our sample (ages of 500 Myr) compared to older galaxies, which may be a result of the delayed production of PAHs by AGB stars. The relative strength of L(7.7) to L(IR) is also lower by a factor of ~ 2 for galaxies with masses $M_* < 10^{10}M_{\\odot}$, compared to the more massive ones. We demonstrate that commonly-used conversions of L(7.7) (or 24 micron flux density; f(24)) to L(IR) underestimate the IR luminosity by more than a factor of 2 at $M_*$ ~ $10^{9.6-10.0} M_{\\odot}$. We adopt a mass-dependent conversion of L(7.7) to L(IR) with L(7.7)/L(IR)= 0.09 and 0.22 for $M_* < 10^{10}$ and $> 10^{10} M_{\\odot}$, respectively. Based on the new scaling, the SFR-$M_*$ relation has a shallower slope than previously derived. Our results also suggest a higher IR luminosity density at z ~ 2 than previously measured, corresponding to a ~ 30% increase in the SFR density.\n ', '\n Convolutional Neural Network (CNN) has been successful in image recognition tasks, and recent works shed lights on how CNN separates different classes with the learned inter-class knowledge through visualization. In this work, we instead visualize the intra-class knowledge inside CNN to better understand how an object class is represented in the fully-connected layers.\n To invert the intra-class knowledge into more interpretable images, we propose a non-parametric patch prior upon previous CNN visualization models. With it, we show how different "styles" of templates for an object class are organized by CNN in terms of location and content, and represented in a hierarchical and ensemble way. Moreover, such intra-class knowledge can be used in many interesting applications, e.g. style-based image retrieval and style-based object completion.\n ', "\n Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific information* (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.\n ", "\n Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that identifies possible ``dead-ends'' of a state space. We focus on the condition of patients in the intensive care unit, where a ``medical dead-end'' indicates that a patient will expire, regardless of all potential future treatment sequences. We postulate ``treatment security'' as avoiding treatments with probability proportional to their chance of leading to dead-ends, present a formal proof, and frame discovery as an RL problem. We then train three independent deep neural models for automated state construction, dead-end discovery and confirmation. Our empirical results discover that dead-ends exist in real clinical data among septic patients, and further reveal gaps between secure treatments and those that were administered.\n ", '\n Machine learning models achieve state-of-the-art performance on many supervised learning tasks. However, prior evidence suggests that these models may learn to rely on shortcut biases or spurious correlations (intuitively, correlations that do not hold in the test as they hold in train) for good predictive performance. Such models cannot be trusted in deployment environments to provide accurate predictions. While viewing the problem from a causal lens is known to be useful, the seamless integration of causation techniques into machine learning pipelines remains cumbersome and expensive. In this work, we study and extend a causal pre-training debiasing technique called causal bootstrapping (CB) under five practical confounded-data generation-acquisition scenarios (with known and unknown confounding). Under these settings, we systematically investigate the effect of confounding bias on deep learning model performance, demonstrating their propensity to rely on shortcut biases when these biases are not properly accounted for. We demonstrate that such a causal pre-training technique can significantly outperform existing base practices to mitigate confounding bias on real-world domain generalization benchmarking tasks. This systematic investigation underlines the importance of accounting for the underlying data-generating mechanisms and fortifying data-preprocessing pipelines with a causal framework to develop methods robust to confounding biases.\n ', '\n Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via techniques that implicitly or explicitly increase the effective sample size.\n ', '\n Background: In medical imaging, prior studies have demonstrated disparate AI performance by race, yet there is no known correlation for race on medical imaging that would be obvious to the human expert interpreting the images.\n Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect race from medical images, including the ability of these models to generalize to external environments and across multiple imaging modalities, B) assessment of possible confounding anatomic and phenotype population features, such as disease distribution and body habitus as predictors of race, and C) investigation into the underlying mechanism by which AI models can recognize race.\n Findings: Standard deep learning models can be trained to predict race from medical images with high performance across multiple imaging modalities. Our findings hold under external validation conditions, as well as when models are optimized to perform clinically motivated tasks. We demonstrate this detection is not due to trivial proxies or imaging-related surrogate covariates for race, such as underlying disease distribution. Finally, we show that performance persists over all anatomical regions and frequency spectrum of the images suggesting that mitigation efforts will be challenging and demand further study.\n Interpretation: We emphasize that model ability to predict self-reported race is itself not the issue of importance. However, our findings that AI can trivially predict self-reported race -- even from corrupted, cropped, and noised medical images -- in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.\n ', '\n Deep Metric Learning (DML) aims to find representations suitable for zero-shot transfer to a priori unknown test distributions. However, common evaluation protocols only test a single, fixed data split in which train and test classes are assigned randomly. More realistic evaluations should consider a broad spectrum of distribution shifts with potentially varying degree and difficulty. In this work, we systematically construct train-test splits of increasing difficulty and present the ooDML benchmark to characterize generalization under out-of-distribution shifts in DML. ooDML is designed to probe the generalization performance on much more challenging, diverse train-to-test distribution shifts. Based on our new benchmark, we conduct a thorough empirical analysis of state-of-the-art DML methods. We find that while generalization tends to consistently degrade with difficulty, some methods are better at retaining performance as the distribution shift increases. Finally, we propose few-shot DML as an efficient way to consistently improve generalization in response to unknown test shifts presented in ooDML. Code available here: https://github.com/Confusezius/Characterizing_Generalization_in_DeepMetricLearning.\n ', '\n Clinical machine learning models experience significantly degraded performance in datasets not seen during training, e.g., new hospitals or populations. Recent developments in domain generalization offer a promising solution to this problem by creating models that learn invariances across environments. In this work, we benchmark the performance of eight domain generalization methods on multi-site clinical time series and medical imaging data. We introduce a framework to induce synthetic but realistic domain shifts and sampling bias to stress-test these methods over existing non-healthcare benchmarks. We find that current domain generalization methods do not consistently achieve significant gains in out-of-distribution performance over empirical risk minimization on real-world medical imaging data, in line with prior work on general imaging datasets. However, a subset of realistic induced-shift scenarios in clinical time series data do exhibit limited performance gains. We characterize these scenarios in detail, and recommend best practices for domain generalization in the clinical setting.\n ', '\n Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data. In practice, successful RL relies on informative latent states derived from sequential observations to develop optimal treatment strategies. To date, how best to construct such states in a healthcare setting is an open question. In this paper, we perform an empirical study of several information encoding architectures using data from septic patients in the MIMIC-III dataset to form representations of a patient state. We evaluate the impact of representation dimension, correlations with established acuity scores, and the treatment policies derived from them. We find that sequentially formed state representations facilitate effective policy learning in batch settings, validating a more thoughtful approach to representation learning that remains faithful to the sequential and partial nature of healthcare data.\n ', '\n Reliable treatment effect estimation from observational data depends on the availability of all confounding information. While much work has targeted treatment effect estimation from observational data, there is relatively little work in the setting of confounding variable missingness, where collecting more information on confounders is often costly or time-consuming. In this work, we frame this challenge as a problem of feature acquisition of confounding features for causal inference. Our goal is to prioritize acquiring values for a fixed and known subset of missing confounders in samples that lead to efficient average treatment effect estimation. We propose two acquisition strategies based on i) covariate balancing (CB), and ii) reducing statistical estimation error on observed factual outcome error (OE). We compare CB and OE on five common causal effect estimation methods, and demonstrate improved sample efficiency of OE over baseline methods under various settings. We also provide visualizations for further analysis on the difference between our proposed methods.\n ', '\n Building user trust in dialogue agents requires smooth and consistent dialogue exchanges. However, agents can easily lose conversational context and generate irrelevant utterances. These situations are called dialogue breakdown, where agent utterances prevent users from continuing the conversation. Building systems to detect dialogue breakdown allows agents to recover appropriately or avoid breakdown entirely. In this paper we investigate the use of semi-supervised learning methods to improve dialogue breakdown detection, including continued pre-training on the Reddit dataset and a manifold-based data augmentation method. We demonstrate the effectiveness of these methods on the Dialogue Breakdown Detection Challenge (DBDC) English shared task. Our submissions to the 2020 DBDC5 shared task place first, beating baselines and other submissions by over 12\\% accuracy. In ablations on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2\\% accuracy. These methods are applicable generally to any dialogue task and provide a simple way to improve model performance.\n ', '\n Multiple Sclerosis (MS) is a chronic, inflammatory and degenerative neurological disease, which is monitored by a specialist using the Expanded Disability Status Scale (EDSS) and recorded in unstructured text in the form of a neurology consult note. An EDSS measurement contains an overall "EDSS" score and several functional subscores. Typically, expert knowledge is required to interpret consult notes and generate these scores. Previous approaches used limited context length Word2Vec embeddings and keyword searches to predict scores given a consult note, but often failed when scores were not explicitly stated. In this work, we present MS-BERT, the first publicly available transformer model trained on real clinical data other than MIMIC. Next, we present MSBC, a classifier that applies MS-BERT to generate embeddings and predict EDSS and functional subscores. Lastly, we explore combining MSBC with other models through the use of Snorkel to generate scores for unlabelled consult notes. MSBC achieves state-of-the-art performance on all metrics and prediction tasks and outperforms the models generated from the Snorkel ensemble. We improve Macro-F1 by 0.12 (to 0.88) for predicting EDSS and on average by 0.29 (to 0.63) for predicting functional subscores over previous Word2Vec CNN and rule-based approaches.\n ', '\n Machine learning models in health care are often deployed in settings where it is important to protect patient privacy. In such settings, methods for differentially private (DP) learning provide a general-purpose approach to learn models with privacy guarantees. Modern methods for DP learning ensure privacy through mechanisms that censor information judged as too unique. The resulting privacy-preserving models, therefore, neglect information from the tails of a data distribution, resulting in a loss of accuracy that can disproportionately affect small groups. In this paper, we study the effects of DP learning in health care. We use state-of-the-art methods for DP learning to train privacy-preserving models in clinical prediction tasks, including x-ray classification of images and mortality prediction in time series data. We use these models to perform a comprehensive empirical investigation of the tradeoffs between privacy, utility, robustness to dataset shift, and fairness. Our results highlight lesser-known limitations of methods for DP learning in health care, models that exhibit steep tradeoffs between privacy and utility, and models whose predictions are disproportionately influenced by large demographic groups in the training data. We discuss the costs and benefits of differentially private learning in health care.\n ', '\n Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be beneficial including calibration and missing data. Beyond predictive models, we also investigate the utility of probabilistic machine learning models in phenotyping, in generative models for clinical use cases, and in reinforcement learning.\n ', '\n The use of machine learning (ML) in health care raises numerous ethical concerns, especially as models can amplify existing health inequities. Here, we outline ethical considerations for equitable ML in the advancement of health care. Specifically, we frame ethics of ML in health care through the lens of social justice. We describe ongoing efforts and outline challenges in a proposed pipeline of ethical ML in health, ranging from problem selection to post-deployment considerations. We close by summarizing recommendations to address these challenges.\n ', '\n Models that perform well on a training domain often fail to generalize to out-of-domain (OOD) examples. Data augmentation is a common method used to prevent overfitting and improve OOD generalization. However, in natural language, it is difficult to generate new examples that stay on the underlying data manifold. We introduce SSMBA, a data augmentation method for generating synthetic training examples by using a pair of corruption and reconstruction functions to move randomly on a data manifold. We investigate the use of SSMBA in the natural language domain, leveraging the manifold assumption to reconstruct corrupted text with masked language models. In experiments on robustness benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods and baseline models on both in-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.\n ', '\n Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives. However, generalization capacity is known to scale with the embedding space dimensionality. Unfortunately, high dimensional embeddings also create higher retrieval cost for downstream applications. To remedy this, we propose \\emph{Simultaneous Similarity-based Self-distillation (S2SD). S2SD extends DML with knowledge distillation from auxiliary, high-dimensional embedding and feature spaces to leverage complementary context during training while retaining test-time cost and with negligible changes to the training time. Experiments and ablations across different objectives and standard benchmarks show S2SD offers notable improvements of up to 7% in Recall@1, while also setting a new state-of-the-art. Code available at https://github.com/MLforHealth/S2SD.\n ', '\n Multi-task learning (MTL) is a machine learning technique aiming to improve model performance by leveraging information across many tasks. It has been used extensively on various data modalities, including electronic health record (EHR) data. However, despite significant use on EHR data, there has been little systematic investigation of the utility of MTL across the diverse set of possible tasks and training schemes of interest in healthcare. In this work, we examine MTL across a battery of tasks on EHR time-series data. We find that while MTL does suffer from common negative transfer, we can realize significant gains via MTL pre-training combined with single-task fine-tuning. We demonstrate that these gains can be achieved in a task-independent manner and offer not only minor improvements under traditional learning, but also notable gains in a few-shot learning context, thereby suggesting this could be a scalable vehicle to offer improved performance in important healthcare contexts.\n ', '\n Deep Neural Networks have shown great promise on a variety of downstream applications; but their ability to adapt and generalize to new data and tasks remains a challenge. However, the ability to perform few or zero-shot adaptation to novel tasks is important for the scalability and deployment of machine learning models. It is therefore crucial to understand what makes for good, transfer-able features in deep networks that best allow for such adaptation. In this paper, we shed light on this by showing that features that are most transferable have high uniformity in the embedding space and propose a uniformity regularization scheme that encourages better transfer and feature reuse. We evaluate the regularization on its ability to facilitate adaptation to unseen tasks and data, for which we conduct a thorough experimental study covering four relevant, and distinct domains: few-shot Meta-Learning, Deep Metric Learning, Zero-Shot Domain Adaptation, as well as Out-of-Distribution classification. Across all experiments, we show that uniformity regularization consistently offers benefits over baseline methods and is able to achieve state-of-the-art performance in Deep Metric Learning and Meta-Learning.\n ', "\n It is often infeasible or impossible to obtain ground truth labels for medical data. To circumvent this, one may build rule-based or other expert-knowledge driven labelers to ingest data and yield silver labels absent any ground-truth training data. One popular such labeler is CheXpert, a labeler that produces diagnostic labels for chest X-ray radiology reports. CheXpert is very useful, but is relatively computationally slow, especially when integrated with end-to-end neural pipelines, is non-differentiable so can't be used in any applications that require gradients to flow through the labeler, and does not yield probabilistic outputs, which limits our ability to improve the quality of the silver labeler through techniques such as active learning.\n In this work, we solve all three of these problems with $\\texttt{CheXpert++}$, a BERT-based, high-fidelity approximation to CheXpert. $\\texttt{CheXpert++}$ achieves 99.81\\% parity with CheXpert, which means it can be reliably used as a drop-in replacement for CheXpert, all while being significantly faster, fully differentiable, and probabilistic in output. Error analysis of $\\texttt{CheXpert++}$ also demonstrates that $\\texttt{CheXpert++}$ has a tendency to actually correct errors in the CheXpert labels, with $\\texttt{CheXpert++}$ labels being more often preferred by a clinician over CheXpert labels (when they disagree) on all but one disease task. To further demonstrate the utility of these advantages in this model, we conduct a proof-of-concept active learning study, demonstrating we can improve accuracy on an expert labeled random subset of report sentences by approximately 8\\% over raw, unaltered CheXpert by using one-iteration of active-learning inspired re-training. These findings suggest that simple techniques in co-learning and active learning can yield high-quality labelers under minimal, and controllable human labeling demands.\n ", "\n Across the world's coronavirus disease 2019 (COVID-19) hot spots, the need to streamline patient diagnosis and management has become more pressing than ever. As one of the main imaging tools, chest X-rays (CXRs) are common, fast, non-invasive, relatively cheap, and potentially bedside to monitor the progression of the disease. This paper describes the first public COVID-19 image data collection as well as a preliminary exploration of possible use cases for the data. This dataset currently contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19. It was manually aggregated from publication figures as well as various web based repositories into a machine learning (ML) friendly format with accompanying dataloader code. We collected frontal and lateral view imagery and metadata such as the time since first symptoms, intensive care unit (ICU) status, survival status, intubation status, or hospital location. We present multiple possible use cases for the data such as predicting the need for the ICU, predicting patient survival, and understanding a patient's trajectory during treatment. Data can be accessed here: https://github.com/ieee8023/covid-chestxray-dataset\n ", '\n Domain shift creates significant challenges for sequential decision making in healthcare since the target domain may be data-scarce and confounded. In this paper, we propose a method for off-policy transfer by modeling the underlying generative process with a causal mechanism. We use informative priors from the source domain to augment counterfactual trajectories in the target in a principled manner. We demonstrate how this addresses data-scarcity in the presence of unobserved confounding. The causal parametrization of our sampling procedure guarantees that counterfactual quantities can be estimated from scarce observational target data, maintaining intuitive stability properties. Policy learning in the target domain is further regularized via the source policy through KL-divergence. Through evaluation on a simulated sepsis treatment task, our counterfactual policy transfer procedure significantly improves the performance of a learned treatment policy when assumptions of "no-unobserved confounding" are relaxed.\n ', "\n Purpose: The need to streamline patient management for COVID-19 has become more pressing than ever. Chest X-rays provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge severity of COVID-19 lung infections (and pneumonia in general) that can be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the ICU.\n Methods: Images from a public COVID-19 database were scored retrospectively by three blinded experts in terms of the extent of lung involvement as well as the degree of opacity. A neural network model that was pre-trained on large (non-COVID-19) chest X-ray datasets is used to construct features for COVID-19 images which are predictive for our task.\n Results: This study finds that training a regression model on a subset of the outputs from an this pre-trained chest X-ray model predicts our geographic extent score (range 0-8) with 1.14 mean absolute error (MAE) and our lung opacity score (range 0-6) with 0.78 MAE.\n Conclusions: These results indicate that our model's ability to gauge severity of COVID-19 lung infections could be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the intensive care unit (ICU). A proper clinical trial is needed to evaluate efficacy. To enable this we make our code, labels, and data available online at https://github.com/mlmed/torchxrayvision/tree/master/scripts/covid-severity and https://github.com/ieee8023/covid-chestxray-dataset\n ", '\n In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.\n ', "\n Machine learning systems have received much attention recently for their ability to achieve expert-level performance on clinical tasks, particularly in medical imaging. Here, we examine the extent to which state-of-the-art deep learning classifiers trained to yield diagnostic labels from X-ray images are biased with respect to protected attributes. We train convolution neural networks to predict 14 diagnostic labels in 3 prominent public chest X-ray datasets: MIMIC-CXR, Chest-Xray8, CheXpert, as well as a multi-site aggregation of all those datasets. We evaluate the TPR disparity -- the difference in true positive rates (TPR) -- among different protected attributes such as patient sex, age, race, and insurance type as a proxy for socioeconomic status. We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups. A multi-source dataset corresponds to the smallest disparities, suggesting one way to reduce bias. We find that TPR disparities are not significantly correlated with a subgroup's proportional disease burden. As clinical models move from papers to products, we encourage clinical decision makers to carefully audit for algorithmic disparities prior to deployment. Our code can be found at, https://github.com/LalehSeyyed/CheXclusion\n ", '\n Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transport (OT) domain adaptation systems. We learn mappings from other languages to English and detect aphasia from linguistic characteristics of speech, and show that OT domain adaptation improves aphasia detection over unilingual baselines for French (6% increased F1) and Mandarin (5% increased F1). Further, we show that adding aphasic data to the domain adaptation system significantly increases performance for both French and Mandarin, increasing the F1 scores further (10% and 8% increase in F1 scores for French and Mandarin, respectively, over unilingual baselines).\n ', "\n When training clinical prediction models from electronic health records (EHRs), a key concern should be a model's ability to sustain performance over time when deployed, even as care practices, database systems, and population demographics evolve. Due to de-identification requirements, however, current experimental practices for public EHR benchmarks (such as the MIMIC-III critical care dataset) are time agnostic, assigning care records to train or test sets without regard for the actual dates of care. As a result, current benchmarks cannot assess how well models trained on one year generalise to another. In this work, we obtain a Limited Data Use Agreement to access year of care for each record in MIMIC and show that all tested state-of-the-art models decay in prediction quality when trained on historical data and tested on future data, particularly in response to a system-wide record-keeping change in 2008 (0.29 drop in AUROC for mortality prediction, 0.10 drop in AUROC for length-of-stay prediction with a random forest classifier). We further develop a simple yet effective mitigation strategy: by aggregating raw features into expert-defined clinical concepts, we see only a 0.06 drop in AUROC for mortality prediction and a 0.03 drop in AUROC for length-of-stay prediction. We demonstrate that this aggregation strategy outperforms other automatic feature preprocessing techniques aimed at increasing robustness to data drift. We release our aggregated representations and code to encourage more deployable clinical prediction models.\n ", '\n Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract, an open-source pipeline for transforming raw electronic health record (EHR) data for critical care patients contained in the publicly-available MIMIC-III database into dataframes that are directly usable in common machine learning pipelines. MIMIC-Extract addresses three primary challenges in making complex health records data accessible to the broader machine learning community. First, it provides standardized data processing functions, including unit conversion, outlier detection, and aggregating semantically equivalent features, thus accounting for duplication and reducing missingness. Second, it preserves the time series nature of clinical data and can be easily integrated into clinically actionable prediction tasks in machine learning for health. Finally, it is highly extensible so that other researchers with related questions can easily use the same pipeline. We demonstrate the utility of this pipeline by showcasing several benchmark tasks and baseline results.\n ', '\n Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision. This requirement warrants a stricter attention to issues of reproducibility than other fields of machine learning.\n In this work, we conduct a systematic evaluation of over 100 recently published ML4H research papers along several dimensions related to reproducibility. We find that the field of ML4H compares poorly to more established machine learning fields, particularly concerning data and code accessibility. Finally, drawing from success in other fields of science, we propose recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward.\n ', '\n Understanding if classifiers generalize to out-of-sample datasets is a central problem in machine learning. Microscopy images provide a standardized way to measure the generalization capacity of image classifiers, as we can image the same classes of objects under increasingly divergent, but controlled factors of variation. We created a public dataset of 132,209 images of mouse cells, COOS-7 (Cells Out Of Sample 7-Class). COOS-7 provides a classification setting where four test datasets have increasing degrees of covariate shift: some images are random subsets of the training data, while others are from experiments reproduced months later and imaged by different instruments. We benchmarked a range of classification models using different representations, including transferred neural network features, end-to-end classification with a supervised deep CNN, and features from a self-supervised CNN. While most classifiers perform well on test datasets similar to the training dataset, all classifiers failed to generalize their performance to datasets with greater covariate shifts. These baselines highlight the challenges of covariate shifts in image data, and establish metrics for improving the generalization capacity of image classifiers.\n ', '\n The automatic generation of radiology reports given medical radiographs has significant potential to operationally and improve clinical patient care. A number of prior works have focused on this problem, employing advanced methods from computer vision and natural language generation to produce readable reports. However, these works often fail to account for the particular nuances of the radiology domain, and, in particular, the critical importance of clinical accuracy in the resulting generated reports. In this work, we present a domain-aware automatic chest X-ray radiology report generation system which first predicts what topics will be discussed in the report, then conditionally generates sentences corresponding to these topics. The resulting system is fine-tuned using reinforcement learning, considering both readability and clinical accuracy, as assessed by the proposed Clinically Coherent Reward. We verify this system on two datasets, Open-I and MIMIC-CXR, and demonstrate that our model offers marked improvements on both language generation metrics and CheXpert assessed accuracy over a variety of competitive baselines.\n ', '\n A crucial challenge in image-based modeling of biomedical data is to identify trends and features that separate normality and pathology. In many cases, the morphology of the imaged object exhibits continuous change as it deviates from normality, and thus a generative model can be trained to model this morphological continuum. Moreover, given side information that correlates to certain trend in morphological change, a latent variable model can be regularized such that its latent representation reflects this side information. In this work, we use the Wasserstein Auto-encoder to model this pathology continuum, and apply the Hilbert-Schmitt Independence Criterion (HSIC) to enforce dependency between certain latent features and the provided side information. We experimentally show that the model can provide disentangled and interpretable latent representations and also generate a continuum of morphological changes that corresponds to change in the side information.\n ', '\n Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time. These changing practices induce definitive changes in observed data which confound evaluations which do not account for dates and limit the generalisability of date-agnostic models. In this work, we establish the magnitude of this problem on MIMIC, a public hospital dataset, and showcase a simple solution. We augment MIMIC with the year in which care was provided and show that a model trained using standard feature representations will significantly degrade in quality over time. We find a deterioration of 0.3 AUC when evaluating mortality prediction on data from 10 years later. We find a similar deterioration of 0.15 AUC for length-of-stay. In contrast, we demonstrate that clinically-oriented aggregates of raw features significantly mitigate future deterioration. Our suggested aggregated representations, when retrained yearly, have prediction quality comparable to year-agnostic models.\n ', "\n Speech datasets for identifying Alzheimer's disease (AD) are generally restricted to participants performing a single task, e.g. describing an image shown to them. As a result, models trained on linguistic features derived from such datasets may not be generalizable across tasks. Building on prior work demonstrating that same-task data of healthy participants helps improve AD detection on a single-task dataset of pathological speech, we augment an AD-specific dataset consisting of subjects describing a picture with multi-task healthy data. We demonstrate that normative data from multiple speech-based tasks helps improve AD detection by up to 9%. Visualization of decision boundaries reveals that models trained on a combination of structured picture descriptions and unstructured conversational speech have the least out-of-task error and show the most potential to generalize to multiple tasks. We analyze the impact of age of the added samples and if they affect fairness in classification. We also provide explanations for a possible inductive bias effect across tasks using model-agnostic feature anchors. This work highlights the need for heterogeneous datasets for encoding changes in multiple facets of cognition and for developing a task-independent AD detection model.\n ", '\n This volume represents the accepted submissions from the Machine Learning for Health (ML4H) workshop at the conference on Neural Information Processing Systems (NeurIPS) 2018, held on December 8, 2018 in Montreal, Canada.\n ', '\n Making decisions about what clinical tasks to prepare for is multi-factored, and especially challenging in intensive care environments where resources must be balanced with patient needs. Electronic health records (EHRs) are a rich data source, but are task-agnostic and can be difficult to use as summarizations of patient needs for a specific task, such as "could this patient need a ventilator tomorrow?" In this paper, we introduce ClinicalVis, an open-source EHR visualization-based prototype system for task-focused design evaluation of interactions between healthcare providers (HCPs) and EHRs. We situate ClinicalVis in a task-focused proof-of-concept design study targeting these interactions with real patient data. We conduct an empirical study of 14 HCPs, and discuss our findings on usability, accuracy, preference, and confidence in treatment decisions. We also present design implications that our findings suggest for future EHR interfaces, the presentation of clinical data for task-based planning, and evaluating task-focused HCP/EHR interactions in practice.\n ', '\n There are established racial disparities in healthcare, including during end-of-life care, when poor communication and trust can lead to suboptimal outcomes for patients and their families. In this work, we find that racial disparities which have been reported in existing literature are also present in the MIMIC-III database. We hypothesize that one underlying cause of this disparity is due to mistrust between patient and caregivers, and we develop multiple possible trust metric proxies (using coded interpersonal variables and clinical notes) to measure this phenomenon more directly. These metrics show even stronger disparities in end-of-life care than race does, and they also tend to demonstrate statistically significant higher levels of mistrust for black patients than white ones. Finally, we demonstrate that these metrics improve performance on three clinical tasks: in-hospital mortality, discharge against medical advice (AMA) and modified care status (e.g., DNR, DNI, etc.).\n ', '\n In this work, we characterize the doctor-patient relationship using a machine learning-derived trust score. We show that this score has statistically significant racial associations, and that by modeling trust directly we find stronger disparities in care than by stratifying on race. We further demonstrate that mistrust is indicative of worse outcomes, but is only weakly associated with physiologically-created severity scores. Finally, we describe sentiment analysis experiments indicating patients with higher levels of mistrust have worse experiences and interactions with their caregivers. This work is a step towards measuring fairer machine learning in the healthcare domain.\n ', '\n Modern electronic health records (EHRs) provide data to answer clinically meaningful questions. The growing data in EHRs makes healthcare ripe for the use of machine learning. However, learning in a clinical setting presents unique challenges that complicate the use of common machine learning methodologies. For example, diseases in EHRs are poorly labeled, conditions can encompass multiple underlying endotypes, and healthy individuals are underrepresented. This article serves as a primer to illuminate these challenges and highlights opportunities for members of the machine learning community to contribute to healthcare.\n ', '\n Risk prediction is central to both clinical medicine and public health. While many machine learning models have been developed to predict mortality, they are rarely applied in the clinical literature, where classification tasks typically rely on logistic regression. One reason for this is that existing machine learning models often seek to optimize predictions by incorporating features that are not present in the databases readily available to providers and policy makers, limiting generalizability and implementation. Here we tested a number of machine learning classifiers for prediction of six-month mortality in a population of elderly Medicare beneficiaries, using an administrative claims database of the kind available to the majority of health care payers and providers. We show that machine learning classifiers substantially outperform current widely-used methods of risk prediction but only when used with an improved feature set incorporating insights from clinical medicine, developed for this study. Our work has applications to supporting patient and provider decision making at the end of life, as well as population health-oriented efforts to identify patients at high risk of poor outcomes.\n ', '\n Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions annually. Treating a septic patient is highly challenging, because individual patients respond very differently to medical interventions and there is no universally agreed-upon treatment for sepsis. In this work, we propose an approach to deduce treatment policies for septic patients by using continuous state-space models and deep reinforcement learning. Our model learns clinically interpretable treatment policies, similar in important aspects to the treatment policies of physicians. The learned policies could be used to aid intensive care clinicians in medical decision making and improve the likelihood of patient survival.\n ', '\n Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs). This task is complicated by data sources that are noisy, sparse, heterogeneous and outcomes that are imbalanced. In this paper, we integrate data from all available ICU sources (vitals, labs, notes, demographics) and focus on learning rich representations of this data to predict onset and weaning of multiple invasive interventions. In particular, we compare both long short-term memory networks (LSTM) and convolutional neural networks (CNN) for prediction of five intervention tasks: invasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, and crystalloid boluses. Our predictions are done in a forward-facing manner to enable "real-time" performance, and predictions are made with a six hour gap time to support clinically actionable planning. We achieve state-of-the-art results on our predictive tasks using deep architectures. We explore the use of feature occlusion to interpret LSTM models, and compare this to the interpretability gained from examining inputs that maximally activate CNN outputs. We show that our models are able to significantly outperform baselines in intervention prediction, and provide insight into model learning, which is crucial for the adoption of such models in practice.\n ', "\n Sepsis is a leading cause of mortality in intensive care units (ICUs) and costs hospitals billions annually. Treating a septic patient is highly challenging, because individual patients respond very differently to medical interventions and there is no universally agreed-upon treatment for sepsis. Understanding more about a patient's physiological state at a given time could hold the key to effective treatment policies. In this work, we propose a new approach to deduce optimal treatment policies for septic patients by using continuous state-space models and deep reinforcement learning. Learning treatment policies over continuous spaces is important, because we retain more of the patient's physiological information. Our model is able to learn clinically interpretable treatment policies, similar in important aspects to the treatment policies of physicians. Evaluating our algorithm on past ICU patient data, we find that our model could reduce patient mortality in the hospital by up to 3.6% over observed clinical policies, from a baseline mortality of 13.7%. The learned treatment policies could be used to aid intensive care clinicians in medical decision making and improve the likelihood of patient survival.\n ", '\n We use autoencoders to create low-dimensional embeddings of underlying patient phenotypes that we hypothesize are a governing factor in determining how different patients will react to different interventions. We compare the performance of autoencoders that take fixed length sequences of concatenated timesteps as input with a recurrent sequence-to-sequence autoencoder. We evaluate our methods on around 35,500 patients from the latest MIMIC III dataset from Beth Israel Deaconess Hospital.\n ', '\n Voice disorders affect an estimated 14 million working-aged Americans, and many more worldwide. We present the first large scale study of vocal misuse based on long-term ambulatory data collected by an accelerometer placed on the neck. We investigate an unsupervised data mining approach to uncovering latent information about voice misuse.\n We segment signals from over 253 days of data from 22 subjects into over a hundred million single glottal pulses (closures of the vocal folds), cluster segments into symbols, and use symbolic mismatch to uncover differences between patients and matched controls, and between patients pre- and post-treatment. Our results show significant behavioral differences between patients and controls, as well as between some pre- and post-treatment patients. Our proposed approach provides an objective basis for helping diagnose behavioral voice disorders, and is a first step towards a more data-driven understanding of the impact of voice therapy.\n ', '\n Balancing the needs of data privacy and predictive utility is a central challenge for machine learning in healthcare. In particular, privacy concerns have led to a dearth of public datasets, complicated the construction of multi-hospital cohorts and limited the utilization of external machine learning resources. To remedy this, new methods are required to enable data owners, such as hospitals, to share their datasets publicly, while preserving both patient privacy and modeling utility. We propose NeuraCrypt, a private encoding scheme based on random deep neural networks. NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner, and publishes both the encoded data and associated labels publicly. From a theoretical perspective, we demonstrate that sampling from a sufficiently rich family of encoding functions offers a well-defined and meaningful notion of privacy against a computationally unbounded adversary with full knowledge of the underlying data-distribution. We propose to approximate this family of encoding functions through random deep neural networks. Empirically, we demonstrate the robustness of our encoding to a suite of adversarial attacks and show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks. Moreover, we demonstrate that multiple hospitals, using independent private encoders, can collaborate to train improved x-ray models. Finally, we release a challenge dataset to encourage the development of new attacks on NeuraCrypt.\n ', "\n Today, network devices share buffer across priority queues to avoid drops during transient congestion. While cost-effective most of the time, this sharing can cause undesired interference among seemingly independent traffic. As a result, low-priority traffic can cause increased packet loss to high-priority traffic. Similarly, long flows can prevent the buffer from absorbing incoming bursts even if they do not share the same queue. The cause of this perhaps unintuitive outcome is that today's buffer sharing techniques are unable to guarantee isolation across (priority) queues without statically allocating buffer space. To address this issue, we designed FB, a novel buffer sharing scheme that offers strict isolation guarantees to high-priority traffic without sacrificing link utilizations. Thus, FB outperforms conventional buffer sharing algorithms in absorbing bursts while achieving on-par throughput. We show that FB is practical and runs at line-rate on existing hardware (Barefoot Tofino). Significantly, FB's operations can be approximated in non-programmable devices.\n ", '\n This paper presents a performance analysis of the design space of optical datacenter networks, including both demand-oblivious (static or dynamic) and demand-aware networks. We formally show that the number of specific optical switch types which should be used in an optimized datacenter network, depends on the traffic pattern, and in particular, the flow size distribution.\n ', '\n This paper studies the structure of several real-world traces (including Facebook, High-Performance Computing, Machine Learning, and simulation generated traces) and presents a systematic approach to quantify and compare the structure of packet traces based on the entropy contained in the trace file. Insights into the structure of packet traces can lead to improved network algorithms that are optimized toward specific traffic patterns. We then present a methodology to quantify the temporal and non-temporal components of entropy contained in a packet trace, called the trace complexity, using randomization and compression. We show that trace complexity provides unique insights into the characteristics of various applications and argue that there is a need for traffic generation models that preserve the intrinsic structure of empirically measured application traces. We then propose a traffic generator model that is able to produce a synthetic trace that matches the complexity level of its corresponding real-world trace.\n ', '\n Neural network pruning is a popular technique used to reduce the inference costs of modern, potentially overparameterized, networks. Starting from a pre-trained network, the process is as follows: remove redundant parameters, retrain, and repeat while maintaining the same test accuracy. The result is a model that is a fraction of the size of the original with comparable predictive performance (test accuracy). Here, we reassess and evaluate whether the use of test accuracy alone in the terminating condition is sufficient to ensure that the resulting model performs well across a wide spectrum of "harder" metrics such as generalization to out-of-distribution data and resilience to noise. Across evaluations on varying architectures and data sets, we find that pruned networks effectively approximate the unpruned model, however, the prune ratio at which pruned networks achieve commensurate performance varies significantly across tasks. These results call into question the extent of \\emph{genuine} overparameterization in deep learning and raise concerns about the practicability of deploying pruned networks, specifically in the context of safety-critical systems, unless they are widely evaluated beyond test accuracy to reliably predict their performance. Our code is available at https://github.com/lucaslie/torchprune.\n ', "\n We introduce the maximum $n$-times coverage problem that selects $k$ overlays to maximize the summed coverage of weighted elements, where each element must be covered at least $n$ times. We also define the min-cost $n$-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least $n$ times is at least $τ$. Maximum $n$-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for $n$-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum $n$-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.\n ", '\n Image classifiers are typically scored on their test set accuracy, but high accuracy can mask a subtle type of model failure. We find that high scoring convolutional neural networks (CNNs) on popular benchmarks exhibit troubling pathologies that allow them to display high accuracy even in the absence of semantically salient features. When a model provides a high-confidence decision without salient supporting input features, we say the classifier has overinterpreted its input, finding too much class-evidence in patterns that appear nonsensical to humans. Here, we demonstrate that neural networks trained on CIFAR-10 and ImageNet suffer from overinterpretation, and we find models on CIFAR-10 make confident predictions even when 95% of input images are masked and humans cannot discern salient features in the remaining pixel-subsets. We introduce Batched Gradient SIS, a new method for discovering sufficient input subsets for complex datasets, and use this method to show the sufficiency of border pixels in ImageNet for training and testing. Although these patterns portend potential model fragility in real-world deployment, they are in fact valid statistical patterns of the benchmark that alone suffice to attain high test accuracy. Unlike adversarial examples, overinterpretation relies upon unmodified image pixels. We find ensembling and input dropout can each help mitigate overinterpretation.\n ', '\n We introduce Information Condensing Active Learning (ICAL), a batch mode model agnostic Active Learning (AL) method targeted at Deep Bayesian Active Learning that focuses on acquiring labels for points which have as much information as possible about the still unacquired points. ICAL uses the Hilbert Schmidt Independence Criterion (HSIC) to measure the strength of the dependency between a candidate batch of points and the unlabeled set. We develop key optimizations that allow us to scale our method to large unlabeled sets. We show significant improvements in terms of model accuracy and negative log likelihood (NLL) on several image datasets compared to state of the art batch mode AL methods for deep learning.\n ', '\n The inaccuracy of neural network models on inputs that do not stem from the training data distribution is both problematic and at times unrecognized. Model uncertainty estimation can address this issue, where uncertainty estimates are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), a straightforward approach to improve ensemble-based uncertainty estimates by encouraging larger overall diversity in ensemble predictions across all possible inputs that might be encountered in the future. When applied to various neural network ensembles, MOD significantly improves predictive performance for out-of-distribution test examples without sacrificing in-distribution performance on 38 Protein-DNA binding regression datasets, 9 UCI datasets, and the IMDB-Wiki image dataset. Across many Bayesian optimization tasks, the performance of UCB acquisition is also greatly improved by leveraging MOD uncertainty estimates.\n ', "\n Local explanation frameworks aim to rationalize particular decisions made by a black-box prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model's decision making process. We instead propose sufficient input subsets that identify minimal subsets of features whose observed values alone suffice for the same decision to be reached, even if all other input feature values are missing. General principles that globally govern a model's decision-making can also be revealed by searching for clusters of such input patterns across many data points. Our approach is conceptually straightforward, entirely model-agnostic, simply implemented using instance-wise backward selection, and able to produce more concise rationales than existing techniques. We demonstrate the utility of our interpretation method on various neural network models trained on text, image, and genomic data.\n ", '\n We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequential-progression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship reflects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a "trend" in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data.\n ', "\n Image synthesis via Generative Adversarial Networks (GANs) of three-dimensional (3D) medical images has great potential that can be extended to many medical applications, such as, image enhancement and disease progression modeling. However, current GAN technologies for 3D medical image synthesis need to be significantly improved to be readily adapted to real-world medical problems. In this paper, we extend the state-of-the-art StyleGAN2 model, which natively works with two-dimensional images, to enable 3D image synthesis. In addition to the image synthesis, we investigate the controllability and interpretability of the 3D-StyleGAN via style vectors inherited form the original StyleGAN2 that are highly suitable for medical applications: (i) the latent space projection and reconstruction of unseen real images, and (ii) style mixing. We demonstrate the 3D-StyleGAN's performance and feasibility with ~12,000 three-dimensional full brain MR T1 images, although it can be applied to any 3D volumetric images. Furthermore, we explore different configurations of hyperparameters to investigate potential improvement of the image synthesis with larger networks. The codes and pre-trained networks are available online: https://github.com/sh4174/3DStyleGAN.\n ", '\n Fetal motion is unpredictable and rapid on the scale of conventional MR scan times. Therefore, dynamic fetal MRI, which aims at capturing fetal motion and dynamics of fetal function, is limited to fast imaging techniques with compromises in image quality and resolution. Super-resolution for dynamic fetal MRI is still a challenge, especially when multi-oriented stacks of image slices for oversampling are not available and high temporal resolution for recording the dynamics of the fetus or placenta is desired. Further, fetal motion makes it difficult to acquire high-resolution images for supervised learning methods. To address this problem, in this work, we propose STRESS (Spatio-Temporal Resolution Enhancement with Simulated Scans), a self-supervised super-resolution framework for dynamic fetal MRI with interleaved slice acquisitions. Our proposed method simulates an interleaved slice acquisition along the high-resolution axis on the originally acquired data to generate pairs of low- and high-resolution images. Then, it trains a super-resolution network by exploiting both spatial and temporal correlations in the MR time series, which is used to enhance the resolution of the original data. Evaluations on both simulated and in utero data show that our proposed method outperforms other self-supervised super-resolution methods and improves image quality, which is beneficial to other downstream tasks and evaluations.\n ', '\n Intravascular ultrasound and optical coherence tomography are widely available for characterizing coronary stenoses and provide critical vessel parameters to optimize percutaneous intervention. Intravascular polarization-sensitive optical coherence tomography (PS-OCT) simultaneously provides high-resolution cross-sectional images of vascular structures while also revealing preponderant tissue components such as collagen and smooth muscle and thereby enhances plaque characterization. Automated interpretation of these features promises to facilitate the objective clinical investigation of the natural history and significance of coronary atheromas. Here, we propose a convolutional neural network model, optimized using a new multi-term loss function, to classify the lumen, intima, and media layers in addition to the guidewire and plaque shadows. We demonstrate that our multi-class classification model outperforms state-of-the-art methods in detecting the coronary anatomical layers. Furthermore, the proposed model segments two classes of common imaging artifacts and detects the anatomical layers within the thickened vessel wall regions that were excluded from analysis by other studies. The source code and the trained model are publicly available at https://github.com/mhaft/OCTseg\n ', "\n BrainPainter is a software for the 3D visualization of human brain structures; it generates colored brain images using user-defined biomarker data for each brain region. However, BrainPainter is only able to generate human brain images. In this paper, we present updates to the existing BrainPainter software which enables the generation of mouse brain images. We load meshes for each mouse brain region, based on the Allen Mouse Brain Atlas, into Blender, a powerful 3D computer graphics engine. We then use Blender to color each region and generate images of subcortical, outer-cortical, inner-cortical, top and bottom view renders. In addition to those views, we add new render angles and separate visualization settings for the left and right hemispheres. While BrainPainter traditionally ran from the browser ( https://brainpainter.csail.mit.edu ), we also created a graphical user interface that launches image-generation requests in a user-friendly way, by connecting to the Blender backend via a Docker API. We illustrate a use case of BrainPainter for modeling the progression of tau protein accumulation in a mouse study. Our contributions can help neuroscientists visualize brains in mouse studies and show disease progression. In addition, integration into Blender can subsequently enable the generation of complex animations using a moving camera, generation of complex mesh deformations that simulate tumors and other pathologies, as well as visualization of toxic proteins using Blender's particle system.\n ", '\n We demonstrate an object tracking method for 3D images with fixed computational cost and state-of-the-art performance. Previous methods predicted transformation parameters from convolutional layers. We instead propose an architecture that does not include either flattening of convolutional features or fully connected layers, but instead relies on equivariant filters to preserve transformations between inputs and outputs (e.g. rot./trans. of inputs rotate/translate outputs). The transformation is then derived in closed form from the outputs of the filters. This method is useful for applications requiring low latency, such as real-time tracking. We demonstrate our model on synthetically augmented adult brain MRI, as well as fetal brain MRI, which is the intended use-case.\n ', '\n We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.\n ', '\n We show that for a wide class of harmonization/domain-invariance schemes several undesirable properties are unavoidable. If a predictive machine is made invariant to a set of domains, the accuracy of the output predictions (as measured by mutual information) is limited by the domain with the least amount of information to begin with. If a real label value is highly informative about the source domain, it cannot be accurately predicted by an invariant predictor. These results are simple and intuitive, but we believe that it is beneficial to state them for medical imaging harmonization.\n ', '\n Most existing algorithms for automatic 3D morphometry of human brain MRI scans are designed for data with near-isotropic voxels at approximately 1 mm resolution, and frequently have contrast constraints as well - typically requiring T1 scans (e.g., MP-RAGE). This limitation prevents the analysis of millions of MRI scans acquired with large inter-slice spacing ("thick slice") in clinical settings every year. The inability to quantitatively analyze these scans hinders the adoption of quantitative neuroimaging in healthcare, and precludes research studies that could attain huge sample sizes and hence greatly improve our understanding of the human brain. Recent advances in CNNs are producing outstanding results in super-resolution and contrast synthesis of MRI. However, these approaches are very sensitive to the contrast, resolution and orientation of the input images, and thus do not generalize to diverse clinical acquisition protocols - even within sites. Here we present SynthSR, a method to train a CNN that receives one or more thick-slice scans with different contrast, resolution and orientation, and produces an isotropic scan of canonical contrast (typically a 1 mm MP-RAGE). The presented method does not require any preprocessing, e.g., skull stripping or bias field correction. Crucially, SynthSR trains on synthetic input images generated from 3D segmentations, and can thus be used to train CNNs for any combination of contrasts, resolutions and orientations without high-resolution training data. We test the images generated with SynthSR in an array of common downstream analyses, and show that they can be reliably used for subcortical segmentation and volumetry, image registration (e.g., for tensor-based morphometry), and, if some image quality requirements are met, even cortical thickness morphometry. The source code is publicly available at github.com/BBillot/SynthSR.\n ', "\n Machine learning models are commonly trained end-to-end and in a supervised setting, using paired (input, output) data. Examples include recent super-resolution methods that train on pairs of (low-resolution, high-resolution) images. However, these end-to-end approaches require re-training every time there is a distribution shift in the inputs (e.g., night images vs daylight) or relevant latent variables (e.g., camera blur or hand motion). In this work, we leverage state-of-the-art (SOTA) generative models (here StyleGAN2) for building powerful image priors, which enable application of Bayes' theorem for many downstream reconstruction tasks. Our method, Bayesian Reconstruction through Generative Models (BRGM), uses a single pre-trained generator model to solve different image restoration tasks, i.e., super-resolution and in-painting, by combining it with different forward corruption models. We keep the weights of the generator model fixed, and reconstruct the image by estimating the Bayesian maximum a-posteriori (MAP) estimate over the input latent vector that generated the reconstructed image. We further use variational inference to approximate the posterior distribution over the latent vectors, from which we sample multiple solutions. We demonstrate BRGM on three large and diverse datasets: (i) 60,000 images from the Flick Faces High Quality dataset (ii) 240,000 chest X-rays from MIMIC III and (iii) a combined collection of 5 brain MRI datasets with 7,329 scans. Across all three datasets and without any dataset-specific hyperparameter tuning, our simple approach yields performance competitive with current task-specific state-of-the-art methods on super-resolution and in-painting, while being more generalisable and without requiring any training. Our source code and pre-trained models are available online: https://razvanmarinescu.github.io/brgm/.\n ", '\n Ensembling is now recognized as an effective approach for increasing the predictive performance and calibration of deep networks. We introduce a new approach, Parameter Ensembling by Perturbation (PEP), that constructs an ensemble of parameter values as random perturbations of the optimal parameter set from training by a Gaussian with a single variance parameter. The variance is chosen to maximize the log-likelihood of the ensemble average ($\\mathbb{L}$) on the validation data set. Empirically, and perhaps surprisingly, $\\mathbb{L}$ has a well-defined maximum as the variance grows from zero (which corresponds to the baseline model). Conveniently, calibration level of predictions also tends to grow favorably until the peak of $\\mathbb{L}$ is reached. In most experiments, PEP provides a small improvement in performance, and, in some cases, a substantial improvement in empirical calibration. We show that this "PEP effect" (the gain in log-likelihood) is related to the mean curvature of the likelihood function and the empirical Fisher information. Experiments on ImageNet pre-trained networks including ResNet, DenseNet, and Inception showed improved calibration and likelihood. We further observed a mild improvement in classification accuracy on these networks. Experiments on classification benchmarks such as MNIST and CIFAR-10 showed improved calibration and likelihood, as well as the relationship between the PEP effect and overfitting; this demonstrates that PEP can be used to probe the level of overfitting that occurred during training. In general, no special training procedure or network architecture is needed, and in the case of pre-trained networks, no additional training is needed.\n ', "\n We present a semi-parametric generative model for predicting anatomy of a patient in subsequent scans following a single baseline image. Such predictive modeling promises to facilitate novel analyses in both voxel-level studies and longitudinal biomarker evaluation. We capture anatomical change through a combination of population-wide regression and a non-parametric model of the subject's health based on individual genetic and clinical indicators. In contrast to classical correlation and longitudinal analysis, we focus on predicting new observations from a single subject observation. We demonstrate prediction of follow-up anatomical scans in the ADNI cohort, and illustrate a novel analysis approach that compares a patient's scans to the predicted subject-specific healthy anatomical trajectory. The code is available at https://github.com/adalca/voxelorb.\n ", '\n Estimating mutual information between continuous random variables is often intractable and extremely challenging for high-dimensional data. Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information. Although showing promise for this difficult problem, the variational methods have been theoretically and empirically proven to have serious statistical limitations: 1) many methods struggle to produce accurate estimates when the underlying mutual information is either low or high; 2) the resulting estimators may suffer from high variance. Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution rather than from the product of its marginal distributions. Moreover, we establish a direct connection between mutual information and the average log odds estimate produced by the classifier on a test set, leading to a simple and accurate estimator of mutual information. We show theoretically that our method and other variational approaches are equivalent when they achieve their optimum, while our method sidesteps the variational bound. Empirical results demonstrate high accuracy of our approach and the advantages of our estimator in the context of representation learning. Our demo is available at https://github.com/RayRuizhiLiao/demi_mi_estimator.\n ', '\n We propose and demonstrate a novel machine learning algorithm that assesses pulmonary edema severity from chest radiographs. While large publicly available datasets of chest radiographs and free-text radiology reports exist, only limited numerical edema severity labels can be extracted from radiology reports. This is a significant challenge in learning such models for image classification. To take advantage of the rich information present in the radiology reports, we develop a neural network model that is trained on both images and free-text to assess pulmonary edema severity from chest radiographs at inference time. Our experimental results suggest that the joint image-text representation learning improves the performance of pulmonary edema assessment compared to a supervised model trained on images only. We also show the use of the text for explaining the image classification by the joint model. To the best of our knowledge, our approach is the first to leverage free-text radiology reports for improving the image model performance in this application. Our code is available at https://github.com/RayRuizhiLiao/joint_chestxray.\n ', '\n Purpose: To develop a machine learning model to classify the severity grades of pulmonary edema on chest radiographs.\n Materials and Methods: In this retrospective study, 369,071 chest radiographs and associated radiology reports from 64,581 (mean age, 51.71; 54.51% women) patients from the MIMIC-CXR chest radiograph dataset were included. This dataset was split into patients with and without congestive heart failure (CHF). Pulmonary edema severity labels from the associated radiology reports were extracted from patients with CHF as four different ordinal levels: 0, no edema; 1, vascular congestion; 2, interstitial edema; and 3, alveolar edema. Deep learning models were developed using two approaches: a semi-supervised model using a variational autoencoder and a pre-trained supervised learning model using a dense neural network. Receiver operating characteristic curve analysis was performed on both models.\n Results: The area under the receiver operating characteristic curve (AUC) for differentiating alveolar edema from no edema was 0.99 for the semi-supervised model and 0.87 for the pre-trained models. Performance of the algorithm was inversely related to the difficulty in categorizing milder states of pulmonary edema (shown as AUCs for semi-supervised model and pre-trained model, respectively): 2 versus 0, 0.88 and 0.81; 1 versus 0, 0.79 and 0.66; 3 versus 1, 0.93 and 0.82; 2 versus 1, 0.69 and 0.73; and, 3 versus 2, 0.88 and 0.63.\n Conclusion: Deep learning models were trained on a large chest radiograph dataset and could grade the severity of pulmonary edema on chest radiographs with high performance.\n ', '\n Fetal MRI is heavily constrained by unpredictable and substantial fetal motion that causes image artifacts and limits the set of viable diagnostic image contrasts. Current mitigation of motion artifacts is predominantly performed by fast, single-shot MRI and retrospective motion correction. Estimation of fetal pose in real time during MRI stands to benefit prospective methods to detect and mitigate fetal motion artifacts where inferred fetal motion is combined with online slice prescription with low-latency decision making. Current developments of deep reinforcement learning (DRL), offer a novel approach for fetal landmarks detection. In this task 15 agents are deployed to detect 15 landmarks simultaneously by DRL. The optimization is challenging, and here we propose an improved DRL that incorporates priors on physical structure of the fetal body. First, we use graph communication layers to improve the communication among agents based on a graph where each node represents a fetal-body landmark. Further, additional reward based on the distance between agents and physical structures such as the fetal limbs is used to fully exploit physical structure. Evaluation of this method on a repository of 3-mm resolution in vivo data demonstrates a mean accuracy of landmark estimation within 10 mm of ground truth as 87.3%, and a mean error of 6.9 mm. The proposed DRL for fetal pose landmark search demonstrates a potential clinical utility for online detection of fetal motion that guides real-time mitigation of motion artifacts as well as health diagnosis during MRI of the pregnant mother.\n ', '\n We demonstrate that neural network layers that explicitly combine frequency and image feature representations are a versatile building block for analysis of imaging data acquired in the frequency space. Our work is motivated by the challenges arising in MRI acquisition where the signal is a corrupted Fourier transform of the desired image. The joint learning schemes proposed and analyzed in this paper enable both correction of artifacts native to the frequency space and manipulation of image space representations to reconstruct coherent image structures. This is in contrast to most current deep learning approaches for image reconstruction that apply learned data manipulations solely in the frequency space or solely in the image space. We demonstrate the advantages of joint convolutional learning on three diverse tasks: image reconstruction from undersampled acquisitions, motion correction, and image denoising in brain and knee MRI. We further demonstrate advantages of the joint learning approaches across training schemes using a wide variety of loss functions. Unlike purely image based and purely frequency based architectures, the joint models produce consistently high quality output images across all tasks and datasets. Joint image and frequency space feature representations promise to significantly improve modeling and reconstruction of images acquired in the frequency space. Our code is available at https://github.com/nalinimsingh/interlacer.\n ', '\n Fetal brain MRI is useful for diagnosing brain abnormalities but is challenged by fetal motion. The current protocol for T2-weighted fetal brain MRI is not robust to motion so image volumes are degraded by inter- and intra- slice motion artifacts. Besides, manual annotation for fetal MR image quality assessment are usually time-consuming. Therefore, in this work, a semi-supervised deep learning method that detects slices with artifacts during the brain volume scan is proposed. Our method is based on the mean teacher model, where we not only enforce consistency between student and teacher models on the whole image, but also adopt an ROI consistency loss to guide the network to focus on the brain region. The proposed method is evaluated on a fetal brain MR dataset with 11,223 labeled images and more than 200,000 unlabeled images. Results show that compared with supervised learning, the proposed method can improve model accuracy by about 6\\% and outperform other state-of-the-art semi-supervised learning methods. The proposed method is also implemented and evaluated on an MR scanner, which demonstrates the feasibility of online image quality assessment and image reacquisition during fetal MR scans.\n ', '\n The history of computer science and brain sciences are intertwined. In his unfinished manuscript "The Computer and the Brain," von Neumann debates whether or not the brain can be thought of as a computing machine and identifies some of the similarities and differences between natural and artificial computation. Turing, in his 1950 article in Mind, argues that computing devices could ultimately emulate intelligence, leading to his proposed Turing test. Herbert Simon predicted in 1957 that most psychological theories would take the form of a computer program. In 1976, David Marr proposed that the function of the visual system could be abstracted and studied at computational and algorithmic levels that did not depend on the underlying physical substrate.\n In December 2014, a two-day workshop supported by the Computing Community Consortium (CCC) and the National Science Foundation\'s Computer and Information Science and Engineering Directorate (NSF CISE) was convened in Washington, DC, with the goal of bringing together computer scientists and brain researchers to explore these new opportunities and connections, and develop a new, modern dialogue between the two research communities. Specifically, our objectives were: 1. To articulate a conceptual framework for research at the interface of brain sciences and computing and to identify key problems in this interface, presented in a way that will attract both CISE and brain researchers into this space. 2. To inform and excite researchers within the CISE research community about brain research opportunities and to identify and explain strategic roles they can play in advancing this initiative. 3. To develop new connections, conversations and collaborations between brain sciences and CISE researchers that will lead to highly relevant and competitive proposals, high-impact research, and influential publications.\n ', '\n We present the findings of "The Alzheimer\'s Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer\'s disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer\'s Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guessing. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as patient-specific biomarker trends. The submission system remains open via the website https://tadpole.grand-challenge.org, while code for submissions is being collated by TADPOLE SHARE: https://tadpole-share.github.io/. Our work suggests that current prediction algorithms are accurate for biomarkers related to clinical diagnosis and ventricle volume, opening up the possibility of cohort refinement in clinical trials for Alzheimer\'s disease.\n ', "\n The TADPOLE Challenge compares the performance of algorithms at predicting the future evolution of individuals at risk of Alzheimer's disease. TADPOLE Challenge participants train their models and algorithms on historical data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. Participants are then required to make forecasts of three key outcomes for ADNI-3 rollover participants: clinical diagnosis, ADAS-Cog 13, and total volume of the ventricles -- which are then compared with future measurements. Strong points of the challenge are that the test data did not exist at the time of forecasting (it was acquired afterwards), and that it focuses on the challenging problem of cohort selection for clinical trials by identifying fast progressors. The submission phase of TADPOLE was open until 15 November 2017; since then data has been acquired until April 2019 from 219 subjects with 223 clinical visits and 150 Magnetic Resonance Imaging (MRI) scans, which was used for the evaluation of the participants' predictions. Thirty-three teams participated with a total of 92 submissions. No single submission was best at predicting all three outcomes. For diagnosis prediction, the best forecast (team Frog), which was based on gradient boosting, obtained a multiclass area under the receiver-operating curve (MAUC) of 0.931, while for ventricle prediction the best forecast (team EMC1), which was based on disease progression modelling and spline regression, obtained mean absolute error of 0.41% of total intracranial volume (ICV). For ADAS-Cog 13, no forecast was considerably better than the benchmark mixed effects model (BenchmarkME), provided to participants before the submission deadline. Further analysis can help understand which input features and algorithms are most suitable for Alzheimer's disease prediction and for aiding patient stratification in clinical trials.\n ", '\n We propose and demonstrate a joint model of anatomical shapes, image features and clinical indicators for statistical shape modeling and medical image analysis. The key idea is to employ a copula model to separate the joint dependency structure from the marginal distributions of variables of interest. This separation provides flexibility on the assumptions made during the modeling process. The proposed method can handle binary, discrete, ordinal and continuous variables. We demonstrate a simple and efficient way to include binary, discrete and ordinal variables into the modeling. We build Bayesian conditional models based on observed partial clinical indicators, features or shape based on Gaussian processes capturing the dependency structure. We apply the proposed method on a stroke dataset to jointly model the shape of the lateral ventricles, the spatial distribution of the white matter hyperintensity associated with periventricular white matter disease, and clinical indicators. The proposed method yields interpretable joint models for data exploration and patient-specific statistical shape models for medical image analysis.\n ', '\n The performance and diagnostic utility of magnetic resonance imaging (MRI) in pregnancy is fundamentally constrained by fetal motion. Motion of the fetus, which is unpredictable and rapid on the scale of conventional imaging times, limits the set of viable acquisition techniques to single-shot imaging with severe compromises in signal-to-noise ratio and diagnostic contrast, and frequently results in unacceptable image quality. Surprisingly little is known about the characteristics of fetal motion during MRI and here we propose and demonstrate methods that exploit a growing repository of MRI observations of the gravid abdomen that are acquired at low spatial resolution but relatively high temporal resolution and over long durations (10-30 minutes). We estimate fetal pose per frame in MRI volumes of the pregnant abdomen via deep learning algorithms that detect key fetal landmarks. Evaluation of the proposed method shows that our framework achieves quantitatively an average error of 4.47 mm and 96.4\\% accuracy (with error less than 10 mm). Fetal pose estimation in MRI time series yields novel means of quantifying fetal movements in health and disease, and enables the learning of kinematic models that may enhance prospective mitigation of fetal motion artifacts during MRI acquisition.\n ', "\n We present BrainPainter, a software that automatically generates images of highlighted brain structures given a list of numbers corresponding to the output colours of each region. Compared to existing visualisation software (i.e. Freesurfer, SPM, 3D Slicer), BrainPainter has three key advantages: (1) it does not require the input data to be in a specialised format, allowing BrainPainter to be used in combination with any neuroimaging analysis tools, (2) it can visualise both cortical and subcortical structures and (3) it can be used to generate movies showing dynamic processes, e.g. propagation of pathology on the brain. We highlight three use cases where BrainPainter was used in existing neuroimaging studies: (1) visualisation of the degree of atrophy through interpolation along a user-defined gradient of colours, (2) visualisation of the progression of pathology in Alzheimer's disease as well as (3) visualisation of pathology in subcortical regions in Huntington's disease. Moreover, through the design of BrainPainter we demonstrate the possibility of using a powerful 3D computer graphics engine such as Blender to generate brain visualisations for the neuroscience community. Blender's capabilities, e.g. particle simulations, motion graphics, UV unwrapping, raster graphics editing, raytracing and illumination effects, open a wealth of possibilities for brain visualisation not available in current neuroimaging software. BrainPainter is customisable, easy to use, and can run straight from the web browser: https://brainpainter.csail.mit.edu , as well as from source-code packaged in a docker container: https://github.com/mrazvan22/brain-coloring . It can be used to visualise biomarker data from any brain imaging modality, or simply to highlight a particular brain structure for e.g. anatomy courses.\n ", '\n Probabilistic atlas priors have been commonly used to derive adaptive and robust brain MRI segmentation algorithms. Widely-used neuroimage analysis pipelines rely heavily on these techniques, which are often computationally expensive. In contrast, there has been a recent surge of approaches that leverage deep learning to implement segmentation tools that are computationally efficient at test time. However, most of these strategies rely on learning from manually annotated images. These supervised deep learning methods are therefore sensitive to the intensity profiles in the training dataset. To develop a deep learning-based segmentation model for a new image dataset (e.g., of different contrast), one usually needs to create a new labeled training dataset, which can be prohibitively expensive, or rely on suboptimal ad hoc adaptation or augmentation approaches. In this paper, we propose an alternative strategy that combines a conventional probabilistic atlas-based segmentation with deep learning, enabling one to train a segmentation model for new MRI scans without the need for any manually segmented images. Our experiments include thousands of brain MRI scans and demonstrate that the proposed method achieves good accuracy for a brain MRI segmentation task for different MRI contrasts, requiring only approximately 15 seconds at test time on a GPU. The code is freely available at http://voxelmorph.mit.edu.\n ', '\n We present a volumetric mesh-based algorithm for flattening the placenta to a canonical template to enable effective visualization of local anatomy and function. Monitoring placental function in vivo promises to support pregnancy assessment and to improve care outcomes. We aim to alleviate visualization and interpretation challenges presented by the shape of the placenta when it is attached to the curved uterine wall. To do so, we flatten the volumetric mesh that captures placental shape to resemble the well-studied ex vivo shape. We formulate our method as a map from the in vivo shape to a flattened template that minimizes the symmetric Dirichlet energy to control distortion throughout the volume. Local injectivity is enforced via constrained line search during gradient descent. We evaluate the proposed method on 28 placenta shapes extracted from MRI images in a clinical study of placental function. We achieve sub-voxel accuracy in mapping the boundary of the placenta to the template while successfully controlling distortion throughout the volume. We illustrate how the resulting mapping of the placenta enhances visualization of placental anatomy and function. Our code is freely available at https://github.com/mabulnaga/placenta-flattening .\n ', '\n Segmentation of structural and diffusion MRI (sMRI/dMRI) is usually performed independently in neuroimaging pipelines. However, some brain structures (e.g., globus pallidus, thalamus and its nuclei) can be extracted more accurately by fusing the two modalities. Following the framework of Bayesian segmentation with probabilistic atlases and unsupervised appearance modeling, we present here a novel algorithm to jointly segment multi-modal sMRI/dMRI data. We propose a hierarchical likelihood term for the dMRI defined on the unit ball, which combines the Beta and Dimroth-Scheidegger-Watson distributions to model the data at each voxel. This term is integrated with a mixture of Gaussians for the sMRI data, such that the resulting joint unsupervised likelihood enables the analysis of multi-modal scans acquired with any type of MRI contrast, b-values, or number of directions, which enables wide applicability. We also propose an inference algorithm to estimate the maximum-a-posteriori model parameters from input images, and to compute the most likely segmentation. Using a recently published atlas derived from histology, we apply our method to thalamic nuclei segmentation on two datasets: HCP (state of the art) and ADNI (legacy) - producing lower sample sizes than Bayesian segmentation with sMRI alone.\n ', '\n We present a robust method to correct for motion in volumetric in-utero MRI time series. Time-course analysis for in-utero volumetric MRI time series often suffers from substantial and unpredictable fetal motion. Registration provides voxel correspondences between images and is commonly employed for motion correction. Current registration methods often fail when aligning images that are substantially different from a template (reference image). To achieve accurate and robust alignment, we make a Markov assumption on the nature of motion and take advantage of the temporal smoothness in the image data. Forward message passing in the corresponding hidden Markov model (HMM) yields an estimation algorithm that only has to account for relatively small motion between consecutive frames. We evaluate the utility of the temporal model in the context of in-utero MRI time series alignment by examining the accuracy of propagated segmentation label maps. Our results suggest that the proposed model captures accurately the temporal dynamics of transformations in in-utero MRI time series.\n ', '\n We propose and demonstrate machine learning algorithms to assess the severity of pulmonary edema in chest x-ray images of congestive heart failure patients. Accurate assessment of pulmonary edema in heart failure is critical when making treatment and disposition decisions. Our work is grounded in a large-scale clinical dataset of over 300,000 x-ray images with associated radiology reports. While edema severity labels can be extracted unambiguously from a small fraction of the radiology reports, accurate annotation is challenging in most cases. To take advantage of the unlabeled images, we develop a Bayesian model that includes a variational auto-encoder for learning a latent representation from the entire image set trained jointly with a regressor that employs this representation for predicting pulmonary edema severity. Our experimental results suggest that modeling the distribution of images jointly with the limited labels improves the accuracy of pulmonary edema scoring compared to a strictly supervised approach. To the best of our knowledge, this is the first attempt to employ machine learning algorithms to automatically and quantitatively assess the severity of pulmonary edema in chest x-ray images.\n ', "\n We introduce Disease Knowledge Transfer (DKT), a novel technique for transferring biomarker information between related neurodegenerative diseases. DKT infers robust multimodal biomarker trajectories in rare neurodegenerative diseases even when only limited, unimodal data is available, by transferring information from larger multimodal datasets from common neurodegenerative diseases. DKT is a joint-disease generative model of biomarker progressions, which exploits biomarker relationships that are shared across diseases. Our proposed method allows, for the first time, the estimation of plausible, multimodal biomarker trajectories in Posterior Cortical Atrophy (PCA), a rare neurodegenerative disease where only unimodal MRI data is available. For this we train DKT on a combined dataset containing subjects with two distinct diseases and sizes of data available: 1) a larger, multimodal typical AD (tAD) dataset from the TADPOLE Challenge, and 2) a smaller unimodal Posterior Cortical Atrophy (PCA) dataset from the Dementia Research Centre (DRC), for which only a limited number of Magnetic Resonance Imaging (MRI) scans are available. Although validation is challenging due to lack of data in PCA, we validate DKT on synthetic data and two patient datasets (TADPOLE and PCA cohorts), showing it can estimate the ground truth parameters in the simulation and predict unseen biomarkers on the two patient datasets. While we demonstrated DKT on Alzheimer's variants, we note DKT is generalisable to other forms of related neurodegenerative diseases. Source code for DKT is available online: https://github.com/mrazvan22/dkt.\n ", '\n We propose a new iterative segmentation model which can be accurately learned from a small dataset. A common approach is to train a model to directly segment an image, requiring a large collection of manually annotated images to capture the anatomical variability in a cohort. In contrast, we develop a segmentation model that recursively evolves a segmentation in several steps, and implement it as a recurrent neural network. We learn model parameters by optimizing the interme- diate steps of the evolution in addition to the final segmentation. To this end, we train our segmentation propagation model by presenting incom- plete and/or inaccurate input segmentations paired with a recommended next step. Our work aims to alleviate challenges in segmenting heart structures from cardiac MRI for patients with congenital heart disease (CHD), which encompasses a range of morphological deformations and topological changes. We demonstrate the advantages of this approach on a dataset of 20 images from CHD patients, learning a model that accurately segments individual heart chambers and great vessels. Com- pared to direct segmentation, the iterative method yields more accurate segmentation for patients with the most severe CHD malformations.\n ', '\n We present an algorithm for creating high resolution anatomically plausible images consistent with acquired clinical brain MRI scans with large inter-slice spacing. Although large data sets of clinical images contain a wealth of information, time constraints during acquisition result in sparse scans that fail to capture much of the anatomy. These characteristics often render computational analysis impractical as many image analysis algorithms tend to fail when applied to such images. Highly specialized algorithms that explicitly handle sparse slice spacing do not generalize well across problem domains. In contrast, we aim to enable application of existing algorithms that were originally developed for high resolution research scans to significantly undersampled scans. We introduce a generative model that captures fine-scale anatomical structure across subjects in clinical image collections and derive an algorithm for filling in the missing data in scans with large inter-slice spacing. Our experimental results demonstrate that the resulting method outperforms state-of-the-art upsampling super-resolution techniques, and promises to facilitate subsequent analysis not previously possible with scans of this quality. Our implementation is freely available at https://github.com/adalca/papago .\n ', '\n We introduce an approach for image segmentation based on sparse correspondences between keypoints in testing and training images. Keypoints represent automatically identified distinctive image locations, where each keypoint correspondence suggests a transformation between images. We use these correspondences to transfer label maps of entire organs from the training images to the test image. The keypoint transfer algorithm includes three steps: (i) keypoint matching, (ii) voting-based keypoint labeling, and (iii) keypoint-based probabilistic transfer of organ segmentations. We report segmentation results for abdominal organs in whole-body CT and MRI, as well as in contrast-enhanced CT and MRI. Our method offers a speed-up of about three orders of magnitude in comparison to common multi-atlas segmentation, while achieving an accuracy that compares favorably. Moreover, keypoint transfer does not require the registration to an atlas or a training phase. Finally, the method allows for the segmentation of scans with highly variable field-of-view.\n ', '\n A reliable Ultrasound (US)-to-US registration method to compensate for brain shift would substantially improve Image-Guided Neurological Surgery. Developing such a registration method is very challenging, due to factors such as missing correspondence in images, the complexity of brain pathology and the demand for fast computation. We propose a novel feature-driven active framework. Here, landmarks and their displacement are first estimated from a pair of US images using corresponding local image features. Subsequently, a Gaussian Process (GP) model is used to interpolate a dense deformation field from the sparse landmarks. Kernels of the GP are estimated by using variograms and a discrete grid search method. If necessary, the user can actively add new landmarks based on the image context and visualization of the uncertainty measure provided by the GP to further improve the result. We retrospectively demonstrate our registration framework as a robust and accurate brain shift compensation solution on clinical data acquired during neurosurgery.\n ', '\n We present a robust method to correct for motion and deformations for in-utero volumetric MRI time series. Spatio-temporal analysis of dynamic MRI requires robust alignment across time in the presence of substantial and unpredictable motion. We make a Markov assumption on the nature of deformations to take advantage of the temporal structure in the image data. Forward message passing in the corresponding hidden Markov model (HMM) yields an estimation algorithm that only has to account for relatively small motion between consecutive frames. We demonstrate the utility of the temporal model by showing that its use improves the accuracy of the segmentation propagation through temporal registration. Our results suggest that the proposed model captures accurately the temporal dynamics of deformations in in-utero MRI time series.\n ', '\n Despite the popularity and empirical success of patch-based nearest-neighbor and weighted majority voting approaches to medical image segmentation, there has been no theoretical development on when, why, and how well these nonparametric methods work. We bridge this gap by providing a theoretical performance guarantee for nearest-neighbor and weighted majority voting segmentation under a new probabilistic model for patch-based image segmentation. Our analysis relies on a new local property for how similar nearby patches are, and fuses existing lines of work on modeling natural imagery patches and theory for nonparametric classification. We use the model to derive a new patch-based segmentation algorithm that iterates between inferring local label patches and merging these local segmentations to produce a globally consistent image segmentation. Many existing patch-based algorithms arise as special cases of the new algorithm.\n ', '\n High computational costs of manifold learning prohibit its application for large point sets. A common strategy to overcome this problem is to perform dimensionality reduction on selected landmarks and to successively embed the entire dataset with the Nyström method. The two main challenges that arise are: (i) the landmarks selected in non-Euclidean geometries must result in a low reconstruction error, (ii) the graph constructed from sparsely sampled landmarks must approximate the manifold well. We propose the sampling of landmarks from determinantal distributions on non-Euclidean spaces. Since current determinantal sampling algorithms have the same complexity as those for manifold learning, we present an efficient approximation running in linear time. Further, we recover the local geometry after the sparsification by assigning each landmark a local covariance matrix, estimated from the original point set. The resulting neighborhood selection based on the Bhattacharyya distance improves the embedding of sparsely sampled manifolds. Our experiments show a significant performance improvement compared to state-of-the-art landmark selection techniques.\n ', '\n We propose a novel diverse feature selection method based on determinantal point processes (DPPs). Our model enables one to flexibly define diversity based on the covariance of features (similar to orthogonal matching pursuit) or alternatively based on side information. We introduce our approach in the context of Bayesian sparse regression, employing a DPP as a variational approximation to the true spike and slab posterior distribution. We subsequently show how this variational DPP approximation generalizes and extends mean-field approximation, and can be learned efficiently by exploiting the fast sampling properties of DPPs. Our motivating application comes from bioinformatics, where we aim to identify a diverse set of genes whose expression profiles predict a tumor type where the diversity is defined with respect to a gene-gene interaction network. We also explore an application in spatial statistics. In both cases, we demonstrate that the proposed method yields significantly more diverse feature sets than classic sparse methods, without compromising accuracy.\n ', '\n Manifold learning has been successfully applied to a variety of medical imaging problems. Its use in real-time applications requires fast projection onto the low-dimensional space. To this end, out-of-sample extensions are applied by constructing an interpolation function that maps from the input space to the low-dimensional manifold. Commonly used approaches such as the Nyström extension and kernel ridge regression require using all training points. We propose an interpolation function that only depends on a small subset of the input training data. Consequently, in the testing phase each new point only needs to be compared against a small number of input training data in order to project the point onto the low-dimensional space. We interpret our method as an out-of-sample extension that approximates kernel ridge regression. Our method involves solving a simple convex optimization problem and has the attractive property of guaranteeing an upper bound on the approximation error, which is crucial for medical applications. Tuning this error bound controls the sparsity of the resulting interpolation function. We illustrate our method in two clinical applications that require fast mapping of input images onto a low-dimensional space.\n ', '\n Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning. Consequently, there has been significant work on efficiently approximating matrix multiplies. We introduce a learning-based algorithm for this task that greatly outperforms existing methods. Experiments using hundreds of matrices from diverse domains show that it often runs $100\\times$ faster than exact matrix products and $10\\times$ faster than current approximate methods. In the common case that one matrix is known ahead of time, our method also has the interesting property that it requires zero multiply-adds. These results suggest that a mixture of hashing, averaging, and byte shuffling$-$the core operations of our method$-$could be a more promising building block for machine learning than the sparsified, factorized, and/or scalar quantized matrix products that have recently been the focus of substantial research and hardware investment.\n ', '\n The impact of machine learning models on healthcare will depend on the degree of trust that healthcare professionals place in the predictions made by these models. In this paper, we present a method to provide people with clinical expertise with domain-relevant evidence about why a prediction should be trusted. We first design a probabilistic model that relates meaningful latent concepts to prediction targets and observed data. Inference of latent variables in this model corresponds to both making a prediction and providing supporting evidence for that prediction. We present a two-step process to efficiently approximate inference: (i) estimating model parameters using variational learning, and (ii) approximating maximum a posteriori estimation of latent variables in the model using a neural network, trained with an objective derived from the probabilistic model. We demonstrate the method on the task of predicting mortality risk for patients with cardiovascular disease. Specifically, using electrocardiogram and tabular data as input, we show that our approach provides appropriate domain-relevant supporting evidence for accurate predictions.\n ', '\n We present HyperMorph, a learning-based strategy for deformable image registration that removes the need to tune important registration hyperparameters during training. Classical registration methods solve an optimization problem to find a set of spatial correspondences between two images, while learning-based methods leverage a training dataset to learn a function that generates these correspondences. The quality of the results for both types of techniques depends greatly on the choice of hyperparameters. Unfortunately, hyperparameter tuning is time-consuming and typically involves training many separate models with various hyperparameter values, potentially leading to suboptimal results. To address this inefficiency, we introduce amortized hyperparameter learning for image registration, a novel strategy to learn the effects of hyperparameters on deformation fields. The proposed framework learns a hypernetwork that takes in an input hyperparameter and modulates a registration network to produce the optimal deformation field for that hyperparameter value. In effect, this strategy trains a single, rich model that enables rapid, fine-grained discovery of hyperparameter values from a continuous interval at test-time. We demonstrate that this approach can be used to optimize multiple hyperparameters considerably faster than existing search strategies, leading to a reduced computational and human burden as well as increased flexibility. We also show several important benefits, including increased robustness to initialization and the ability to rapidly identify optimal hyperparameter values specific to a registration task, dataset, or even a single anatomical region, all without retraining the HyperMorph model. Our code is publicly available at http://voxelmorph.mit.edu.\n ', '\n Test-time augmentation -- the aggregation of predictions across transformed versions of a test input -- is a common practice in image classification. Traditionally, predictions are combined using a simple average. In this paper, we present 1) experimental analyses that shed light on cases in which the simple average is suboptimal and 2) a method to address these shortcomings. A key finding is that even when test-time augmentation produces a net improvement in accuracy, it can change many correct predictions into incorrect predictions. We delve into when and why test-time augmentation changes a prediction from being correct to incorrect and vice versa. Building on these insights, we present a learning-based method for aggregating test-time augmentations. Experiments across a diverse set of models, datasets, and augmentations show that our method delivers consistent improvements over existing approaches.\n ', '\n Current unsupervised domain adaptation methods can address many types of distribution shift, but they assume data from the source domain is freely available. As the use of pre-trained models becomes more prevalent, it is reasonable to assume that source data is unavailable. We propose an unsupervised method for adapting a source classifier to a target domain that varies from the source domain along natural axes, such as brightness and contrast. Our method only requires access to unlabeled target instances and the source classifier. We validate our method in scenarios where the distribution shift involves brightness, contrast, and rotation and show that it outperforms fine-tuning baselines in scenarios with limited labeled data.\n ', '\n Changes over time in brain anatomy can provide important insight for treatment design or scientific analyses. We present a method that predicts how a brain MRI for an individual will change over time. We model changes using a diffeomorphic deformation field that we predict using function using convolutional neural networks. Given a predicted deformation field, a baseline scan can be warped to give a prediction of the brain scan at a future time. We demonstrate the method using the ADNI cohort, and analyze how performance is affected by model variants and the subject-specific information provided. We show that the model provides good predictions and that external clinical data can improve predictions.\n ', '\n Neural network pruning---the task of reducing the size of a network by removing parameters---has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods. We use ShrinkBench to compare various pruning techniques and show that its comprehensive evaluation can prevent common pitfalls when comparing pruning methods.\n ', '\n Global eradication of malaria depends on the development of drugs effective against the silent, yet obligate liver stage of the disease. The gold standard in drug development remains microscopic imaging of liver stage parasites in in vitro cell culture models. Image analysis presents a major bottleneck in this pipeline since the parasite has significant variability in size, shape, and density in these models. As with other highly variable datasets, traditional segmentation models have poor generalizability as they rely on hand-crafted features; thus, manual annotation of liver stage malaria images remains standard. To address this need, we develop a convolutional neural network architecture that utilizes spatial dropout sampling for parasite segmentation and epistemic uncertainty estimation in images of liver stage malaria. Our pipeline produces high-precision segmentations nearly identical to expert annotations, generalizes well on a diverse dataset of liver stage malaria parasites, and promotes independence between learned feature maps to model the uncertainty of generated predictions.\n ', '\n Estimation of individual treatment effects is commonly used as the basis for contextual decision making in fields such as healthcare, education, and economics. However, it is often sufficient for the decision maker to have estimates of upper and lower bounds on the potential outcomes of decision alternatives to assess risks and benefits. We show that, in such cases, we can improve sample efficiency by estimating simple functions that bound these outcomes instead of estimating their conditional expectations, which may be complex and hard to estimate. Our analysis highlights a trade-off between the complexity of the learning task and the confidence with which the learned bounds hold. Guided by these findings, we develop an algorithm for learning upper and lower bounds on potential outcomes which optimize an objective function defined by the decision maker, subject to the probability that bounds are violated being small. Using a clinical dataset and a well-known causality benchmark, we demonstrate that our algorithm outperforms baselines, providing tighter, more reliable bounds.\n ', '\n We develop a learning framework for building deformable templates, which play a fundamental role in many image analysis and computational anatomy tasks. Conventional methods for template creation and image alignment to the template have undergone decades of rich technical development. In these frameworks, templates are constructed using an iterative process of template estimation and alignment, which is often computationally very expensive. Due in part to this shortcoming, most methods compute a single template for the entire population of images, or a few templates for specific sub-groups of the data. In this work, we present a probabilistic model and efficient learning strategy that yields either universal or conditional templates, jointly with a neural network that provides efficient alignment of the images to these templates. We demonstrate the usefulness of this method on a variety of domains, with a special focus on neuroimaging. This is particularly useful for clinical applications where a pre-existing template does not exist, or creating a new one with traditional methods can be prohibitively expensive. Our code and atlases are available online as part of the VoxelMorph library at http://voxelmorph.csail.mit.edu.\n ', '\n Classical deformable registration techniques achieve impressive results and offer a rigorous theoretical treatment, but are computationally intensive since they solve an optimization problem for each image pair. Recently, learning-based methods have facilitated fast registration by learning spatial deformation functions. However, these approaches use restricted deformation models, require supervised labels, or do not guarantee a diffeomorphic (topology-preserving) registration. Furthermore, learning-based registration tools have not been derived from a probabilistic framework that can offer uncertainty estimates.\n In this paper, we build a connection between classical and learning-based methods. We present a probabilistic generative model and derive an unsupervised learning-based inference algorithm that uses insights from classical registration methods and makes use of recent developments in convolutional neural networks (CNNs). We demonstrate our method on a 3D brain registration task for both images and anatomical surfaces, and provide extensive empirical analyses. Our principled approach results in state of the art accuracy and very fast runtimes, while providing diffeomorphic guarantees. Our implementation is available at http://voxelmorph.csail.mit.edu.\n ', '\n A wide range of systems exhibit high dimensional incomplete data. Accurate estimation of the missing data is often desired, and is crucial for many downstream analyses. Many state-of-the-art recovery methods involve supervised learning using datasets containing full observations. In contrast, we focus on unsupervised estimation of missing image data, where no full observations are available - a common situation in practice. Unsupervised imputation methods for images often employ a simple linear subspace to capture correlations between data dimensions, omitting more complex relationships. In this work, we introduce a general probabilistic model that describes sparse high dimensional imaging data as being generated by a deep non-linear embedding. We derive a learning algorithm using a variational approximation based on convolutional neural networks and discuss its relationship to linear imputation models, the variational auto encoder, and deep image priors. We introduce sparsity-aware network building blocks that explicitly model observed and missing data. We analyze proposed sparsity-aware network building blocks, evaluate our method on public domain imaging datasets, and conclude by showing that our method enables imputation in an important real-world problem involving medical images. The code is freely available as part of the \\verb|neuron| library at http://github.com/adalca/neuron.\n ', '\n We consider the problem of segmenting a biomedical image into anatomical regions of interest. We specifically address the frequent scenario where we have no paired training data that contains images and their manual segmentations. Instead, we employ unpaired segmentation images to build an anatomical prior. Critically these segmentations can be derived from imaging data from a different dataset and imaging modality than the current task. We introduce a generative probabilistic model that employs the learned prior through a convolutional neural network to compute segmentations in an unsupervised setting. We conducted an empirical analysis of the proposed approach in the context of structural brain MRI segmentation, using a multi-study dataset of more than 14,000 scans. Our results show that an anatomical prior can enable fast unsupervised segmentation which is typically not possible using standard convolutional networks. The integration of anatomical priors can facilitate CNN-based anatomical segmentation in a range of novel clinical problems, where few or no annotations are available and thus standard networks are not trainable. The code is freely available at http://github.com/adalca/neuron.\n ', '\n We introduce SparseVM, a method that registers clinical-quality 3D MR scans both faster and more accurately than previously possible. Deformable alignment, or registration, of clinical scans is a fundamental task for many clinical neuroscience studies. However, most registration algorithms are designed for high-resolution research-quality scans. In contrast to research-quality scans, clinical scans are often sparse, missing up to 86% of the slices available in research-quality scans. Existing methods for registering these sparse images are either inaccurate or extremely slow. We present a learning-based registration method, SparseVM, that is more accurate and orders of magnitude faster than the most accurate clinical registration methods. To our knowledge, it is the first method to use deep learning specifically tailored to registering clinical images. We demonstrate our method on a clinically-acquired MRI dataset of stroke patients and on a simulated sparse MRI dataset. Our code is available as part of the VoxelMorph package at http://voxelmorph.mit.edu/.\n ', "\n Patients who suffer an acute coronary syndrome are at elevated risk for adverse cardiovascular events such as myocardial infarction and cardiovascular death. Accurate assessment of this risk is crucial to their course of care. We focus on estimating a patient's risk of cardiovascular death after an acute coronary syndrome based on a patient's raw electrocardiogram (ECG) signal. Learning from this signal is challenging for two reasons: 1) positive examples signifying a downstream cardiovascular event are scarce, causing drastic class imbalance, and 2) each patient's ECG signal consists of thousands of heartbeats, accompanied by a single label for the downstream outcome. Machine learning has been previously applied to this task, but most approaches rely on hand-crafted features and domain knowledge. We propose a method that learns a representation from the raw ECG signal by using a multiple instance learning framework. We present a learned risk score for cardiovascular death that outperforms existing risk metrics in predicting cardiovascular death within 30, 60, 90, and 365 days on a dataset of 5000 patients.\n ", "\n We present VoxelMorph, a fast learning-based framework for deformable, pairwise medical image registration. Traditional registration methods optimize an objective function for each pair of images, which can be time-consuming for large datasets or rich deformation models. In contrast to this approach, and building on recent learning-based methods, we formulate registration as a function that maps an input image pair to a deformation field that aligns these images. We parameterize the function via a convolutional neural network (CNN), and optimize the parameters of the neural network on a set of images. Given a new pair of scans, VoxelMorph rapidly computes a deformation field by directly evaluating the function. In this work, we explore two different training strategies. In the first (unsupervised) setting, we train the model to maximize standard image matching objective functions that are based on the image intensities. In the second setting, we leverage auxiliary segmentations available in the training data. We demonstrate that the unsupervised model's accuracy is comparable to state-of-the-art methods, while operating orders of magnitude faster. We also show that VoxelMorph trained with auxiliary data improves registration accuracy at test time, and evaluate the effect of training set size on registration. Our method promises to speed up medical image analysis and processing pipelines, while facilitating novel directions in learning-based registration and its applications. Our code is freely available at voxelmorph.csail.mit.edu.\n ", "\n Thanks to the rapid proliferation of connected devices, sensor-generated time series constitute a large and growing portion of the world's data. Often, this data is collected from distributed, resource-constrained devices and centralized at one or more servers. A key challenge in this setup is reducing the size of the transmitted data without sacrificing its quality. Lower quality reduces the data's utility, but smaller size enables both reduced network and storage costs at the servers and reduced power consumption in sensing devices. A natural solution is to compress the data at the sensing devices. Unfortunately, existing compression algorithms either violate the memory and latency constraints common for these devices or, as we show experimentally, perform poorly on sensor-generated time series.\n We introduce a time series compression algorithm that achieves state-of-the-art compression ratios while requiring less than 1KB of memory and adding virtually no latency. This method is suitable not only for low-power devices collecting data, but also for servers storing and querying data; in the latter context, it can decompress at over 3GB/s in a single thread, even faster than many algorithms with much lower compression ratios. A key component of our method is a high-speed forecasting algorithm that can be trained online and significantly outperforms alternatives such as delta coding.\n Extensive experiments on datasets from many domains show that these results hold not only for sensor data but also across a wide array of other time series.\n ", '\n Machine learning approaches have been effective in predicting adverse outcomes in different clinical settings. These models are often developed and evaluated on datasets with heterogeneous patient populations. However, good predictive performance on the aggregate population does not imply good performance for specific groups.\n In this work, we present a two-step framework to 1) learn relevant patient subgroups, and 2) predict an outcome for separate patient populations in a multi-task framework, where each population is a separate task. We demonstrate how to discover relevant groups in an unsupervised way with a sequence-to-sequence autoencoder. We show that using these groups in a multi-task framework leads to better predictive performance of in-hospital mortality both across groups and overall. We also highlight the need for more granular evaluation of performance when dealing with heterogeneous populations.\n ', '\n Traditional deformable registration techniques achieve impressive results and offer a rigorous theoretical treatment, but are computationally intensive since they solve an optimization problem for each image pair. Recently, learning-based methods have facilitated fast registration by learning spatial deformation functions. However, these approaches use restricted deformation models, require supervised labels, or do not guarantee a diffeomorphic (topology-preserving) registration. Furthermore, learning-based registration tools have not been derived from a probabilistic framework that can offer uncertainty estimates. In this paper, we present a probabilistic generative model and derive an unsupervised learning-based inference algorithm that makes use of recent developments in convolutional neural networks (CNNs). We demonstrate our method on a 3D brain registration task, and provide an empirical analysis of the algorithm. Our approach results in state of the art accuracy and very fast runtimes, while providing diffeomorphic guarantees and uncertainty estimates. Our implementation is available online at http://voxelmorph.csail.mit.edu .\n ', '\n We address the computational problem of novel human pose synthesis. Given an image of a person and a desired pose, we produce a depiction of that person in that pose, retaining the appearance of both the person and background. We present a modular generative neural network that synthesizes unseen poses using training pairs of images and poses taken from human action videos. Our network separates a scene into different body part and background layers, moves body parts to new locations and refines their appearances, and composites the new foreground with a hole-filled background. These subtasks, implemented with separate modules, are trained jointly using only a single target image as a supervised label. We use an adversarial discriminator to force our network to synthesize realistic details conditioned on pose. We demonstrate image synthesis results on three action classes: golf, yoga/workouts and tennis, and show that our method produces accurate results within action classes as well as across action classes. Given a sequence of desired poses, we also produce coherent videos of actions.\n ', '\n We present a fast learning-based algorithm for deformable, pairwise 3D medical image registration. Current registration methods optimize an objective function independently for each pair of images, which can be time-consuming for large data. We define registration as a parametric function, and optimize its parameters given a set of images from a collection of interest. Given a new pair of scans, we can quickly compute a registration field by directly evaluating the function using the learned parameters. We model this function using a convolutional neural network (CNN), and use a spatial transform layer to reconstruct one image from another while imposing smoothness constraints on the registration field. The proposed method does not require supervised information such as ground truth registration fields or anatomical landmarks. We demonstrate registration accuracy comparable to state-of-the-art 3D image registration, while operating orders of magnitude faster in practice. Our method promises to significantly speed up medical image analysis and processing pipelines, while facilitating novel directions in learning-based registration and its applications. Our code is available at https://github.com/balakg/voxelmorph .\n ', "\n When an infection spreads in a community, an individual's probability of becoming infected depends on both her susceptibility and exposure to the contagion through contact with others. While one often has knowledge regarding an individual's susceptibility, in many cases, whether or not an individual's contacts are contagious is unknown. We study the problem of predicting if an individual will adopt a contagion in the presence of multiple modes of infection (exposure/susceptibility) and latent neighbor influence. We present a generative probabilistic model and a variational inference method to learn the parameters of our model. Through a series of experiments on synthetic data, we measure the ability of the proposed model to identify latent spreaders, and predict the risk of infection. Applied to a real dataset of 20,000 hospital patients, we demonstrate the utility of our model in predicting the onset of a healthcare associated infection using patient room-sharing and nurse-sharing networks. Our model outperforms existing benchmarks and provides actionable insights for the design and implementation of targeted interventions to curb the spread of infection.\n ", "\n For many movement disorders, such as Parkinson's disease and ataxia, disease progression is visually assessed by a clinician using a numerical disease rating scale. These tests are subjective, time-consuming, and must be administered by a professional. This can be problematic where specialists are not available, or when a patient is not consistently evaluated by the same clinician. We present an automated method for quantifying the severity of motion impairment in patients with ataxia, using only video recordings. We consider videos of the finger-to-nose test, a common movement task used as part of the assessment of ataxia progression during the course of routine clinical checkups.\n Our method uses neural network-based pose estimation and optical flow techniques to track the motion of the patient's hand in a video recording. We extract features that describe qualities of the motion such as speed and variation in performance. Using labels provided by an expert clinician, we train a supervised learning model that predicts severity according to the Brief Ataxia Rating Scale (BARS). The performance of our system is comparable to that of a group of ataxia specialists in terms of mean error and correlation, and our system's predictions were consistently within the range of inter-rater variability. This work demonstrates the feasibility of using computer vision and machine learning to produce consistent and clinically useful measures of motor impairment.\n ", '\n We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy. From this we identify the current practice of values engineering: the creation of classifiers from human-created data with value-based labels. This has worked in practice for a variety of issues, but problems are addressed one at a time, and users and other stakeholders have seldom been involved. Instead, we look to AI alignment work for approaches that could learn complex values directly from stakeholders, and identify four major directions: useful measures of alignment, participatory design and operation, interactive value learning, and informed deliberative judgments.\n ', '\n Ever since social activity on the Internet began migrating from the wilds of the open web to the walled gardens erected by so-called platforms, debates have raged about the responsibilities that these platforms ought to bear. And yet, despite intense scrutiny from the news media and grassroots movements of outraged users, platforms continue to operate, from a legal standpoint, on the friendliest terms. Under the current regulatory framework, platforms simultaneously benefit from: (1) broad discretion to organize (and censor) content however they choose; (2) powerful algorithms for curating a practically limitless supply of user-posted microcontent according to whatever ends they wish; and (3) absolution from the sorts of liability born by creators of the underlying content. In this paper, we contest the very validity of the platform-creator distinction, arguing that it is ill-adapted to the modern social media landscape where, in a real sense, platforms are creating derivative media products. We argue that any coherent regulatory framework must adapt to this reality, recognizing the subtle continuum of activities that span the curation-creation spectrum, providing a finer system of categorization and clearer guidance for precisely when platforms assume the responsibilities associated with content creation.\n ', "\n AI systems often rely on two key components: a specified goal or reward function and an optimization algorithm to compute the optimal behavior for that goal. This approach is intended to provide value for a principal: the user on whose behalf the agent acts. The objectives given to these agents often refer to a partial specification of the principal's goals. We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $L$ attributes of the state correspond to different sources of utility for the principal. We assume that the reward function given to the agent only has support on $J < L$ attributes. The contributions of our paper are as follows: 1) we propose a novel model of an incomplete principal-agent problem from artificial intelligence; 2) we provide necessary and sufficient conditions under which indefinitely optimizing for any incomplete proxy objective leads to arbitrarily low overall utility; and 3) we show how modifying the setup to allow reward functions that reference the full state or allowing the principal to update the proxy objective over time can lead to higher utility solutions. The results in this paper argue that we should view the design of reward functions as an interactive and dynamic process and identifies a theoretical scenario where some degree of interactivity is desirable.\n ", "\n We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning games. We analyze in particular a generalization of apprenticeship learning in which the humans first perform some work to obtain utility and demonstrate their preferences, and then the robot acts to further maximize the sum of human payoffs. We show in this setting that if the game is sufficiently collegial, i.e. if the humans are responsible for obtaining a sufficient fraction of the rewards through their own actions, then their preferences are straightforwardly revealed through their work. This revelation mechanism is non-dictatorial, does not limit the possible outcomes to two alternatives, and is dominant-strategy incentive-compatible.\n ", '\n Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function. This paper studies multi-principal assistance games, which cover the more general case in which the robot acts on behalf of N humans who may have widely differing payoffs. Impossibility theorems in social choice theory and voting theory can be applied to such games, suggesting that strategic behavior by the human principals may complicate the robots task in learning their payoffs. We analyze in particular a bandit apprentice game in which the humans act first to demonstrate their individual preferences for the arms and then the robot acts to maximize the sum of human payoffs. We explore the extent to which the cost of choosing suboptimal arms reduces the incentive to mislead, a form of natural mechanism design. In this context we propose a social choice method that uses shared control of a system to combine preference inference with social welfare optimization.\n ', '\n How can societies learn to enforce and comply with social norms? Here we investigate the learning dynamics and emergence of compliance and enforcement of social norms in a foraging game, implemented in a multi-agent reinforcement learning setting. In this spatiotemporally extended game, individuals are incentivized to implement complex berry-foraging policies and punish transgressions against social taboos covering specific berry types. We show that agents benefit when eating poisonous berries is taboo, meaning the behavior is punished by other agents, as this helps overcome a credit-assignment problem in discovering delayed health effects. Critically, however, we also show that introducing an additional taboo, which results in punishment for eating a harmless berry, improves the rate and stability with which agents learn to punish taboo violations and comply with taboos. Counterintuitively, our results show that an arbitrary taboo (a "silly rule") can enhance social learning dynamics and achieve better outcomes in the middle stages of learning. We discuss the results in the context of studying normativity as a group-level emergent phenomenon.\n ', '\n In artificial intelligence, we often specify tasks through a reward function. While this works well in some settings, many tasks are hard to specify this way. In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging. Instead, we present an interface for specifying tasks interactively using demonstrations. Our approach defines a set of increasingly complex policies. The interface allows the user to switch between these policies at fixed intervals to generate demonstrations of novel, more complex, tasks. We train new policies based on these demonstrations and repeat the process. We present a case study of our approach in the Lunar Lander domain, and show that this simple approach can quickly learn a successful landing policy and outperforms an existing comparison-based deep RL method.\n ', '\n Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. We propose a geometric framework, drawing on tools from the manifold reconstruction literature, to analyze the high-dimensional geometry of adversarial examples. In particular, we highlight the importance of codimension: for low-dimensional data manifolds embedded in high-dimensional space there are many directions off the manifold in which an adversary could construct adversarial examples. Adversarial examples are a natural consequence of learning a decision boundary that classifies the low-dimensional data manifold well, but classifies points near the manifold incorrectly. Using our geometric framework we prove that adversarial training is sample inefficient, and show sufficient sampling conditions under which nearest neighbor classifiers and ball-based adversarial training are robust. Finally we introduce adversarial training with Voronoi constraints, which replaces the norm ball constraint with the Voronoi cell for each point in the training set. We show that adversarial training with Voronoi constraints produces robust models which significantly improve over the state-of-the-art on MNIST and are competitive on CIFAR-10.\n ', '\n Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a reward misspecification; even if the designers immediately correct the reward function, the damage is done. To mitigate this risk, we introduce an approach that balances optimization of the primary reward function with preservation of the ability to optimize auxiliary reward functions. Surprisingly, even when the auxiliary reward functions are randomly generated and therefore uninformative about the correctly specified reward function, this approach induces conservative, effective behavior.\n ', '\n Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science. However, most work makes the assumption that humans are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people are themselves learning about what they want. In this work, we introduce the assistive multi-armed bandit, where a robot assists a human playing a bandit task to maximize cumulative reward. In this problem, the human does not know the reward function but can learn it through the rewards received from arm pulls; the robot only observes which arms the human pulls but not the reward associated with each pull. We offer sufficient and necessary conditions for successfully assisting the human in this framework. Surprisingly, better human performance in isolation does not necessarily lead to better performance when assisted by the robot: a human policy can do better by effectively communicating its observed rewards to the robot. We conduct proof-of-concept experiments that support these results. We see this work as contributing towards a theory behind algorithms for human-robot interaction.\n ', '\n Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly? In the context of HRI, part of the world to be modeled is the human. One option is for the robot to treat the human as a black box and learn a policy for how they act directly. But it can also model the human as an agent, and rely on a "theory of mind" to guide or bias the learning (grey box). We contribute a characterization of the performance of these methods for an autonomous driving task under the optimistic case of having an ideal theory of mind, as well as under different scenarios in which the assumptions behind the robot\'s theory of mind for the human are wrong, as they inevitably will be in practice.\n ', "\n People frequently face challenging decision-making problems in which outcomes are uncertain or unknown. Artificial intelligence (AI) algorithms exist that can outperform humans at learning such tasks. Thus, there is an opportunity for AI agents to assist people in learning these tasks more effectively. In this work, we use a multi-armed bandit as a controlled setting in which to explore this direction. We pair humans with a selection of agents and observe how well each human-agent team performs. We find that team performance can beat both human and agent performance in isolation. Interestingly, we also find that an agent's performance in isolation does not necessarily correlate with the human-agent team's performance. A drop in agent performance can lead to a disproportionately large drop in team performance, or in some settings can even improve team performance. Pairing a human with an agent that performs slightly better than them can make them perform much better, while pairing them with an agent that performs the same can make them them perform much worse. Further, our results suggest that people have different exploration strategies and might perform better with agents that match their strategy. Overall, optimizing human-agent team performance requires going beyond optimizing agent performance, to understanding how the agent's suggestions will influence human decision-making.\n ", '\n It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws. But human laws and norms are complex and culturally varied systems, in many cases agents will have to learn the rules. This requires autonomous agents to have models of how human rule systems work so that they can make reliable predictions about rules. In this paper we contribute to the building of such models by analyzing an overlooked distinction between important rules and what we call silly rules--rules with no discernible direct impact on welfare. We show that silly rules render a normative system both more robust and more adaptable in response to shocks to perceived stability. They make normativity more legible for humans, and can increase legibility for AI systems as well. For AI systems to integrate into human normative systems, we suggest, it may be important for them to have models that include representations of silly rules.\n ', '\n Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. We propose a geometric framework, drawing on tools from the manifold reconstruction literature, to analyze the high-dimensional geometry of adversarial examples. In particular, we highlight the importance of codimension: for low-dimensional data manifolds embedded in high-dimensional space there are many directions off the manifold in which to construct adversarial examples. Adversarial examples are a natural consequence of learning a decision boundary that classifies the low-dimensional data manifold well, but classifies points near the manifold incorrectly. Using our geometric framework we prove (1) a tradeoff between robustness under different norms, (2) that adversarial training in balls around the data is sample inefficient, and (3) sufficient sampling conditions under which nearest neighbor classifiers and ball-based adversarial training are robust.\n ', '\n Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true reward. In contrast to approaches asking the designer for optimal behavior, this allows us to gather additional information by eliciting preferences between suboptimal behaviors. After each query, we need to update the posterior over the true reward function from observing the proxy reward function chosen by the designer. The recently proposed Inverse Reward Design (IRD) enables this. Our approach substantially outperforms IRD in test environments. In particular, it can query the designer about interpretable, linear reward functions and still infer non-linear ones.\n ', "\n Our goal is for AI systems to correctly identify and act according to their human user's objectives. Cooperative Inverse Reinforcement Learning (CIRL) formalizes this value alignment problem as a two-player game between a human and robot, in which only the human knows the parameters of the reward function: the robot needs to learn them as the interaction unfolds. Previous work showed that CIRL can be solved as a POMDP, but with an action space size exponential in the size of the reward parameter space. In this work, we exploit a specific property of CIRL---the human is a full information agent---to derive an optimality-preserving modification to the standard Bellman update; this reduces the complexity of the problem by an exponential factor and allows us to relax CIRL's assumption of human rationality. We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parameter spaces, and larger action spaces for both robot and human. In solutions to these larger problems, the human exhibits pedagogic (teaching) behavior, while the robot interprets it as such and attains higher value for the human.\n ", '\n Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and-conquer approach that enables the designer to specify a reward separately for each environment. By treating these separate reward functions as observations about the underlying true reward, we derive an approach to infer a common reward across all environments. We conduct user studies in an abstract grid world domain and in a motion planning domain for a 7-DOF manipulator that measure user effort and solution quality. We show that our method is faster, easier to use, and produces a higher quality solution than the typical method of designing a reward jointly across all environments. We additionally conduct a series of experiments that measure the sensitivity of these results to different properties of the reward design task, such as the number of environments, the number of feasible solutions per environment, and the fraction of the total features that vary within each environment. We find that independent reward design outperforms the standard, joint, reward design process but works best when the design problem can be divided into simpler subproblems.\n ', '\n We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions. We first provide an overview of the incomplete contracting literature and explore parallels between this work and the problem of AI alignment. As we emphasize, misalignment between principal and agent is a core focus of economic analysis. We highlight some technical results from the economics literature on incomplete contracts that may provide insights for AI alignment researchers. Our core contribution, however, is to bring to bear an insight that economists have been urged to absorb from legal scholars and other behavioral scientists: the fact that human contracting is supported by substantial amounts of external structure, such as generally available institutions (culture, law) that can supply implied terms to fill the gaps in incomplete contracts. We propose a research agenda for AI alignment work that focuses on the problem of how to build AI that can replicate the human cognitive processes that connect individual incomplete contracts with this supporting external structure.\n ', '\n Our goal is to enable robots to \\emph{time} their motion in a way that is purposefully expressive of their internal states, making them more transparent to people. We start by investigating what types of states motion timing is capable of expressing, focusing on robot manipulation and keeping the path constant while systematically varying the timing. We find that users naturally pick up on certain properties of the robot (like confidence), of the motion (like naturalness), or of the task (like the weight of the object that the robot is carrying). We then conduct a hypothesis-driven experiment to tease out the directions and magnitudes of these effects, and use our findings to develop candidate mathematical models for how users make these inferences from the timing. We find a strong correlation between the models and real user data, suggesting that robots can leverage these models to autonomously optimize the timing of their motion to be expressive.\n ', "\n Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.\n ", "\n As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match those of their human users; this is known as the value-alignment problem. In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go. We argue that a meaningful solution to value alignment must combine multi-agent decision theory with rich mathematical models of human cognition, enabling robots to tap into people's natural collaborative capabilities. We present a solution to the cooperative inverse reinforcement learning (CIRL) dynamic game based on well-established cognitive models of decision making and theory of mind. The solution captures a key reciprocity relation: the human will not plan her actions in isolation, but rather reason pedagogically about how the robot might learn from them; the robot, in turn, can anticipate this and interpret the human's actions pragmatically. To our knowledge, this work constitutes the first formal analysis of value alignment grounded in empirically validated cognitive models.\n ", "\n Intuitively, obedience -- following the order that a human gives -- seems like a good property for a robot to have. But, we humans are not perfect and we may give orders that are not best aligned to our preferences. We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that simply follows the human's literal order. Thus, there is a tradeoff between the obedience of a robot and the value it can attain for its owner. We investigate how this tradeoff is impacted by the way the robot infers the human's preferences, showing that some methods err more on the side of obedience than others. We then analyze how performance degrades when the robot has a misspecified model of the features that the human cares about or the level of rationality of the human. Finally, we study how robots can start detecting such model misspecification. Overall, our work suggests that there might be a middle ground in which robots intelligently decide when to obey human orders, but err on the side of obedience.\n ", "\n It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.\n ", "\n For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-information game with two agents, human and robot; both are rewarded according to the human's reward function, but the robot does not initially know what this is. In contrast to classical IRL, where the human is assumed to act optimally in isolation, optimal CIRL solutions produce behaviors such as active teaching, active learning, and communicative actions that are more effective in achieving value alignment. We show that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, prove that optimality in isolation is suboptimal in CIRL, and derive an approximate CIRL algorithm.\n ", '\n We recently put forth a new fundamental lattice Hamiltonian based on an underlying picture of electrons and deuterons as elementary Dirac particles. Within this model there appears a term in which lattice vibrations are coupled to internal nuclear transitions. This is interesting as it has the potential to provide a connection between experiment and models that describe coherent energy transfer between two-level systems and an oscillator. In this work we describe a calculation of the coupling matrix element in the case of the deuteron based on the old empirical Hamada-Johnston model for the nucleon-nucleon interaction. The triplet S and D states of the the deuteron in the rest frame couples to a singlet P state through this new interaction. The singlet P state in this calculation is a virtual state with an energy of 125 MeV, and a coupling matrix element for $z$-directed motion given by $2.98 \\times 10^{-3} ~M_J c \\hat{P}_z$.\n ', '\n Motivated by many observations of anomalies in condensed matter systems, we consider a new fundamental Hamiltonian in which condensed matter and nuclear systems are described initially on the same footing. Since it may be possible that the lattice will respond to the mass change associated with a excited nuclear state, we adopt a relativistic description throughout based on a many-particle Dirac formalism. This approach has not been used in the past, perhaps due to the difficulty in separating the center of mass and relative degrees of freedom of the nuclear system, or perhaps due to an absence of applications for such a model. In response to some recent ideas about how to think about the center of mass and relative separation, we obtained from the Dirac model a new fundamental Hamiltonian in which the lattice couples to different states within the composite nuclei within the lattice. In this description the different nuclear states have different mass energies and kinetic energies, as we had expected. In addition there appear new terms which provide for nuclear excitation as a result of coupling to the composite momentum. This new effect comes about because of changes in the composite nuclear state as a result of the dynamical Lorentz boost in the lattice.\n ', "\n We are interested in the energy-momentum relation for a moving composite in relativistic quantum mechanics in many-particle Dirac models. For a manifestly covariant model one can apply the Lorentz transform to go from the rest frame to a moving frame to establish an energy-momentum relation of the form $\\sqrt{(M^*c^2)^2+c^2|{\\bf P}|^2}$ where $M^*$ is the kinematic mass. However, the many-particle Dirac model is not manifestly covariant, and some other approach is required. We have found a simple approach that allows for a separation of relative and center of mass contributions to the energy. We are able to define the associated kinematic energy and determine the energy-momentum relation. Our result can be expressed as a modified deBroglie relation of the form\n $$ \\hbar ω({\\bf P}) = <Φ' | \\sum_j {m_j \\over M} β_j | Φ' >~ \\sqrt{[M^*({\\bf P}) c^2]^2 + c^2 |{\\bf P}|^2} $$\n where the kinematic mass $M^*$ will depend on the total momentum ${\\bf P}$ for a general noncovariant potential. The prefactor that occurs we associate with a time dilation effect, the existence of which has been discussed previously in the literature.\n ", '\n Phonon exchange with nuclei in the course of fusion reactions that occur in a solid has not been analyzed previously. This problem has become of interest in connection with claims of observations of anomalies in metal deuterides. If the strong force interaction were dependent only on position (and not spin or isospin), then the coupling with phonons can be developed directly. Since a nuclear interaction can change the lattice constituents, the initial and final state lattices can be different, and we must include this in the formulation. For more realistic strong force models with spin and isospin dependence, we can use correlated nuclear wavefunctions which are made up of products of space, spin and isospin components. In this case, the spin and isospin algebra can be done analytically, producing channel-dependent potentials that are only space dependent. The formulation that results can be used for quantitative estimates of phonon exchange.\n ', '\n Series of short contributions that are part of Nobel Symposium 162 - Microfluidics arXiv:1712.08369.\n ', '\n We develop the first theoretical model for the analytical description of ion concentration polarization (ICP)-based electrokinetic molecular concentration, which had not been possible due to the extraordinary complexity of the system. We define the two separate limits for the enrichment factor achievable in a given system and derive the scaling laws for critical parameters, which are validated by numerical simulations and experiments. This work provides clear theoretical explanations on the diverse experimental behaviors previously observed yet unexplainable, while setting solid foundation for the engineering of ICP-based concentrators and other fluid-coupled electrokinetic systems.\n ', '\n This paper studies mechanism of preconcentration of charged particles in a straight micro-channel embedded with permselective membranes, by numerically solving coupled transport equations of ions, charged particles and solvent fluid without any simplifying assumptions. It is demonstrated that trapping and preconcentration of charged particles are determined by the interplay between drag force from the electroosmotic fluid flow and the electrophoretic force applied trough the electric field. Several insightful characteristics are revealed, including the diverse dynamics of co-ions and counter ions, replacement of co-ions by focused particles, lowered ion concentrations in particle enriched zone, and enhanced electroosmotic pumping effect etc. Conditions for particles that may be concentrated are identified in terms of charges, sizes and electrophoretic mobilities of particles and co-ions. Dependences of enrichment factor on cross-membrane voltage, initial particle concentration and buffer ion concentrations are analyzed and the underlying reasons are elaborated. Finally, post priori a condition for validity of decoupled simulation model is given based on charges carried by focused charge particles and that by buffer co-ions. These results provide important guidance in the design and optimization of nanofluidic preconcentration and other related devices.\n ', '\n Magnetometers based on quantum mechanical processes enable high sensitivity and long-term stability without the need for re-calibration, but their integration into fieldable devices remains challenging. This paper presents a CMOS quantum vector-field magnetometer that miniaturizes the conventional quantum sensing platforms using nitrogen-vacancy (NV) centers in diamond. By integrating key components for spin control and readout, the chip performs magnetometry through optically detected magnetic resonance (ODMR) through a diamond slab attached to a custom CMOS chip. The ODMR control is highly uniform across the NV centers in the diamond, which is enabled by a CMOS-generated $\\sim$2.87 GHz magnetic field with <5% inhomogeneity across a large-area current-driven wire array. The magnetometer chip is 1.5 mm$^2$ in size, prototyped in 65-nm bulk CMOS technology, and attached to a 300$\\times$80 $μ$m2 diamond slab. NV fluorescence is measured by CMOS-integrated photodetectors. This on-chip measurement is enabled by efficient rejection of the green pump light from the red fluorescence through a CMOS-integrated spectral filter based on a combination of spectrally dependent plasmonic losses and diffractive filtering in the CMOS back-end-of-line (BEOL). This filter achieves $\\sim$25 dB of green light rejection. We measure a sensitivity of 245 nT/Hz$^{1/2}$, marking a 130$\\times$ improvement over a previous CMOS-NV sensor prototype, largely thanks to the better spectral filtering and homogeneous microwave generation over larger area.\n ', '\n The nitrogen vacancy (NV) center in diamond has emerged as a leading solid-state quantum sensor for applications including magnetometry, electrometry, thermometry, and chemical sensing. However, an outstanding challenge for practical applications is that existing NV-based sensing techniques require bulky and discrete instruments for spin control and detection. Here, we address this challenge by integrating NV based quantum sensing with complementary metal-oxide-semiconductor (CMOS) technology. Through tailored CMOS-integrated microwave generation and photodetection, this work dramatically reduces the instrumentation footprint for quantum magnetometry and thermometry. This hybrid diamond-CMOS integration enables an ultra-compact and scalable platform for quantum sensing and quantum information processing.\n ', '\n Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.\n ', '\n Quantum Neural Network (QNN) is a promising application towards quantum advantage on near-term quantum hardware. However, due to the large quantum noises (errors), the performance of QNN models has a severe degradation on real quantum devices. For example, the accuracy gap between noise-free simulation and noisy results on IBMQ-Yorktown for MNIST-4 classification is over 60%. Existing noise mitigation methods are general ones without leveraging unique characteristics of QNN and are only applicable to inference; on the other hand, existing QNN work does not consider noise effect. To this end, we present RoQNN, a QNN-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We analytically deduct and experimentally observe that the effect of quantum noise to QNN measurement outcome is a linear map from noise-free outcome with a scaling and a shift factor. Motivated by that, we propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to QNN according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect. Extensive experiments on 8 classification tasks using 6 quantum devices demonstrate that RoQNN improves accuracy by up to 43%, and achieves over 94% 2-class, 80% 4-class, and 34% 10-class MNIST classification accuracy measured on real quantum computers. We also open-source our PyTorch library for construction and noise-aware training of QNN at https://github.com/mit-han-lab/pytorch-quantum .\n ', '\n We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks. Existing regularization techniques (e.g., data augmentation, dropout) have shown much success on large neural networks (e.g., ResNet50) by adding noise to overcome over-fitting. However, we found these techniques hurt the performance of tiny neural networks. We argue that training tiny models are different from large models: rather than augmenting the data, we should augment the model, since tiny models tend to suffer from under-fitting rather than over-fitting due to limited capacity. To alleviate this issue, NetAug augments the network (reverse dropout) instead of inserting noise into the dataset or the network. It puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision, in addition to functioning as an independent model. At test time, only the tiny model is used for inference, incurring zero inference overhead. We demonstrate the effectiveness of NetAug on image classification and object detection. NetAug consistently improves the performance of tiny models, achieving up to 2.1% accuracy improvement on ImageNet, and 4.3% on Cars. On Pascal VOC, NetAug provides 2.96% mAP improvement with the same computational cost.\n ', '\n Deep learning on point clouds plays a vital role in a wide range of applications such as autonomous driving and AR/VR. These applications interact with people in real-time on edge devices and thus require low latency and low energy. Compared to projecting the point cloud to 2D space, directly processing the 3D point cloud yields higher accuracy and lower #MACs. However, the extremely sparse nature of point cloud poses challenges to hardware acceleration. For example, we need to explicitly determine the nonzero outputs and search for the nonzero neighbors (mapping operation), which is unsupported in existing accelerators. Furthermore, explicit gather and scatter of sparse features are required, resulting in large data movement overhead. In this paper, we comprehensively analyze the performance bottleneck of modern point cloud networks on CPU/GPU/TPU. To address the challenges, we then present PointAcc, a novel point cloud deep learning accelerator. PointAcc maps diverse mapping operations onto one versatile ranking-based kernel, streams the sparse computation with configurable caching, and temporally fuses consecutive dense layers to reduce the memory footprint. Evaluated on 8 point cloud models across 4 applications, PointAcc achieves 3.7X speedup and 22X energy savings over RTX 2080Ti GPU. Co-designed with light-weight neural networks, PointAcc rivals the prior accelerator Mesorasi by 100X speedup with 9.1% higher accuracy running segmentation on the S3DIS dataset. PointAcc paves the way for efficient point cloud recognition.\n ', '\n The explosive growth in video streaming requires video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN-based methods can achieve good performance but are computationally intensive. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. The key idea of TSM is to shift part of the channels along the temporal dimension, thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. TSM offers several unique advantages. Firstly, TSM has high performance; it ranks the first on the Something-Something leaderboard upon submission. Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8. Thirdly, TSM has higher scalability compared to 3D networks, enabling large-scale Kinetics training on 1,536 GPUs in 15 minutes. Lastly, TSM enables action concepts learning, which 2D networks cannot model; we visualize the category attention map and find that spatial-temporal action detector emerges during the training of classification tasks. The code is publicly available at https://github.com/mit-han-lab/temporal-shift-module.\n ', '\n Real-time end-to-end task scheduling in networked control systems (NCSs) requires the joint consideration of both network and computing resources to guarantee the desired quality of service (QoS). This paper introduces a new model for composite resource scheduling (CRS) in real-time networked control systems, which considers a strict execution order of sensing, computing, and actuating segments based on the control loop of the target NCS. We prove that the general CRS problem is NP-hard and study two special cases of the CRS problem. The first case restricts the computing and actuating segments to have unit-size execution time while the second case assumes that both sensing and actuating segments have unit-size execution time. We propose an optimal algorithm to solve the first case by checking the intervals with 100% network resource utilization and modify the deadlines of the tasks within those intervals to prune the search. For the second case, we propose another optimal algorithm based on a novel backtracking strategy to check the time intervals with the network resource utilization larger than 100% and modify the timing parameters of tasks based on these intervals. For the general case, we design a greedy strategy to modify the timing parameters of both network segments and computing segments within the time intervals that have network and computing resource utilization larger than 100%, respectively. The correctness and effectiveness of the proposed algorithms are verified through extensive experiments.\n ', '\n Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouse-over gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions and propose to supervise the cross-modal attention map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x while achieving comparable or even improved performance on COCO instance segmentation. When provided with the same amount of annotations, LocTex achieves around 4% higher accuracy than the previous state-of-the-art "vision+language" pre-training approach on the task of PASCAL VOC image classification.\n ', '\n The Rashba effect, i.e., the splitting of electronic spin-polarized bands in the momentum space of a crystal with broken inversion symmetry, has enabled the realization of spin-orbitronic devices, in which spins are manipulated by spin-orbit coupling. In optics, where the helicity of light polarization represents the spin degree of freedom for spin-momentum coupling, the optical Rashba effect is manifested by the splitting of optical states with opposite chirality in the momentum space. Previous realizations of the optical Rashba effect relied on passive devices determining either the propagation direction of surface plasmons or circularly polarized light into nanostructures, or the directional emission of polarized luminescence from metamaterials hybridized with light-emitting media. Here we demonstrate an active device underpinned by the optical Rashba effect, in which a monolithic halide perovskite metasurface emits highly directional chiral photoluminescence. An all-dielectric metasurface design with broken in-plane inversion symmetry is directly embossed into the high refractive index, light-emitting perovskite film, yielding a degree of circular polarization of photoluminescence of 40% at room temperature - more than one order of magnitude greater than in state of art chiral perovskites.\n ', '\n Quantum noise is the key challenge in Noisy Intermediate-Scale Quantum (NISQ) computers. Previous work for mitigating noise has primarily focused on gate-level or pulse-level noise-adaptive compilation. However, limited research efforts have explored a higher level of optimization by making the quantum circuits themselves resilient to noise.\n We propose QuantumNAS, a comprehensive framework for noise-adaptive co-search of the variational circuit and qubit mapping. Variational quantum circuits are a promising approach for constructing QML and quantum simulation. However, finding the best variational circuit and its optimal parameters is challenging due to the large design space and parameter training cost. We propose to decouple the circuit search and parameter training by introducing a novel SuperCircuit. The SuperCircuit is constructed with multiple layers of pre-defined parameterized gates and trained by iteratively sampling and updating the parameter subsets (SubCircuits) of it. It provides an accurate estimation of SubCircuits performance trained from scratch. Then we perform an evolutionary co-search of SubCircuit and its qubit mapping. The SubCircuit performance is estimated with parameters inherited from SuperCircuit and simulated with real device noise models. Finally, we perform iterative gate pruning and finetuning to remove redundant gates.\n Extensively evaluated with 12 QML and VQE benchmarks on 10 quantum comput, QuantumNAS significantly outperforms baselines. For QML, QuantumNAS is the first to demonstrate over 95% 2-class, 85% 4-class, and 32% 10-class classification accuracy on real QC. It also achieves the lowest eigenvalue for VQE tasks on H2, H2O, LiH, CH4, BeH2 compared with UCCSD. We also open-source QuantumEngine (https://github.com/mit-han-lab/pytorch-quantum) for fast training of parameterized quantum circuits to facilitate future research.\n ', '\n The topological band theory predicts that bulk materials with nontrivial topological phases support topological edge states. This phenomenon is universal for various wave systems and has been widely observed for electromagnetic and acoustic waves. Here, we extend the notion of band topology from wave to diffusion dynamics. Unlike the wave systems that are usually Hermitian, the diffusion systems are anti-Hermitian with purely imaginary eigenvalues corresponding to decay rates. Via direct probe of the temperature diffusion, we experimentally retrieve the Hamiltonian of a thermal lattice, and observe the emergence of topological edge decays within the gap of bulk decays. Our results show that such edge states exhibit robust decay rates, which are topologically protected against disorders. This work constitutes a thermal analogue of topological insulators and paves the way to exploring defect-immune heat dissipation.\n ', '\n Data-driven, automatic design space exploration of neural accelerator architecture is desirable for specialization and productivity. Previous frameworks focus on sizing the numerical architectural hyper-parameters while neglect searching the PE connectivities and compiler mappings. To tackle this challenge, we propose Neural Accelerator Architecture Search (NAAS) which holistically searches the neural network architecture, accelerator architecture, and compiler mapping in one optimization loop. NAAS composes highly matched architectures together with efficient mapping. As a data-driven approach, NAAS rivals the human design Eyeriss by 4.4x EDP reduction with 2.7% accuracy improvement on ImageNet under the same computation resource, and offers 1.4x to 3.5x EDP reduction than only sizing the architectural hyper-parameters.\n ', "\n Deep learning has been used to demonstrate end-to-end neural network learning for autonomous vehicle control from raw sensory input. While LiDAR sensors provide reliably accurate information, existing end-to-end driving solutions are mainly based on cameras since processing 3D data requires a large memory footprint and computation cost. On the other hand, increasing the robustness of these systems is also critical; however, even estimating the model's uncertainty is very challenging due to the cost of sampling-based methods. In this paper, we present an efficient and robust LiDAR-based end-to-end navigation framework. We first introduce Fast-LiDARNet that is based on sparse convolution kernel optimization and hardware-aware model design. We then propose Hybrid Evidential Fusion that directly estimates the uncertainty of the prediction from only a single forward pass and then fuses the control predictions intelligently. We evaluate our system on a full-scale vehicle and demonstrate lane-stable as well as navigation capabilities. In the presence of out-of-distribution events (e.g., sensor failures), our system significantly improves robustness and reduces the number of takeovers in the real world.\n ", '\n Object recognition is a fundamental problem in many video processing tasks, accurately locating seen objects at low computation cost paves the way for on-device video recognition. We propose PatchNet, an efficient convolutional neural network to match objects in adjacent video frames. It learns the patchwise correlation features instead of pixel features. PatchNet is very compact, running at just 58MFLOPs, $5\\times$ simpler than MobileNetV2. We demonstrate its application on two tasks, video object detection and visual object tracking. On ImageNet VID, PatchNet reduces the flops of R-FCN ResNet-101 by 5x and EfficientDet-D0 by 3.4x with less than 1% mAP loss. On OTB2015, PatchNet reduces SiamFC and SiamRPN by 2.5x with no accuracy loss. Experiments on Jetson Nano further demonstrate 2.8x to 4.3x speed-ups associated with flops reduction. Code is open sourced at https://github.com/RalphMao/PatchNet.\n ', '\n Generative adversarial networks (GANs) have enabled photorealistic image synthesis and editing. However, due to the high computational cost of large-scale generators (e.g., StyleGAN2), it usually takes seconds to see the results of a single edit on edge devices, prohibiting interactive user experience. In this paper, we take inspirations from modern rendering software and propose Anycost GAN for interactive natural image editing. We train the Anycost GAN to support elastic resolutions and channels for faster image generation at versatile speeds. Running subsets of the full generator produce outputs that are perceptually similar to the full generator, making them a good proxy for preview. By using sampling-based multi-resolution training, adaptive-channel training, and a generator-conditioned discriminator, the anycost generator can be evaluated at various configurations while achieving better image quality compared to separately trained models. Furthermore, we develop new encoder training and latent code optimization techniques to encourage consistency between the different sub-generators during image projection. Anycost GAN can be executed at various cost budgets (up to 10x computation reduction) and adapt to a wide range of hardware and latency requirements. When deployed on desktop CPUs and edge devices, our model can provide perceptually similar previews at 6-12x speedup, enabling interactive image editing. The code and demo are publicly available: https://github.com/mit-han-lab/anycost-gan.\n ', '\n The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.\n Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.\n ', "\n To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization. However, a single operator can no longer fully utilize the available parallelism given the rapid advances in high-performance hardware, resulting in a large gap between the peak performance and the real performance. This performance gap is more severe under smaller batch sizes. In this work, we extensively study the parallelism between operators and propose Inter-Operator Scheduler (IOS) to automatically schedule multiple operators' parallel execution through a novel dynamic programming algorithm. IOS consistently outperforms state-of-the-art libraries (e.g., TensorRT) by 1.1 to 1.5x on modern CNN benchmarks. The code to reproduce each experiment is available at: https://github.com/mit-han-lab/inter-operator-scheduler.\n ", "\n Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy, and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.\n ", '\n Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (e.g., pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPVConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive SemanticKITTI leaderboard. It also achieves 8x computation reduction and 3x measured speedup over MinkowskiNet with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI.\n ', "\n On-device learning enables edge devices to continually adapt the AI models to new data, which requires a small memory footprint to fit the tight memory constraint of edge devices. Existing work solves this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory saving since the major bottleneck is the activations, not parameters. In this work, we present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning. TinyTL freezes the weights while only learns the bias modules, thus no need to store the intermediate activations. To maintain the adaptation capacity, we introduce a new memory-efficient bias module, the lite residual module, to refine the feature extractor by learning small residual feature maps adding only 3.8% memory overhead. Extensive experiments show that TinyTL significantly saves the memory (up to 6.5x) with little accuracy loss compared to fine-tuning the full network. Compared to fine-tuning the last layer, TinyTL provides significant accuracy improvements (up to 34.1%) with little memory overhead. Furthermore, combined with feature extractor adaptation, TinyTL provides 7.3-12.9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.\n ", '\n Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers. TinyNAS adopts a two-stage neural architecture search approach that first optimizes the search space to fit the resource constraints, then specializes the network architecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i.e.device, latency, energy, memory) under low search costs.TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the search space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 4.8x, and accelerating the inference by 1.7-3.3x compared to TF-Lite Micro and CMSIS-NN. MCUNet is the first to achieves >70% ImageNet top1 accuracy on an off-the-shelf commercial microcontroller, using 3.5x less SRAM and 5.7x less Flash compared to quantized MobileNetV2 and ResNet-18. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4x faster than MobileNetV2 and ProxylessNAS-based solutions with 3.7-4.1x smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived. Code and models can be found here: https://tinyml.mit.edu.\n ', '\n The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminator is memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve a state-of-the-art FID of 6.80 with an IS of 100.8 on ImageNet 128x128 and 2-4x reductions of FID given 1,000 images on FFHQ and LSUN. Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available at https://github.com/mit-han-lab/data-efficient-gans.\n ', '\n We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized <model, accuracy> pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a full-precision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU hours and CO2 emission, pushing the frontier for green AI that is environmental-friendly. The code and video are publicly available.\n ', "\n Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with $\\textit{arbitrary encoder-decoder attention}$ and $\\textit{heterogeneous layers}$. Then we train a $\\textit{SuperTransformer}$ that covers all candidates in the design space, and efficiently produces many $\\textit{SubTransformers}$ with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized $\\textit{SubTransformer}$ dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT can achieve $\\textbf{3}\\times$ speedup, $\\textbf{3.7}\\times$ smaller size over baseline Transformer; $\\textbf{2.7}\\times$ speedup, $\\textbf{3.6}\\times$ smaller size over Evolved Transformer with $\\textbf{12,041}\\times$ less search cost and no performance loss. HAT code is https://github.com/mit-han-lab/hardware-aware-transformers.git\n ", '\n It is important to design compact language models for efficient deployment. We improve upon recent advances in both the language modeling domain and the model-compression domain to construct parameter and computation efficient language models. We use an efficient transformer-based architecture with adaptive embedding and softmax, differentiable non-parametric cache, Hebbian softmax, knowledge distillation, network pruning, and low-bit quantization. In this paper, we provide the winning solution to the NeurIPS 2019 MicroNet Challenge in the language modeling track. Compared to the baseline language model provided by the MicroNet Challenge, our model is 90 times more parameter-efficient and 36 times more computation-efficient while achieving the required test perplexity of 35 on the Wikitext-103 dataset. We hope that this work will aid future research into efficient language models, and we have released our full source code at https://github.com/mit-han-lab/neurips-micronet.\n ', '\n Automatic transistor sizing is a challenging problem in circuit design due to the large design space, complex performance trade-offs, and fast technological advancements. Although there has been plenty of work on transistor sizing targeting on one circuit, limited research has been done on transferring the knowledge from one circuit to another to reduce the re-design overhead. In this paper, we present GCN-RL Circuit Designer, leveraging reinforcement learning (RL) to transfer the knowledge between different technology nodes and topologies. Moreover, inspired by the simple fact that circuit is a graph, we learn on the circuit topology representation with graph convolutional neural networks (GCN). The GCN-RL agent extracts features of the topology graph whose vertices are transistors, edges are wires. Our learning-based optimization consistently achieves the highest Figures of Merit (FoM) on four different circuits compared with conventional black-box optimization methods (Bayesian Optimization, Evolutionary Algorithms), random search, and human expert designs. Experiments on transfer learning between five technology nodes and two circuit topologies demonstrate that RL with transfer learning can achieve much higher FoMs than methods without knowledge transfer. Our transferable optimization method makes transistor sizing and design porting more effective and efficient.\n ', "\n Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.\n ", '\n In this paper, we study the problem of fast constructions of source-wise round-trip spanners in weighted directed graphs. For a source vertex set $S\\subseteq V$ in a graph $G(V,E)$, an $S$-sourcewise round-trip spanner of $G$ of stretch $k$ is a subgraph $H$ of $G$ such that for every pair of vertices $u,v\\in S\\times V$, their round-trip distance in $H$ is at most $k$ times of their round-trip distance in $G$. We show that for a graph $G(V,E)$ with $n$ vertices and $m$ edges, an $s$-sized source vertex set $S\\subseteq V$ and an integer $k>1$, there exists an algorithm that in time $O(ms^{1/k}\\log^5n)$ constructs an $S$-sourcewise round-trip spanner of stretch $O(k\\log n)$ and $O(ns^{1/k}\\log^2n)$ edges with high probability. Compared to the fast algorithms for constructing all-pairs round-trip spanners \\cite{PRS+18,CLR+20}, our algorithm improve the running time and the number of edges in the spanner when $k$ is super-constant. Compared with the existing algorithm for constructing source-wise round-trip spanners \\cite{ZL17}, our algorithm significantly improves their construction time $Ω(\\min\\{ms,n^ω\\})$ (where $ω\\in [2,2.373)$ and 2.373 is the matrix multiplication exponent) to nearly linear $O(ms^{1/k}\\log^5n)$, at the expense of paying an extra $O(\\log n)$ in the stretch. As an important building block of the algorithm, we develop a graph partitioning algorithm to partition $G$ into clusters of bounded radius and prove that for every $u,v\\in S\\times V$ at small round-trip distance, the probability of separating them in different clusters is small. The algorithm takes the size of $S$ as input and does not need the knowledge of $S$. With the algorithm and a reachability vertex size estimation algorithm, we show that the recursive algorithm for constructing standard round-trip spanners \\cite{PRS+18} can be adapted to the source-wise setting.\n ', '\n Conditional Generative Adversarial Networks (cGANs) have enabled controllable image synthesis for many computer vision and graphics applications. However, recent cGANs are 1-2 orders of magnitude more computationally-intensive than modern recognition CNNs. For example, Gau-GAN consumes 281G MACs per image, compared to 0.44GMACs for MobileNet-v3, making it difficult for interactive deployment. In this work, we propose a general-purpose compression framework for reducing the inference time and model size of the generator in cGANs. Directly applying existing CNNs compression methods yields poor performance due to the difficulty of GAN training and the differences in generator architectures. We address these challenges in two ways. First, to stabilize the GAN training, we transfer knowledge of multiple intermediate representations of the original model to its compressed model, and unify unpaired and paired learning. Second, instead of reusing existing CNN designs, our method automatically finds efficient architectures via neural architecture search (NAS). To accelerate the search process, we decouple the model training and architecture search via weight sharing. Experiments demonstrate the effectiveness of our method across different supervision settings (paired and unpaired), model architectures, and learning methods (e.g., pix2pix, GauGAN, CycleGAN). Without losing image quality, we reduce the computation of CycleGAN by more than 20x and GauGAN by 9x, paving the way for interactive image synthesis. The code and demo are publicly available.\n ', '\n Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGENN introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access.\n To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSPACE, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.\n ', '\n Deep video recognition is more computationally expensive than image recognition, especially on large-scale datasets like Kinetics [1]. Therefore, training scalability is essential to handle a large amount of videos. In this paper, we study the factors that impact the training scalability of video networks. We recognize three bottlenecks, including data loading (data movement from disk to GPU), communication (data movement over networking), and computation FLOPs. We propose three design guidelines to improve the scalability: (1) fewer FLOPs and hardware-friendly operator to increase the computation efficiency; (2) fewer input frames to reduce the data movement and increase the data loading efficiency; (3) smaller model size to reduce the networking traffic and increase the networking efficiency. With these guidelines, we designed a new operator Temporal Shift Module (TSM) that is efficient and scalable for distributed training. TSM model can achieve 1.8x higher throughput compared to previous I3D models. We scale up the training of the TSM model to 1,536 GPUs, with a mini-batch of 12,288 video clips/98,304 images, without losing the accuracy. With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74.0%, which is 1.6x and 2.9x faster than previous 3D video models with higher accuracy. The code and more details can be found here: http://tsm-hanlab.mit.edu.\n ', '\n The fast developing Industrial Internet of Things (IIoT) technologies provide a promising opportunity to build large-scale systems to connect numerous heterogeneous devices into the Internet. Most existing IIoT infrastructures are based on a centralized architecture, which is easier for management but cannot effectively support immutable and verifiable services among multiple parties. Blockchain technology provides many desired features for large-scale IIoT infrastructures, such as decentralization, trustworthiness, trackability, and immutability. This paper presents a blockchain-based IIoT architecture to support immutable and verifiable services. However, when applying blockchain technology to the IIoT infrastructure, the required storage space posts a grant challenge to resource-constrained IIoT infrastructures. To address the storage issue, this paper proposes a hierarchical blockchain storage structure, \\textit{ChainSplitter}. Specially, the proposed architecture features a hierarchical storage structure where the majority of the blockchain is stored in the clouds, while the most recent blocks are stored in the overlay network of the individual IIoT networks. The proposed architecture seamlessly binds local IIoT networks, the blockchain overlay network, and the cloud infrastructure together through two connectors, the \\textit{blockchain connector} and the \\textit{cloud connector}, to construct the hierarchical blockchain storage. The blockchain connector in the overlay network builds blocks in blockchain from data generated in IIoT networks, and the cloud connector resolves the blockchain synchronization issues between the overlay network and the clouds. We also provide a case study to show the efficiency of the proposed hierarchical blockchain storage in a practical Industrial IoT case.\n ', "\n We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices. Conventional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally prohibitive (causing $CO_2$ emission as much as 5 cars' lifetime) thus unscalable. In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training. To efficiently train OFA networks, we also propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning (depth, width, kernel size, and resolution). It can obtain a surprisingly large number of sub-networks ($> 10^{19}$) that can fit different hardware platforms and latency constraints while maintaining the same level of accuracy as training independently. On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting ($<$600M MACs). OFA is the winning solution for the 3rd Low Power Computer Vision Challenge (LPCVC), DSP classification track and the 4th LPCVC, both classification track and detection track. Code and 50 pre-trained models (for many devices & many latency constraints) are released at https://github.com/mit-han-lab/once-for-all.\n ", '\n We present Point-Voxel CNN (PVCNN) for efficient, fast 3D deep learning. Previous work processes 3D data using either voxel-based or point-based NN models. However, both approaches are computationally inefficient. The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution. As for point-based networks, up to 80% of the time is wasted on structuring the sparse data which have rather poor memory locality, not on the actual feature extraction. In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality. Our PVCNN model is both memory and computation efficient. Evaluated on semantic and part segmentation datasets, it achieves much higher accuracy than the voxel-based baseline with 10x GPU memory reduction; it also outperforms the state-of-the-art point-based models with 7x measured speedup on average. Remarkably, the narrower version of PVCNN achieves 2x speedup over PointNet (an extremely efficient model) on part and scene segmentation benchmarks with much higher accuracy. We validate the general effectiveness of PVCNN on 3D object detection: by replacing the primitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by 2.4% mAP on average with 1.5x measured speedup and GPU memory reduction.\n ', "\n Exchanging gradients is a widely used method in modern multi-node machine learning system (e.g., distributed training, collaborative learning). For a long time, people believed that gradients are safe to share: i.e., the training data will not be leaked by gradient exchange. However, we show that it is possible to obtain the private training data from the publicly shared gradients. We name this leakage as Deep Leakage from Gradient and empirically validate the effectiveness on both computer vision and natural language processing tasks. Experimental results show that our attack is much stronger than previous approaches: the recovery is pixel-wise accurate for images and token-wise matching for texts. We want to raise people's awareness to rethink the gradient's safety. Finally, we discuss several possible strategies to prevent such deep leakage. The most effective defense method is gradient pruning.\n ", "\n Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need to change the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithm makes the design space much larger: it's not only about designing the hardware but also about how to tweak the algorithm to best fit the hardware. Human engineers can hardly exhaust the design space by heuristics. It's labor consuming and sub-optimal. We propose design automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channel pruning, and auto mixed-precision quantization. We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200x than previous work, so that we can afford to design specialized neural network models for different hardware platforms.\n ", "\n Neural network quantization is becoming an industry standard to efficiently deploy deep learning models on hardware platforms, such as CPU, GPU, TPU, and FPGAs. However, we observe that the conventional quantization approaches are vulnerable to adversarial attacks. This paper aims to raise people's awareness about the security of the quantized models, and we designed a novel quantization methodology to jointly optimize the efficiency and robustness of deep learning models. We first conduct an empirical study to show that vanilla quantization suffers more from adversarial attacks. We observe that the inferior robustness comes from the error amplification effect, where the quantization operation further enlarges the distance caused by amplified noise. Then we propose a novel Defensive Quantization (DQ) method by controlling the Lipschitz constant of the network during quantization, such that the magnitude of the adversarial noise remains non-expansive during inference. Extensive experiments on CIFAR-10 and SVHN datasets demonstrate that our new quantization method can defend neural networks against adversarial examples, and even achieves superior robustness than their full-precision counterparts while maintaining the same hardware efficiency as vanilla quantization approaches. As a by-product, DQ can also improve the accuracy of quantized models without adversarial attack.\n ", '\n Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.\n ', '\n Along with the rapid growth of Industrial Internet-of-Things (IIoT) applications and their penetration into many industry sectors, real-time wireless networks (RTWNs) have been playing a more critical role in providing real-time, reliable and secure communication services for such applications. A key challenge in RTWN management is how to ensure real-time Quality of Services (QoS) especially in the presence of unexpected disturbances and lossy wireless links. Most prior work takes centralized approaches for handling disturbances, which are slow and subject to single-point failure, and do not scale. To overcome these drawbacks, this paper presents a fully distributed packet scheduling framework called FD-PaS. FD-PaS aims to provide guaranteed fast response to unexpected disturbances while achieving minimum performance degradation for meeting the timing and reliability requirements of all critical tasks. To combat the scalability challenge, FD-PaS incorporates several key advances in both algorithm design and data link layer protocol design to enable individual nodes to make on-line decisions locally without any centralized control. Our extensive simulation and testbed results have validated the correctness of the FD-PaS design and demonstrated its effectiveness in providing fast response for handling disturbances while ensuring the designated QoS requirements.\n ', '\n Analog IC design relies on human experts to search for parameters that satisfy circuit specifications with their experience and intuitions, which is highly labor intensive, time consuming and suboptimal. Machine learning is a promising tool to automate this process. However, supervised learning is difficult for this task due to the low availability of training data: 1) Circuit simulation is slow, thus generating large-scale dataset is time-consuming; 2) Most circuit designs are propitiatory IPs within individual IC companies, making it expensive to collect large-scale datasets. We propose Learning to Design Circuits (L2DC) to leverage reinforcement learning that learns to efficiently generate new circuits data and to optimize circuits. We fix the schematic, and optimize the parameters of the transistors automatically by training an RL agent with no prior knowledge about optimizing circuits. After iteratively getting observations, generating a new set of transistor parameters, getting a reward, and adjusting the model, L2DC is able to optimize circuits. We evaluate L2DC on two transimpedance amplifiers. Trained for a day, our RL agent can achieve comparable or better performance than human experts trained for a quarter. It first learns to meet hard-constraints (eg. gain, bandwidth), and then learns to optimize good-to-have targets (eg. area, power). Compared with grid search-aided human design, L2DC can achieve $\\mathbf{250}\\boldsymbol{\\times}$ higher sample efficiency with comparable performance. Under the same runtime constraint, the performance of L2DC is also better than Bayesian Optimization.\n ', '\n Neural architecture search (NAS) has a great impact by automatically designing effective neural network architectures. However, the prohibitive computational demand of conventional NAS algorithms (e.g. $10^4$ GPU hours) makes it difficult to \\emph{directly} search the architectures on large-scale tasks (e.g. ImageNet). Differentiable NAS can reduce the cost of GPU hours via a continuous representation of network architecture but suffers from the high GPU memory consumption issue (grow linearly w.r.t. candidate set size). As a result, they need to utilize~\\emph{proxy} tasks, such as training on a smaller dataset, or learning with only a few blocks, or training just for a few epochs. These architectures optimized on proxy tasks are not guaranteed to be optimal on the target task. In this paper, we present \\emph{ProxylessNAS} that can \\emph{directly} learn the architectures for large-scale target tasks and target hardware platforms. We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness of directness and specialization. On CIFAR-10, our model achieves 2.08\\% test error with only 5.7M parameters, better than the previous state-of-the-art architecture AmoebaNet-B, while using 6$\\times$ fewer parameters. On ImageNet, our model achieves 3.1\\% better top-1 accuracy than MobileNetV2, while being 1.2$\\times$ faster with measured GPU latency. We also apply ProxylessNAS to specialize neural architectures for hardware with direct hardware metrics (e.g. latency) and provide insights for efficient CNN architecture design.\n ', "\n Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.\n ", "\n The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition. The code is available at: https://github.com/mit-han-lab/temporal-shift-module.\n ", '\n We consider the problem of clustering graph nodes over large-scale dynamic graphs, such as citation networks, images and web networks, when graph updates such as node/edge insertions/deletions are observed distributively. We propose communication-efficient algorithms for two well-established communication models namely the message passing and the blackboard models. Given a graph with $n$ nodes that is observed at $s$ remote sites over time $[1,t]$, the two proposed algorithms have communication costs $\\tilde{O}(ns)$ and $\\tilde{O}(n+s)$ ($\\tilde{O}$ hides a polylogarithmic factor), almost matching their lower bounds, $Ω(ns)$ and $Ω(n+s)$, respectively, in the message passing and the blackboard models. More importantly, we prove that at each time point in $[1,t]$ our algorithms generate clustering quality nearly as good as that of centralizing all updates up to that time and then applying a standard centralized clustering algorithm. We conducted extensive experiments on both synthetic and real-life datasets which confirmed the communication efficiency of our approach over baseline algorithms while achieving comparable clustering results.\n ', '\n Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.\n ', '\n We introduce a new function-preserving transformation for efficient neural architecture search. This network transformation allows reusing previously trained networks and existing successful architectures that improves sample efficiency. We aim to address the limitation of current network transformation operations that can only perform layer-level architecture modifications, such as adding (pruning) filters or inserting (removing) a layer, which fails to change the topology of connection paths. Our proposed path-level transformation operations enable the meta-controller to modify the path topology of the given network while keeping the merits of reusing weights, and thus allow efficiently designing effective structures with complex path topologies like Inception models. We further propose a bidirectional tree-structured reinforcement learning meta-controller to explore a simple yet highly expressive tree-structured architecture space that can be viewed as a generalization of multi-branch architectures. We experimented on the image classification datasets with limited computational resources (about 200 GPU-hours), where we observed improved parameter efficiency and better test results (97.70% test accuracy on CIFAR-10 with 14.3M parameters and 74.6% top-1 accuracy on ImageNet in the mobile setting), demonstrating the effectiveness and transferability of our designed architectures.\n ', '\n Recent results at the Large Hadron Collider (LHC) have pointed to enhanced physics capabilities through the improvement of the real-time event processing techniques. Machine learning methods are ubiquitous and have proven to be very powerful in LHC physics, and particle physics as a whole. However, exploration of the use of such techniques in low-latency, low-power FPGA hardware has only just begun. FPGA-based trigger and data acquisition (DAQ) systems have extremely low, sub-microsecond latency requirements that are unique to particle physics. We present a case study for neural network inference in FPGAs focusing on a classifier for jet substructure which would enable, among many other physics scenarios, searches for new dark sector particles and novel measurements of the Higgs boson. While we focus on a specific example, the lessons are far-reaching. We develop a package based on High-Level Synthesis (HLS) called hls4ml to build machine learning models in FPGAs. The use of HLS increases accessibility across a broad user community and allows for a drastic decrease in firmware development time. We map out FPGA resource usage and latency versus neural network hyperparameters to identify the problems in particle physics that would benefit from performing neural network inference with FPGAs. For our example jet substructure model, we fit well within the available resources of modern FPGAs with a latency on the scale of 100 ns.\n ', '\n Recently emerged dielectric resonators and metasurfaces offer a low-loss platform for efficient manipulation of electromagnetic waves from microwave to visible. Such flat meta-optics can focus electromagnetic waves, generate structured beams and vortices, enhance local fields for sensing as well as provide additional functionalities for advanced MRI machinery. Recent advances are associated with exotic optical modes called bound states in the continuum, which can give rise to extremely large quality factors and supercavity lasing. Here, we experimentally demonstrate subwavelength active supercavities with extremely high-Q resonances that could be reconfigured at an ultrafast time scale. We reveal that such supercavities enable all-optical switching and modulation of extremely sharp resonances, and thus could have numerous applications in lasing, mode multiplexing, and biosensing.\n ', '\n In most process control systems nowadays, process measurements are periodically collected and archived in historians. Analytics applications process the data, and provide results offline or in a time period that is considerably slow in comparison to the performance of the manufacturing process. Along with the proliferation of Internet-of-Things (IoT) and the introduction of "pervasive sensors" technology in process industries, increasing number of sensors and actuators are installed in process plants for pervasive sensing and control, and the volume of produced process data is growing exponentially. To digest these data and meet the ever-growing requirements to increase production efficiency and improve product quality, there needs to be a way to both improve the performance of the analytics system and scale the system to closely monitor a much larger set of plant resources. In this paper, we present a real-time data analytics platform, called RT-DAP, to support large-scale continuous data analytics in process industries. RT-DAP is designed to be able to stream, store, process and visualize a large volume of realtime data flows collected from heterogeneous plant resources, and feedback to the control system and operators in a realtime manner. A prototype of the platform is implemented on Microsoft Azure. Our extensive experiments validate the design methodologies of RT-DAP and demonstrate its efficiency in both component and system levels.\n ', "\n Convolutional Neural Networks (CNNs) are computationally intensive, which limits their application on mobile devices. Their energy is dominated by the number of multiplies needed to perform the convolutions. Winograd's minimal filtering algorithm (Lavin, 2015) and network pruning (Han et al., 2015) can reduce the operation count, but these two methods cannot be directly combined $-$ applying the Winograd transform fills in the sparsity in both the weights and the activations. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations. Second, we prune the weights in the Winograd domain to exploit static weight sparsity. For models on CIFAR-10, CIFAR-100 and ImageNet datasets, our method reduces the number of multiplications by $10.4\\times$, $6.8\\times$ and $10.8\\times$ respectively with loss of accuracy less than $0.1\\%$, outperforming previous baselines by $2.0\\times$-$3.0\\times$. We also show that moving ReLU to the Winograd domain allows more aggressive pruning.\n ", '\n The interaction between microscopic particles has always been a fascinating and intriguing area of science. Direct interrogation of such interactions is often difficult or impossible. Structured electromagnetic systems offer a rich toolkit for mimicking and reproducing the key dynamics that governs the microscopic interactions, and thus provide an avenue to explore and interpret the microscopic phenomena. In particular, metamaterials offer the freedom to artificially tailor light-matter coupling and to control the interaction between unit cells in the metamaterial array. Here we demonstrate a terahertz metamaterial that mimics spin-related interactions of microscopic particles in a 2D lattice via complex electromagnetic multipole interactions within the metamaterial array. Fano resonances featured by distinct mode properties due to strong nearest-neighbor interactions are discussed that draw parallels with the 2D Ising model. Interestingly, a hyperfine Fano splitting spectrum is observed by manipulating the 2D interactions without applying external magnetic or electric fields, which provides a passive multispectral platform for applications in super-resolution imaging, biosensing, and selective thermal emission. The dynamic approach to reproduce the static interaction between microscopic particles would enable more profound significance in exploring the unknown physical world by the macroscopic analogues.\n ', '\n We aid in neurocognitive monitoring outside the hospital environment by enabling app-based measurements of visual reaction time (saccade latency) and error rate in a cohort of subjects spanning the adult age spectrum. Methods: We developed an iOS app to record subjects with the frontal camera during pro- and anti-saccade tasks. We further developed automated algorithms for measuring saccade latency and error rate that take into account the possibility that it might not always be possible to determine the eye movement from app-based recordings. Results: To measure saccade latency on a tablet, we ensured that the absolute timing error between on-screen task presentation and the camera recording is within 5 ms. We collected over 235,000 eye movements in 80 subjects ranging in age from 20 to 92 years, with 96% of recorded eye movements either declared good or directional errors. Our error detection code achieved a sensitivity of 0.97 and a specificity of 0.97. Confirming prior reports, we observed a positive correlation between saccade latency and age while the relationship between error rate and age was not significant. Finally, we observed significant intra- and inter-subject variations in saccade latency and error rate distributions, which highlights the importance of individualized tracking of these visual digital biomarkers. Conclusion and Significance: Our system and algorithms allow ubiquitous tracking of saccade latency and error rate, which opens up the possibility of quantifying patient state on a finer timescale in a broader population than previously possible.\n ', '\n A binary beat-by-beat classification algorithm for cerebral blood flow velocity (CBFV) recordings based on amplitude, spectral and morphological features is presented. The classification difference between 15 manually and algorithmically annotated CBFV records is around 5%.\n ', '\n The rowmotion operator acting on the set of order ideals of a finite poset has been the focus of a significant amount of recent research. One of the major goals has been to exhibit homomesies: statistics that have the same average along every orbit of the action. We systematize a technique for proving that various statistics of interest are homomesic by writing these statistics as linear combinations of "toggleability statistics" (originally introduced by Striker) plus a constant. We show that this technique recaptures most of the known homomesies for the posets on which rowmotion has been most studied. We also show that the technique continues to work in modified contexts. For instance, this technique also yields homomesies for the piecewise-linear and birational extensions of rowmotion; furthermore, we introduce a $q$-analogue of rowmotion and show that the technique yields homomesies for "$q$-rowmotion" as well.\n ', '\n Barely set-valued tableaux are a variant of Young tableaux in which one box contains two numbers as its entry. It has recently been discovered that there are product formulas enumerating certain classes of barely set-valued tableaux. We give some q-analogs of these product formulas by introducing a version of major index for these tableaux. We also give product formulas and q-analogs for barely set-valued plane partitions. The proofs use several probability distributions on the set of order ideals of a poset, depending on the real parameter q > 0, which we think could be of independent interest.\n ', '\n For a Weyl group $W$ of rank $r$, the $W$-Catalan number is the number of antichains of the poset of positive roots, and the $W$-Narayana numbers refine the $W$-Catalan number by keeping track of the cardinalities of these antichains. The $W$-Narayana numbers are symmetric, i.e., the number of antichains of cardinality $k$ is the same as the number of cardinality $r-k$. However, this symmetry is far from obvious. Panyushev posed the problem of defining an involution on root poset antichains that exhibits the symmetry of the $W$-Narayana numbers.\n Rowmotion and rowvacuation are two related operators, defined as compositions of "toggles," that give a dihedral action on the set of antichains of any ranked poset. Rowmotion acting on root posets has been the subject of a significant amount of research in the recent past. We prove that for the root posets of classical types, rowvacuation is Panyushev\'s desired involution.\n ', '\n The Lalanne-Kreweras involution is an involution on the set of Dyck paths which combinatorially exhibits the symmetry of the number of valleys and major index statistics. We define piecewise-linear and birational extensions of the Lalanne-Kreweras involution. Actually, we show that the Lalanne-Kreweras involution is a special case of a more general operator, called rowvacuation, which acts on the antichains of any graded poset. Rowvacuation, like the closely related and more studied rowmotion operator, is a composition of toggles. We obtain the piecewise-linear and birational lifts of the Lalanne-Kreweras involution by using the piecewise-linear and birational toggles of Einstein and Propp. We show that the symmetry properties of the Lalanne-Kreweras involution extend to these piecewise-linear and birational lifts.\n ', '\n We give a product formula for the number of shifted plane partitions of shifted double staircase shape with bounded entries. This is the first new example of a family of shapes with a plane partition product formula in many years. The proof is based on the theory of lozenge tilings; specifically, we apply the "free boundary" Kuo condensation due to Ciucu.\n ', '\n We survey all known examples of finite posets whose order polynomials have product formulas, and we put forward a heuristic which says that these are the same posets which have good dynamical behavior. Here the dynamics in question are the actions of promotion on the linear extensions of the poset, and rowmotion on the P-partitions of the poset.\n ', "\n Kreweras words are words consisting of n A's, n B's, and n C's in which every prefix has at least as many A's as B's and at least as many A's as C's. Equivalently, a Kreweras word is a linear extension of the poset ${\\sf V}\\times [n]$. Kreweras words were introduced in 1965 by Kreweras, who gave a remarkable product formula for their enumeration. Subsequently they became a fundamental example in the theory of lattice walks in the quarter plane. We study Schützenberger's promotion operator on the set of Kreweras words. In particular, we show that 3n applications of promotion on a Kreweras word merely swaps the B's and C's. Doing so, we provide the first answer to a question of Stanley from 2009, asking for posets with `good' behavior under promotion, other than the four families of shapes classified by Haiman in 1992. We also uncover a strikingly simple description of Kreweras words in terms of Kuperberg's $\\mathfrak{sl}_3$-webs, and Postnikov's trip permutation associated with any plabic graph. In this description, Schützenberger's promotion corresponds to rotation of the web.\n ", "\n The cyclic sieving phenomenon of Reiner, Stanton, and White says that we can often count the fixed points of elements of a cyclic group acting on a combinatorial set by plugging roots of unity into a polynomial related to this set. One of the most impressive instances of the cyclic sieving phenomenon is a theorem of Rhoades asserting that the set of plane partitions in a rectangular box under the action of promotion exhibits cyclic sieving. In Rhoades's result the sieving polynomial is the size generating function for these plane partitions, which has a well-known product formula due to MacMahon. We extend Rhoades's result by also considering symmetries of plane partitions: specifically, complementation and transposition. The relevant polynomial here is the size generating function for symmetric plane partitions, whose product formula was conjectured by MacMahon and proved by Andrews and Macdonald. Finally, we explain how these symmetry results also apply to the rowmotion operator on plane partitions, which is closely related to promotion.\n ", '\n We relate Reiner, Tenner, and Yong\'s coincidental down-degree expectations (CDE) property of posets to the minuscule doppelgänger pairs studied by Hamaker, Patrias, Pechenik, and Williams. Via this relation, we put forward a series of conjectures which suggest that the minuscule doppelgänger pairs behave "as if" they had isomorphic comparability graphs, even though they do not. We further explore the idea of minuscule doppelgänger pairs pretending to have isomorphic comparability graphs by considering the rowmotion operator on order ideals. We conjecture that the members of a minuscule doppelgänger pair behave the same way under rowmotion, as they would if they had isomorphic comparability graphs. Moreover, we conjecture that these pairs continue to behave the same way under the piecewise-linear and birational liftings of rowmotion introduced by Einstein and Propp. This conjecture motivates us to study the homomesies (in the sense of Propp and Roby) exhibited by birational rowmotion. We establish the birational analog of the antichain cardinality homomesy for the major examples of posets known or conjectured to have finite birational rowmotion order (namely: minuscule posets and root posets of coincidental type).\n ', '\n We prove a conjecture of Reiner, Tenner, and Yong which says that the initial weak order intervals corresponding to certain vexillary permutations have the coincidental down-degree expectations (CDE) property. Actually our theorem applies more generally to certain "skew vexillary" permutations (a notion we introduce), and shows that these posets are in fact "toggle CDE." As a corollary we obtain a homomesy result for rowmotion acting on semidistributive lattices in the sense of Barnard and of Thomas and Williams.\n ', '\n In earlier work in collaboration with Pavel Galashin and Thomas McConville we introduced a version of chip-firing for root systems. Our investigation of root system chip-firing led us to define certain polynomials analogous to Ehrhart polynomials of lattice polytopes, which we termed the symmetric and truncated Ehrhart-like polynomials. We conjectured that these polynomials have nonnegative integer coefficients. Here we affirm "half" of this positivity conjecture by providing a positive, combinatorial formula for the coefficients of the symmetric Ehrhart-like polynomials. This formula depends on a subtle integrality property of slices of permutohedra, and in turn a lemma concerning dilations of projections of root polytopes, which both may be of independent interest. We also discuss how our formula very naturally suggests a conjecture for the coefficients of the truncated Ehrhart-like polynomials that turns out to be false in general, but which may hold in some cases.\n ', '\n Jim Propp recently introduced a variant of chip-firing on a line where the chips are given distinct integer labels. Hopkins, McConville, and Propp showed that this process is confluent from some (but not all) initial configurations of chips. We recast their set-up in terms of root systems: labeled chip-firing can be seen as a root-firing process which allows the moves $λ\\to λ+ α$ for $α\\in Φ^{+}$ whenever $\\langleλ,α^\\vee\\rangle = 0$, where $Φ^{+}$ is the set of positive roots of a root system of Type A and $λ$ is a weight of this root system. We are thus motivated to study the exact same root-firing process for an arbitrary root system. Actually, this central root-firing process is the subject of a sequel to this paper. In the present paper, we instead study the interval root-firing processes determined by $λ\\to λ+ α$ for $α\\in Φ^{+}$ whenever $\\langleλ,α^\\vee\\rangle \\in [-k-1,k-1]$ or $\\langleλ,α^\\vee\\rangle \\in [-k,k-1]$, for any $k \\geq 0$. We prove that these interval-firing processes are always confluent, from any initial weight. We also show that there is a natural way to consistently label the stable points of these interval-firing processes across all values of $k$ so that the number of weights with given stabilization is a polynomial in $k$. We conjecture that these Ehrhart-like polynomials have nonnegative integer coefficients.\n ', '\n Jim Propp recently proposed a labeled version of chip-firing on a line and conjectured that this process is confluent from some initial configurations. This was proved by Hopkins-McConville-Propp. We reinterpret Propp\'s labeled chip-firing moves in terms of root systems: a "central-firing" move consists of replacing a weight $λ$ by $λ+α$ for any positive root $α$ that is orthogonal to $λ$. We show that central-firing is always confluent from any initial weight after modding out by the Weyl group, giving a generalization of unlabeled chip-firing on a line to other types. For simply-laced root systems we describe this unlabeled chip-firing as a number game on the Dynkin diagram. We also offer a conjectural classification of when central-firing is confluent from the origin or a fundamental weight.\n ', '\n We investigate a variant of the chip-firing process on the infinite path graph: rather than treating the chips as indistinguishable, we label them with positive integers. To fire an unstable vertex, i.e. a vertex with more than one chip, we choose any two chips at that vertex and move the lesser-labeled chip to the left and the greater-labeled chip to the right. This labeled version of the chip-firing process exhibits a remarkable confluence property, similar to but subtler than the confluence that prevails for unlabeled chip-firing: when all chips start at the origin and the number of chips is even, the chips always end up in sorted order. Our proof of sorting relies upon an independently interesting lemma concerning unlabeled chip- firing which says that stabilization preserves a natural partial order on configurations. We also discuss some extensions of this sorting phenomenon to other graphs (variants of the infinite path), to other initial configurations, and to other Cartan-Killing types.\n ', '\n Reiner, Tenner, and Yong recently introduced the coincidental down-degree expectations (CDE) property for finite posets and showed that many nice posets are CDE. In this paper we further explore the CDE property, resolving a number of conjectures about CDE posets put forth by Reiner-Tenner-Yong. A consequence of our work is the completion of a case-by-case proof that any minuscule lattice is CDE. We also explain two major applications of the study of CDE posets: formulas for certain classes of set-valued tableaux; and homomesy results for rowmotion and gyration acting on sets of order ideals.\n ', '\n We define a $q$-deformation of the classical ring of integer-valued polynomials which we call the ring of quantum integer-valued polynomials. We show that this ring has a remarkable combinatorial structure and enjoys many positivity properties: for instance, the structure constants for this ring with respect to its basis of $q$-binomial coefficient polynomials belong to $\\mathbb{N}[q]$. We then classify all maps from this ring into a field, extending a known classification in the classical case where $q=1$.\n ', '\n A fourientation of a graph $G$ is a choice for each edge of the graph whether to orient that edge in either direction, leave it unoriented, or biorient it. We may naturally view fourientations as a mixture of subgraphs and graph orientations where unoriented and bioriented edges play the role of absent and present subgraph edges, respectively. Building on work of Backman and Hopkins (2015), we show that given a linear order and a reference orientation of the edge set, one can define activities for fourientations of $G$ which allow for a new 12 variable expansion of the Tutte polynomial $T_G$. Our formula specializes to both the Las Vergnas (1984) orientation activites expansion of the Tutte polynomial and the generalized activities expansion of Gordon and Traldi (1990).\n ', '\n The jaggedness of an order ideal I in a poset P is the number of maximal elements in I plus the number of minimal elements of P not in I. A probability distribution on the set of order ideals of P is toggle-symmetric if for every p in P, the probability that p is maximal in I equals the probability that p is minimal not in I. In this paper, we prove a formula for the expected jaggedness of an order ideal of P under any toggle-symmetric probability distribution when P is the poset of boxes in a skew Young diagram. Our result extends the main combinatorial theorem of Chan-López-Pflueger-Teixidor, who used an expected jaggedness computation as a key ingredient to prove an algebro-geometric formula; and it has applications to homomesies, in the sense of Propp-Roby, of the antichain cardinality statistic for order ideals in partially ordered sets.\n ', '\n Kreweras proved that the reversed sum enumerator for parking functions of length $n$ is equal to the inversion enumerator for labeled trees on $n+1$ vertices. Recently, Perkinson, Yang, and Yu gave a bijective proof of this equality that moreover generalizes to graphical parking functions. Using a depth-first search variant of Dhar\'s burning algorithm they proved that the codegree enumerator for $G$-parking functions equals the $κ$-number enumerator for spanning trees of $G$. The $κ$-number is a kind of generalized tree inversion number originally defined by Gessel. We extend the work of Perkinson-Yang-Yu to what are referred to as "generalized parking functions" in the literature, but which we prefer to call vector parking functions because they depend on a choice of vector $\\mathbf{x} \\in \\mathbb{N}^n$. Specifically, we give an expression for the reversed sum enumerator for $\\mathbf{x}$-parking functions in terms of inversions in rooted plane trees with respect to certain admissible vertex orders. Along the way we clarify the relationship between graphical and vector parking functions.\n ', '\n A fourientation of a graph is a choice for each edge of the graph whether to orient that edge in either direction, leave it unoriented, or biorient it. Fixing a total order on the edges and a reference orientation of the graph, we investigate properties of cuts and cycles in fourientations which give trivariate generating functions that are generalized Tutte polynomial evaluations of the form \\[(k+m)^{n-1}(k+l)^gT\\left(\\frac{αk + βl + m}{k+m},\\frac{γk + l + δm}{k+l}\\right)\\] for $α,γ\\in \\{0,1,2\\}$ and $β, δ\\in \\{0,1\\}$. We introduce an intersection lattice of 64 cut-cycle fourientation classes enumerated by generalized Tutte polynomial evaluations of this form. We prove these enumerations using a single deletion-contraction argument and classify axiomatically the set of fourientation classes to which our deletion-contraction argument applies. This work unifies and extends earlier results for fourientations due to Gessel and Sagan, and results for partial orientations due to the first author, and the second author and David Perkinson, as well as results for total orientations due to many authors. We conclude by describing how these classes of fourientations relate to geometric, combinatorial, and algebraic objects including bigraphical arrangements, cycle-cocycle reversal systems, graphic Lawrence ideals, Riemann-Roch theory for graphs, zonotopal algebras, and the reliability polynomial.\n ', "\n We define a statistic called the weight of oscillating tableaux. Oscillating tableaux, a generalization of standard Young tableaux, are certain walks in Young's lattice of partitions. The weight of an oscillating tableau is the sum of the sizes of all the partitions that it visits. We show that the average weight of all oscillating tableaux of shape lambda and length 2n plus the size of lambda has a surprisingly simple formula: it is a quadratic polynomial in the size of lambda and n. Our proof via the theory of differential posets is largely computational. We suggest how the homomesy paradigm of Propp and Roby may lead to a more conceptual proof of this result and reveal a hidden symmetry in the set of perfect matchings.\n ", '\n Motivated by the problem of giving a bijective proof of the fact that the birational RSK correspondence satisfies the octahedron recurrence, we define interlacing networks, which are certain planar directed networks with a rigid structure of sources and sinks. We describe an involution that swaps paths in these networks and leads to Plücker-like three-term relations among path weights. We show that indeed these relations follow from the Plücker relations in the Grassmannian together with some simple rank properties of the matrices corresponding to our interlacing networks. The space of matrices obeying these rank properties forms the closure of a cell in the matroid stratification of the totally nonnegative Grassmannian. Not only does the octahedron recurrence for RSK follow immediately from the three-term relations for interlacing networks, but also these relations imply some interesting identities of Schur functions reminiscent of those obtained by Fulmek and Kleber. These Schur function identities lead to some results on Schur positivity for expressions of the form $s_νs_ρ - s_λs_μ$.\n ', "\n We present a new proof of the monomial case of Wilmes' conjecture, which gives a formula for the coarsely-graded Betti numbers of the G-parking function ideal in terms of maximal parking functions of contractions of G. Our proof is via poset topology and relies on a theorem of Gasharov, Peeva, and Welker that connects the Betti numbers of a monomial ideal to the topology of its lcm-lattice.\n ", '\n We define the bigraphical arrangement of a graph and show that the Pak-Stanley labels of its regions are the parking functions of a closely related graph, thus proving conjectures of Duval, Klivans, and Martin and of Hopkins and Perkinson. A consequence is a new proof of a bijection between labeled graphs and regions of the Shi arrangement first given by Stanley. We also give bounds on the number of regions of a bigraphical arrangement.\n ', '\n We extend the concept of pattern avoidance in permutations on a totally ordered set to pattern avoidance in permutations on partially ordered sets. The number of permutations on $P$ that avoid the pattern $π$ is denoted $Av_P(π)$. We extend a proof of Simion and Schmidt to show that $Av_P(132) \\leq Av_P(123)$ for any poset $P$, and we exactly classify the posets for which equality holds.\n ', '\n It is known that the Pak-Stanley labeling of the Shi hyperplane arrangement provides a bijection between the regions of the arrangement and parking functions. For any graph G, we define the G-semiorder arrangement and show that the Pak-Stanley labeling of its regions produces all G-parking functions.\n ', '\n Convolutions have long been regarded as fundamental to applied mathematics, physics and engineering. Their mathematical elegance allows for common tasks such as numerical differentiation to be computed efficiently on large data sets. Efficient computation of convolutions is critical to artificial intelligence in real-time applications, like machine vision, where convolutions must be continuously and efficiently computed on tens to hundreds of kilobytes per second. In this paper, we explore how convolutions are used in fundamental machine vision applications. We present an accelerated n-dimensional convolution package in the high performance computing language, Julia, and demonstrate its efficacy in solving the time to contact problem for machine vision. Results are measured against synthetically generated videos and quantitatively assessed according to their mean squared error from the ground truth. We achieve over an order of magnitude decrease in compute time and allocated memory for comparable machine vision applications. All code is packaged and integrated into the official Julia Package Manager to be used in various other scenarios.\n ', '\n Modern scattering-type scanning near-field optical microscopy (s-SNOM) has become an indispensable tool in material research. However, as the s-SNOM technique marches into the far-infrared (IR) and terahertz (THz) regimes, emerging experiments sometimes produce puzzling results. For example, anomalies in the near-field optical contrast have been widely reported. In this Letter, we systematically investigate a series of extreme subwavelength metallic nanostructures via s-SNOM near-field imaging in the GHz to THz frequency range. We find that the near-field material contrast is greatly impacted by the lateral size of the nanostructure, while the spatial resolution is practically independent of it. The contrast is also strongly affected by the connectivity of the metallic structures to a larger metallic ground plane. The observed effect can be largely explained by a quasi-electrostatic analysis. We also compare the THz s-SNOM results to those of the mid-IR regime, where the size-dependence becomes significant only for smaller structures. Our results reveal that the quantitative analysis of the near-field optical material contrasts in the long-wavelength regime requires a careful assessment of the size and configuration of metallic (optically conductive) structures.\n ', "\n Matched Field Processing (MFP) locates the underwater sources by matching the received data with the replica vectors, which could be regarded as a generalized beamformer. In this paper, the MFP method is combined with a recently developed framework -- Graph Signal Processing (GSP) method. Following the paradigm of GSP, a spatial adjacency matrix is constructed for the arbitrary distributed sensors based on the Green's function, then the source is located by utilizing the graph Fourier transform. The simulation results illustrate that the Graph-based MFP outperforms the the conventional MFP processors -- the Bartlett processor and the Minimum Variance processor -- for its good accuracy and robustness.\n ", '\n Hyperspectral imaging is a technique that allows for the creation of multi-color images. At terahertz wavelengths, it has emerged as a prominent tool for a number of applications, ranging from non-ionizing cancer diagnosis and pharmaceutical characterization to non-destructive artifact testing. Contemporary terahertz imaging systems typically rely on non-linear optical down-conversion of a fiber-based near-infrared femtosecond laser, requiring complex optical systems. Here, we demonstrate hyperspectral imaging with chip-scale frequency combs based on terahertz quantum cascade lasers. The dual combs are free-running and emit coherent terahertz radiation that covers a bandwidth of 220 GHz at 3.4 THz with ~10 μW per line. The combination of the fast acquisition rate of dual-comb spectroscopy with the monolithic design, scalability, and chip-scale size of the combs is highly appealing for future imaging applications in biomedicine and in the pharmaceutical industry.\n ', '\n For many applications Optical Frequency Combs (OFCs) require a high degree of temporal coherence (narrow linewidth). Commonly OFCs are generated in nonlinear media from a monochromatic narrow linewidth laser sources or from a mode-locked laser pulses but in the all-important mid-infrared (MIR) and terahertz (THz) regions of spectrum OFCs can be generated intrinsically by the free-running quantum cascade lasers (QCLs) with high efficiency. These combs do not look like conventional OFCs as the phases of each mode are different and in temporal domain the OFC is a seemingly random combination of amplitude- and phase-modulated signals rather than a short pulse. Despite this pseudo-randomness, the experimental evidence suggests that the linewidth of the QCL OFC is just as narrow as that of a QCL operating in the single mode. While universally acknowledged, this seemingly observation is not fully understood. In this work we rigorously prove this fact by deriving the expression for the Schawlow-Townes linewidth of QCL OFC and offer a transparent physical interpretation based on orthogonality of laser modes, indicating that despite their very different temporal profiles MIR and THz QCL OFCs are just as good for most applications as any other OFC.\n ', "\n Topological materials bear gapped excitations in bulk yet protected gapless excitations at boundaries. Magnetoplasmons (MPs), as high-frequency density excitations of two-dimensional electron gas (2DEG) in a perpendicular magnetic field, embody a prototype of band topology for bosons. The time-reversal-breaking magnetic field opens a topological gap for bulk MPs up to the cyclotron frequency; topologically-protected edge magnetoplasmons (EMPs) bridge the bulk gap and propagate unidirectionally along system's boundaries. However, all the EMPs known to date adhere to physical edges where the electron density terminates abruptly. This restriction has made device application extremely difficult. Here we demonstrate a new class of topological edge plasmons -- domain-boundary magnetoplasmons (DBMPs), within a uniform edgeless 2DEG. Such DBMPs arise at the domain boundaries of an engineered sign-changing magnetic field and are protected by the difference of gap Chern numbers (+/-1) across the magnetic domains. They propagate unidirectionally along the domain boundaries and are immune to domain defects. Moreover, they exhibit wide tunability in the microwave frequency range under an applied magnetic field or gate voltage. Our study opens a new direction to realize high-speed reconfigurable topological devices.\n ", "\n Dual comb spectroscopy allows for high-resolution spectra to be measured over broad bandwidths, but an essential requirement for coherent integration is the availability of a phase reference. Usually, this means that the combs' phase and timing errors must be measured and either minimized by stabilization or removed by correction, limiting the technique's applicability. In this work, we demonstrate that it is possible to extract the phase and timing signals of a multiheterodyne spectrum completely computationally, without any extra measurements or optical elements. These techniques are viable even when the relative linewidth exceeds the repetition rate difference, and can tremendously simplify any dual comb system. By reconceptualizing frequency combs in terms of the temporal structure of their phase noise, not their frequency stability, we are able to greatly expand the scope of multiheterodyne techniques.\n ", '\n Frequency combs based on terahertz quantum cascade lasers feature broadband coverage and high output powers in a compact package, making them an attractive option for broadband spectroscopy. Here, we demonstrate the first multi-heterodyne spectroscopy using two terahertz quantum cascade laser combs. With just 100 $μ$s of integration time, we achieve peak signal-to-noise ratios exceeding 60 dB and a spectral coverage greater than 250 GHz centered at 2.8 THz. Even with room-temperature detectors we are able to achieve peak signal-to-noise ratios of 50 dB, and as a proof-of-principle we use these combs to measure the broadband transmission spectrum of etalon samples. Finally, we show that with proper signal processing, it is possible to extend the multi-heterodyne spectroscopy to quantum cascade laser combs operating in pulsed mode, greatly expanding the range of quantum cascade lasers that could be suitable for these techniques.\n ', '\n Two-dimensional molecular aggregate (2DMA), a thin sheet of strongly interacting dipole molecules self-assembled at close distance on an ordered lattice, is a fascinating fluorescent material. It is distinctively different from the single or colloidal dye molecules or quantum dots in most previous research. In this paper, we verify for the first time that when a 2DMA is placed at a nanometric distance from a metallic substrate, the strong and coherent interaction between the dipoles inside the 2DMA dominates its fluorescent decay at picosecond timescale. Our streak-camera lifetime measurement and interacting lattice-dipole calculation reveal that the metal-mediated dipole-dipole interaction shortens the fluorescent lifetime to about one half and increases the energy dissipation rate by ten times than expected from the noninteracting single-dipole picture. Our finding can enrich our understanding of nanoscale energy transfer in molecular excitonic systems and may designate a new direction for developing fast and efficient optoelectronic devices.\n ', '\n We demonstrate an unexpectedly strong surface-plasmonic absorption at the interface of silver and high-index dielectrics based on electron and photon spectroscopy. The measured bandwidth and intensity of absorption deviate significantly from the classical theory. Our density-functional calculation well predicts the occurrence of this phenomenon. It reveals that due to the low metal-to-dielectric work function at such interfaces, conduction electrons can display a drastic quantum spillover, causing the interfacial electron-hole pair production to dominate the decay of surface plasmons. This finding can be of fundamental importance in understanding and designing quantum nano-plasmonic devices that utilize noble metals and high-index dielectrics.\n ', '\n On-chip nanophotonics serves as the foundation for the new generation of information technology, but it is challenged by the diffraction limit of light. With the capabilities of confining light into (deep) subwavelength volumes, plasmonics makes it possible to dramatically miniaturize optical devices so as to integrate them into silicon chips. Here we demonstrate that by cascading nano-corrugation gratings with different periodicities on silver nanowires atop silicon, different colors can be spatially separated and chronologically released at different grating junctions. The released light frequency depends on the grating arrangement and corrugation periodicities. Hence the nanowire acts as a spectral splitter for sorting/demultiplexing photons at different nano-scale positions with a ten-femtosecond-level interval. Such nanowires can be constructed further into compact 2D networks or circuits. We believe that this study provides a new and promising approach for realizing spatiotemporal-sensitive spectral splitting and optical signal processing on nanoscales, and for general integration of nanophotonics with microelectronics.\n ', '\n We report on a heterodyne receiver designed to observe the astrophysically important neutral atomic oxygen [OI] line at 4.7448 THz. The local oscillator is a third-order distributed feedback Quantum Cascade Laser operating in continuous wave mode at 4.741 THz. A quasi-optical, superconducting NbN hot electron bolometer is used as the mixer. We recorded a double sideband receiver noise temperature (T^DSB_rec) of 815 K, which is ~7 times the quantum noise limit (hν/2k_B) and an Allan variance time of 15 s at an effective noise fluctuation bandwidth of 18 MHz. Heterodyne performance was confirmed by measuring a methanol line spectrum.\n ', '\n We study a "strongly-coupled" (SC) polariton system formed between the atom-like intersubband transitions in a semiconductor nanostructure and the THz optical modes that are localised at the edges of a gold aperture. The polaritons can be excited optically, by incoherent excitation with bandgap radiation, and we find that they also coherently scatter the same input laser, to give strikingly sharp "sideband" (SB) spectral peaks in the backscattered spectrum. The SB intensity is a sensitive track of the polariton density and they can be detected down to a quantum noise floor that is more than 2500 times lower than the excitation thresholds of comparable quantum cascade laser diodes. Compared with other coherent scattering mechanisms, higher order SB scattering events are readily observable, and we speculate that the effect may find utility as a passive all-optical wavelength shifting mechanism in telecommunications systems.\n ', '\n We develop simple density-matrix models to describe the role of coherence in resonant-tunneling (RT) transport of quantum-cascade lasers (QCLs). Specifically, we investigate the effects of coherent coupling between the lasing levels with other levels on the transport properties and gain spectra. In the first part of the paper, we use a three-level density-matrix model to obtain useful analytical expressions for current transport through the injector barrier in a QCL. An expression for the slope discontinuity in the current-voltage characteristics at the lasing threshold is derived. This value is shown to be a direct measure of the population inversion at threshold, and contradicts the previously held belief of it being indicative of ratio of the laser level lifetimes. In the second part of the paper, we use density matrices to compute the gain spectrum for a resonant-phonon terahertz QCL design. The large anticrossing of the doublet of lower radiative levels is reflected in a broad gain linewidth due to a coherent RT assisted depopulation process. At certain bias conditions, the gain spectrum exhibits double peaks which is supported by experimental observations.\n ', "\n The Web has enabled one of the most visible recent developments in education---the deployment of massive open online courses. With their global reach and often staggering enrollments, MOOCs have the potential to become a major new mechanism for learning. Despite this early promise, however, MOOCs are still relatively unexplored and poorly understood.\n In a MOOC, each student's complete interaction with the course materials takes place on the Web, thus providing a record of learner activity of unprecedented scale and resolution. In this work, we use such trace data to develop a conceptual framework for understanding how users currently engage with MOOCs. We develop a taxonomy of individual behavior, examine the different behavioral patterns of high- and low-achieving students, and investigate how forum participation relates to other parts of the course.\n We also report on a large-scale deployment of badges as incentives for engagement in a MOOC, including randomized experiments in which the presentation of badges was varied across sub-populations. We find that making badges more salient produced increases in forum engagement.\n ", '\n Social media sites are often guided by a core group of committed users engaged in various forms of governance. A crucial aspect of this type of governance is deliberation, in which such a group reaches decisions on issues of importance to the site. Despite its crucial --- though subtle --- role in how a number of prominent social media sites function, there has been relatively little investigation of the deliberative aspects of social media governance. Here we explore this issue, investigating a particular deliberative process that is extensive, public, and recorded: the promotion of Wikipedia admins, which is determined by elections that engage committed members of the Wikipedia community. We find that the group decision-making at the heart of this process exhibits several fundamental forms of relative assessment. First we observe that the chance that a voter will support a candidate is strongly dependent on the relationship between characteristics of the voter and the candidate. Second we investigate how both individual voter decisions and overall election outcomes can be based on models that take into account the sequential, public nature of the voting.\n ', '\n We study online social networks in which relationships can be either positive (indicating relations such as friendship) or negative (indicating relations such as opposition or antagonism). Such a mix of positive and negative links arise in a variety of online settings; we study datasets from Epinions, Slashdot and Wikipedia. We find that the signs of links in the underlying social networks can be predicted with high accuracy, using models that generalize across this diverse range of sites. These models provide insight into some of the fundamental principles that drive the formation of signed links in networks, shedding light on theories of balance and status from social psychology; they also suggest social computing applications by which the attitude of one user toward another can be estimated from evidence provided by their relationships with other members of the surrounding social network.\n ', '\n Relations between users on social media sites often reflect a mixture of positive (friendly) and negative (antagonistic) interactions. In contrast to the bulk of research on social networks that has focused almost exclusively on positive interpretations of links between people, we study how the interplay between positive and negative relationships affects the structure of on-line social networks. We connect our analyses to theories of signed networks from social psychology. We find that the classical theory of structural balance tends to capture certain common patterns of interaction, but that it is also at odds with some of the fundamental phenomena we observe --- particularly related to the evolving, directed nature of these on-line networks. We then develop an alternate theory of status that better explains the observed edge signs and provides insights into the underlying social mechanisms. Our work provides one of the first large-scale evaluations of theories of signed networks using on-line datasets, as well as providing a perspective for reasoning about social media sites.\n ', '\n Frequency estimation is one of the most fundamental problems in streaming algorithms. Given a stream $S$ of elements from some universe $U=\\{1 \\ldots n\\}$, the goal is to compute, in a single pass, a short sketch of $S$ so that for any element $i \\in U$, one can estimate the number $x_i$ of times $i$ occurs in $S$ based on the sketch alone. Two state of the art solutions to this problems are the Count-Min and Count-Sketch algorithms. The frequency estimator $\\tilde{x}$ produced by Count-Min, using $O(1/\\varepsilon \\cdot \\log n)$ dimensions, guarantees that $\\|\\tilde{x}-x\\|_{\\infty} \\le \\varepsilon \\|x\\|_1$ with high probability, and $\\tilde{x} \\ge x$ holds deterministically. Also, Count-Min works under the assumption that $x \\ge 0$. On the other hand, Count-Sketch, using $O(1/\\varepsilon^2 \\cdot \\log n)$ dimensions, guarantees that $\\|\\tilde{x}-x\\|_{\\infty} \\le \\varepsilon \\|x\\|_2$ with high probability. A natural question is whether it is possible to design the best of both worlds sketching method, with error guarantees depending on the $\\ell_2$ norm and space comparable to Count-Sketch, but (like Count-Min) also has the no-underestimation property.\n Our main set of results shows that the answer to the above question is negative. We show this in two incomparable computational models: linear sketching and streaming algorithms. We also study the complementary problem, where the sketch is required to not over-estimate, i.e., $\\tilde{x} \\le x$ should hold always.\n ', '\n We propose a model for online graph problems where algorithms are given access to an oracle that predicts the degrees of nodes in the graph (e.g., based on past data). Within this model, we study the classic problem of online bipartite matching. An extensive empirical evaluation shows that a greedy algorithm called MinPredictedDegree compares favorably to state-of-the-art online algorithms for this problem. We also initiate the theoretical study of MinPredictedDegree on a natural random graph model with power law degree distribution and show that it produces matchings almost as large as the maximum matching on such graphs.\n ', '\n We study the problem of representing all distances between $n$ points in $\\mathbb R^d$, with arbitrarily small distortion, using as few bits as possible. We give asymptotically tight bounds for this problem, for Euclidean metrics, for $\\ell_1$ (a.k.a.~Manhattan) metrics, and for general metrics.\n Our bounds for Euclidean metrics mark the first improvement over compression schemes based on discretizing the classical dimensionality reduction theorem of Johnson and Lindenstrauss (Contemp.~Math.~1984). Since it is known that no better dimension reduction is possible, our results establish that Euclidean metric compression is possible beyond dimension reduction.\n ', '\n Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems. We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset $X$ onto a random $d = O(d_X)$-dimensional subspace (where $d_X$ is the doubling dimension of $X$), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension $d$ having an extra $\\log \\log n$ term and the approximation factor being arbitrarily close to $1$. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of $k$-means and $k$-medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of $X$.\n ', "\n We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \\pm \\varepsilon n$ from a sample of size $O(\\log^2(1/\\varepsilon) \\cdot n/\\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \\[ \\ \\log (1/\\varepsilon) \\cdot n^{1-Θ(1/\\log(1/\\varepsilon))}. \\] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.\n ", '\n We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K \\in \\mathbb{R}^{n \\times n}$ corresponding to $n$ points $x_1,\\ldots,x_n \\in \\mathbb{R}^d$. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector.\n We show that the sum of matrix entries can be estimated to $1+ε$ relative error in time $sublinear$ in $n$ and linear in $d$ for many popular kernels, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to $1+ε$ relative error in time $subquadratic$ in $n$ and linear in $d$.\n Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation.\n ', '\n Contrastive learning is one of the fastest growing research areas in machine learning due to its ability to learn useful representations without labeled data. However, contrastive learning is susceptible to shortcuts - i.e., it may learn shortcut features irrelevant to the task of interest, and discard relevant information. Past work has addressed this limitation via handcrafted data augmentations that eliminate the shortcut. But, manually crafted augmentations do not work across all datasets and tasks. Further, data augmentations fail in addressing shortcuts in multi-attribute classification when one attribute acts as a shortcut around other attributes. In this paper, we analyze the objective function of contrastive learning and formally prove that it is vulnerable to shortcuts. We then present reconstructive contrastive learning (RCL), a framework for learning unsupervised representations that are robust to shortcuts. The key idea is to force the learned representation to reconstruct the input, which naturally counters potential shortcuts. Extensive experiments verify that RCL is highly robust to shortcuts and outperforms state-of-the-art contrastive learning methods on a variety of datasets and tasks.\n ', "\n We consider online algorithms for the {\\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve a competitive ratio that tends to $1$ as the prediction error rate tends to $0$. Specifically, the competitive ratio is equal to $1+O(q)$, where $q$ is the prediction error rate. We also design a ``fallback option'' that ensures that the competitive ratio of the algorithm for {\\em any} input sequence is at most $O(1/q)$. Our result adds to the recent body of work that uses machine learning to improve the performance of ``classic'' algorithms.\n ", '\n We consider the task of estimating the entropy of $k$-ary distributions from samples in the streaming model, where space is limited. Our main contribution is an algorithm that requires $O\\left(\\frac{k \\log (1/\\varepsilon)^2}{\\varepsilon^3}\\right)$ samples and a constant $O(1)$ memory words of space and outputs a $\\pm\\varepsilon$ estimate of $H(p)$. Without space limitations, the sample complexity has been established as $S(k,\\varepsilon)=Θ\\left(\\frac k{\\varepsilon\\log k}+\\frac{\\log^2 k}{\\varepsilon^2}\\right)$, which is sub-linear in the domain size $k$, and the current algorithms that achieve optimal sample complexity also require nearly-linear space in $k$.\n Our algorithm partitions $[0,1]$ into intervals and estimates the entropy contribution of probability values in each interval. The intervals are designed to trade off the bias and variance of these estimates.\n ', '\n We introduce a "learning-based" algorithm for the low-rank decomposition problem: given an $n \\times d$ matrix $A$, and a parameter $k$, compute a rank-$k$ matrix $A\'$ that minimizes the approximation loss $\\|A-A\'\\|_F$. The algorithm uses a training set of input matrices in order to optimize its performance. Specifically, some of the most efficient approximate algorithms for computing low-rank approximations proceed by computing a projection $SA$, where $S$ is a sparse random $m \\times n$ "sketching matrix", and then performing the singular value decomposition of $SA$. We show how to replace the random matrix $S$ with a "learned" matrix of the same sparsity to reduce the error.\n Our experiments show that, for multiple types of data sets, a learned sketch matrix can substantially reduce the approximation loss compared to a random matrix $S$, sometimes by one order of magnitude. We also study mixed matrices where only some of the rows are trained and the remaining ones are random, and show that matrices still offer improved performance while retaining worst-case guarantees.\n Finally, to understand the theoretical aspects of our approach, we study the special case of $m=1$. In particular, we give an approximation algorithm for minimizing the empirical loss, with approximation factor depending on the stable rank of matrices in the training set. We also show generalization bounds for the sketch matrix learning problem.\n ', '\n The Optimal Transport (a.k.a. Wasserstein) distance is an increasingly popular similarity measure for rich data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search algorithms according to this distance, which poses a substantial computational bottleneck on massive datasets. In this work we introduce Flowtree, a fast and accurate approximation algorithm for the Wasserstein-$1$ distance. We formally analyze its approximation factor and running time. We perform extensive experimental evaluation of nearest neighbor search algorithms in the $W_1$ distance on real-world dataset. Our results show that compared to previous state of the art, Flowtree achieves up to $7.4$ times faster running time.\n ', "\n \\begin{abstract} The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms for this problem, Count-Min and Count-Sketch, are widely used in practice.\n In a recent work [Hsu et al., ICLR'19], it was shown empirically that augmenting Count-Min and Count-Sketch with a machine learning algorithm leads to a significant reduction of the estimation error. The experiments were complemented with an analysis of the expected error incurred by Count-Min (both the standard and the augmented version) when the input frequencies follow a Zipfian distribution. Although the authors established that the learned version of Count-Min has lower estimation error than its standard counterpart, their analysis of the standard Count-Min algorithm was not tight. Moreover, they provided no similar analysis for Count-Sketch.\n In this paper we resolve these problems. First, we provide a simple tight analysis of the expected error incurred by Count-Min. Second, we provide the first error bounds for both the standard and the augmented version of Count-Sketch. These bounds are nearly tight and again demonstrate an improved performance of the learned version of Count-Sketch.\n In addition to demonstrating tight gaps between the aforementioned algorithms, we believe that our bounds for the standard versions of Count-Min and Count-Sketch are of independent interest. In particular, it is a typical practice to set the number of hash functions in those algorithms to $Θ(\\log n)$. In contrast, our results show that to minimize the \\emph{expected} error, the number of hash functions should be a constant, strictly greater than $1$.\n ", "\n ``Composable core-sets'' are an efficient framework for solving optimization problems in massive data models. In this work, we consider efficient construction of composable core-sets for the determinant maximization problem. This can also be cast as the MAP inference task for determinantal point processes, that have recently gained a lot of interest for modeling diversity and fairness. The problem was recently studied in [IMOR'18], where they designed composable core-sets with the optimal approximation bound of $\\tilde O(k)^k$. On the other hand, the more practical Greedy algorithm has been previously used in similar contexts. In this work, first we provide a theoretical approximation guarantee of $O(C^{k^2})$ for the Greedy algorithm in the context of composable core-sets; Further, we propose to use a Local Search based algorithm that while being still practical, achieves a nearly optimal approximation bound of $O(k)^{2k}$; Finally, we implement all three algorithms and show the effectiveness of our proposed algorithm on standard data sets.\n ", '\n A distance matrix $A \\in \\mathbb R^{n \\times m}$ represents all pairwise distances, $A_{ij}=\\mathrm{d}(x_i,y_j)$, between two point sets $x_1,...,x_n$ and $y_1,...,y_m$ in an arbitrary metric space $(\\mathcal Z, \\mathrm{d})$. Such matrices arise in various computational contexts such as learning image manifolds, handwriting recognition, and multi-dimensional unfolding.\n In this work we study algorithms for low-rank approximation of distance matrices. Recent work by Bakshi and Woodruff (NeurIPS 2018) showed it is possible to compute a rank-$k$ approximation of a distance matrix in time $O((n+m)^{1+γ}) \\cdot \\mathrm{poly}(k,1/ε)$, where $ε>0$ is an error parameter and $γ>0$ is an arbitrarily small constant. Notably, their bound is sublinear in the matrix size, which is unachievable for general matrices.\n We present an algorithm that is both simpler and more efficient. It reads only $O((n+m) k/ε)$ entries of the input matrix, and has a running time of $O(n+m) \\cdot \\mathrm{poly}(k,1/ε)$. We complement the sample complexity of our algorithm with a matching lower bound on the number of entries that must be read by any algorithm. We provide experimental results to validate the approximation quality and running time of our algorithm.\n ', '\n We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of $m$ sets over $n$ elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities.\n On one hand, first we show an adaptation of the streaming algorithm presented in Har-Peled et al. [2016] to the sub-linear query model, that returns an $α$-approximate cover using $\\tilde{O}(m(n/k)^{1/(α-1)} + nk)$ queries to the input, where $k$ denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of $k$, the required number of queries is $\\tildeΩ(m(n/k)^{1/(2α)})$, even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require $Ω(nk)$ queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter $k$.\n On the other hand, we show that this bound is not optimal for larger values of the parameter $k$, as there exists a $(1+\\varepsilon)$-approximation algorithm with $\\tilde{O}(mn/k\\varepsilon^2)$ queries. We show that this bound is essentially tight for sufficiently small constant $\\varepsilon$, by establishing a lower bound of $\\tildeΩ(mn/k)$ query complexity.\n ', '\n We study the fair variant of the classic $k$-median problem introduced by Chierichetti et al. [2017]. In the standard $k$-median problem, given an input pointset $P$, the goal is to find $k$ centers $C$ and assign each input point to one of the centers in $C$ such that the average distance of points to their cluster center is minimized.\n In the fair variant of $k$-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an "approximately equal" number of points of each color.\n Chierichetti et al. proposed a two-phase algorithm for fair $k$-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the $k$-median objective. In the second step, fairlets are merged into $k$ clusters by one of the existing $k$-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time.\n In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. Our algorithm additionally allows for finer control over the balance of resulting clusters than the original work. We complement our theoretical bounds with empirical evaluation.\n ', '\n Space partitions of $\\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces [Andoni, Naor, Nikolov, Razenshteyn, Waingarten STOC 2018, FOCS 2018], we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner [Sanders, Schulz SEA 2013] and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS, our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.\n ', "\n We study a spectral generalization of classical combinatorial graph spanners to the spectral setting. Given a set of vectors $V\\subseteq \\Re^d$, we say a set $U\\subseteq V$ is an $α$-spectral spanner if for all $v\\in V$ there is a probability distribution $μ_v$ supported on $U$ such that $$vv^\\intercal \\preceq α\\cdot\\mathbb{E}_{u\\simμ_v} uu^\\intercal.$$ We show that any set $V$ has an $\\tilde{O}(d)$-spectral spanner of size $\\tilde{O}(d)$ and this bound is almost optimal in the worst case.\n We use spectral spanners to study composable core-sets for spectral problems. We show that for many objective functions one can use a spectral spanner, independent of the underlying functions, as a core-set and obtain almost optimal composable core-sets. For example, for the determinant maximization problem we obtain an $\\tilde{O}(k)^k$-composable core-set and we show that this is almost optimal in the worst case.\n Our algorithm is a spectral analogue of the classical greedy algorithm for finding (combinatorial) spanners in graphs. We expect that our spanners find many other applications in distributed or parallel models of computation. Our proof is spectral. As a side result of our techniques, we show that the rank of diagonally dominant lower-triangular matrices are robust under `small perturbations' which could be of independent interests.\n ", '\n We consider the $(1+ε)$-approximate nearest neighbor search problem: given a set $X$ of $n$ points in a $d$-dimensional space, build a data structure that, given any query point $y$, finds a point $x \\in X$ whose distance to $y$ is at most $(1+ε) \\min_{x \\in X} \\|x-y\\|$ for an accuracy parameter $ε\\in (0,1)$. Our main result is a data structure that occupies only $O(ε^{-2} n \\log(n) \\log(1/ε))$ bits of space, assuming all point coordinates are integers in the range $\\{-n^{O(1)} \\ldots n^{O(1)}\\}$, i.e., the coordinates have $O(\\log n)$ bits of precision. This improves over the best previously known space bound of $O(ε^{-2} n \\log(n)^2)$, obtained via the randomized dimensionality reduction method of Johnson and Lindenstrauss (1984). We also consider the more general problem of estimating all distances from a collection of query points to all data points $X$, and provide almost tight upper and lower bounds for the space complexity of this problem.\n ', '\n The nearest neighbor problem is defined as follows: Given a set $P$ of $n$ points in some metric space $(X,D)$, build a data structure that, given any point $q$, returns a point in $P$ that is closest to $q$ (its "nearest neighbor" in $P$). The data structure stores additional information about the set $P$, which is then used to find the nearest neighbor without computing all distances between $q$ and $P$. The problem has a wide range of applications in machine learning, computer vision, databases and other fields.\n To reduce the time needed to find nearest neighbors and the amount of memory used by the data structure, one can formulate the {\\em approximate} nearest neighbor problem, where the the goal is to return any point $p\' \\in P$ such that the distance from $q$ to $p\'$ is at most $c \\cdot \\min_{p \\in P} D(q,p)$, for some $c \\geq 1$. Over the last two decades, many efficient solutions to this problem were developed. In this article we survey these developments, as well as their connections to questions in geometric functional analysis and combinatorial geometry.\n ', '\n We introduce a new distance-preserving compact representation of multi-dimensional point-sets. Given $n$ points in a $d$-dimensional space where each coordinate is represented using $B$ bits (i.e., $dB$ bits per point), it produces a representation of size $O( d \\log(d B/ε) + \\log n)$ bits per point from which one can approximate the distances up to a factor of $1 \\pm ε$. Our algorithm almost matches the recent bound of~\\cite{indyk2017near} while being much simpler. We compare our algorithm to Product Quantization (PQ)~\\cite{jegou2011product}, a state of the art heuristic metric compression method. We evaluate both algorithms on several data sets: SIFT (used in \\cite{jegou2011product}), MNIST~\\cite{lecun1998mnist}, New York City taxi time series~\\cite{guha2016robust} and a synthetic one-dimensional data set embedded in a high-dimensional space. With appropriately tuned parameters, our algorithm produces representations that are comparable to or better than those produced by PQ, while having provable guarantees on its performance.\n ', '\n There is much interest in integrating millimeter wave radios (mmWave) into wireless LANs and 5G cellular networks to benefit from their multiple GHz of available spectrum. Yet unlike existing technologies, e.g., WiFi, mmWave radios require highly directional antennas. Since the antennas have pencil-beams, the transmitter and receiver need to align their antenna beams before they can communicate. Existing solutions scan the entire space to find the best alignment. Such a process has been shown to introduce up to seconds of delay, and is unsuitable for wireless networks where an access point has to quickly switch between users and accommodate mobile clients.\n This paper presents Rapid-Link, a new protocol that can find the best mmWave beam alignment without scanning the space. Given all possible directions for setting the antenna beam, Rapid-Link provably finds the optimal direction in logarithmic number of measurements. Further, Rapid-Link works within the existing 802.11ad standard for mmWave LAN, and can support both clients and access points. We have implemented Rapid-Link in a mmWave radio and evaluated it empirically. Our results show that it reduces beam alignment delay by orders of magnitude. In particular, for highly directional mmWave devices operating under 802.11ad, the delay drops from over a second to 2.5 ms.\n ', '\n Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there has been a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.\n ', '\n In the Sparse Linear Regression (SLR) problem, given a $d \\times n$ matrix $M$ and a $d$-dimensional query $q$, the goal is to compute a $k$-sparse $n$-dimensional vector $τ$ such that the error $||M τ-q||$ is minimized. This problem is equivalent to the following geometric problem: given a set $P$ of $n$ points and a query point $q$ in $d$ dimensions, find the closest $k$-dimensional subspace to $q$, that is spanned by a subset of $k$ points in $P$. In this paper, we present data-structures/algorithms and conditional lower bounds for several variants of this problem (such as finding the closest induced $k$ dimensional flat/simplex instead of a subspace).\n In particular, we present approximation algorithms for the online variants of the above problems with query time $\\tilde O(n^{k-1})$, which are of interest in the "low sparsity regime" where $k$ is small, e.g., $2$ or $3$. For $k=d$, this matches, up to polylogarithmic factors, the lower bound that relies on the affinely degenerate conjecture (i.e., deciding if $n$ points in $\\mathbb{R}^d$ contains $d+1$ points contained in a hyperplane takes $Ω(n^d)$ time). Moreover, our algorithms involve formulating and solving several geometric subproblems, which we believe to be of independent interest.\n ', '\n The metric sketching problem is defined as follows. Given a metric on $n$ points, and $ε>0$, we wish to produce a small size data structure (sketch) that, given any pair of point indices, recovers the distance between the points up to a $1+ε$ distortion. In this paper we consider metrics induced by $\\ell_2$ and $\\ell_1$ norms whose spread (the ratio of the diameter to the closest pair distance) is bounded by $Φ>0$. A well-known dimensionality reduction theorem due to Johnson and Lindenstrauss yields a sketch of size $O(ε^{-2} \\log (Φn) n\\log n)$, i.e., $O(ε^{-2} \\log (Φn) \\log n)$ bits per point. We show that this bound is not optimal, and can be substantially improved to $O(ε^{-2}\\log(1/ε) \\cdot \\log n + \\log\\log Φ)$ bits per point. Furthermore, we show that our bound is tight up to a factor of $\\log(1/ε)$.\n We also consider sketching of general metrics and provide a sketch of size $O(n\\log(1/ε)+ \\log\\log Φ)$ bits per point, which we show is optimal.\n ', '\n Motivated by applications in computer vision and databases, we introduce and study the Simultaneous Nearest Neighbor Search (SNN) problem. Given a set of data points, the goal of SNN is to design a data structure that, given a collection of queries, finds a collection of close points that are compatible with each other. Formally, we are given $k$ query points $Q=q_1,\\cdots,q_k$, and a compatibility graph $G$ with vertices in $Q$, and the goal is to return data points $p_1,\\cdots,p_k$ that minimize (i) the weighted sum of the distances from $q_i$ to $p_i$ and (ii) the weighted sum, over all edges $(i,j)$ in the compatibility graph $G$, of the distances between $p_i$ and $p_j$. The problem has several applications, where one wants to return a set of consistent answers to multiple related queries. This generalizes well-studied computational problems, including NN, Aggregate NN and the 0-extension problem.\n In this paper we propose and analyze the following general two-step method for designing efficient data structures for SNN. In the first step, for each query point $q_i$ we find its (approximate) nearest neighbor point $\\hat{p}_i$; this can be done efficiently using existing approximate nearest neighbor structures. In the second step, we solve an off-line optimization problem over sets $q_1,\\cdots,q_k$ and $\\hat{p}_1,\\cdots,\\hat{p}_k$; this can be done efficiently given that $k$ is much smaller than $n$. Even though $\\hat{p}_1,\\cdots,\\hat{p}_k$ might not constitute the optimal answers to queries $q_1,\\cdots,q_k$, we show that, for the unweighted case, the resulting algorithm is $O(\\log k/\\log \\log k)$-approximation. Also, we show that the approximation factor can be in fact reduced to a constant for compatibility graphs frequently occurring in practice.\n Finally, we show that the "empirical approximation factor" provided by the above approach is very close to 1.\n ', '\n Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an $O(mn)$ running time (where $m$ is the length of the pattern and $n$ is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc.\n In this paper, we show that the complexity of regular expression matching can be characterized based on its {\\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH.\n An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., $[a|ab|bc]^*$. This corresponds to the so-called {\\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly) $O(n\\sqrt{m})$ is known. We show that the latter bound is not tight and improve the runtime to $O(nm^{0.44\\ldots})$.\n ', '\n We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets.\n We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.\n ', '\n We consider the classic Set Cover problem in the data stream model. For $n$ elements and $m$ sets ($m\\geq n$) we give a $O(1/δ)$-pass algorithm with a strongly sub-linear $\\tilde{O}(mn^δ)$ space and logarithmic approximation factor. This yields a significant improvement over the earlier algorithm of Demaine et al. [DIMV14] that uses exponentially larger number of passes. We complement this result by showing that the tradeoff between the number of passes and space exhibited by our algorithm is tight, at least when the approximation factor is equal to $1$. Specifically, we show that any algorithm that computes set cover exactly using $({1 \\over 2δ}-1)$ passes must use $\\tildeΩ(mn^δ)$ space in the regime of $m=O(n)$. Furthermore, we consider the problem in the geometric setting where the elements are points in $\\mathbb{R}^2$ and sets are either discs, axis-parallel rectangles, or fat triangles in the plane, and show that our algorithm (with a slight modification) uses the optimal $\\tilde{O}(n)$ space to find a logarithmic approximation in $O(1/δ)$ passes.\n Finally, we show that any randomized one-pass algorithm that distinguishes between covers of size 2 and 3 must use a linear (i.e., $Ω(mn)$) amount of space. This is the first result showing that a randomized, approximate algorithm cannot achieve a space bound that is sublinear in the input size.\n This indicates that using multiple passes might be necessary in order to achieve sub-linear space bounds for this problem while guaranteeing small approximation factors.\n ', '\n For every fixed constant $α> 0$, we design an algorithm for computing the $k$-sparse Walsh-Hadamard transform of an $N$-dimensional vector $x \\in \\mathbb{R}^N$ in time $k^{1+α} (\\log N)^{O(1)}$. Specifically, the algorithm is given query access to $x$ and computes a $k$-sparse $\\tilde{x} \\in \\mathbb{R}^N$ satisfying $\\|\\tilde{x} - \\hat{x}\\|_1 \\leq c \\|\\hat{x} - H_k(\\hat{x})\\|_1$, for an absolute constant $c > 0$, where $\\hat{x}$ is the transform of $x$ and $H_k(\\hat{x})$ is its best $k$-sparse approximation. Our algorithm is fully deterministic and only uses non-adaptive queries to $x$ (i.e., all queries are determined and performed in parallel when the algorithm starts).\n An important technical tool that we use is a construction of nearly optimal and linear lossless condensers which is a careful instantiation of the GUV condenser (Guruswami, Umans, Vadhan, JACM 2009). Moreover, we design a deterministic and non-adaptive $\\ell_1/\\ell_1$ compressed sensing scheme based on general lossless condensers that is equipped with a fast reconstruction algorithm running in time $k^{1+α} (\\log N)^{O(1)}$ (for the GUV-based condenser) and is of independent interest. Our scheme significantly simplifies and improves an earlier expander-based construction due to Berinde, Gilbert, Indyk, Karloff, Strauss (Allerton 2008).\n Our methods use linear lossless condensers in a black box fashion; therefore, any future improvement on explicit constructions of such condensers would immediately translate to improved parameters in our framework (potentially leading to $k (\\log N)^{O(1)}$ reconstruction time with a reduced exponent in the poly-logarithmic factor, and eliminating the extra parameter $α$).\n Finally, by allowing the algorithm to use randomness, while still using non-adaptive queries, the running time of the algorithm can be improved to $\\tilde{O}(k \\log^3 N)$.\n ', "\n We initiate the study of trade-offs between sparsity and the number of measurements in sparse recovery schemes for generic norms. Specifically, for a norm $\\|\\cdot\\|$, sparsity parameter $k$, approximation factor $K>0$, and probability of failure $P>0$, we ask: what is the minimal value of $m$ so that there is a distribution over $m \\times n$ matrices $A$ with the property that for any $x$, given $Ax$, we can recover a $k$-sparse approximation to $x$ in the given norm with probability at least $1-P$? We give a partial answer to this problem, by showing that for norms that admit efficient linear sketches, the optimal number of measurements $m$ is closely related to the doubling dimension of the metric induced by the norm $\\|\\cdot\\|$ on the set of all $k$-sparse vectors. By applying our result to specific norms, we cast known measurement bounds in our general framework (for the $\\ell_p$ norms, $p \\in [1,2]$) as well as provide new, measurement-efficient schemes (for the Earth-Mover Distance norm). The latter result directly implies more succinct linear sketches for the well-studied planar $k$-median clustering problem. Finally, our lower bound for the doubling dimension of the EMD norm enables us to address the open question of [Frahling-Sohler, STOC'05] about the space complexity of clustering problems in the dynamic streaming model.\n ", '\n Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.\n ', '\n The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for this problem run in nearly quadratic time.\n In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be tight. Specifically, we show that, if the edit distance can be computed in time $O(n^{2-δ})$ for some constant $δ>0$, then the satisfiability of conjunctive normal form formulas with $N$ variables and $M$ clauses can be solved in time $M^{O(1)} 2^{(1-ε)N}$ for a constant $ε>0$. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.\n ', '\n Compressive Sensing (CS) stipulates that a sparse signal can be recovered from a small number of linear measurements, and that this recovery can be performed efficiently in polynomial time. The framework of model-based compressive sensing (model-CS) leverages additional structure in the signal and prescribes new recovery schemes that can reduce the number of measurements even further. However, model-CS requires an algorithm that solves the model-projection problem: given a query signal, produce the signal in the model that is also closest to the query signal. Often, this optimization can be computationally very expensive. Moreover, an approximation algorithm is not sufficient for this optimization task. As a result, the model-projection problem poses a fundamental obstacle for extending model-CS to many interesting models.\n In this paper, we introduce a new framework that we call approximation-tolerant model-based compressive sensing. This framework includes a range of algorithms for sparse recovery that require only approximate solutions for the model-projection problem. In essence, our work removes the aforementioned obstacle to model-based compressive sensing, thereby extending the applicability of model-CS to a much wider class of models. We instantiate this new framework for the Constrained Earth Mover Distance (CEMD) model, which is particularly useful for signal ensembles where the positions of the nonzero coefficients do not change significantly as a function of spatial (or temporal) location. We develop novel approximation algorithms for both the maximization and the minimization versions of the model-projection problem via graph optimization techniques. Leveraging these algorithms into our framework results in a nearly sample-optimal sparse recovery scheme for the CEMD model.\n ', '\n We give an algorithm for $\\ell_2/\\ell_2$ sparse recovery from Fourier measurements using $O(k\\log N)$ samples, matching the lower bound of \\cite{DIPW} for non-adaptive algorithms up to constant factors for any $k\\leq N^{1-δ}$. The algorithm runs in $\\tilde O(N)$ time. Our algorithm extends to higher dimensions, leading to sample complexity of $O_d(k\\log N)$, which is optimal up to constant factors for any $d=O(1)$. These are the first sample optimal algorithms for these problems.\n A preliminary experimental evaluation indicates that our algorithm has empirical sampling complexity comparable to that of other recovery methods known in the literature, while providing strong provable guarantees on the recovery quality.\n ', "\n We present a new data structure for the c-approximate near neighbor problem (ANN) in the Euclidean space. For n points in R^d, our algorithm achieves O(n^ρ + d log n) query time and O(n^{1 + ρ} + d log n) space, where ρ<= 7/(8c^2) + O(1 / c^3) + o(1). This is the first improvement over the result by Andoni and Indyk (FOCS 2006) and the first data structure that bypasses a locality-sensitive hashing lower bound proved by O'Donnell, Wu and Zhou (ICS 2011). By a standard reduction we obtain a data structure for the Hamming space and \\ell_1 norm with ρ<= 7/(8c) + O(1/c^{3/2}) + o(1), which is the first improvement over the result of Indyk and Motwani (STOC 1998).\n ", ...]
Text Preprocessing¶
Pre-processing the dataset
In the original work we have processed the data as raw documents as the dataset size was small. However if you want to use Matthew Hoffman’s original SVI code instead, that code takes a text file with a specific format. Once you have each abstract in a separate text file, you may find the 2017 © Massachusetts Institute of Technology following Python packages useful: io, collections, nltk. It is good practice to keep your dataset in its own folder, so io can be used to access that folder using a constant (relative) path. Read each file and use nltk.tokenize to tokenize each chunk of text. Use collections to process each abstract using a Counter/Dictionary, before writing the counts of words of each individual abstract as a line in the text file.
def word_cleaning_and_count(s):
s_lower = s.lower()
cleaning_set = set(stopwords.words('english'))
tokens = word_tokenize(s_lower)
tokens = [token for token in tokens if token.isalpha()]
word_dict = dict(collections.Counter(tokens))
for key in cleaning_set:
word_dict.pop(key, None)
return word_dict
papers_word_dict = [word_cleaning_and_count(paper) for paper in papers]
dup_keys = []
for i in range(len(papers_word_dict)):
dup_keys = dup_keys + list(papers_word_dict[i].keys())
vocab = list(collections.Counter(dup_keys).keys())
lookup_table = dict(zip(vocab, range(len(vocab))))
Save data¶
import json
with open('data/names', 'w') as fout:
json.dump(names, fout)
with open('data/papers', 'w') as fout:
json.dump(papers, fout)
with open('data/papers_word_dict', 'w') as fout:
json.dump(papers_word_dict, fout)
with open('data/vocab', 'w') as fout:
json.dump(vocab, fout)
with open('data/lookup_table', 'w') as fout:
json.dump(lookup_table, fout)
LDA¶
#similar to k in K-means clustering. We want to divide abstracts into 5 topics.
no_topics = 5
Load data¶
Please make an empty folder named data in your working directory
import json
with open('data/names', 'r') as json_file:
names = json.load(json_file)
with open('data/papers', 'r') as json_file:
papers = json.load(json_file)
with open('data/papers_word_dict', 'r') as json_file:
papers_word_dict = json.load(json_file)
with open('data/vocab', 'r') as json_file:
vocab = json.load(json_file)
with open('data/lookup_table', 'r') as json_file:
lookup_table = json.load(json_file)
vocab_size = len(vocab)
Using Scikit-learn¶
doc_vecs = []
for paper in papers_word_dict:
doc_vec = [0 for _ in range(vocab_size)]
for token, occurs in paper.items():
doc_vec[lookup_table[token]] = occurs
doc_vecs.append(doc_vec)
from Scikit-learn.decomposition import LatentDirichletAllocation
# Run the LDA
lda = LatentDirichletAllocation(n_components=no_topics, learning_method='online').fit(doc_vecs)
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print('Topic %d:' % (topic_idx))
print(' '.join([vocab[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
#using top 10 words present in each topic
no_top_words = 10
display_topics(lda, doc_vecs, no_top_words)
Topic 0: data network system systems design performance using algorithms paper memory Topic 1: algorithm problem graph n problems time g gradient graphs show Topic 2: learning data model models neural approach training methods tasks performance Topic 3: n problem algorithms time algorithm show channel bounds also number Topic 4: quantum optical materials magnetic spin superconducting energy properties using material
- Topic 0: It may contain abstracts from traditional computer science
- Topic 1: It may contain abstracts from modern computer science and Graph algorithms
- Topic 2: It may contain abstracts from machine learning and deep learning.
- Topic 3: It may contain abstracts belonging to modern computer science and machine learning.
- Topic 4: It may contain topics from quantum theory and material science.
End-to-end Code (SVILDA algorithm)¶
doc_vecs = []
for paper in papers_word_dict:
wordslist = []
countslist = []
for token, occurs in paper.items():
wordslist.append(lookup_table[token])
countslist.append(occurs)
doc_vecs.append((wordslist, countslist))
from svilda import SVILDA
iterations = 10000
lda = SVILDA(vocab, no_topics, len(doc_vecs), 0.1, 0.01, 1, 0.75, iterations)
lda.runSVI(doc_vecs)
ITERATION 0 running document number 1429 ITERATION 100 running document number 557 ITERATION 200 running document number 876 ITERATION 300 running document number 1326 ITERATION 400 running document number 826 ITERATION 500 running document number 363 ITERATION 600 running document number 2819 ITERATION 700 running document number 1086 ITERATION 800 running document number 2269 ITERATION 900 running document number 878 ITERATION 1000 running document number 1779 ITERATION 1100 running document number 148 ITERATION 1200 running document number 760 ITERATION 1300 running document number 956 ITERATION 1400 running document number 343 ITERATION 1500 running document number 769 ITERATION 1600 running document number 2833 ITERATION 1700 running document number 652 ITERATION 1800 running document number 2565 ITERATION 1900 running document number 1554 ITERATION 2000 running document number 180 ITERATION 2100 running document number 1788 ITERATION 2200 running document number 1793 ITERATION 2300 running document number 1757 ITERATION 2400 running document number 147 ITERATION 2500 running document number 369 ITERATION 2600 running document number 2448 ITERATION 2700 running document number 1654 ITERATION 2800 running document number 2052 ITERATION 2900 running document number 1712 ITERATION 3000 running document number 1647 ITERATION 3100 running document number 1242 ITERATION 3200 running document number 1790 ITERATION 3300 running document number 599 ITERATION 3400 running document number 1146 ITERATION 3500 running document number 118 ITERATION 3600 running document number 1431 ITERATION 3700 running document number 1391 ITERATION 3800 running document number 1229 ITERATION 3900 running document number 1127 ITERATION 4000 running document number 526 ITERATION 4100 running document number 1789 ITERATION 4200 running document number 153 ITERATION 4300 running document number 2201 ITERATION 4400 running document number 1665 ITERATION 4500 running document number 795 ITERATION 4600 running document number 26 ITERATION 4700 running document number 1074 ITERATION 4800 running document number 1117 ITERATION 4900 running document number 562 ITERATION 5000 running document number 2693 ITERATION 5100 running document number 1177 ITERATION 5200 running document number 968 ITERATION 5300 running document number 1028 ITERATION 5400 running document number 780 ITERATION 5500 running document number 417 ITERATION 5600 running document number 2258 ITERATION 5700 running document number 1807 ITERATION 5800 running document number 1484 ITERATION 5900 running document number 1508 ITERATION 6000 running document number 317 ITERATION 6100 running document number 2349 ITERATION 6200 running document number 35 ITERATION 6300 running document number 864 ITERATION 6400 running document number 655 ITERATION 6500 running document number 85 ITERATION 6600 running document number 23 ITERATION 6700 running document number 1778 ITERATION 6800 running document number 1916 ITERATION 6900 running document number 928 ITERATION 7000 running document number 1595 ITERATION 7100 running document number 1463 ITERATION 7200 running document number 284 ITERATION 7300 running document number 171 ITERATION 7400 running document number 1083 ITERATION 7500 running document number 1995 ITERATION 7600 running document number 1796 ITERATION 7700 running document number 23 ITERATION 7800 running document number 2809 ITERATION 7900 running document number 172 ITERATION 8000 running document number 2273 ITERATION 8100 running document number 85 ITERATION 8200 running document number 1907 ITERATION 8300 running document number 22 ITERATION 8400 running document number 1463 ITERATION 8500 running document number 18 ITERATION 8600 running document number 1688 ITERATION 8700 running document number 1115 ITERATION 8800 running document number 498 ITERATION 8900 running document number 1602 ITERATION 9000 running document number 2662 ITERATION 9100 running document number 1841 ITERATION 9200 running document number 2122 ITERATION 9300 running document number 887 ITERATION 9400 running document number 236 ITERATION 9500 running document number 279 ITERATION 9600 running document number 602 ITERATION 9700 running document number 609 ITERATION 9800 running document number 2115 ITERATION 9900 running document number 723
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model._lambda):
print('Topic %d:' % (topic_idx))
print(' '.join([vocab[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
no_top_words = 10
display_topics(lda, doc_vecs, no_top_words)
Topic 0: problem algorithms paper performance neural present problems error task lower Topic 1: model data show methods work propose training given tasks first Topic 2: algorithm models new also network networks use k different design Topic 3: learning n using graph method approach two number demonstrate based Topic 4: time results one set information system study systems bounds large
- Topic 0: It may contain abstracts from machine learning and deep learning.
- Topic 1: It may contain abstracts from machine learning algorithms.
- Topic 2: IIt may contain abstracts from modern computer science and machine learning algorithms
- Topic 3: It may contain abstracts from machine learning algorithms and Graph algorithms
- Topic 4: It may contain topics from computer science systems.
Conclusion¶
- LDA gives better result as compared to SVILDA.
- LDA is able to group topics more precisely into different and meaningful clusters.
- The research papers are mostly related to algorithms, core computer science and machine/deep learning.
- Other than computer science the topics are related to Quantum theory and material science.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"