Predicting life outcomes is a challenging task even for advanced
machine learning (ML) algorithms. At the same time, accurately
predicting these outcomes has important implications in providing
targeted assistance and in improving policy making. Recent studies based
on Fragile Families and Child Wellbeing Study dataset
have shown that complex ML pipelines even in the presence of thousands
of variables produce low quality predictions. This research raises
several questions about the predictability of life outcomes: 1) What
factors influence the predictability of an outcome (e.g., quality of
data, pre-processing steps, model hyperparameters etc.) 2) How does the
predictability of outcomes vary by domain (e.g., are health outcomes
easier to predict than education outcomes)? To answer these questions,
we are building a cloud-based system to train and test hundreds of ML
pipelines on thousands of life outcomes. We use the results of this
large-scale exploration in a data-driven way to understand the
predictability of life outcomes.
In the first part of the talk, we discuss
the study design and describe the system we built to run such a
large-scale exploration. This system is both general and has easy to use
interfaces to run a wide range of studies. In the second part, we
present a meta-learning inspired method to derive key insights related
to the problem of predictability by A) Comparing the relative predictive
power of different classes of models B) Using descriptive statistics
that best predict the predictability of ML pipelines. Predictability of
life outcomes is a multi-faceted problem. We conclude the talk by
briefly discussing some of our other studies that are currently in the pipeline.
Bio:
Pranay Anchuri is a data scientist supported by the DataX fund at
CITP. His research interests include graph mining, large-scale data
analytics and blockchain technologies. Pranay graduated with a Ph.D. in
computer science from Rensselaer Polytechnic Institute in 2015. During
graduate studies, he worked at various labs including IBM, Yahoo, and
QCRI. His thesis focused on developing algorithms for efficiently
extracting frequent patterns noisy networks.
After graduation, Pranay started as a research scientist at NEC Labs,
Princeton working on log modeling and analytics. Most recently, he
worked as a research scientist at Axoni, NY where his research focused
on problems related to the implementation of high-performance
permissioned blockchains.