Jiwon's Alcove
Research Blog

Submodular coresets: Machine learning with less data

Training machine learning models is expensive for a variety of reasons. Data movement costs immense time and power and backpropagation is...

A theoretical computer science problem that the SOTA LLMs get wrong (as of now)

A P-splinter is a language where a PTIME function can enumerate its elements. Despite this being an old result, state-of-the-art LLMs get this problem wrong. (For now—by writing this, I am inadvertently helping the next generation of LLMs improve via memorization.)

CoolerSpace: A Language for Physically Correct and Computationally Efficient Color Programming

A type system for color programming that prevents physically meaningless computations and optimizes performance via equality saturation.

The PLUTUS pipeline consists of a cycle: model training, sliceline to identify problematic subgroups, and distribution tailoring to acquire additional data.

PLUTUS: Understanding Data Distribution Tailoring for Machine Learning

A human-in-the-loop pipeline that identifies problematic model slices and acquires targeted data from external sources to improve fairness.

The DT pipeline where data sources are combined with RatioColl or EpsilonGreedy to form a balanced unified dataset.

Data Distribution Tailoring

Cost-efficient algorithms for collecting data from multiple sources while ensuring adequate representation across demographic groups.

The Two Coupons, Generic Quota Coupon Collector's Problem

Suppose there are two distinct coupons. In each iteration, the probability of sampling type-1 coupon is $p$ and that for type-2 coupon is $q = 1 - p$. Our goal is to collect at least $k$ of type-1 and $r$ of type-2. How many iterations does it take, in expectation, to complete a full collection?

Red and blue points on a plane, with some red points and some blue points having a circle around them.

Fair $k$-Cover Coresets

Efficiently obtain coresets that cover every point in the dataset while adequately representing groups of interest.

The pipeline. An SDR image is made brighter and darker, then denoising is applied. The resultant three images are combined with Mertens' fusion.

Single Exposure Fusion

Extend the perceptual dynamic range of a single photograph using classic denoising and exposure fusion, no machine learning required.