Jiwon's Alcove
Research Blog

Submodular coresets: Machine learning with less data

A theoretical computer science problem that the SOTA LLMs get wrong (as of now)

A P-splinter is a language where a PTIME function can enumerate its elements. Despite this being an old result, state-of-the-art LLMs get this problem wrong. (For now—by writing this, I am inadvertently helping the next generation of LLMs improve via memorization.)

CoolerSpace: A Language for Physically Correct and Computationally Efficient Color Programming

The PLUTUS pipeline consists of a cycle: model training, sliceline to identify problematic subgroups, and distribution tailoring to acquire additional data.

PLUTUS: Understanding Data Distribution Tailoring for Machine Learning

The Two Coupons, Generic Quota Coupon Collector's Problem

Suppose there are two distinct coupons. In each iteration, the probability of sampling type-1 coupon is $p$ and that for type-2 coupon is $q = 1 - p$. Our goal is to collect at least $k$ of type-1 and $r$ of type-2. How many iterations does it take, in expectation, to complete a full collection?

The DT pipeline where data sources are combined with RatioColl or EpsilonGreedy to form a balanced unified dataset.

Data Distribution Tailoring

Red and blue points on a plane, with some red points and some blue points having a circle around them.

Fair $k$-Cover Coresets

The pipeline. An SDR image is made brighter and darker, then denoising is applied. The resultant three images are combined with Mertens' fusion.

Single Exposure Fusion