What is an Agent Enrichment?

This post walks Data Scientists through how Populus enriches a synthetic population with new attributes—either by directly copying values or by sampling from statistical distributions.

What “enrichment” means in Populus

Enrichment adds one or more new attributes (e.g., BMI, income, program eligibility) to every agent in a synthetic population. You provide a CSV/JSON with the values or distribution parameters, tell Populus how to match rows to agents (the match columns), and Populus writes the resulting values back to the population.

The enrichment workflow at a glance

Upload your source data
Upload a CSV or JSON file for processing using
POST /populations/{populationId}/enrichments.
MVP: the file is ingested to a processing location. V1: it’s stored in S3 and tracked in the SynthPop DB.
Inspect the columns Populus parsed
Confirm the headers Populus detected with
GET /enrichment-jobs/{importAgentDataJobId}/columns.
Choose the assignment method
- Direct Assignment: copy a value from your file to each matching agent.
- Distributional Assignment: sample a value per agent from a distribution you specify (e.g., Normal).
Map your columns
- Specify the match columns (the keys that align rows in your file with agents, such as sex, race/ethnicity, state).
- For Distributional Assignment, provide:
  - distr_family: a SciPy distribution name (e.g., "norm").
  - param_cols: a mapping from SciPy parameter names (e.g., loc, scale) to the column names in your file (e.g., bmi_mean, bmi_std).
Preview & confirm
The UI shows a preview of the mapping and grouped values so you can validate coverage before running. (See page 4.)
Start processing
Send your mapping configuration with
PUT /enrichment-jobs/{importAgentDataJobId}/mappings, then monitor progress with
GET /populations/jobs/{jobId}/status. (Endpoints and status check shown on page 4; the repo link there illustrates how distr_family is evaluated.)

The two assignment methods (when to use which)

Direct Assignment:
Use when your file already contains the final value to assign per matched row (e.g., “coverage = true”, “plan = Bronze”). You only specify the match columns and the source column to copy.

Distributional Assignment:
Use when you want Populus to sample a value for each agent from a distribution conditioned on the match columns. You define a distribution family (a SciPy scipy.stats distribution name) and tell Populus which CSV columns hold the distribution’s parameters (e.g., loc, scale). Populus then samples a value per agent from that group’s distribution and writes the new attribute.

Note: The document explicitly states that distr_family must be the name of a SciPy distribution and that param_cols maps SciPy parameter names (e.g., loc, scale) to your column names (e.g., mean, std_dev). The reference implementation linked in the doc shows how distr_family gets evaluated in code.

A concrete example: assigning BMI by sex, ethnicity, and state

Let’s enrich your population with BMI using a Normal distribution whose parameters vary by gender, ethnicity, and state. Below is the example CSV you’d upload:

gender,ethnicity,state,bmi_mean,bmi_std
m,0,IL,25,3.2
m,1,IL,26,2.1
m,2,IL,23,4.1
m,3,IL,27,2.1
f,0,IL,25,3.2
f,1,IL,26,2.1
f,2,IL,23,4.1
f,3,IL,27,2.1

How Populus uses this file

Match columns
In the mapping step, you’ll tell Populus which agent attributes align with your CSV keys. For example:
- sex (Populus agent attribute) ← gender (your CSV)
- ethnicity (or race, depending on your population schema) ← ethnicity
- state ← state
Tip: Ensure the codes/enumerations in your file (e.g., gender = m/f, ethnicity = 0–3) match the coding used in your population; otherwise, your rows won’t match any agents.
Distribution family & parameters
Choose Distributional Assignment with:
- distr_family = "norm" (Normal distribution)
- param_cols = { "loc": "bmi_mean", "scale": "bmi_std" }
For every agent whose (sex, ethnicity, state) matches one of your CSV rows, Populus will sample BMI from Normal(loc=bmi_mean, scale=bmi_std) and write the result to the new attribute (e.g., bmi). This is precisely the pattern described for distributional assignments and SciPy parameters in the doc.
Units & metadata
Set the attribute name to bmi, description to something like “Body Mass Index,” and units to kg/m^2 in the mapping form (the UI includes Name/Description/Units fields).
Preview
The preview screen will summarize your groups and values so you can verify coverage before you run.

Example API flow (MVP)

Below is a representative, high‑level sequence that mirrors the document’s endpoints. Exact payloads may vary with your deployment; the key idea is the same.

Upload your CSV

POST /populations/{populationId}/enrichments
# multipart/form-data with `file=@bmi_il.csv`
# → returns { "importAgentDataJobId": "...", ... }

Confirm parsed columns

GET /enrichment-jobs/{importAgentDataJobId}/columns
# → ["gender","ethnicity","state","bmi_mean","bmi_std"]

Send the mapping (Distributional Assignment)

PUT /enrichment-jobs/{importAgentDataJobId}/mappings
Content-Type: application/json

{
"attribute": {
"name": "bmi",
"description": "Body Mass Index",
"units": "kg/m^2"
},
"method": "distributional",
"match_columns": [
{ "populus_attribute": "sex", "csv_column": "gender" },
{ "populus_attribute": "ethnicity", "csv_column": "ethnicity" },
{ "populus_attribute": "state", "csv_column": "state" }
],
"distr_family": "norm",
"param_cols": { "loc": "bmi_mean", "scale": "bmi_std" }
}

Monitor the job

GET /populations/jobs/{jobId}/status
# → { "status": "running" | "succeeded" | "failed", ... }

These endpoints and the preview/status UX are outlined in the enrichment document. The doc also points to a reference implementation that evaluates distr_family using SciPy.

Quality checks & tips

Coverage: After preview, check that every (sex, ethnicity, state) combination in your population is covered by at most one row in your CSV. Missing or duplicate keys will reduce match rates or create ambiguity.
Parameter sanity: Ensure bmi_std > 0. Extreme standard deviations can produce unrealistic values.
Distribution choice: Normal works for many biometrics; if you use other families (e.g., lognorm, beta), include any required shape parameters in your CSV and extend param_cols accordingly. (The system uses SciPy distribution names and parameters.)
Attribute naming: Keep names consistent (bmi) and units explicit (kg/m^2) so downstream analyses are straightforward.
Multiple attributes: The MVP UI runs one attribute per execution, but the API can accept multiple attribute definitions in a single run—useful once your UI supports it.

Summary

Direct Assignment: copy values by match keys.
Distributional Assignment: sample from a SciPy distribution per matched group using distr_family and param_cols (e.g., loc = mean, scale = std).
End-to-end: upload file → inspect columns → map method & parameters → preview → start job → poll status.

If you follow the BMI example, you’ll have a working pattern you can reuse for any continuous attribute you want to synthesize.