What is an Agent Enrichment?
Populus Population Enrichment, Explained (with a BMI example)
This post walks Data Scientists through how Populus enriches a synthetic population with new attributes—either by directly copying values or by sampling from statistical distributions.
What “enrichment” means in Populus
Enrichment adds one or more new attributes (e.g., BMI, income, program eligibility) to every agent in a synthetic population. You provide a CSV/JSON with the values or distribution parameters, tell Populus how to match rows to agents (the match columns), and Populus writes the resulting values back to the population.
The enrichment workflow at a glance
-
Upload your source data
Upload a CSV or JSON file for processing usingPOST /populations/{populationId}/enrichments
.
MVP: the file is ingested to a processing location. V1: it’s stored in S3 and tracked in the SynthPop DB. -
Inspect the columns Populus parsed
Confirm the headers Populus detected withGET /enrichment-jobs/{importAgentDataJobId}/columns
. -
Choose the assignment method
-
Direct Assignment: copy a value from your file to each matching agent.
-
Distributional Assignment: sample a value per agent from a distribution you specify (e.g., Normal).
-
-
Map your columns
-
Specify the match columns (the keys that align rows in your file with agents, such as
sex
,race/ethnicity
,state
). -
For Distributional Assignment, provide:
-
distr_family
: a SciPy distribution name (e.g.,"norm"
). -
param_cols
: a mapping from SciPy parameter names (e.g.,loc
,scale
) to the column names in your file (e.g.,bmi_mean
,bmi_std
).
-
-
-
Preview & confirm
The UI shows a preview of the mapping and grouped values so you can validate coverage before running. (See page 4.) -
Start processing
Send your mapping configuration withPUT /enrichment-jobs/{importAgentDataJobId}/mappings
, then monitor progress withGET /populations/jobs/{jobId}/status
. (Endpoints and status check shown on page 4; the repo link there illustrates howdistr_family
is evaluated.)
The two assignment methods (when to use which)
Direct Assignment:
Use when your file already contains the final value to assign per matched row (e.g., “coverage = true”, “plan = Bronze”). You only specify the match columns and the source column to copy.
Distributional Assignment:
Use when you want Populus to sample a value for each agent from a distribution conditioned on the match columns. You define a distribution family (a SciPy scipy.stats
distribution name) and tell Populus which CSV columns hold the distribution’s parameters (e.g., loc
, scale
). Populus then samples a value per agent from that group’s distribution and writes the new attribute.
Note: The document explicitly states that
distr_family
must be the name of a SciPy distribution and thatparam_cols
maps SciPy parameter names (e.g.,loc
,scale
) to your column names (e.g.,mean
,std_dev
). The reference implementation linked in the doc shows howdistr_family
gets evaluated in code.
A concrete example: assigning BMI by sex, ethnicity, and state
Let’s enrich your population with BMI using a Normal distribution whose parameters vary by gender, ethnicity, and state. Below is the example CSV you’d upload:
gender,ethnicity,state,bmi_mean,bmi_std
m,0,IL,25,3.2
m,1,IL,26,2.1
m,2,IL,23,4.1
m,3,IL,27,2.1
f,0,IL,25,3.2
f,1,IL,26,2.1
f,2,IL,23,4.1
f,3,IL,27,2.1
How Populus uses this file
-
Match columns
In the mapping step, you’ll tell Populus which agent attributes align with your CSV keys. For example:-
sex
(Populus agent attribute) ←gender
(your CSV) -
ethnicity
(orrace
, depending on your population schema) ←ethnicity
-
state
←state
Tip: Ensure the codes/enumerations in your file (e.g.,
gender = m/f
,ethnicity = 0–3
) match the coding used in your population; otherwise, your rows won’t match any agents. -
-
Distribution family & parameters
Choose Distributional Assignment with:-
distr_family = "norm"
(Normal distribution) -
param_cols = { "loc": "bmi_mean", "scale": "bmi_std" }
For every agent whose
(sex, ethnicity, state)
matches one of your CSV rows, Populus will sample BMI fromNormal(loc=bmi_mean, scale=bmi_std)
and write the result to the new attribute (e.g.,bmi
). This is precisely the pattern described for distributional assignments and SciPy parameters in the doc. -
-
Units & metadata
Set the attribute name tobmi
, description to something like “Body Mass Index,” and units tokg/m^2
in the mapping form (the UI includes Name/Description/Units fields). -
Preview
The preview screen will summarize your groups and values so you can verify coverage before you run.
Example API flow (MVP)
Below is a representative, high‑level sequence that mirrors the document’s endpoints. Exact payloads may vary with your deployment; the key idea is the same.
-
Upload your CSV
POST /populations/{populationId}/enrichments
# multipart/form-data with `file=@bmi_il.csv`
# → returns { "importAgentDataJobId": "...", ... }
-
Confirm parsed columns
GET /enrichment-jobs/{importAgentDataJobId}/columns
# → ["gender","ethnicity","state","bmi_mean","bmi_std"]
-
Send the mapping (Distributional Assignment)
PUT /enrichment-jobs/{importAgentDataJobId}/mappings
Content-Type: application/json
{
"attribute": {
"name": "bmi",
"description": "Body Mass Index",
"units": "kg/m^2"
},
"method": "distributional",
"match_columns": [
{ "populus_attribute": "sex", "csv_column": "gender" },
{ "populus_attribute": "ethnicity", "csv_column": "ethnicity" },
{ "populus_attribute": "state", "csv_column": "state" }
],
"distr_family": "norm",
"param_cols": { "loc": "bmi_mean", "scale": "bmi_std" }
}
Monitor the job
GET /populations/jobs/{jobId}/status
# → { "status": "running" | "succeeded" | "failed", ... }
These endpoints and the preview/status UX are outlined in the enrichment document. The doc also points to a reference implementation that evaluates distr_family
using SciPy.
Quality checks & tips
-
Coverage: After preview, check that every (sex, ethnicity, state) combination in your population is covered by at most one row in your CSV. Missing or duplicate keys will reduce match rates or create ambiguity.
-
Parameter sanity: Ensure
bmi_std > 0
. Extreme standard deviations can produce unrealistic values. -
Distribution choice: Normal works for many biometrics; if you use other families (e.g.,
lognorm
,beta
), include any required shape parameters in your CSV and extendparam_cols
accordingly. (The system uses SciPy distribution names and parameters.) -
Attribute naming: Keep names consistent (
bmi
) and units explicit (kg/m^2
) so downstream analyses are straightforward. -
Multiple attributes: The MVP UI runs one attribute per execution, but the API can accept multiple attribute definitions in a single run—useful once your UI supports it.
Summary
-
Direct Assignment: copy values by match keys.
-
Distributional Assignment: sample from a SciPy distribution per matched group using
distr_family
andparam_cols
(e.g.,loc
= mean,scale
= std). -
End-to-end: upload file → inspect columns → map method & parameters → preview → start job → poll status.
If you follow the BMI example, you’ll have a working pattern you can reuse for any continuous attribute you want to synthesize.