The Populus Population Data Format
What you get when you export a population from Populus
When you export a population from Epistemix’s Populus web service, you receive a tidy, analysis‑ready bundle of flat files that describe Agents, Places, and the Networks (links) that connect them. The bundle is designed to be easy to load in Python/R/SQL tools and to scale to large geographies. The spec below summarizes what to expect.
TL;DR (cheat sheet)
-
Formats: CSV or Parquet.
– CSV files cap at 200 MB each; Parquet files cap at 500 MB each (files are automatically chunked). -
Tables you’ll get:
-
Agents (e.g., people and their attributes)
-
Places (e.g., households, schools, workplaces, group quarters)
-
Mappings (links): agent→place and place→place.
-
-
Schema style: wide tables—each attribute is a column. Missing values are expected where an attribute doesn’t apply (e.g., a
household_relationship
for a person living in a dorm). -
Naming: Standard patterns like
agent-person_*.csv
andagent-person_*.parquet
with numeric chunk suffixes (e.g.,_1
,_2
, …). CSVs may be county‑segmented; Parquet is size‑segmented only. -
Manifest: Each export includes a manifest that describes what’s inside (schemas, generation date, etc.). Use it to confirm column names and dtypes for your load pipeline.
Entities & files
1) Agents
-
What it is: One row per agent with all attributes as columns.
-
File naming:
-
CSV:
agent-[agent-type]_[county].csv
(split by county and then chunked at 200 MB →…_[integer].csv
)
Example:agent-person_LA_County_1.csv
,agent-person_LA_County_2.csv
-
Parquet:
agent-[agent-type].parquet
(chunked at 500 MB →…_[integer].parquet
)
Example:agent-person_1.parquet
,agent-person_2.parquet
-
-
Why wide? Easier to analyze directly in pandas/Polars/SQL, even if it introduces
NULL
s in columns that don’t apply to everyone. (Prior exports used “triples” likeagent_id, attribute_name, attribute_value
.)
2) Places
-
What it is: One row per place (household, school, workplace, etc.), with attributes as columns (lat/long/elevation, category‑specific fields, …). Missing values are fine where they don’t apply.
-
Group quarters: By default, barracks, college dormitories, nursing homes, and prisons are merged into a single logical type
group_quarters
, and aplace_type
column distinguishes the specific subtype. This simplifies analysis and accelerates exports. -
File naming:
-
CSV:
place-[place-type]_[county].csv
(200 MB chunking with_#[…]
) -
Parquet:
place-[place-type].parquet
(500 MB chunking with_#[…]
)
-
3) Networks (links): agent→place & place→place
-
What it is: Lightweight link tables that let you join across entities.
-
Agent→Place: connects each person to, for example, their household, school, or workplace.
-
Place→Place: connects places to larger containers (e.g., block‑group to county).
-
-
Group quarters links: In the default export, links associated with group quarters are grouped analogously to the place files. No
place_type
column appears in link files—you get that from the joined Place table. -
File naming:
-
CSV:
agent-[agent-type]_to_place-[place-type]_[county].csv
andplace-[a]_to_place-[b]_[county].csv
(200 MB chunking with_#[…]
) -
Parquet:
agent-[agent-type]_to_place-[place-type].parquet
andplace-[a]_to_place-[b].parquet
(500 MB chunking with_#[…]
)
-
Join keys: Link tables contain the IDs needed to connect sources to targets (e.g.,
agent_id
↔place_id
). Use your manifest to confirm exact column names for your export.
Naming & partitioning rules
Topic | CSV | Parquet |
---|---|---|
Partitioning | Split by county; then chunked to ≤ 200 MB | Not county‑partitioned; chunked to ≤ 500 MB |
Chunk suffix | Append _1 , _2 , … |
Append _1 , _2 , … |
Examples | agent-person_LA_County_1.csv place-household_Allegheny_2.csv |
agent-person_1.parquet place-household_2.parquet |
Why ≤500 MB Parquet files? Better parallel reads, predicate pushdown, and resilience in engines like Spark, Dask, DuckDB, and Athena—versus a few massive files. Smaller files also reduce reader memory pressure and play nicely with S3 concurrency.
Directory layout
Exports are delivered in a clear, hierarchical tree by population and geography. For CSV, you’ll see folders down to the county level; for Parquet, files are chunked by size and organized under the population/country/state structure (not by county).
POPULATION_NAME/
└── COUNTRY_ISO2/ (e.g., US)
└── STATE_CODE/ (e.g., PA)
└── COUNTY_NAME/ (CSV only; e.g., Allegheny)
├── agent-person_Allegheny_1.csv
├── place-household_Allegheny_1.csv
├── agent-person_to_place-household_Allegheny_1.csv
└── …
(Parquet exports appear at the relevant geography root with chunked files like agent-person_1.parquet
, place-household_1.parquet
, etc.)
Loading the data (Python examples)
Replace column names with the ones in your manifest if they differ (e.g.,
agent_id
,place_id
).
CSV (county‑segmented & chunked)
import glob
import pandas as pd
base = "/path/to/POPULATION_NAME/US/PA/Allegheny"
# Agents (concatenate chunks)
agent_files = glob.glob(f"{base}/agent-person_Allegheny_*.csv")
agents = pd.concat((pd.read_csv(f) for f in agent_files), ignore_index=True)
# Places (households)
place_files = glob.glob(f"{base}/place-household_Allegheny_*.csv")
households = pd.concat((pd.read_csv(f) for f in place_files), ignore_index=True)
# Links: person -> household
link_files = glob.glob(f"{base}/agent-person_to_place-household_Allegheny_*.csv")
links = pd.concat((pd.read_csv(f) for f in link_files), ignore_index=True)
# Join example: attach household attributes to people
people_with_households = (
agents
.merge(links, on="agent_id", how="left")
.merge(households, on="place_id", how="left", suffixes=("", "_household"))
)
Parquet (size‑chunked)
import pyarrow.dataset as ds
# Point to the folder that contains your Parquet parts
agents_ds = ds.dataset("/path/to/…/agent-person", format="parquet")
places_ds = ds.dataset("/path/to/…/place-household", format="parquet")
links_ds = ds.dataset("/path/to/…/agent-person_to_place-household", format="parquet")
agents = agents_ds.to_table().to_pandas()
households = places_ds.to_table().to_pandas()
links = links_ds.to_table().to_pandas()
people_with_households = (
agents
.merge(links, on="agent_id", how="left")
.merge(households, on="place_id", how="left", suffixes=("", "_household"))
)
Tip: Parquet chunking is intentional—engines will happily read a directory of parts in parallel for faster scans and more effective predicate pushdown.
Group quarters: what to expect
-
In Place files, group‑quarters subtypes (barracks, dorms, nursing homes, prisons) are combined under
group_quarters
with aplace_type
column for the specific subtype. -
In link files, there’s no
place_type
column; you join to the Place table to get that detail.
This reduces file count and round‑trips while keeping analysis straightforward.
Programmatic downloads (heads‑up)
Exports are generated asynchronously; when a job is done, Populus returns one or more download URLs (in the MVP, typically one URL). Teams often download directly from S3 in code (e.g., boto3) or via CLI. See the “Download Ready” panel in the UI for a copy‑paste snippet when your export finishes.
FAQ
Why are there NULL
s in some columns?
Because the export uses wide tables; attributes that don’t apply to a specific row (e.g., household_relationship
for dorm residents) are simply left empty. That keeps the data easy to read and analyze directly.
Are column names and types fixed?
Populus supports extensible schemas. Always check the included manifest for the precise set of fields and dtypes in your export.
Why not one big Parquet file?
Multiple ~≤500 MB parts improve read parallelism, data skipping, memory use, and resilience—especially in object stores like S3.
Summary
Populus exports deliver exactly what most analytics stacks want: three families of flat tables (Agents, Places, Links) with clear naming, predictable partitioning, and schemas suited to pandas/Polars/Spark/DuckDB. Load the pieces you need, join via IDs from the link tables, and rely on the manifest to keep your pipeline robust across populations and releases.