The Populus Population Data Format

When you export a population from Epistemix’s Populus web service, you receive a tidy, analysis‑ready bundle of flat files that describe Agents, Places, and the Networks (links) that connect them. The bundle is designed to be easy to load in Python/R/SQL tools and to scale to large geographies. The spec below summarizes what to expect.

TL;DR (cheat sheet)

Formats: CSV or Parquet.
– CSV files cap at 200 MB each; Parquet files cap at 500 MB each (files are automatically chunked).
Tables you’ll get:
1. Agents (e.g., people and their attributes)
2. Places (e.g., households, schools, workplaces, group quarters)
3. Mappings (links): agent→place and place→place.
Schema style: wide tables—each attribute is a column. Missing values are expected where an attribute doesn’t apply (e.g., a household_relationship for a person living in a dorm).
Naming: Standard patterns like agent-person_*.csv and agent-person_*.parquet with numeric chunk suffixes (e.g., _1, _2, …). CSVs may be county‑segmented; Parquet is size‑segmented only.
Manifest: Each export includes a manifest that describes what’s inside (schemas, generation date, etc.). Use it to confirm column names and dtypes for your load pipeline.

Entities & files

1) Agents

What it is: One row per agent with all attributes as columns.
File naming:
- CSV: agent-[agent-type]_[county].csv (split by county and then chunked at 200 MB → …_[integer].csv)
  Example: agent-person_LA_County_1.csv, agent-person_LA_County_2.csv
- Parquet: agent-[agent-type].parquet (chunked at 500 MB → …_[integer].parquet)
  Example: agent-person_1.parquet, agent-person_2.parquet
Why wide? Easier to analyze directly in pandas/Polars/SQL, even if it introduces NULLs in columns that don’t apply to everyone. (Prior exports used “triples” like agent_id, attribute_name, attribute_value.)

2) Places

What it is: One row per place (household, school, workplace, etc.), with attributes as columns (lat/long/elevation, category‑specific fields, …). Missing values are fine where they don’t apply.
Group quarters: By default, barracks, college dormitories, nursing homes, and prisons are merged into a single logical type group_quarters, and a place_type column distinguishes the specific subtype. This simplifies analysis and accelerates exports.
File naming:
- CSV: place-[place-type]_[county].csv (200 MB chunking with _#[…])
- Parquet: place-[place-type].parquet (500 MB chunking with _#[…])

3) Networks (links): agent→place & place→place

What it is: Lightweight link tables that let you join across entities.
- Agent→Place: connects each person to, for example, their household, school, or workplace.
- Place→Place: connects places to larger containers (e.g., block‑group to county).
Group quarters links: In the default export, links associated with group quarters are grouped analogously to the place files. No place_type column appears in link files—you get that from the joined Place table.
File naming:
- CSV: agent-[agent-type]_to_place-[place-type]_[county].csv and place-[a]_to_place-[b]_[county].csv (200 MB chunking with _#[…])
- Parquet: agent-[agent-type]_to_place-[place-type].parquet and place-[a]_to_place-[b].parquet (500 MB chunking with _#[…])

Join keys: Link tables contain the IDs needed to connect sources to targets (e.g., agent_id ↔ place_id). Use your manifest to confirm exact column names for your export.

Naming & partitioning rules

Topic	CSV	Parquet
Partitioning	Split by county; then chunked to ≤ 200 MB	Not county‑partitioned; chunked to ≤ 500 MB
Chunk suffix	Append `_1`, `_2`, …	Append `_1`, `_2`, …
Examples	`agent-person_LA_County_1.csvplace-household_Allegheny_2.csv`	`agent-person_1.parquetplace-household_2.parquet`

Why ≤500 MB Parquet files? Better parallel reads, predicate pushdown, and resilience in engines like Spark, Dask, DuckDB, and Athena—versus a few massive files. Smaller files also reduce reader memory pressure and play nicely with S3 concurrency.

Directory layout

Exports are delivered in a clear, hierarchical tree by population and geography. For CSV, you’ll see folders down to the county level; for Parquet, files are chunked by size and organized under the population/country/state structure (not by county).

POPULATION_NAME/
└── COUNTRY_ISO2/ (e.g., US)
    └── STATE_CODE/ (e.g., PA)
        └── COUNTY_NAME/ (CSV only; e.g., Allegheny)
            ├── agent-person_Allegheny_1.csv
            ├── place-household_Allegheny_1.csv
            ├── agent-person_to_place-household_Allegheny_1.csv
            └── …

(Parquet exports appear at the relevant geography root with chunked files like agent-person_1.parquet, place-household_1.parquet, etc.)

Loading the data (Python examples)

Replace column names with the ones in your manifest if they differ (e.g., agent_id, place_id).

CSV (county‑segmented & chunked)

import glob
import pandas as pd

base = "/path/to/POPULATION_NAME/US/PA/Allegheny"

# Agents (concatenate chunks)
agent_files = glob.glob(f"{base}/agent-person_Allegheny_*.csv")
agents = pd.concat((pd.read_csv(f) for f in agent_files), ignore_index=True)

# Places (households)
place_files = glob.glob(f"{base}/place-household_Allegheny_*.csv")
households = pd.concat((pd.read_csv(f) for f in place_files), ignore_index=True)

# Links: person -> household
link_files = glob.glob(f"{base}/agent-person_to_place-household_Allegheny_*.csv")
links = pd.concat((pd.read_csv(f) for f in link_files), ignore_index=True)

# Join example: attach household attributes to people
people_with_households = (
    agents
    .merge(links, on="agent_id", how="left")
    .merge(households, on="place_id", how="left", suffixes=("", "_household"))
)

Parquet (size‑chunked)

import pyarrow.dataset as ds

# Point to the folder that contains your Parquet parts
agents_ds   = ds.dataset("/path/to/…/agent-person", format="parquet")
places_ds   = ds.dataset("/path/to/…/place-household", format="parquet")
links_ds    = ds.dataset("/path/to/…/agent-person_to_place-household", format="parquet")

agents = agents_ds.to_table().to_pandas()
households = places_ds.to_table().to_pandas()
links = links_ds.to_table().to_pandas()

people_with_households = (
    agents
    .merge(links, on="agent_id", how="left")
    .merge(households, on="place_id", how="left", suffixes=("", "_household"))
)

Tip: Parquet chunking is intentional—engines will happily read a directory of parts in parallel for faster scans and more effective predicate pushdown.

Group quarters: what to expect

In Place files, group‑quarters subtypes (barracks, dorms, nursing homes, prisons) are combined under group_quarters with a place_type column for the specific subtype.
In link files, there’s no place_type column; you join to the Place table to get that detail.
This reduces file count and round‑trips while keeping analysis straightforward.

Programmatic downloads (heads‑up)

Exports are generated asynchronously; when a job is done, Populus returns one or more download URLs (in the MVP, typically one URL). Teams often download directly from S3 in code (e.g., boto3) or via CLI. See the “Download Ready” panel in the UI for a copy‑paste snippet when your export finishes.

FAQ

Why are there NULLs in some columns?
Because the export uses wide tables; attributes that don’t apply to a specific row (e.g., household_relationship for dorm residents) are simply left empty. That keeps the data easy to read and analyze directly.

Are column names and types fixed?
Populus supports extensible schemas. Always check the included manifest for the precise set of fields and dtypes in your export.

Why not one big Parquet file?
Multiple ~≤500 MB parts improve read parallelism, data skipping, memory use, and resilience—especially in object stores like S3.

Summary

Populus exports deliver exactly what most analytics stacks want: three families of flat tables (Agents, Places, Links) with clear naming, predictable partitioning, and schemas suited to pandas/Polars/Spark/DuckDB. Load the pieces you need, join via IDs from the link tables, and rely on the manifest to keep your pipeline robust across populations and releases.