Fundamentals
OpenDAPI Fundamentals
Every dataset needs a DAPI file. Dataset is loosely defined and can be anything - entity, event or other package of information accessible in a storage system.
DAPI files are language-agnostic and are usually named <dataset>.dapi.yaml
or <dataset>.dapi.json
. DAPI files are written in YAML or JSON, and are validated against the JSONSchema specification identified in the schema
attribute. The specifications are available on opendapi.org.
Examples
DAPI
Let us take a look at a DAPI file for a dataset called user
:
# /path/to/user.dapi.yaml
schema: https://opendapi.org/spec/0-0-1/dapi.json
# Identifier that is unique across the organization
urn: company.membership.user
type: entity
description: This dataset contains information about our users and are stored when they signup
# The team that owns this dataset
owner_team_urn: my_company.teams.engineering.accounts
datastores:
# The datastore where this dataset is created
sources:
- urn: my_company.datastores.dynamodb
# what is this dataset called in the datastore
data:
identifier: User
namespace: ''
# approved business purposes for which this dataset is used in this datastore
business_purposes: ['servicing', 'fraud_prevention']
# infinite retention
retention_days: ~
# source datastore
freshness_sla_minutes: 0
# The datastore where this dataset is replicated to and consumed from
sinks:
- urn: my_company.datastores.snowflake
data:
identifier: user
namespace: prod.accounts
business_purposes: ['product_analytics', 'direct_marketing']
# These business purposes require a 365 day retention
retention_days: 365
# SLA to replicate to this datastore
freshness_sla_minutes: 60
fields:
- name: email
data_type: string
description: The email of the user that is verified via OTP
is_nullable: false
is_pii: true
# Does the producer guarantee the field's stability
access: public
- name: name
data_type: string
description: The full legal name of the user
is_nullable: false
is_pii: true
# the producer may change the field's semantics at some point
access: private
- name: created_at
data_type: datetime
description: Timestamp when they created an account
is_nullable: false
is_pii: false
access: public
# unique key, often useful to manage data
primary_key:
- email
# additional context about this DAPI
context:
service: membership
integration: pynamodb
rel_model_path: ../../models/user.py
rel_doc_path: ../../docs/user.md
Every DAPI has its URN
to uniquely identify within an organization, and it also references three other URNs:
owner_team_urn
- The team that owns this dataset and maps to acompany.teams.yaml
orcompany.teams.json
file and adheres to the teams specificationdatastores URN
- The datastore where this dataset is created and consumed from, and maps to acompany.datastores.yaml
file, adhering to the datastores specificationbusiness_purposes URN
- The business purposes for which this dataset is used, and maps to acompany.purposes.yaml
file, adhering to the purposes specification
Teams
Let us review the company.teams.yaml
file.
# /path/to/company.teams.yaml
schema: https://opendapi.org/spec/0-0-1/teams.json
organization:
name: Acme Company
# Slack integration
slack_teams:
- T1234567890
teams:
- urn: my_company.teams.engineering
name: Engineering
domain: Engineering
email: lead@company.com
- urn: my_company.teams.engineering.accounts
name: Accounts Engineering
domain: User Growth
email: accounts@company.com
manager_email: mgr@company.com
slack_channel_id: C1234567890
github_team: org/accounts
# org hierarchy
parent_team_urn: my_company.teams.engineering
Datastores
Here is a sample company.datastores.yaml
. Adding datastore credentials is optional, but recommended to enable DAPI servers to connect to the datastores and provide impact analysis as part of every change.
# /path/to/company.datastores.yaml
schema: https://opendapi.org/spec/0-0-1/datastores.json
datastores:
- urn: company.datastores.dynamodb
type: dynamodb
host:
# Supports multiple environments, but env_prod is required
env_prod:
location: arn:aws:dynamodb:us-east-1:12345567
# Credentials must be shared with your DAPI server AWS accounts
username: aws_secretsmanager:arn:aws:secretsmanager:us-east-1:1234567:secret:prod/dynamodb/username-123
password: aws_secretsmanager:arn:aws:secretsmanager:us-east-1:1234567:secret:prod/dynamodb/pw-123
env_dev:
location: arn:aws:dynamodb:us-east-1:972019825782
username: aws_secretsmanager:arn:aws:secretsmanager:us-east-1:1234567:secret:prod/dynamodb/username-123
password: aws_secretsmanager:arn:aws:secretsmanager:us-east-1:1234567:secret:prod/dynamodb/pw-123
- urn: company.datastores.snowflake
type: snowflake
host:
env_prod:
location: org-account.us-east-1.snowflakecomputing.com
username: aws_secretsmanager:arn:aws:secretsmanager:us-east-1:1234567:secret:prod/dynamodb/username-123
password: aws_secretsmanager:arn:aws:secretsmanager:us-east-1:1234567:secret:prod/dynamodb/pw-123
Purposes
and a sample company.purposes.yaml
,
# /path/to/company.purposes.yaml
schema: https://opendapi.org/spec/0-0-1/purposes.json
purposes:
- urn: company.purposes.servicing
description: Servicing our customers
- urn: company.purposes.fraud_prevention
description: Preventing fraud
Putting DAPIs to use
There must be exactly one DAPI file per dataset. However, it is okay to have multiple files for teams, datastores and purposes. This ecosystem is fully extensible to support and enforce data policies as well.
As evident, OpenDAPI files are merely representation of information. The real value comes from the tooling to create & maintain DAPIs and power workflows using the DAPI server.
Here are some example scenarios:
-
An engineer adds or updates a dataset. Their local development environment auto generates and validates the OpenDAPI files. They push the changes to Github and the Github Actions interacts with the DAPI server to provide helpful AI-powered suggestions to improve the dataset’s DAPI - semantics, data classification, etc. It also identifies potential downstream impact the changes may have and flags that in the same PR.
-
A data consumer wants access to a Snowflake dataset for their business purpose. They make their request in the DAPI server, which inturn creates a PR against the appropriate DAPI file for the data owner’s approval. Upon approval, the snowflake role for the business purpose will be granted access to the dataset.
-
Want to create a ROPA? The DAPI server can parse the DAPIs and extract a fully-compliant ROPA for you to review and sign.