Skip to content

Add machine-readable constraint definitions (#22)#27

Open
Abhishek-Kumar-Rai5 wants to merge 1 commit intoPecanProject:mainfrom
Abhishek-Kumar-Rai5:feat/constraint-schema
Open

Add machine-readable constraint definitions (#22)#27
Abhishek-Kumar-Rai5 wants to merge 1 commit intoPecanProject:mainfrom
Abhishek-Kumar-Rai5:feat/constraint-schema

Conversation

@Abhishek-Kumar-Rai5
Copy link
Copy Markdown

Description

Implement machine-readable constraint definitions using Frictionless Data Table Schema (datapackage.json) and custom YAML for constraints that Frictionless cannot express natively.

This establishes the single source of truth for all BETYdb validation rules, enabling external tools to validate data before submission and eliminating the need to reverse-engineer rules from scattered Rails/SQL implementations.

Related Issue(s)

Closes #22 (Implement machine-readable source of truth for data constraints and model validation rules)
Related to #14 (Implement constraints and validation from postgres BETYdb)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Data update (changes to datasets)
  • Documentation update
  • Breaking change (fix or feature that would cause existing functionality to change)

Changes Made

Layer 1: Frictionless Data (datapackage.json)

  • Extended with native constraint metadata:
    • required: true on critical fields (sitename, trait, mean, id, name, units)
    • minimum/maximum on numeric fields (lat: -90/90, lon: -180/180, masl, n ≥ 1)
    • enum on restricted values (access_level: 1-4, statname, checked)
    • unique and primaryKey on uniqueness constraints
    • foreignKeys on referential integrity across all tables
  • No structural changes — field names, types, and order preserved

Layer 2: Custom YAML (data-raw/custom_constraints.yaml, inst/extdata/)

  • NEW file: 5 constraint definitions across 3 tables
  • sites: composite constraint (sand_pct + clay_pct ≤ 100)
  • traits: conditional constraint (stat requires statname), cross-table constraint (trait.mean within variable bounds)
  • cultivars: unique combination constraint (name + specie_id)
  • Each constraint references original source (Rails model, SQL migration)

Checklist

Data Changes (if applicable)

  • datapackage.json updated with constraint metadata
  • Data rebuilt from source (no rebuild needed — schema-only change)
  • Row counts verified (no data rows affected)

Additional Notes

What This Enables

  • External tools (R packages, Python clients, ingestion pipelines) can validate data before upload
  • Single maintained reference for all validation rules
  • Any language can consume constraints from datapackage.json and custom_constraints.yaml

What PR 2nd Will Add

  • R validation functions that read and enforce these definitions
  • Integration into build process (make-data.R)
  • Comprehensive test suite (37 tests)
  • Auto-filtering of invalid records

How to Review

  • Verify constraint definitions are complete and accurate
  • Check that Frictionless layer covers all standard constraints
  • Confirm custom YAML captures all complex rules
  • Suggest any additions or removals before this is merged

@Abhishek-Kumar-Rai5
Copy link
Copy Markdown
Author

Hi @dlebauer as you had instructed and keeping the suggestion in mind of using both the frictionless-data and defining the rest in a separate validation layer, I have made all the additions and changes. Please have a look whenever you have a second. Further along with it I have also addressed the parent issue #14 and its PR is ready from my side which will address all the issues you have given in the description and the comment there. Ready to push that Pr whenever you give the go signal to raise it.
Thank You.
Meanwhile I am regularly working on the project and giving you the updates the slack. Would be waiting for your review and suggestions.

Copy link
Copy Markdown
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start, but it does not appear to cover the set of constraints defined in #14 (more specifically, documentation, spreadsheet, and sql schema definition therein?).

While it is not required to migrate all constraints, for any constraints that don't need to be migrated, please make a note why that don't need to be migrated (you can even copy the spreadsheet and keep notes there). And if there are constraints where it is not clear that they need to be migrated, please make a note in the spreadsheet or ask.

Note: 1. constraints on/from tables that haven't been migrated are excluded. 2. The set of constraints is comprehensive, and there may be some that do not need to be migrated, but we should discuss rather than silently dropping.

lat, lon, and masl must all be specified together or not at all.
Enforced indirectly via the geometry column in the CSV representation.
Flagged here for documentation; enforcement is via geometry parsing logic.
message: "lat, lon, and masl must all be present together (source: site.rb complete_geometry_specification)"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to keep the source of the message here (or below, sufficient to have this in file metadata header

# Cross-table lookup rules that require joining against another table.
# These were originally implemented as PostgreSQL triggers/functions.
#
# Sources:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from #14 see also

@Abhishek-Kumar-Rai5
Copy link
Copy Markdown
Author

@dlebauer sir, apologies for the confusion that got created.
Just to clarify scope on this PR: this was intentionally structured as a first-phase, schema-focused contribution tied to Issue #22, specifically to establish the machine-readable constraint layer before moving into broader validation implementation.

The scope here was limited to:

  • Frictionless constraints in datapackage.json
  • Custom machine-readable constraints in inst/extdata/custom_constraints.yaml
  • Establishing the two-layer constraint model (schema constraints + custom YAML constraints)

I just wanted to keep the PR separate so that they are clean and can be reviewed easily as the changes were large. And also based on this I wanted a go signal in case my current approach was correct and if it was not then I would have taken the suggestions and then raised that PR for an easy merger.
The full PR which properly addresses both these issues, I am raising it separately so as to keep the PR clean. Please have a look at it and provide your suggestions on it.
Thank you!

@Abhishek-Kumar-Rai5
Copy link
Copy Markdown
Author

I have created the PR as it was already ready with me. Please have a look at it whenever you have a second. The PR is #28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement machine-readable source of truth for data constraints and model validation rules

2 participants