Home
GTFS Data Quality Rules

Data quality pipeline

How we process every GTFS feed
before it reaches you

Raw GTFS data published by transit agencies is rarely production-ready. It contains duplicates, missing required fields, invalid coordinates, broken calendars, and format violations. Our three-stage automated pipeline processes 7,000+ feeds from transit operators worldwide: first downloading and normalising each feed, then consolidating feeds from multiple operators into a single coherent dataset, then applying 80+ correction, enrichment, and validation rules — so you integrate clean, consistent data on the first attempt.

Active development

This pipeline is under continuous development — new rules, improved algorithms, and data quality fixes are shipped every week. The data you receive through our API is always the result of the latest validated pipeline version, deployed via a blue-green release strategy for zero-downtime updates and high availability.

Weekly

Pipeline releases

New processing version with rule improvements deployed every week

~7 days

Static GTFS refresh

All 7,000+ feeds fully reprocessed and published on a weekly cycle

15–60 sec

Real-time data latency

GTFS-RT vehicle positions and trip updates ingested live from 1,000+ feeds

Static GTFS processing pipeline

Input

7,000+ Raw GTFS

Transit operator feeds

Stage 0

Feed Acquisition

15 steps

Stage 1

Feed Consolidation

11 steps

Intermediate

Unified GTFS

1 feed

Stage 2 · Phase 1

Normalize & Correct

23 rules

Stage 2 · Phase 2

Enrich

7 rules

Stage 2 · Phase 3

Deduplicate

9 rules

Stage 2 · Phase 4

Validate & Hash

20 rules

Stage 2 · Phase 5

Generate derivatives

5 rules

Output

Improved GTFS

+ GeoJSON & stats

Stage 0: Feed Acquisition & Pre-processing

Before consolidation or quality rules run, raw GTFS data is downloaded from operator sources and passed through a pre-processing layer that makes each feed structurally parseable and safe for downstream processing. Feeds that have not changed since the previous cycle are detected via content hashing and skipped entirely, eliminating unnecessary reprocessing.

Download and duplicate detection

Downloads the feed from the operator's URL. Computes a content hash of all files; if the hash matches the previously stored value, the feed is marked unchanged and processing halts — preventing redundant reprocessing of identical data.

Archive structure normalisation

Removes nested folders so GTFS files are accessible at the archive root. Extracts nested ZIP archives to the root level. Halts processing if more than one subfolder or ZIP is found at the same level, as this indicates an unsupported multi-archive structure.

File format normalisation

Renames .csv files to .txt. Removes non-GTFS files, empty files, and files that contain only a header row with no data. Removes files where all data rows contain only empty values.

Required file presence check

Verifies that the minimum required GTFS files exist: routes.txt, trips.txt, stops.txt, and stop_times.txt. Also checks that at least one of calendar.txt or calendar_dates.txt is present. Missing required files halt processing.

Line ending and encoding repair

Converts Windows CRLF line endings to Unix LF. Fixes 122 common character sequences that result from UTF-8 data being misinterpreted as latin1. Converts UTF-16 and other exotic encodings to UTF-8 using a two-method detection approach. Removes null bytes and binary artifacts from all files.

Text field cleanup

Strips leading and trailing whitespace from all string fields. Removes orphaned quotation marks (lines that contain exactly one double quote). Removes newline characters embedded inside quoted CSV fields. Replaces literal \n escape sequences that appear as raw text inside field values with a space.

Header and quoting normalisation

Removes spaces from file header rows (column names). Removes spurious quotation marks that wrap entire CSV rows — a common export artifact that breaks standard CSV parsers.

CSV structure repair

Validates each GTFS file against the CSV specification. For files with mismatched quote characters, applies a multi-step structural repair. For files with inconsistent column counts, trims excess columns to match the header. For files with bare unescaped quotes in non-quoted fields, applies an additional repair pass. Each fix is re-validated; failed fixes roll back to the original.

Statistics collection

Counts the number of rows in each GTFS file and records the detected character encoding. Extracts the earliest start date and latest end date from calendar.txt and calendar_dates.txt. All statistics are written to the database for operational monitoring and cycle management.

Stop times sort

Sorts stop_times.txt by trip_id and then by stop_sequence to ensure deterministic processing order for all downstream rules that iterate over stop sequences.

Operator-specific custom scripts

For feeds that require transformations beyond the generic rules, operator-specific scripts handle source-level quirks: encoding fixes, custom archive layouts, proprietary ID schemes, and integration with external reference datasets (NAPTAN, NOC register, Traveline, and others).

Error handling: Feeds that fail structure checks (missing required files, unsupported archive layout, unchanged hash) are halted immediately and logged to the database. Non-fatal issues (encoding conversion failures, CSV repair failures) are logged and processing continues with the best available version of the file.

Stage 1: Feed Consolidation

Before any quality rules run, we consolidate feeds from multiple transit operators into a single coherent GTFS file. This is the most complex step in the pipeline: it handles timezone normalization, spatial stop deduplication, calendar merging, and translation inheritance across all source datasets.

Build auxiliary index maps

Before any transformation, the system builds four lookup tables: agency_id → timezone (to determine the dominant timezone), service_id → ServiceInfo (operating days and exceptions), trip_id → TripInfo (timezone, first departure, transport mode), and stop_id → StopInfo (serving routes, transport modes).

Detect feed languages

Reads feed_info.txt and agency.txt to determine the language of each source. If sources use different languages, feed_lang is set to mul (multilingual). All translations.txt files are loaded for later processing.

Merge calendars with DST-aware timezone conversion

The most technically challenging step. Every trip's service period is split into sub-periods at daylight saving time (DST) transition boundaries — separately for both the source timezone and the target (dominant) timezone. For each sub-period, time_shift (in seconds) and days_shift are computed and applied. Inconsistent DST boundaries produce an error and halt processing. Unique service IDs are generated for each sub-period to prevent collisions.

Deduplicate and merge stops (spatial + semantic)

Stop deduplication uses a spatial index to find candidates within transport-mode-specific distance thresholds (30–715 m for bus/metro/tram, up to 700 m for ferries, 1–3 m for cable cars). Name similarity is verified using a configurable string-similarity metric. Merge is blocked when stops have different platform codes, different transport modes, or both stops are served by the same trip (which would create a loop). A special case handles station (location_type=1) / platform (location_type=0) pairs: instead of merging, the platform is assigned as a child of the station. The stop closest to the geometric center of a duplicate group is kept as the canonical record.

Merge trips with DST sub-trip splitting

Trips that span a DST transition boundary are split into one record per sub-period, each with a new unique trip_id and the corresponding service_id. Translations referencing the original trip are duplicated for all sub-trips and the original translation is removed.

Merge stop_times and frequencies with time shifts

Each stop_times record is updated: stop_id is replaced via the deduplication mapping. For DST-split trips, records are duplicated for each sub-trip and departure/arrival times are shifted by time_shift seconds. The same logic applies to frequencies.txt headway-based trips.

Merge remaining tables

All other tables are concatenated with targeted fixes: agency_timezone is normalized to the dominant timezone for all agencies; transfers.txt and pathways.txt have stop IDs remapped through the deduplication mapping; all other tables (routes, shapes, fare_attributes, fare_rules, levels, attributions) are merged without modification.

Finalize translations

All accumulated translations (copied, removed, updated across steps 5–7) are collected, deduplicated, sorted by (table_name, field_name, field_value, record_id, language), and written to the final translations.txt.

Create feed_info and provenance mapping

Writes feed_info.txt (publisher, URL, language, date range covering all active calendars, version timestamp). Creates feed_version.mapping — a traceability table linking each record in the output to its source dataset, enabling full provenance tracking.

Merge provenance mapping files

Combines .mapping files from all sources. stops.mapping applies the stop deduplication remapping; trips.mapping expands each original trip ID into all its DST-split subtypes; agency.mapping and routes.mapping are merged without ID changes.

Duplicate route removal (optional)

When enabled for a region, performs a full preliminary merge, then runs a pairwise route similarity comparison. Routes with similarity ≥ threshold (default 99.99%, configurable per region) are considered duplicates. The winner is chosen by priority: real-time sources first, then by manual deduplication_priority score. A duplication report is generated for review.

Error handling: Fatal errors (trip referencing non-existent service, inconsistent DST boundaries, unknown stop in stop_times) halt the merge immediately. Non-fatal warnings (service without trips, missing translations) are logged and processing continues.

Stage 2: what each phase does

Error Correction

24 rules

Fixes format violations, fills missing required fields with safe defaults, corrects broken calendars, normalizes names and coordinates, and removes logically invalid records — without discarding valid data.

ID normalization Calendar repair Time consistency Name cleanup

Data Enrichment

7 rules

Adds data not present in the original feed: fills missing stops referenced in stop_times, adds entrances and exits to stations, links stops to cities, and populates headsigns from last-stop names.

Missing stops Station entrances Stop–city links Agency enrichment

Deduplication

10 rules

Detects and merges duplicate stops, trips, routes, and calendars both within a single feed and across overlapping regional feeds. Calendars are rebuilt to cover the union of working days without data loss.

Stop dedup Trip dedup Calendar merge Route dedup

Validation

7 rules

Validates URLs, emails, timezone identifiers, route colors, transfer rules, and arrival/departure time ordering. Invalid records are corrected where possible, or flagged and removed if they cannot be repaired.

URL/email check Timezone validation Color validation Transfer rules

Stable Hashing

6 rules

Computes content-based hashes for stops, trips, and routes that are minimally affected by provider data updates. Stable IDs enable reliable GTFS-RT matching and historical change tracking across update cycles.

Trip hash (hash1) Stop hash Route hash Sort order hash

Derivative Generation

6 rules

Generates feed-level statistics, per-stop route lists, GeoJSON shapes (geometrically simplified), trip bearing data, and MobilityData validation reports alongside the final improved GTFS ZIP.

Feed statistics GeoJSON shapes MobilityData report Trip bearing

Complete rule catalogue

Rules execute in order within each phase. Click any group to expand.

Feed Acquisition & Pre-processing 15 steps

Step	What it does
archive_normalization	Unwraps nested folders and ZIP packages inside archives, ensuring all GTFS files are accessible at the archive root before any other processing begins
change_detection	Fingerprints the incoming feed and skips the full processing cycle when content is unchanged since the previous run, keeping pipeline throughput high across 7,000+ feeds
file_extension_normalization	Renames `.csv` files to `.txt` to conform to GTFS naming conventions
unnecessary_file_removal	Removes non-GTFS files, header-only files, and files consisting entirely of empty values
required_files_check	Verifies that all required GTFS files are present before processing continues; halts the cycle early if the feed is structurally incomplete
text_normalization	Eliminates text-formatting artifacts across all files: Windows line endings, newlines embedded inside field values, and literal escape sequences that break CSV parsing
csv_quote_repair	Removes malformed quotation marks — orphaned mid-row quotes and spurious full-row wraps — a pervasive export artifact that breaks standard CSV parsers
whitespace_normalization	Strips leading and trailing whitespace from all string fields, including column headers
encoding_artifact_repair	Repairs UTF-8/latin1 double-encoding artifacts in name fields, covering patterns observed across thousands of real-world feeds
binary_artifact_removal	Removes null bytes and binary artifacts from all text files
utf8_normalization	Detects non-UTF-8 encodings and converts all files to UTF-8, covering the full range of legacy and regional character sets
feed_statistics_capture	Records per-file row counts, character encodings, and calendar date ranges to the monitoring database at the start of each processing cycle
csv_structure_repair	Validates and repairs CSV structure — mismatched quotes, inconsistent column counts, unescaped characters; rolls back to the original if repair fails
stop_times_sort	Sorts `stop_times.txt` by trip and stop sequence to guarantee deterministic input order for all downstream rules
operator_normalization	Applies operator-specific corrections for known non-standard exports from selected agencies

Error Correction 22 rules

Rule	What it does
id_normalization	Normalizes all primary and foreign key values across tables, resolving non-standard characters that break referential integrity
location_type_inference	Infers `location_type` from stop usage context when the field is absent, defaulting to platform-level where appropriate
name_sanitization	Sanitizes all name fields by removing control characters, non-standard Unicode sequences, and extraneous whitespace
stop_description_cleanup	Clears `stop_desc` when its content duplicates `stop_name`, eliminating noise that misleads downstream consumers
default_stop_name	Assigns a valid placeholder name when `stop_name` is absent, ensuring no stop enters the pipeline without an identifier
calendar_repair	Repairs calendar integrity issues: expired service periods, reversed start/end dates, and contradictory day-of-week flags
calendar_canonicalization	Canonicalizes calendar records to their minimal equivalent representation, reducing row count while preserving the exact service pattern
route_name_derivation	Derives `route_long_name` from the first and last stop of the representative trip when the original value is missing or clearly incorrect
route_name_promotion	Promotes content from `route_long_name` to `route_short_name` when the short name field is empty and the value is compact enough to serve that role
wheelchair_defaults	Standardizes `wheelchair_boarding` to "unknown" on all stops lacking an explicit accessibility declaration, ensuring consistent output schema
default_route_type	Assigns a contextually derived `route_type` when the field is absent or contains an unrecognized value
route_url_cleanup	Removes `route_url` when it duplicates `agency_url`, eliminating redundant data
arrival_departure_correction	Detects and corrects `stop_times` rows where arrival is recorded after departure — a systematic error in certain scheduling export tools
agency_name_recovery	Assigns a canonical name to agencies missing one and flags the feed for manual review
cross_midnight_normalization	Normalizes cross-midnight departure times into valid GTFS extended-time format without altering any actual schedule values
parent_station_relocation	Relocates parent stations that are geographically distant from their child stops to a more accurate position
route_color_defaults	Derives `route_text_color` from background luminance when the field is missing, ensuring readable contrast on all display surfaces
trip_headsign_derivation	Populates `trip_headsign` from the terminal stop name when the provider has not specified it
coordinate_precision	Rounds stop coordinates to six decimal places — sub-meter precision — reducing data volume without any loss of geospatial accuracy
timepoint_defaults	Fills missing `timepoint` fields with the GTFS default (exact timing), preventing ambiguity for downstream consumers
name_character_cleanup	Final name-field pass: removes residual non-alphanumeric characters that survived earlier normalization stages
feed_language_correction	Resolves mismatches between the declared `feed_lang` value and the language detected in feed content

Data Enrichment 7 rules

Rule	What it does
missing_stop_reconstruction	Reconstructs stop records referenced in `stop_times` but absent from `stops.txt`, preventing broken trip references and downstream failures
route_data_enrichment	Applies a curated correction table maintained by our data team to override incorrect vehicle types, route names, and agency assignments
station_topology_enrichment	Supplements feeds with station entrance and exit data (location_type 2/3) sourced from our global hub database
stop_data_correction	Applies a curated stop correction table to fix coordinates and names — particularly effective for NAPTAN-derived and other official-source datasets with systematic errors
agency_contact_enrichment	Fills missing agency contact details from a maintained cross-reference database
stop_mode_classification	Derives the dominant transport mode for each stop from the set of routes serving it — used for map rendering and mode-specific filtering
stop_city_resolution	Resolves each stop to its containing city using geographic analysis — the foundation for city-level data exports and API filtering

Deduplication 9 rules

Rule	What it does
route_name_deduplication	Removes `route_desc` and `route_long_name` when their content duplicates `route_short_name`, eliminating noise that misleads display logic
calendar_deduplication	Identifies and merges structurally identical calendars within a feed, rebuilding service ID references across all dependent tables
duplicate_stop_removal	Removes duplicate stop records within a feed, retaining the entry with the most complete field coverage
global_stop_deduplication	Multi-signal stop deduplication: resolves stop identity across update cycles using a combination of ID, name, and coordinate signals, maintaining stable global stop references over time
directional_stop_deduplication	Identifies duplicate stops positioned on opposite sides of the same road using directional analysis — a common artifact when stops are sourced from multiple operators
stop_times_deduplication	Removes duplicate `stop_times` rows produced when the same trip visits the same stop at the same time — typically a scheduling export artifact
route_deduplication	Merges routes sharing an identical stop-sequence fingerprint across different source feeds or agencies, producing a single canonical route record
trip_deduplication	Multi-pass trip deduplication: resolves trips by route and schedule fingerprint, then by schedule alone across routes; merges calendars when service patterns overlap
shape_optimization	Reduces shape point density using geometric simplification and removes redundant shape entries, cutting GeoJSON payload without visible quality loss

Validation & Cleanup 11 rules

Rule	What it does
arrival_departure_validation	Validates all `stop_times` for arrival/departure sequence violations; corrects where possible, removes irrecoverable rows
orphaned_trip_removal	Removes trips that have no valid service period after calendar cleanup — trips that can never run
location_type_validation	Removes trips that reference stops with incompatible location types (for example, referencing a station entrance inside `stop_times`)
url_email_validation	Validates and sanitizes URL and email fields across agencies and routes; removes values that fail format checks
agency_url_defaults	Assigns a placeholder URL to agencies missing a website, satisfying the GTFS required field constraint without breaking downstream consumers
timezone_validation	Corrects invalid IANA timezone identifiers in `stop_timezone` and `agency_timezone` fields
speed_anomaly_removal	Identifies and removes trips where the implied speed between consecutive stops is physically impossible for the declared vehicle type
route_color_validation	Validates `route_color` and `route_text_color` hex values; removes or corrects invalid color codes
single_stop_trip_removal	Removes trips containing only a single stop — logically invalid and unrepresentable as real transit service
transfer_validation	Validates transfer rules and removes entries referencing stops or routes that do not exist in the feed
timezone_geographic_cross_check	Cross-validates agency timezone declarations against stop geographic coordinates to detect and correct systematic timezone mismatches

Stable Hashing & Normalization 9 rules

Rule	What it does
route_type_normalization	Normalizes extended GTFS route type codes to the base specification values for maximum compatibility across consumers
referential_integrity_cleanup	Full referential integrity pass: removes orphaned records across all tables — unused stops, expired calendars, routes without trips, and unlinked shape data
route_fingerprinting	Calculates a content-based fingerprint for each route derived from the geographic coordinates of its stops — used for cross-source route matching
trip_fingerprinting	Computes stable fingerprints for each trip — a primary and a secondary variant with different parameter sets — enabling reliable cross-cycle tracking even when provider IDs change between updates
stop_sequence_normalization	Renumbers stop sequences to start at 1 and increment monotonically, eliminating gaps and non-standard ordering across all trips
trip_departure_ordering	Assigns a stable daily sort order to trips within each route group based on first departure time, used for deterministic API response ordering
stop_fingerprinting	Calculates a stable stop fingerprint based on name, coordinates, and transport type — the basis for cross-source stop identity matching
shape_dist_computation	Calculates and inserts `shape_dist_traveled` values in `stop_times` when shapes are present but distances are missing
final_integrity_check	Terminal consistency pass run after all hashing is complete, ensuring no orphaned references remain before the feed is packaged for export

Statistics & Derivative Generation 5 rules

Rule	What it does
feed_statistics	Computes per-feed statistics: route count, stop count, agency count, total shape distance, and geographic bounding box
stop_route_index	Builds an inverted index of routes serving each stop — powers stop detail pages and API endpoints
trip_direction_computation	Computes the compass bearing of each trip from its first to last stop — used for directional filtering in real-time vehicle matching
feed_packaging	Packages and exports the final processed output: improved GTFS archive, GeoJSON shapes, and MobilityData validation report
artifact_publication	Publishes all pipeline artifacts to the content delivery platform and API servers; triggers cache invalidation to ensure downstream consumers receive the latest data

Real-time processing pipeline

Input

1,000+ GTFS-RT

Vehicle positions, trip updates

Stage 1

Parse & Validate

Protobuf decode

Stage 2

Geographic Filter

Discard out-of-region

Stage 3

ID Normalization

Route & stop ID remapping

Stage 4

Trip Matching

Multi-signal matching

Stage 5

Arrival Prediction

Interpolation for gaps

Output

Real-time API

Matched positions & predictions

GTFS-RT synchronization is one of the hardest problems in transit data engineering. Every provider uses its own internal IDs, formats, and naming conventions — which must be matched against the static schedule in real time. Our pipeline includes dedicated real-time matching rules on top of the static processing:

Geographic sanity check on vehicle positions — coordinates outside the service region are automatically discarded (real-world example: a London feed reporting buses in the US)
Automatic replacement of incorrect route IDs and stop references based on historical movement data and manually curated mapping tables
Multi-signal trip matching across dozens of parameters: route ID, stop sequence, scheduled time, vehicle bearing, and geographic position
Arrival time prediction algorithms for stops where real-time data is delayed or missing
Match quality scoring using stop locations, route shapes, and current vehicle position
Stable trip hashes (hash1) in static data ensure GTFS-RT matching remains accurate even after full feed updates

Country-specific processing

In addition to the universal pipeline, we develop custom processing rules for individual countries that account for local data formats, coding standards, and operator-specific quirks. Country rules run as an additional layer on top of the base pipeline and are tailored to each provider's actual data quality issues.

UK (NAPTAN integration): Enriches stops with names, coordinates, ATCO codes, and platform data from the national NAPTAN database. Adds parent stations and entrances. Normalizes operator names using the National Operator Code (NOC) register.

Ireland: Enriches stops with NAPTAN Ireland data. Adds manually collected train station entrances and exits for stations not covered by official datasets.

Stop ATCO code correction: Fixes incorrect ATCO codes when a provider uses non-standard identifiers, which would otherwise break real-time data matching against vehicle position streams.

Traveline enrichment: Applies regional Traveline reports to geographically delimit overlapping dataset boundaries and improve GTFS-RT match rates for UK bus data.

Custom rules per client: For enterprise customers requiring specific data transformations or output formats, we implement dedicated processing rules scoped to their feed group.

Access clean, production-ready GTFS data via API

Our API serves data that has passed this complete three-stage quality pipeline — not raw feeds, but fully validated, deduplicated, and enriched datasets. No broken references, no format errors, no missing required fields. The API is engineered for long-term stability: consistent schemas, high uptime, and zero tolerance for corrupt data. Every dataset we publish is something we stand behind.

View API documentation

How we process every GTFS feedbefore it reaches you

Static GTFS processing pipeline

Stage 0: Feed Acquisition & Pre-processing

Stage 1: Feed Consolidation

Stage 2: what each phase does

Complete rule catalogue

Real-time processing pipeline

Country-specific processing

Access clean, production-ready GTFS data via API

How we process every GTFS feed
before it reaches you