Data quality pipeline

How we process every GTFS feed
before it reaches you

Raw GTFS data published by transit agencies is rarely production-ready. It contains duplicates, missing required fields, invalid coordinates, broken calendars, and format violations. Our three-stage automated pipeline processes 7,000+ feeds from transit operators worldwide: first downloading and normalising each feed, then consolidating feeds from multiple operators into a single coherent dataset, then applying 80+ correction, enrichment, and validation rules — so you integrate clean, consistent data on the first attempt.

Active development

This pipeline is under continuous development — new rules, improved algorithms, and data quality fixes are shipped every week. The data you receive through our API is always the result of the latest validated pipeline version, deployed via a blue-green release strategy for zero-downtime updates and high availability.

Weekly
Pipeline releases
New processing version with rule improvements deployed every week
~7 days
Static GTFS refresh
All 7,000+ feeds fully reprocessed and published on a weekly cycle
15–60 sec
Real-time data latency
GTFS-RT vehicle positions and trip updates ingested live from 1,000+ feeds

Static GTFS processing pipeline

Input
7,000+ Raw GTFS
Transit operator feeds
Stage 0
Feed Acquisition
15 steps
Stage 1
Feed Consolidation
11 steps
Intermediate
Unified GTFS
1 feed
Stage 2 · Phase 1
Normalize & Correct
23 rules
Stage 2 · Phase 2
Enrich
7 rules
Stage 2 · Phase 3
Deduplicate
9 rules
Stage 2 · Phase 4
Validate & Hash
20 rules
Stage 2 · Phase 5
Generate derivatives
5 rules
Output
Improved GTFS
+ GeoJSON & stats

Stage 0: Feed Acquisition & Pre-processing

Before consolidation or quality rules run, raw GTFS data is downloaded from operator sources and passed through a pre-processing layer that makes each feed structurally parseable and safe for downstream processing. Feeds that have not changed since the previous cycle are detected via content hashing and skipped entirely, eliminating unnecessary reprocessing.

01
Download and duplicate detection
Downloads the feed from the operator's URL. Computes a content hash of all files; if the hash matches the previously stored value, the feed is marked unchanged and processing halts — preventing redundant reprocessing of identical data.
02
Archive structure normalisation
Removes nested folders so GTFS files are accessible at the archive root. Extracts nested ZIP archives to the root level. Halts processing if more than one subfolder or ZIP is found at the same level, as this indicates an unsupported multi-archive structure.
03
File format normalisation
Renames .csv files to .txt. Removes non-GTFS files, empty files, and files that contain only a header row with no data. Removes files where all data rows contain only empty values.
04
Required file presence check
Verifies that the minimum required GTFS files exist: routes.txt, trips.txt, stops.txt, and stop_times.txt. Also checks that at least one of calendar.txt or calendar_dates.txt is present. Missing required files halt processing.
05
Line ending and encoding repair
Converts Windows CRLF line endings to Unix LF. Fixes 122 common character sequences that result from UTF-8 data being misinterpreted as latin1. Converts UTF-16 and other exotic encodings to UTF-8 using a two-method detection approach. Removes null bytes and binary artifacts from all files.
06
Text field cleanup
Strips leading and trailing whitespace from all string fields. Removes orphaned quotation marks (lines that contain exactly one double quote). Removes newline characters embedded inside quoted CSV fields. Replaces literal \n escape sequences that appear as raw text inside field values with a space.
07
Header and quoting normalisation
Removes spaces from file header rows (column names). Removes spurious quotation marks that wrap entire CSV rows — a common export artifact that breaks standard CSV parsers.
08
CSV structure repair
Validates each GTFS file against the CSV specification. For files with mismatched quote characters, applies a multi-step structural repair. For files with inconsistent column counts, trims excess columns to match the header. For files with bare unescaped quotes in non-quoted fields, applies an additional repair pass. Each fix is re-validated; failed fixes roll back to the original.
09
Statistics collection
Counts the number of rows in each GTFS file and records the detected character encoding. Extracts the earliest start date and latest end date from calendar.txt and calendar_dates.txt. All statistics are written to the database for operational monitoring and cycle management.
10
Stop times sort
Sorts stop_times.txt by trip_id and then by stop_sequence to ensure deterministic processing order for all downstream rules that iterate over stop sequences.
11
Operator-specific custom scripts
For feeds that require transformations beyond the generic rules, operator-specific scripts handle source-level quirks: encoding fixes, custom archive layouts, proprietary ID schemes, and integration with external reference datasets (NAPTAN, NOC register, Traveline, and others).
Error handling: Feeds that fail structure checks (missing required files, unsupported archive layout, unchanged hash) are halted immediately and logged to the database. Non-fatal issues (encoding conversion failures, CSV repair failures) are logged and processing continues with the best available version of the file.

Stage 1: Feed Consolidation

Before any quality rules run, we consolidate feeds from multiple transit operators into a single coherent GTFS file. This is the most complex step in the pipeline: it handles timezone normalization, spatial stop deduplication, calendar merging, and translation inheritance across all source datasets.

01
Build auxiliary index maps
Before any transformation, the system builds four lookup tables: agency_id → timezone (to determine the dominant timezone), service_id → ServiceInfo (operating days and exceptions), trip_id → TripInfo (timezone, first departure, transport mode), and stop_id → StopInfo (serving routes, transport modes).
02
Detect feed languages
Reads feed_info.txt and agency.txt to determine the language of each source. If sources use different languages, feed_lang is set to mul (multilingual). All translations.txt files are loaded for later processing.
03
Merge calendars with DST-aware timezone conversion
The most technically challenging step. Every trip's service period is split into sub-periods at daylight saving time (DST) transition boundaries — separately for both the source timezone and the target (dominant) timezone. For each sub-period, time_shift (in seconds) and days_shift are computed and applied. Inconsistent DST boundaries produce an error and halt processing. Unique service IDs are generated for each sub-period to prevent collisions.
04
Deduplicate and merge stops (spatial + semantic)
Stop deduplication uses a spatial index to find candidates within transport-mode-specific distance thresholds (30–715 m for bus/metro/tram, up to 700 m for ferries, 1–3 m for cable cars). Name similarity is verified using a configurable string-similarity metric. Merge is blocked when stops have different platform codes, different transport modes, or both stops are served by the same trip (which would create a loop). A special case handles station (location_type=1) / platform (location_type=0) pairs: instead of merging, the platform is assigned as a child of the station. The stop closest to the geometric center of a duplicate group is kept as the canonical record.
05
Merge trips with DST sub-trip splitting
Trips that span a DST transition boundary are split into one record per sub-period, each with a new unique trip_id and the corresponding service_id. Translations referencing the original trip are duplicated for all sub-trips and the original translation is removed.
06
Merge stop_times and frequencies with time shifts
Each stop_times record is updated: stop_id is replaced via the deduplication mapping. For DST-split trips, records are duplicated for each sub-trip and departure/arrival times are shifted by time_shift seconds. The same logic applies to frequencies.txt headway-based trips.
07
Merge remaining tables
All other tables are concatenated with targeted fixes: agency_timezone is normalized to the dominant timezone for all agencies; transfers.txt and pathways.txt have stop IDs remapped through the deduplication mapping; all other tables (routes, shapes, fare_attributes, fare_rules, levels, attributions) are merged without modification.
08
Finalize translations
All accumulated translations (copied, removed, updated across steps 5–7) are collected, deduplicated, sorted by (table_name, field_name, field_value, record_id, language), and written to the final translations.txt.
09
Create feed_info and provenance mapping
Writes feed_info.txt (publisher, URL, language, date range covering all active calendars, version timestamp). Creates feed_version.mapping — a traceability table linking each record in the output to its source dataset, enabling full provenance tracking.
10
Merge provenance mapping files
Combines .mapping files from all sources. stops.mapping applies the stop deduplication remapping; trips.mapping expands each original trip ID into all its DST-split subtypes; agency.mapping and routes.mapping are merged without ID changes.
11
Duplicate route removal (optional)
When enabled for a region, performs a full preliminary merge, then runs a pairwise route similarity comparison. Routes with similarity ≥ threshold (default 99.99%, configurable per region) are considered duplicates. The winner is chosen by priority: real-time sources first, then by manual deduplication_priority score. A duplication report is generated for review.
Error handling: Fatal errors (trip referencing non-existent service, inconsistent DST boundaries, unknown stop in stop_times) halt the merge immediately. Non-fatal warnings (service without trips, missing translations) are logged and processing continues.

Stage 2: what each phase does

Error Correction
24 rules
Fixes format violations, fills missing required fields with safe defaults, corrects broken calendars, normalizes names and coordinates, and removes logically invalid records — without discarding valid data.
ID normalization Calendar repair Time consistency Name cleanup
Data Enrichment
7 rules
Adds data not present in the original feed: fills missing stops referenced in stop_times, adds entrances and exits to stations, links stops to cities, and populates headsigns from last-stop names.
Missing stops Station entrances Stop–city links Agency enrichment
Deduplication
10 rules
Detects and merges duplicate stops, trips, routes, and calendars both within a single feed and across overlapping regional feeds. Calendars are rebuilt to cover the union of working days without data loss.
Stop dedup Trip dedup Calendar merge Route dedup
Validation
7 rules
Validates URLs, emails, timezone identifiers, route colors, transfer rules, and arrival/departure time ordering. Invalid records are corrected where possible, or flagged and removed if they cannot be repaired.
URL/email check Timezone validation Color validation Transfer rules
Stable Hashing
6 rules
Computes content-based hashes for stops, trips, and routes that are minimally affected by provider data updates. Stable IDs enable reliable GTFS-RT matching and historical change tracking across update cycles.
Trip hash (hash1) Stop hash Route hash Sort order hash
Derivative Generation
6 rules
Generates feed-level statistics, per-stop route lists, GeoJSON shapes (geometrically simplified), trip bearing data, and MobilityData validation reports alongside the final improved GTFS ZIP.
Feed statistics GeoJSON shapes MobilityData report Trip bearing

Complete rule catalogue

Rules execute in order within each phase. Click any group to expand.

Feed Acquisition & Pre-processing 15 steps
Step What it does
archive_normalization Unwraps nested folders and ZIP packages inside archives, ensuring all GTFS files are accessible at the archive root before any other processing begins
change_detection Fingerprints the incoming feed and skips the full processing cycle when content is unchanged since the previous run, keeping pipeline throughput high across 7,000+ feeds
file_extension_normalization Renames .csv files to .txt to conform to GTFS naming conventions
unnecessary_file_removal Removes non-GTFS files, header-only files, and files consisting entirely of empty values
required_files_check Verifies that all required GTFS files are present before processing continues; halts the cycle early if the feed is structurally incomplete
text_normalization Eliminates text-formatting artifacts across all files: Windows line endings, newlines embedded inside field values, and literal escape sequences that break CSV parsing
csv_quote_repair Removes malformed quotation marks — orphaned mid-row quotes and spurious full-row wraps — a pervasive export artifact that breaks standard CSV parsers
whitespace_normalization Strips leading and trailing whitespace from all string fields, including column headers
encoding_artifact_repair Repairs UTF-8/latin1 double-encoding artifacts in name fields, covering patterns observed across thousands of real-world feeds
binary_artifact_removal Removes null bytes and binary artifacts from all text files
utf8_normalization Detects non-UTF-8 encodings and converts all files to UTF-8, covering the full range of legacy and regional character sets
feed_statistics_capture Records per-file row counts, character encodings, and calendar date ranges to the monitoring database at the start of each processing cycle
csv_structure_repair Validates and repairs CSV structure — mismatched quotes, inconsistent column counts, unescaped characters; rolls back to the original if repair fails
stop_times_sort Sorts stop_times.txt by trip and stop sequence to guarantee deterministic input order for all downstream rules
operator_normalization Applies operator-specific corrections for known non-standard exports from selected agencies
Error Correction 22 rules
Rule What it does
id_normalization Normalizes all primary and foreign key values across tables, resolving non-standard characters that break referential integrity
location_type_inference Infers location_type from stop usage context when the field is absent, defaulting to platform-level where appropriate
name_sanitization Sanitizes all name fields by removing control characters, non-standard Unicode sequences, and extraneous whitespace
stop_description_cleanup Clears stop_desc when its content duplicates stop_name, eliminating noise that misleads downstream consumers
default_stop_name Assigns a valid placeholder name when stop_name is absent, ensuring no stop enters the pipeline without an identifier
calendar_repair Repairs calendar integrity issues: expired service periods, reversed start/end dates, and contradictory day-of-week flags
calendar_canonicalization Canonicalizes calendar records to their minimal equivalent representation, reducing row count while preserving the exact service pattern
route_name_derivation Derives route_long_name from the first and last stop of the representative trip when the original value is missing or clearly incorrect
route_name_promotion Promotes content from route_long_name to route_short_name when the short name field is empty and the value is compact enough to serve that role
wheelchair_defaults Standardizes wheelchair_boarding to "unknown" on all stops lacking an explicit accessibility declaration, ensuring consistent output schema
default_route_type Assigns a contextually derived route_type when the field is absent or contains an unrecognized value
route_url_cleanup Removes route_url when it duplicates agency_url, eliminating redundant data
arrival_departure_correction Detects and corrects stop_times rows where arrival is recorded after departure — a systematic error in certain scheduling export tools
agency_name_recovery Assigns a canonical name to agencies missing one and flags the feed for manual review
cross_midnight_normalization Normalizes cross-midnight departure times into valid GTFS extended-time format without altering any actual schedule values
parent_station_relocation Relocates parent stations that are geographically distant from their child stops to a more accurate position
route_color_defaults Derives route_text_color from background luminance when the field is missing, ensuring readable contrast on all display surfaces
trip_headsign_derivation Populates trip_headsign from the terminal stop name when the provider has not specified it
coordinate_precision Rounds stop coordinates to six decimal places — sub-meter precision — reducing data volume without any loss of geospatial accuracy
timepoint_defaults Fills missing timepoint fields with the GTFS default (exact timing), preventing ambiguity for downstream consumers
name_character_cleanup Final name-field pass: removes residual non-alphanumeric characters that survived earlier normalization stages
feed_language_correction Resolves mismatches between the declared feed_lang value and the language detected in feed content
Data Enrichment 7 rules
Rule What it does
missing_stop_reconstruction Reconstructs stop records referenced in stop_times but absent from stops.txt, preventing broken trip references and downstream failures
route_data_enrichment Applies a curated correction table maintained by our data team to override incorrect vehicle types, route names, and agency assignments
station_topology_enrichment Supplements feeds with station entrance and exit data (location_type 2/3) sourced from our global hub database
stop_data_correction Applies a curated stop correction table to fix coordinates and names — particularly effective for NAPTAN-derived and other official-source datasets with systematic errors
agency_contact_enrichment Fills missing agency contact details from a maintained cross-reference database
stop_mode_classification Derives the dominant transport mode for each stop from the set of routes serving it — used for map rendering and mode-specific filtering
stop_city_resolution Resolves each stop to its containing city using geographic analysis — the foundation for city-level data exports and API filtering
Deduplication 9 rules
Rule What it does
route_name_deduplication Removes route_desc and route_long_name when their content duplicates route_short_name, eliminating noise that misleads display logic
calendar_deduplication Identifies and merges structurally identical calendars within a feed, rebuilding service ID references across all dependent tables
duplicate_stop_removal Removes duplicate stop records within a feed, retaining the entry with the most complete field coverage
global_stop_deduplication Multi-signal stop deduplication: resolves stop identity across update cycles using a combination of ID, name, and coordinate signals, maintaining stable global stop references over time
directional_stop_deduplication Identifies duplicate stops positioned on opposite sides of the same road using directional analysis — a common artifact when stops are sourced from multiple operators
stop_times_deduplication Removes duplicate stop_times rows produced when the same trip visits the same stop at the same time — typically a scheduling export artifact
route_deduplication Merges routes sharing an identical stop-sequence fingerprint across different source feeds or agencies, producing a single canonical route record
trip_deduplication Multi-pass trip deduplication: resolves trips by route and schedule fingerprint, then by schedule alone across routes; merges calendars when service patterns overlap
shape_optimization Reduces shape point density using geometric simplification and removes redundant shape entries, cutting GeoJSON payload without visible quality loss
Validation & Cleanup 11 rules
Rule What it does
arrival_departure_validation Validates all stop_times for arrival/departure sequence violations; corrects where possible, removes irrecoverable rows
orphaned_trip_removal Removes trips that have no valid service period after calendar cleanup — trips that can never run
location_type_validation Removes trips that reference stops with incompatible location types (for example, referencing a station entrance inside stop_times)
url_email_validation Validates and sanitizes URL and email fields across agencies and routes; removes values that fail format checks
agency_url_defaults Assigns a placeholder URL to agencies missing a website, satisfying the GTFS required field constraint without breaking downstream consumers
timezone_validation Corrects invalid IANA timezone identifiers in stop_timezone and agency_timezone fields
speed_anomaly_removal Identifies and removes trips where the implied speed between consecutive stops is physically impossible for the declared vehicle type
route_color_validation Validates route_color and route_text_color hex values; removes or corrects invalid color codes
single_stop_trip_removal Removes trips containing only a single stop — logically invalid and unrepresentable as real transit service
transfer_validation Validates transfer rules and removes entries referencing stops or routes that do not exist in the feed
timezone_geographic_cross_check Cross-validates agency timezone declarations against stop geographic coordinates to detect and correct systematic timezone mismatches
Stable Hashing & Normalization 9 rules
Rule What it does
route_type_normalization Normalizes extended GTFS route type codes to the base specification values for maximum compatibility across consumers
referential_integrity_cleanup Full referential integrity pass: removes orphaned records across all tables — unused stops, expired calendars, routes without trips, and unlinked shape data
route_fingerprinting Calculates a content-based fingerprint for each route derived from the geographic coordinates of its stops — used for cross-source route matching
trip_fingerprinting Computes stable fingerprints for each trip — a primary and a secondary variant with different parameter sets — enabling reliable cross-cycle tracking even when provider IDs change between updates
stop_sequence_normalization Renumbers stop sequences to start at 1 and increment monotonically, eliminating gaps and non-standard ordering across all trips
trip_departure_ordering Assigns a stable daily sort order to trips within each route group based on first departure time, used for deterministic API response ordering
stop_fingerprinting Calculates a stable stop fingerprint based on name, coordinates, and transport type — the basis for cross-source stop identity matching
shape_dist_computation Calculates and inserts shape_dist_traveled values in stop_times when shapes are present but distances are missing
final_integrity_check Terminal consistency pass run after all hashing is complete, ensuring no orphaned references remain before the feed is packaged for export
Statistics & Derivative Generation 5 rules
Rule What it does
feed_statistics Computes per-feed statistics: route count, stop count, agency count, total shape distance, and geographic bounding box
stop_route_index Builds an inverted index of routes serving each stop — powers stop detail pages and API endpoints
trip_direction_computation Computes the compass bearing of each trip from its first to last stop — used for directional filtering in real-time vehicle matching
feed_packaging Packages and exports the final processed output: improved GTFS archive, GeoJSON shapes, and MobilityData validation report
artifact_publication Publishes all pipeline artifacts to the content delivery platform and API servers; triggers cache invalidation to ensure downstream consumers receive the latest data

Real-time processing pipeline

Input
1,000+ GTFS-RT
Vehicle positions, trip updates
Stage 1
Parse & Validate
Protobuf decode
Stage 2
Geographic Filter
Discard out-of-region
Stage 3
ID Normalization
Route & stop ID remapping
Stage 4
Trip Matching
Multi-signal matching
Stage 5
Arrival Prediction
Interpolation for gaps
Output
Real-time API
Matched positions & predictions

GTFS-RT synchronization is one of the hardest problems in transit data engineering. Every provider uses its own internal IDs, formats, and naming conventions — which must be matched against the static schedule in real time. Our pipeline includes dedicated real-time matching rules on top of the static processing:

  • Geographic sanity check on vehicle positions — coordinates outside the service region are automatically discarded (real-world example: a London feed reporting buses in the US)
  • Automatic replacement of incorrect route IDs and stop references based on historical movement data and manually curated mapping tables
  • Multi-signal trip matching across dozens of parameters: route ID, stop sequence, scheduled time, vehicle bearing, and geographic position
  • Arrival time prediction algorithms for stops where real-time data is delayed or missing
  • Match quality scoring using stop locations, route shapes, and current vehicle position
  • Stable trip hashes (hash1) in static data ensure GTFS-RT matching remains accurate even after full feed updates

Country-specific processing

In addition to the universal pipeline, we develop custom processing rules for individual countries that account for local data formats, coding standards, and operator-specific quirks. Country rules run as an additional layer on top of the base pipeline and are tailored to each provider's actual data quality issues.

UK (NAPTAN integration): Enriches stops with names, coordinates, ATCO codes, and platform data from the national NAPTAN database. Adds parent stations and entrances. Normalizes operator names using the National Operator Code (NOC) register.
Ireland: Enriches stops with NAPTAN Ireland data. Adds manually collected train station entrances and exits for stations not covered by official datasets.
Stop ATCO code correction: Fixes incorrect ATCO codes when a provider uses non-standard identifiers, which would otherwise break real-time data matching against vehicle position streams.
Traveline enrichment: Applies regional Traveline reports to geographically delimit overlapping dataset boundaries and improve GTFS-RT match rates for UK bus data.
Custom rules per client: For enterprise customers requiring specific data transformations or output formats, we implement dedicated processing rules scoped to their feed group.

Access clean, production-ready GTFS data via API

Our API serves data that has passed this complete three-stage quality pipeline — not raw feeds, but fully validated, deduplicated, and enriched datasets. No broken references, no format errors, no missing required fields. The API is engineered for long-term stability: consistent schemas, high uptime, and zero tolerance for corrupt data. Every dataset we publish is something we stand behind.

View API documentation