How we process every GTFS feed
before it reaches you
Raw GTFS data published by transit agencies is rarely production-ready. It contains duplicates, missing required fields, invalid coordinates, broken calendars, and format violations. Our three-stage automated pipeline processes 7,000+ feeds from transit operators worldwide: first downloading and normalising each feed, then consolidating feeds from multiple operators into a single coherent dataset, then applying 80+ correction, enrichment, and validation rules — so you integrate clean, consistent data on the first attempt.
This pipeline is under continuous development — new rules, improved algorithms, and data quality fixes are shipped every week. The data you receive through our API is always the result of the latest validated pipeline version, deployed via a blue-green release strategy for zero-downtime updates and high availability.
Static GTFS processing pipeline
Stage 0: Feed Acquisition & Pre-processing
Before consolidation or quality rules run, raw GTFS data is downloaded from operator sources and passed through a pre-processing layer that makes each feed structurally parseable and safe for downstream processing. Feeds that have not changed since the previous cycle are detected via content hashing and skipped entirely, eliminating unnecessary reprocessing.
.csv files to .txt. Removes non-GTFS files, empty files,
and files that contain only a header row with no data. Removes files where all data rows
contain only empty values.routes.txt,
trips.txt, stops.txt, and stop_times.txt.
Also checks that at least one of calendar.txt or
calendar_dates.txt is present. Missing required files halt processing.\n escape sequences
that appear as raw text inside field values with a space.calendar.txt
and calendar_dates.txt. All statistics are written to the database
for operational monitoring and cycle management.stop_times.txt by trip_id and then by
stop_sequence to ensure deterministic processing order for all
downstream rules that iterate over stop sequences.Stage 1: Feed Consolidation
Before any quality rules run, we consolidate feeds from multiple transit operators into a single coherent GTFS file. This is the most complex step in the pipeline: it handles timezone normalization, spatial stop deduplication, calendar merging, and translation inheritance across all source datasets.
feed_info.txt and agency.txt to determine the language
of each source. If sources use different languages, feed_lang is set to
mul (multilingual). All translations.txt files are loaded
for later processing.time_shift (in seconds) and days_shift are computed and
applied. Inconsistent DST boundaries produce an error and halt processing.
Unique service IDs are generated for each sub-period to prevent collisions.trip_id and the corresponding service_id.
Translations referencing the original trip are duplicated for all sub-trips and the
original translation is removed.stop_times record is updated: stop_id is replaced
via the deduplication mapping. For DST-split trips, records are duplicated for each
sub-trip and departure/arrival times are shifted by time_shift seconds.
The same logic applies to frequencies.txt headway-based trips.agency_timezone is normalized to the dominant timezone for all agencies;
transfers.txt and pathways.txt have stop IDs remapped
through the deduplication mapping; all other tables
(routes, shapes, fare_attributes,
fare_rules, levels, attributions)
are merged without modification.(table_name, field_name, field_value, record_id, language),
and written to the final translations.txt.feed_info.txt (publisher, URL, language, date range covering all
active calendars, version timestamp). Creates feed_version.mapping — a
traceability table linking each record in the output to its source dataset, enabling
full provenance tracking..mapping files from all sources.
stops.mapping applies the stop deduplication remapping;
trips.mapping expands each original trip ID into all its DST-split subtypes;
agency.mapping and routes.mapping are merged without ID changes.deduplication_priority score. A duplication report is generated
for review.Stage 2: what each phase does
Complete rule catalogue
Rules execute in order within each phase. Click any group to expand.
Step What it does archive_normalization Unwraps nested folders and ZIP packages inside archives, ensuring all GTFS files are accessible at the archive root before any other processing begins change_detection Fingerprints the incoming feed and skips the full processing cycle when content is unchanged since the previous run, keeping pipeline throughput high across 7,000+ feeds file_extension_normalization Renames .csv files to .txt to conform to GTFS naming conventions unnecessary_file_removal Removes non-GTFS files, header-only files, and files consisting entirely of empty values required_files_check Verifies that all required GTFS files are present before processing continues; halts the cycle early if the feed is structurally incomplete text_normalization Eliminates text-formatting artifacts across all files: Windows line endings, newlines embedded inside field values, and literal escape sequences that break CSV parsing csv_quote_repair Removes malformed quotation marks — orphaned mid-row quotes and spurious full-row wraps — a pervasive export artifact that breaks standard CSV parsers whitespace_normalization Strips leading and trailing whitespace from all string fields, including column headers encoding_artifact_repair Repairs UTF-8/latin1 double-encoding artifacts in name fields, covering patterns observed across thousands of real-world feeds binary_artifact_removal Removes null bytes and binary artifacts from all text files utf8_normalization Detects non-UTF-8 encodings and converts all files to UTF-8, covering the full range of legacy and regional character sets feed_statistics_capture Records per-file row counts, character encodings, and calendar date ranges to the monitoring database at the start of each processing cycle csv_structure_repair Validates and repairs CSV structure — mismatched quotes, inconsistent column counts, unescaped characters; rolls back to the original if repair fails stop_times_sort Sorts stop_times.txt by trip and stop sequence to guarantee deterministic input order for all downstream rules operator_normalization Applies operator-specific corrections for known non-standard exports from selected agencies
Rule What it does id_normalization Normalizes all primary and foreign key values across tables, resolving non-standard characters that break referential integrity location_type_inference Infers location_type from stop usage context when the field is absent, defaulting to platform-level where appropriate name_sanitization Sanitizes all name fields by removing control characters, non-standard Unicode sequences, and extraneous whitespace stop_description_cleanup Clears stop_desc when its content duplicates stop_name, eliminating noise that misleads downstream consumers default_stop_name Assigns a valid placeholder name when stop_name is absent, ensuring no stop enters the pipeline without an identifier calendar_repair Repairs calendar integrity issues: expired service periods, reversed start/end dates, and contradictory day-of-week flags calendar_canonicalization Canonicalizes calendar records to their minimal equivalent representation, reducing row count while preserving the exact service pattern route_name_derivation Derives route_long_name from the first and last stop of the representative trip when the original value is missing or clearly incorrect route_name_promotion Promotes content from route_long_name to route_short_name when the short name field is empty and the value is compact enough to serve that role wheelchair_defaults Standardizes wheelchair_boarding to "unknown" on all stops lacking an explicit accessibility declaration, ensuring consistent output schema default_route_type Assigns a contextually derived route_type when the field is absent or contains an unrecognized value route_url_cleanup Removes route_url when it duplicates agency_url, eliminating redundant data arrival_departure_correction Detects and corrects stop_times rows where arrival is recorded after departure — a systematic error in certain scheduling export tools agency_name_recovery Assigns a canonical name to agencies missing one and flags the feed for manual review cross_midnight_normalization Normalizes cross-midnight departure times into valid GTFS extended-time format without altering any actual schedule values parent_station_relocation Relocates parent stations that are geographically distant from their child stops to a more accurate position route_color_defaults Derives route_text_color from background luminance when the field is missing, ensuring readable contrast on all display surfaces trip_headsign_derivation Populates trip_headsign from the terminal stop name when the provider has not specified it coordinate_precision Rounds stop coordinates to six decimal places — sub-meter precision — reducing data volume without any loss of geospatial accuracy timepoint_defaults Fills missing timepoint fields with the GTFS default (exact timing), preventing ambiguity for downstream consumers name_character_cleanup Final name-field pass: removes residual non-alphanumeric characters that survived earlier normalization stages feed_language_correction Resolves mismatches between the declared feed_lang value and the language detected in feed content
Rule What it does missing_stop_reconstruction Reconstructs stop records referenced in stop_times but absent from stops.txt, preventing broken trip references and downstream failures route_data_enrichment Applies a curated correction table maintained by our data team to override incorrect vehicle types, route names, and agency assignments station_topology_enrichment Supplements feeds with station entrance and exit data (location_type 2/3) sourced from our global hub database stop_data_correction Applies a curated stop correction table to fix coordinates and names — particularly effective for NAPTAN-derived and other official-source datasets with systematic errors agency_contact_enrichment Fills missing agency contact details from a maintained cross-reference database stop_mode_classification Derives the dominant transport mode for each stop from the set of routes serving it — used for map rendering and mode-specific filtering stop_city_resolution Resolves each stop to its containing city using geographic analysis — the foundation for city-level data exports and API filtering
Rule What it does route_name_deduplication Removes route_desc and route_long_name when their content duplicates route_short_name, eliminating noise that misleads display logic calendar_deduplication Identifies and merges structurally identical calendars within a feed, rebuilding service ID references across all dependent tables duplicate_stop_removal Removes duplicate stop records within a feed, retaining the entry with the most complete field coverage global_stop_deduplication Multi-signal stop deduplication: resolves stop identity across update cycles using a combination of ID, name, and coordinate signals, maintaining stable global stop references over time directional_stop_deduplication Identifies duplicate stops positioned on opposite sides of the same road using directional analysis — a common artifact when stops are sourced from multiple operators stop_times_deduplication Removes duplicate stop_times rows produced when the same trip visits the same stop at the same time — typically a scheduling export artifact route_deduplication Merges routes sharing an identical stop-sequence fingerprint across different source feeds or agencies, producing a single canonical route record trip_deduplication Multi-pass trip deduplication: resolves trips by route and schedule fingerprint, then by schedule alone across routes; merges calendars when service patterns overlap shape_optimization Reduces shape point density using geometric simplification and removes redundant shape entries, cutting GeoJSON payload without visible quality loss
Rule What it does arrival_departure_validation Validates all stop_times for arrival/departure sequence violations; corrects where possible, removes irrecoverable rows orphaned_trip_removal Removes trips that have no valid service period after calendar cleanup — trips that can never run location_type_validation Removes trips that reference stops with incompatible location types (for example, referencing a station entrance inside stop_times) url_email_validation Validates and sanitizes URL and email fields across agencies and routes; removes values that fail format checks agency_url_defaults Assigns a placeholder URL to agencies missing a website, satisfying the GTFS required field constraint without breaking downstream consumers timezone_validation Corrects invalid IANA timezone identifiers in stop_timezone and agency_timezone fields speed_anomaly_removal Identifies and removes trips where the implied speed between consecutive stops is physically impossible for the declared vehicle type route_color_validation Validates route_color and route_text_color hex values; removes or corrects invalid color codes single_stop_trip_removal Removes trips containing only a single stop — logically invalid and unrepresentable as real transit service transfer_validation Validates transfer rules and removes entries referencing stops or routes that do not exist in the feed timezone_geographic_cross_check Cross-validates agency timezone declarations against stop geographic coordinates to detect and correct systematic timezone mismatches
Rule What it does route_type_normalization Normalizes extended GTFS route type codes to the base specification values for maximum compatibility across consumers referential_integrity_cleanup Full referential integrity pass: removes orphaned records across all tables — unused stops, expired calendars, routes without trips, and unlinked shape data route_fingerprinting Calculates a content-based fingerprint for each route derived from the geographic coordinates of its stops — used for cross-source route matching trip_fingerprinting Computes stable fingerprints for each trip — a primary and a secondary variant with different parameter sets — enabling reliable cross-cycle tracking even when provider IDs change between updates stop_sequence_normalization Renumbers stop sequences to start at 1 and increment monotonically, eliminating gaps and non-standard ordering across all trips trip_departure_ordering Assigns a stable daily sort order to trips within each route group based on first departure time, used for deterministic API response ordering stop_fingerprinting Calculates a stable stop fingerprint based on name, coordinates, and transport type — the basis for cross-source stop identity matching shape_dist_computation Calculates and inserts shape_dist_traveled values in stop_times when shapes are present but distances are missing final_integrity_check Terminal consistency pass run after all hashing is complete, ensuring no orphaned references remain before the feed is packaged for export
Rule What it does feed_statistics Computes per-feed statistics: route count, stop count, agency count, total shape distance, and geographic bounding box stop_route_index Builds an inverted index of routes serving each stop — powers stop detail pages and API endpoints trip_direction_computation Computes the compass bearing of each trip from its first to last stop — used for directional filtering in real-time vehicle matching feed_packaging Packages and exports the final processed output: improved GTFS archive, GeoJSON shapes, and MobilityData validation report artifact_publication Publishes all pipeline artifacts to the content delivery platform and API servers; triggers cache invalidation to ensure downstream consumers receive the latest data
Real-time processing pipeline
GTFS-RT synchronization is one of the hardest problems in transit data engineering. Every provider uses its own internal IDs, formats, and naming conventions — which must be matched against the static schedule in real time. Our pipeline includes dedicated real-time matching rules on top of the static processing:
- Geographic sanity check on vehicle positions — coordinates outside the service region are automatically discarded (real-world example: a London feed reporting buses in the US)
- Automatic replacement of incorrect route IDs and stop references based on historical movement data and manually curated mapping tables
- Multi-signal trip matching across dozens of parameters: route ID, stop sequence, scheduled time, vehicle bearing, and geographic position
- Arrival time prediction algorithms for stops where real-time data is delayed or missing
- Match quality scoring using stop locations, route shapes, and current vehicle position
- Stable trip hashes (hash1) in static data ensure GTFS-RT matching remains accurate even after full feed updates
Country-specific processing
In addition to the universal pipeline, we develop custom processing rules for individual countries that account for local data formats, coding standards, and operator-specific quirks. Country rules run as an additional layer on top of the base pipeline and are tailored to each provider's actual data quality issues.
Access clean, production-ready GTFS data via API
Our API serves data that has passed this complete three-stage quality pipeline — not raw feeds, but fully validated, deduplicated, and enriched datasets. No broken references, no format errors, no missing required fields. The API is engineered for long-term stability: consistent schemas, high uptime, and zero tolerance for corrupt data. Every dataset we publish is something we stand behind.