How we process every GTFS feed
before it reaches you
Raw GTFS data published by transit agencies is rarely production-ready. It contains duplicates, missing required fields, invalid coordinates, broken calendars, and format violations. Our three-stage automated pipeline processes 7,000+ feeds from transit operators worldwide: first downloading and normalising each feed, then consolidating feeds from multiple operators into a single coherent dataset, then applying 80+ correction, enrichment, and validation rules — so you integrate clean, consistent data on the first attempt.
This pipeline is under continuous development — new rules, improved algorithms, and data quality fixes are shipped every week. The data you receive through our API is always the result of the latest validated pipeline version, deployed via a blue-green release strategy for zero-downtime updates and high availability.
Static GTFS processing pipeline
Stage 0: Feed Acquisition & Pre-processing
Before consolidation or quality rules run, raw GTFS data is downloaded from operator sources and passed through a pre-processing layer that makes each feed structurally parseable and safe for downstream processing. Feeds that have not changed since the previous cycle are detected via content hashing and skipped entirely, eliminating unnecessary reprocessing.
.csv files to .txt. Removes non-GTFS files, empty files,
and files that contain only a header row with no data. Removes files where all data rows
contain only empty values.routes.txt,
trips.txt, stops.txt, and stop_times.txt.
Also checks that at least one of calendar.txt or
calendar_dates.txt is present. Missing required files halt processing.\n escape sequences
that appear as raw text inside field values with a space.calendar.txt
and calendar_dates.txt. All statistics are written to the database
for operational monitoring and cycle management.stop_times.txt by trip_id and then by
stop_sequence to ensure deterministic processing order for all
downstream rules that iterate over stop sequences.Stage 1: Feed Consolidation
Before any quality rules run, we consolidate feeds from multiple transit operators into a single coherent GTFS file. This is the most complex step in the pipeline: it handles timezone normalization, spatial stop deduplication, calendar merging, and translation inheritance across all source datasets.
feed_info.txt and agency.txt to determine the language
of each source. If sources use different languages, feed_lang is set to
mul (multilingual). All translations.txt files are loaded
for later processing.time_shift (in seconds) and days_shift are computed and
applied. Inconsistent DST boundaries produce an error and halt processing.
Unique service IDs are generated for each sub-period to prevent collisions.trip_id and the corresponding service_id.
Translations referencing the original trip are duplicated for all sub-trips and the
original translation is removed.stop_times record is updated: stop_id is replaced
via the deduplication mapping. For DST-split trips, records are duplicated for each
sub-trip and departure/arrival times are shifted by time_shift seconds.
The same logic applies to frequencies.txt headway-based trips.agency_timezone is normalized to the dominant timezone for all agencies;
transfers.txt and pathways.txt have stop IDs remapped
through the deduplication mapping; all other tables
(routes, shapes, fare_attributes,
fare_rules, levels, attributions)
are merged without modification.(table_name, field_name, field_value, record_id, language),
and written to the final translations.txt.feed_info.txt (publisher, URL, language, date range covering all
active calendars, version timestamp). Creates feed_version.mapping — a
traceability table linking each record in the output to its source dataset, enabling
full provenance tracking..mapping files from all sources.
stops.mapping applies the stop deduplication remapping;
trips.mapping expands each original trip ID into all its DST-split subtypes;
agency.mapping and routes.mapping are merged without ID changes.deduplication_priority score. A duplication report is generated
for review.Stage 2: what each phase does
Complete rule catalogue
Rules execute in order within each phase. Click any group to expand.
| Step | What it does |
|---|---|
| archive_normalization | Unwraps nested folders and ZIP packages inside archives, ensuring all GTFS files are accessible at the archive root before any other processing begins |
| change_detection | Fingerprints the incoming feed and skips the full processing cycle when content is unchanged since the previous run, keeping pipeline throughput high across 7,000+ feeds |
| file_extension_normalization | Renames .csv files to .txt to conform to GTFS naming conventions |
| unnecessary_file_removal | Removes non-GTFS files, header-only files, and files consisting entirely of empty values |
| required_files_check | Verifies that all required GTFS files are present before processing continues; halts the cycle early if the feed is structurally incomplete |
| text_normalization | Eliminates text-formatting artifacts across all files: Windows line endings, newlines embedded inside field values, and literal escape sequences that break CSV parsing |
| csv_quote_repair | Removes malformed quotation marks — orphaned mid-row quotes and spurious full-row wraps — a pervasive export artifact that breaks standard CSV parsers |
| whitespace_normalization | Strips leading and trailing whitespace from all string fields, including column headers |
| encoding_artifact_repair | Repairs UTF-8/latin1 double-encoding artifacts in name fields, covering patterns observed across thousands of real-world feeds |
| binary_artifact_removal | Removes null bytes and binary artifacts from all text files |
| utf8_normalization | Detects non-UTF-8 encodings and converts all files to UTF-8, covering the full range of legacy and regional character sets |
| feed_statistics_capture | Records per-file row counts, character encodings, and calendar date ranges to the monitoring database at the start of each processing cycle |
| csv_structure_repair | Validates and repairs CSV structure — mismatched quotes, inconsistent column counts, unescaped characters; rolls back to the original if repair fails |
| stop_times_sort | Sorts stop_times.txt by trip and stop sequence to guarantee deterministic input order for all downstream rules |
| operator_normalization | Applies operator-specific corrections for known non-standard exports from selected agencies |
| Rule | What it does |
|---|---|
| id_normalization | Normalizes all primary and foreign key values across tables, resolving non-standard characters that break referential integrity |
| location_type_inference | Infers location_type from stop usage context when the field is absent, defaulting to platform-level where appropriate |
| name_sanitization | Sanitizes all name fields by removing control characters, non-standard Unicode sequences, and extraneous whitespace |
| stop_description_cleanup | Clears stop_desc when its content duplicates stop_name, eliminating noise that misleads downstream consumers |
| default_stop_name | Assigns a valid placeholder name when stop_name is absent, ensuring no stop enters the pipeline without an identifier |
| calendar_repair | Repairs calendar integrity issues: expired service periods, reversed start/end dates, and contradictory day-of-week flags |
| calendar_canonicalization | Canonicalizes calendar records to their minimal equivalent representation, reducing row count while preserving the exact service pattern |
| route_name_derivation | Derives route_long_name from the first and last stop of the representative trip when the original value is missing or clearly incorrect |
| route_name_promotion | Promotes content from route_long_name to route_short_name when the short name field is empty and the value is compact enough to serve that role |
| wheelchair_defaults | Standardizes wheelchair_boarding to "unknown" on all stops lacking an explicit accessibility declaration, ensuring consistent output schema |
| default_route_type | Assigns a contextually derived route_type when the field is absent or contains an unrecognized value |
| route_url_cleanup | Removes route_url when it duplicates agency_url, eliminating redundant data |
| arrival_departure_correction | Detects and corrects stop_times rows where arrival is recorded after departure — a systematic error in certain scheduling export tools |
| agency_name_recovery | Assigns a canonical name to agencies missing one and flags the feed for manual review |
| cross_midnight_normalization | Normalizes cross-midnight departure times into valid GTFS extended-time format without altering any actual schedule values |
| parent_station_relocation | Relocates parent stations that are geographically distant from their child stops to a more accurate position |
| route_color_defaults | Derives route_text_color from background luminance when the field is missing, ensuring readable contrast on all display surfaces |
| trip_headsign_derivation | Populates trip_headsign from the terminal stop name when the provider has not specified it |
| coordinate_precision | Rounds stop coordinates to six decimal places — sub-meter precision — reducing data volume without any loss of geospatial accuracy |
| timepoint_defaults | Fills missing timepoint fields with the GTFS default (exact timing), preventing ambiguity for downstream consumers |
| name_character_cleanup | Final name-field pass: removes residual non-alphanumeric characters that survived earlier normalization stages |
| feed_language_correction | Resolves mismatches between the declared feed_lang value and the language detected in feed content |
| Rule | What it does |
|---|---|
| missing_stop_reconstruction | Reconstructs stop records referenced in stop_times but absent from stops.txt, preventing broken trip references and downstream failures |
| route_data_enrichment | Applies a curated correction table maintained by our data team to override incorrect vehicle types, route names, and agency assignments |
| station_topology_enrichment | Supplements feeds with station entrance and exit data (location_type 2/3) sourced from our global hub database |
| stop_data_correction | Applies a curated stop correction table to fix coordinates and names — particularly effective for NAPTAN-derived and other official-source datasets with systematic errors |
| agency_contact_enrichment | Fills missing agency contact details from a maintained cross-reference database |
| stop_mode_classification | Derives the dominant transport mode for each stop from the set of routes serving it — used for map rendering and mode-specific filtering |
| stop_city_resolution | Resolves each stop to its containing city using geographic analysis — the foundation for city-level data exports and API filtering |
| Rule | What it does |
|---|---|
| route_name_deduplication | Removes route_desc and route_long_name when their content duplicates route_short_name, eliminating noise that misleads display logic |
| calendar_deduplication | Identifies and merges structurally identical calendars within a feed, rebuilding service ID references across all dependent tables |
| duplicate_stop_removal | Removes duplicate stop records within a feed, retaining the entry with the most complete field coverage |
| global_stop_deduplication | Multi-signal stop deduplication: resolves stop identity across update cycles using a combination of ID, name, and coordinate signals, maintaining stable global stop references over time |
| directional_stop_deduplication | Identifies duplicate stops positioned on opposite sides of the same road using directional analysis — a common artifact when stops are sourced from multiple operators |
| stop_times_deduplication | Removes duplicate stop_times rows produced when the same trip visits the same stop at the same time — typically a scheduling export artifact |
| route_deduplication | Merges routes sharing an identical stop-sequence fingerprint across different source feeds or agencies, producing a single canonical route record |
| trip_deduplication | Multi-pass trip deduplication: resolves trips by route and schedule fingerprint, then by schedule alone across routes; merges calendars when service patterns overlap |
| shape_optimization | Reduces shape point density using geometric simplification and removes redundant shape entries, cutting GeoJSON payload without visible quality loss |
| Rule | What it does |
|---|---|
| arrival_departure_validation | Validates all stop_times for arrival/departure sequence violations; corrects where possible, removes irrecoverable rows |
| orphaned_trip_removal | Removes trips that have no valid service period after calendar cleanup — trips that can never run |
| location_type_validation | Removes trips that reference stops with incompatible location types (for example, referencing a station entrance inside stop_times) |
| url_email_validation | Validates and sanitizes URL and email fields across agencies and routes; removes values that fail format checks |
| agency_url_defaults | Assigns a placeholder URL to agencies missing a website, satisfying the GTFS required field constraint without breaking downstream consumers |
| timezone_validation | Corrects invalid IANA timezone identifiers in stop_timezone and agency_timezone fields |
| speed_anomaly_removal | Identifies and removes trips where the implied speed between consecutive stops is physically impossible for the declared vehicle type |
| route_color_validation | Validates route_color and route_text_color hex values; removes or corrects invalid color codes |
| single_stop_trip_removal | Removes trips containing only a single stop — logically invalid and unrepresentable as real transit service |
| transfer_validation | Validates transfer rules and removes entries referencing stops or routes that do not exist in the feed |
| timezone_geographic_cross_check | Cross-validates agency timezone declarations against stop geographic coordinates to detect and correct systematic timezone mismatches |
| Rule | What it does |
|---|---|
| route_type_normalization | Normalizes extended GTFS route type codes to the base specification values for maximum compatibility across consumers |
| referential_integrity_cleanup | Full referential integrity pass: removes orphaned records across all tables — unused stops, expired calendars, routes without trips, and unlinked shape data |
| route_fingerprinting | Calculates a content-based fingerprint for each route derived from the geographic coordinates of its stops — used for cross-source route matching |
| trip_fingerprinting | Computes stable fingerprints for each trip — a primary and a secondary variant with different parameter sets — enabling reliable cross-cycle tracking even when provider IDs change between updates |
| stop_sequence_normalization | Renumbers stop sequences to start at 1 and increment monotonically, eliminating gaps and non-standard ordering across all trips |
| trip_departure_ordering | Assigns a stable daily sort order to trips within each route group based on first departure time, used for deterministic API response ordering |
| stop_fingerprinting | Calculates a stable stop fingerprint based on name, coordinates, and transport type — the basis for cross-source stop identity matching |
| shape_dist_computation | Calculates and inserts shape_dist_traveled values in stop_times when shapes are present but distances are missing |
| final_integrity_check | Terminal consistency pass run after all hashing is complete, ensuring no orphaned references remain before the feed is packaged for export |
| Rule | What it does |
|---|---|
| feed_statistics | Computes per-feed statistics: route count, stop count, agency count, total shape distance, and geographic bounding box |
| stop_route_index | Builds an inverted index of routes serving each stop — powers stop detail pages and API endpoints |
| trip_direction_computation | Computes the compass bearing of each trip from its first to last stop — used for directional filtering in real-time vehicle matching |
| feed_packaging | Packages and exports the final processed output: improved GTFS archive, GeoJSON shapes, and MobilityData validation report |
| artifact_publication | Publishes all pipeline artifacts to the content delivery platform and API servers; triggers cache invalidation to ensure downstream consumers receive the latest data |
Real-time processing pipeline
GTFS-RT synchronization is one of the hardest problems in transit data engineering. Every provider uses its own internal IDs, formats, and naming conventions — which must be matched against the static schedule in real time. Our pipeline includes dedicated real-time matching rules on top of the static processing:
- Geographic sanity check on vehicle positions — coordinates outside the service region are automatically discarded (real-world example: a London feed reporting buses in the US)
- Automatic replacement of incorrect route IDs and stop references based on historical movement data and manually curated mapping tables
- Multi-signal trip matching across dozens of parameters: route ID, stop sequence, scheduled time, vehicle bearing, and geographic position
- Arrival time prediction algorithms for stops where real-time data is delayed or missing
- Match quality scoring using stop locations, route shapes, and current vehicle position
- Stable trip hashes (hash1) in static data ensure GTFS-RT matching remains accurate even after full feed updates
Country-specific processing
In addition to the universal pipeline, we develop custom processing rules for individual countries that account for local data formats, coding standards, and operator-specific quirks. Country rules run as an additional layer on top of the base pipeline and are tailored to each provider's actual data quality issues.
Access clean, production-ready GTFS data via API
Our API serves data that has passed this complete three-stage quality pipeline — not raw feeds, but fully validated, deduplicated, and enriched datasets. No broken references, no format errors, no missing required fields. The API is engineered for long-term stability: consistent schemas, high uptime, and zero tolerance for corrupt data. Every dataset we publish is something we stand behind.