Those who do analytics as a profession are almost certainly familiar with the following situation.
You’re working on building a report for an executive meeting showing how X has affected overall growth over the last quarter. This analysis is done every quarter, and fortunately the analyst who got stuck with it last time shared her Jupyter notebook with you, so you just need to make a few edits to the date ranges, exclude that week in December where there was a snowstorm, and hit run.
What she forgot to tell you is that you actually need to install this older version of pandas—but not too old, or else you’ll break the numpy dependency. Oh and you’ll need to get access to the growth database, and actually there might be a newer version of the notebook on her machine that fixes that one bug…
Some time later, you’ve gotten the report running, the charts look reasonable, and it only takes under an hour to run the whole notebook on your machine. You export the notebook as PDF, or maybe HTML, or maybe you just paste the charts and highlights into a PowerPoint. You send the report over to management for the presentation, and voila, another successful day as an analytics pro.
But two days later, after the presentation, you get a reply back. “Meeting with execs went great! They want to see what the numbers look like if you break these out by Z rather than Y but with Sundays removed since we don’t expect usage on those days, and can you send Bill the raw data in an Excel?”
You’re faced with a decision; do you change XGrowthQuartely.ipynb to make these changes? But what if they don’t want them next quarter? You could just add both breakouts, but this pushes the notebook runtime to an hour and a half. Maybe you just comment the old analysis out and whoever is on-call next quarter can figure it out. Ugh, and what does Bill even need, exactly? Our code just reads a bunch of source DBs and kicks out a report, what raw data does he want? And are you on the hook for producing this every quarter? I guess we can just create another notebook on the shared drive, XGrowthQuarterlyByZ‑BillXLSX.ipynb…
Now that we’re all done reliving that, one might ask, “How do we avoid it?” At the MTA, the answer we strive for is to Productionalize Analytics.
That’s to say, by turning analytics and accompanying data assets (think: tables) into production-grade products, once we’ve performed the creative, interesting data science, we can set up workflows to have these curated datasets, reports, and insights at the tips of our fingers (and always up to date). And almost more importantly, these datasets and reports are available to anyone who might want to find them, allowing the analytics team to get out of the way of self-serve analytics across the company.
To understand how we’ve tried to tackle building such an ecosystem, we’ll look at three flows:
- The Data Developer Flow
- The Data ELT Flow
- The Data User Flow
Data Developer Flow
How does our analytics team get new data assets onto the data lake?
What we’re optimizing for:
- An effective and enjoyable experience for our data scientists and data engineers
- The ability to do simple things quickly, while supporting custom use cases
- Reproducible builds that allow us to roll back changes easily and understand what’s live and what’s not
The first flow is the Data Developer Flow, or the process by which our analytics team puts data assets into production.
Starting at the bottom left of our diagram, we can imagine a Data Scientist who has built a useful, curated dataset from our source data, something that could be used in numerous downstream analyses—maybe a clean NYCT Elevator and Escalator Availability Dataset. We use Airflow to make sure data sets like these update at the desired frequency.
Because Airflow code is just Python code, our data scientists can build and test Airflow DAGs in their local environment, just like any other code they’re working on.
They then push these changes to our git server (we use GitLab and love it), where changes can be reviewed. This allows for input from other data scientists and the engineering team.
After merge, our build pipeline incorporates these changes into custom Python packages and Airflow images and publishes these artifacts to our container and package registries. The data scientist’s work is now complete, and ownership is handed over to Engineering for release and subsequent pipeline monitoring.
Data ELT Flow
How does information from source systems become useful data assets in the lake?
What we’re optimizing for:
- Observability above almost all else
- Scalability for our ever-growing number of workflows and datasets
- Pipeline execution speed (but only when we need it)
The second flow we’ll look at is the Data ELT Flow, or the technical processes that take data from our source systems and turn it into downstream data products, as defined by our analytics team in the first flow.
Again, Airflow is at the center of this process as our orchestration engine, where it plays a number of important roles:
- Executing pipelines on schedule, in an environment suited to each pipeline
- Ensuring dependent datasets are only updated once their upstream datasets have been updated
- Adding reliability to these pipelines through intelligent retries
- Sending failure alerts when a human needs to unblock a failing pipeline
- Providing a single pane of glass for monitoring the status and logs of our workflows
Our DAGs generally come in one of three flavors: Ingest, Transformation, and Reverse ETL.
Ingest DAGs are generally responsible for pulling data from disparate source systems into our data lake and dumping it into our RAW zone. Sometimes we see these tables being pulled almost unchanged, and other times more complicated queries join and filter a number of tables to create something more useful. As a rule, we try to keep our data as true to the source systems as possible in RAW, but our data lake is for analytics, not a data backup, so usability reigns.
Our transformation DAGs then take data from RAW and produce derived or CURATED datasets, based on algorithms designed by our data scientists. These generally take the form of Spark processes, but sometimes perform more complicated workflows, like Synapse Pipelines or even Azure Batch Jobs. Our datasets in the CURATED zone are almost always written in the Delta Table format. We like Delta Tables because they give us the ability to scale our lake without giving up SQL Semantics.
Reverse ETL DAGs are fewer in number than the previous two, but they have an important purpose: delivering select datasets from our internal Data Lake to downstream public data repositories. This generally takes the form of running a Spark SQL query on one of our curated datasets and pushing the results over to the NYS Open Data Portal.
Data User Flow
How do users turn our data assets into insights at the MTA?
What we’re optimizing for:
- Varied access patterns to meet the needs and skillsets of our users
- Timely, consistent, and transparent delivery of data to end systems
The final flow we’ll look at is the Data User Flow, or the ways in which end users, both inside and outside of the MTA, interact with our data.
For ad hoc querying of data in our internal data lake, we use Azure Serverless SQL as a SQL-compatible endpoint provided to our users. This means that analysts across the MTA can pop open their favorite SQL Client (we love DBeaver) and query our analytical tables as they would any other database.
For canned reports (think our quarterly growth report up above), our dashboards also point to our curated tables, and execute queries through the Serverless SQL endpoints. These dashboards, along with pipelines that keep the tables backing them up to date, ensure that nobody gets stuck producing the Quarterly Report from this or that notebook sitting on their hard drive.
With some data pushed into the Open Data Portal, external users can access MTA data as well. This allows interested parties to perform their own analytics and build downstream applications off of MTA data.
Our last user offering, and probably our coolest, is the MTA Metrics site. By providing an analytics dashboard to the public, powered by Open Data, individuals both inside and outside the MTA can quickly gain insights to what’s happening at the organization, and just as easily access the data driving those insights.
Final thoughts
Is the above plan perfect? No. Are all our analytics currently produced through this modernized process? We wish. Is “Data Science” and “Analytics” getting easier to do well at the MTA? We think so. Are these efforts improving the public’s access to the data they deserve? You tell us. :)
About the author
Mike Kutzma is a Data Engineering Manager on the Data & Analytics team.
Interested in being part of the fun? The Data Engineering team is hiring!