Everyone talks about being data-driven these days — it's a buzzword in almost every job description, culture doc, and company website. But for those of us on data and engineering teams, the real question is: How do you build infrastructure that makes ‘data-driven’ possible?
At AirGarage, our Data team spends a lot of time on this question. Business needs are always evolving, especially in startups, so we’ve found it grounding to focus on a few key principles when architecting our data stack. This is our blueprint.
Surprisingly or unsurprisingly — depending on your familiarity with parking — operating a parking facility well requires a lot of context. Traffic patterns are variable, pricing is variable, and there are hundreds, thousands, even millions of drivers interfacing with each location. And our relatively small team proactively manages the success of hundreds of different properties. Part of the reason that is possible is because our Data & Engineering teams aim to automate and scale the collection of key data points that serve as important context for our teammates. This context serves as leverage for better decision-making.
For example, AirGarage’s dynamic pricing system captures a lot more data than just the price of a parking rental. Tracking, logging, and storing the right data points can be the difference between poor and exceptional data infrastructure because there are so many contextual nuances that are important for business users to consider when making decisions like price setting. Since we’ve built our pricing engine iteratively over the years, we have a decent idea of what the journey from poor to exceptional data capture might look like (as an example):
Feeding our data infrastructure with exceptional raw data is THE important first step to get right because ultimately all other choices on how to move or transform this data are consequential.
As the example above illustrates, capturing data directly from your product is one great opportunity to improve context but there may also be a lot of value in capturing data in other ways, for example, through the purchase of third-party datasets, web-scraping, or even manual data entry. The goal for capturing context should be to minimize the number of browser tabs your business teams have open.
Consider this concept: to maximize the value of company data, one can either increase the amount of data (Principle #1) or increase the value per unit of data. And one way to increase the value per unit of data is to allow it to be joined to other data. The easiest way to do this is to centralize data in one place.
In the early days of a startup, this might be as simple as one spreadsheet with multiple tabs of data joined in one place. But as a company grows, so do the number of systems. And if we gleaned anything from principle #1, systems produce data that we want to capture. Eventually your Sales team might move to Salesforce. Your Operations team might adopt Zendesk. And suddenly it’s not so easy to forecast revenue…. You now have transaction data living in Postgres, sales pipeline data in Salesforce, and things like refunds happening in Zendesk. The data movement and matching required to join data from different systems burns valuable analyst time, not to mention the higher likelihood for errors.
AirGarage built a data warehouse in Snowflake to centralize its data. Clicking ‘Get Started’ on snowflake.com is the easy part though. The art of data warehousing lies in HOW data is loaded and modeled in the warehouse.
Data loading
The ways in which data is loaded into a data warehouse can have implications on cost, processing speed, and data integrity. AirGarage processes millions of records of data at frequent intervals, so it is not practical to do a full refresh of our data pipelines every time we want data updated. This means, we load data incrementally to minimize the compute necessary to run our pipelines. As the scale and complexity of data grows, so may the intricacies of loading data. Ultimately, companies have a spectrum of choices from software vendors who can automate loading tasks for you to writing and managing your own custom data movement scripts.
As a grounding principle, AirGarage is primarily focused on spending as much developer time as possible on writing code that drives value for AirGarage. Ian Macomber, Head of Analytics Engineering & Data Science at Ramp, shared a convincing opinion that code is a liability teams take on because code needs to be maintained in perpetuity. Up to this point we agree, which is why we chose to adopt Polytomic as a tool to connect to source systems like Postgres and Salesforce, extract data from those systems, and automatically load data into Snowflake. But larger teams with more resources available to build custom pipelines may prefer to outsource less of this workload to vendors.
Data modeling
Once data is centralized in Snowflake, it needs to be structured into data models that will serve the variety of different use-cases the business has. How to do this is an important consideration. There are many long-standing best practices for data modeling à la Inmon’s normalized dimensional modeling or Kimball’s star schema approach. There are even hybrid approaches that take some flavor from each or a rising belief in the ‘One Big Table (OBT)’ approach given improvements in storage and compute costs. We are still figuring out when and where to deploy which modeling techniques but we prioritize based on what will make the business more data-driven. On our Data team, this means considering how different approaches will affect the speed of development cycles, flexibility of the model for new edits, and data integrity. And for our business users, we must consider the ease of data exploration or ‘query-ability’ of the data models.
A data-driven culture requires two-way trust between data teams and business teams. Data teams must trust that business teams will query and analyze data in accurate ways. But business teams must trust that the data they are accessing is properly up to date with business logic, reliably processed for freshness, and not error-prone.
How can we manage this two-way trust? One recommendation is to follow some key software engineering best practices. For example, we use version control to track changes in our data transformations over-time, implement automated testing to ensure data quality, and can guarantee that data in one location will match data in another location by tying them to the same source code. These methodologies allow us to catch errors early, maintain consistency across our data models, and quickly iterate on our data pipelines.
How to best follow these practices is dependent on your specific circumstances, but as a small team in some of the early innings of data maturity we’ve found it helpful to utilize engineering frameworks built-in to a tool like dbt. This tool gives us a friendly environment to write and ship modular code, create tests, and provide visibility into the lineage of data. Ultimately, the specific tools you use may be swapped, but what’s important is finding a good rhythm of producing, testing, and shipping reliable code regardless of the tools used.
Teammates across our company have different data abilities. Some are eager to get their hands dirty with their own SQL writing, some prefer to access data in no-code formats. And some don’t have time for either. As such, we must build a data infrastructure that empowers each of these personas. I’ll skip a discussion on different dashboarding and visualization tools and highlight two new techniques we’ve invested in to democratize data.
One new-ish technique we use is operationalizing data through Reverse ETL processes. At AirGarage, we build Reverse ETL pipelines to push data that has been curated in Snowflake back into Salesforce, placing it at the fingertips of our business users in the tools they already work with. This approach significantly reduces friction in accessing and utilizing data, encouraging more data-driven decision-making across the organization.
Additionally, we're in the process of building a semantic layer on top of our data logic. This semantic layer acts as an abstraction between our raw data and business users, providing a consistent, user-friendly interface for querying and analyzing data. This not only simplifies data access for non-technical users but also ensures consistency in metrics and definitions across different tools and teams.
Image Source: dbt
If you're excited about creating a data-driven environment like the one we've described, we'd love to hear from you — AirGarage might be the perfect place for you! Check out our careers page for current openings.