This is a repost of my feature on the Upside Engineering Blog.
At Upside, we’re in the business of providing the best possible experience to business travelers. In that objective, we take measures such as proactively detecting flight delays & cancellations, algorithmically recommending suitable travel packages, and integrating with numerous external services such as Uber, Hertz, and Starbucks. Not to mention, we need to get the word out about our company and advertise across several mediums. All of these efforts generate a lot of data — our flight and hotel streams alone pull in approximately 3TB of data daily.
Last month, our CTO Emily Dresner wrote an excellent blog post about Upside’s policy of using AWS tools unless it is absolutely necessary to move to an external service. With few exceptions, we abide by the rule and it has allowed us to streamline operations and reduce overhead. I wrote this post to explain one of our most notable exceptions — our transition to Snowflake Computing’s cloud data warehouse.
Laying the Foundation
Mention the phrase “data warehouse” at a nimble startup and you’ll usually see a mixture of fear and confusion. For good reason — the class of technology is often associated with large corporations that need to figure out how to report data across business units. But the truth is that data is the lifeblood of a modern company and understanding our data has been key to growing, iterating, and improving our products and services. Rather than siloed, departmentalized datasets, we aim to arm every employee with whatever information he or she needs to be effective. We embrace a testing culture that generates data for every decision and code change. Upside ingests multiple terabytes of raw flight/hotel inventory on a daily basis, in addition to website, mobile app, and marketing data streams. Our data engineering team integrates these disparate sources into a platform for analytics and data science applications.
Our first data warehouse was Amazon Redshift. Simply put, Redshift couldn’t keep up with our demands. It allowed us to get off the ground but concurrent loading and analytical queries degraded performance and the clusters were extremely expensive to maintain. Redshift’s capabilities and features have been relatively stagnant for several years while competing platforms continue to advance. Amazon Athena and Redshift Spectrum allowed for mild decoupling of storage and compute, as we were able to shift lesser-used datasets out of the cluster and into S3. Yet the setup required engineering resources to design AWS Elastic Map Reduce jobs for data pre-processing and entailed a substantial technical barrier for casual inquiries.
Envisioning an Ideal Platform
Upside’s testing culture and “move fast and break things” mantra afforded us the ability to step back and ask “what would an ideal platform for data analytics look like?” We were empowered to look beyond incremental improvements, such as provisioning more Redshift clusters, in favor of more transformative solutions. At the same time, we wanted to avoid the pitfall of mistaking “trendy” for “better”. And though we are data-driven and operate at terabyte scale, we don’t need a Google-scale solution for our startup.
Our selection criteria can be broken into four focus areas:
We’re a startup and need to be able to handle rapid user growth and concomitant data creation, yet we don’t want to have to pay for full capacity up front.
Upside generates data every minute of the day. The idea that we need to wait an hour or more for scheduled batch processes to update data is untenable, so an analytics platform needs to be able to handle constant writes/updates without impairing performance.
3. Support for Semi-structured Data
Upside operates an advanced logging infrastructure using a custom Node.js microservice which exports JSON records via Kinesis Firehose. Application engineers need the flexibility of being able to deploy new fields to JSON schemas without worrying if they will break ETL processes.
4. Integration Capabilities
An effective analytics platform should be accessible across all levels of the company and be useful to users ranging from marketing directors to Ph.D. statisticians and product managers. Accordingly, the platform should integrate with BI tools such as Looker, accept a wide range of programmatic interfaces such as Python, Spark, and Node.js, and easily operate in a cloud environment, specifically AWS.
As the team researched alternatives, we found Snowflake Computing’s cloud data warehouse to be superior in almost every category. Its feature set was advanced enough to necessitate a paradigm shift in how we approach our analytics platform.
The first shift is that it Snowflake functions as a data-warehouse-as-a-service. There are no servers to manage, software to install, or indexes to tune. This allows the data engineers to focus on tasks that will translate into growth for the company.
The second is the concept of a virtual warehouse. Snowflake’s managed architecture distributes customer data across a proprietary cluster (built on S3). You only pay for the underlying storage in S3 and the compute capacity provisioned via a virtual warehouse. Warehouses can be paused at any time for cost efficiency. For improved performance, separate warehouses can serve analytical and ETL workloads, and warehouse size can be altered in less than half a second compared to 12+ hours for Redshift. Snowflake even offers autoscaling, multi-cluster warehousing to seamlessly increase compute resources during peak load. In my mind, Snowflake has the only product on the market offering truly independent scaling of compute and storage services.
The third is how data is stored. Whereas traditional data warehouses require separate clusters for development, staging, and production environments, Snowflake allows users to make “copy-on-write” clones that do not require additional storage space to create. In a similar way to how Github only tracks “diffs” or differences between commits, Snowflake only tracks changes to the underlying data. Development and stage warehouses can be thought of as being a “fork” of the production warehouse. The obvious benefit is significant cost savings, but this architecture provides resiliency and audit capability as Snowflake users can query or revert the state of a table at any point in the last 90 days, down to the millisecond. And cloning a multi-terabyte table takes a fraction of a second, since no writes are required.
Finally, Snowflake implements a “schema-on-read” functionality allowing semi-structured data such as JSON, XML, and AVRO to be loaded directly into a traditional relational table. The semi-structured data can be queried using SQL without worrying about the order in which objects appear. Nested JSON, new attributes, and arrays are all accessible without rewriting ETL code.
We migrated our entire Redshift warehouse to Snowflake in less than two days and were fully production-live within two weeks (with a team of three). Let me say that again, for all the readers here familiar with multi-year IT modernization projects: a team a three data engineers transferred our Matillion-based ETL routines and data to Snowflake in less than two weeks. Yeah… we move fast here.
We followed a planned and measured approach despite moving at that rapid speed. For about two months, we operated Redshift and Snowflake data warehouses simultaneously, allowing us to benchmark performance, cost, reliability, etc. The redundancy was well worth the peace of mind and allowed the data engineering team to demonstrate Snowflake’s capabilities across the company.
At a high level, we’ve achieved substantial benefits:
- Routine queries and ETL processes take ~30-40% of the time of their Redshift counterparts, with ability to autoscale compute for larger datasets or improved performance
- No detectable performance impacts of concurrent analytics and ETL workloads
- Data warehousing costs are ~75% lower compared to Redshift
- The ability to query JSON has drastically improved our ETL methods (much more flexible and maintainable than CSV)
- We scale up the virtual warehouses to handle massive nightly data loads and scale down during normal operations for cost efficiency
- We utilize Spark and Python connectors for machine learning and other advanced data science applications
From an organizational perspective, Snowflake dramatically boosted the data engineering team’s productivity and broadened the scope of data sources that can be tracked. Whereas Redshift reported on core aspects of our business, Snowflake is being used to centralize just about every tool, data source, and external service for analysis and improvement. A centralized and widely-accessible platform for reporting and analytics allows us to iterate faster and more intelligently, and teams can operate independently while working towards the same goals using shared metrics and data sources.