What Is Starburst Data And Why You Should Use It – Data Engineering Consulting

What Is Starburst Data And Why You Should Use It – Data Engineering Consulting

March 6, 2021 big data big data consulting 0

Even small companies these days on average have 47.81 terabytes of data that they manage. Regardless if you’re a small company or a trillion-dollar behemoth, data is driving decisions. But as data ecosystems become more complex, having the right tools for the job is important.

One modern data management tool that can help manage data at really any size from a wide array of data sources in Presto. Presto has been the backbone for multiple big tech companies like Netflix and Lyft.

But Presto is not limited to helping large tech companies, or even just large companies in general. In the past few years, thanks to new technologies Presto is becoming popularized with companies of all sizes.

In this article, we will discuss what Presto is, why companies are using it, and how your company can implement it utilizing companies like Starburst Data so you can fully utilize it for your enterprise.

What Is Presto

Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Unlike many other SQL engines that were often written for very specific databases, Presto can sit on top of a wide array of databases.

Specifically, Presto allows querying data where it lives, including Hive, Cassandra, relational databases, or even proprietary data stores. Presto also makes it easy to join data across different databases and data stores. Meaning that you don’t have to centralize all your data to perform ad hoc queries.

This is starting to bring us to our next point. Why use Presto?

Why use Presto?

Presto is very platform agnostic. Whether in the cloud or on-premises, the technology is versatile and has been used by large companies like Netflix and Lyft but is also popular among small companies and start-ups.

These companies have decided to use Presto because it offers the ability to easily create a self-service BI layer that is accessible to more than just Data engineers. The fact that data can be queried across data sources is what allows Presto to give a more agile approach to data access by data consumers.

But this is just one of many benefits that Presto offers users. In the modern world, querying and managing terabytes of data is normal. This leads to slower queries.

That’s a problem. Analysts still expect their queries to be fast (if anything they expect faster queries).

Presto provides fast queries by leveraging both well-known and novel techniques for distributed query processing. These techniques include in-memory parallel processing, pipelined execution across nodes in the cluster, a multithreaded execution model to keep all the CPU cores busy, efficient flat-memory data structures to minimize Java garbage collection, and Java byte code generation. A detailed description of these complex Presto internals is beyond the scope of this book. For Presto users, these techniques translate into faster insights into your data at a fraction of the cost of other solutions.

Limitations Of Presto

For everything great about Presto. Presto was developed as a bare-bones SQL Engine. What this means is that you’ll have to manage scaling, security, monitoring as well as create new connections on your own.

This means your teams won’t be able to easily lock down data if it only is supposed to be managed by

This can often make Presto unapproachable as for all its benefits it becomes difficult to manage. This is where companies like Starburst Data have come in and created infrastructure around Presto.

What Is Starburst Data And How Can It Help?

Starburst Data makes implementing Presto easy.

Starburst Data provides all the advantages of Presto like reducing the amount of time required for analysts to get access to data in almost any data source.

In addition, Starburst Data has developed several features to fill in the gaps.

Security for fine-grained Access Control

An important feature of any database or data warehouse is being able to manage who has access to what data. You don’t want all of your private data exposed to your entire company.

Starburst Enterprise has several features that help improve the Presto’s lacking security features. For example, Starburst makes it easy for your team to set up access control.

The access control systems all follow role-based access control mechanisms with users, groups, roles, privileges and objects.

This is demonstrated in the image below

 

 

This makes it easy for your security teams and data warehouse administrators to manage who has access to what data.

Starburst also offers other helpful security features such as auditing and encryption.

This enables companies to implement a centralized security framework without having to code their own modules for Presto.

Learn more about our team’s data analytics consulting services here.

Managing Your Presto scaling compute easy.

Traditional databases and SQL engines struggle when it comes to querying data lakes and large data sets in general. This is because many of these database systems were developed as operational databases for on-premise infrastructure.

Starburst Enterprise has worked to take advantage of Presto’s underlying ability to run queries at scale and coupled it with several other performance improvements including embedded caching. Presto already utilizes an MPP. For a quick reminder, Presto is a distributed system and uses an architecture similar to a classic massively parallel processing (MPP) database management system.

Leverage the Data Consumption Layer

Building ETLs is hard, expensive, and sometimes prone to errors.

Starburst Enterprise, works to eliminate the need for copying and moving data. This means no more cubes, extracts, or aggregation tables. Starburst delivers a data consumption layer to over 40+ enterprise data sources.

Individual Starburst connectors are enhanced with table statistics, aggregate pushdown, dynamic filtering, parallelism, and more. Together they provide a single point of access to all your data, and a data consumption layer for your data team. A single Trino (formerly Presto® SQL) query can return results from data in Hadoop, S3, Snowflake, ADLS, Delta Lake, BigQuery, Kafka, Redshift, and many others.

Developing any of these features on your own is not only expensive, but it requires a whole extra team to manage.

So now we know the what, why, and how. But let’s answer a few wheres.

Where is Presto helpful.

Below we will discuss a few use cases where companies are already using Presto as well as a few broad examples of how your team can use Presto.

Presto And Starburst Use Cases

Enough about technical features. What can tools like Presto and Starburst do? Below are some use cases for companies already using these tools.

Netflix’s Big Data Platform

Netflix is one of many companies that have utilized Presto as their big data platform. For Netflix, they have use cases that range from analyzing A/B test results to analyzing user streaming experience to training data models for their recommendation algorithms.

Using standard data warehouse tools for Netflix wasn’t sufficient.

This is why Netflix turned to Presto. Presto helped address the companies’ ad hoc interactive use cases. They needed a data storage system that could manage all of their various use cases at scale.

Presto was that data storage system for Netflix. But Netflix is not the only company using Presto for their compute portion of their data warehouse.

Lyft For Their Data Lake And SQL Engine

Early in 2017, Lyft started exploring Presto for OLAP use cases and we realized the potential of this amazing query engine. Now, thousands of dashboards inside of Lyft are utilizing Presto and its ability to manage large users.

Before Presto, Lyft relied on AWS-Redshift back then and it had data storage and compute coupled together.

A couple of computing and storage caused many performance issues. For example, if the system required maintenance, an upgrade, down time, or scaling nodes it would make querying very slow. They needed a system where data and compute were decoupled, that’s where Presto fit nicely into our use case.

With these improvements, Lyft’s backend now manages about 1.5K weekly active users are running a couple of million queries every month on their platform. All thanks to Presto.

Self Service BI

At the end of the day, the bottleneck for data analytics, machine learning, and data science remains to be getting access to the right data. Data engineers and data architects are constantly having to work on integrating, migrating, and designing new data warehouse tables.

With the larger influx of data coming from every direction and the increased demand for data-driven decisions Presto can help elevate the pressure both on data engineers and on data storage systems.

Presto allows analysts to join data across multiple data sources. This includes systems like Hadoop, S3, Cassandra with other sources such as a traditional relational database.

“Query it where it lies” is what Starburst likes to say. With Presto, you can finally stop moving data around just to query it!

Starburst both allows you to take the performance advantages provided by Presto along with the advantages that come with using a Presto instance developed for Enterprise use.

Data Lakes

We did discuss data lakes in the Lyft example. But we wanted to call this out separately. What is great about Presto is it is not limited to querying structured data. In fact, Presto offers a decent amount of array and map functions.

This means that teams can interact with less structured data and still use SQL to analyze the information. On top of this Presto, especially partnered with Starburst Data, can access data from almost any data storage system. Whether it be Hadoop or S3. The ability to query data where it lives is what makes Presto a good compute layer for data lakes.

Presto Is Useful And Starburst Makes It Easy

Presto provides a lot of advantage for companies of all size. In particular, the ability to query data where it lives reduces the amount of time data engineers need to spend developing complex ETLs. Meaning your teams can more quickly answer questions that their business owners have. This is a huge benefit and when partnered with Starburst Data, makes Presto relativity easy to utilize.

Instead of needing a large team to manage all your Presto clusters, you can easily have 1-2 engineers manage and develop your data infrastructure.

And if your team needs to help to implement Presto and Starburst Data, reach out to us today. Our data science and data engineering consulting team can help you build everything from a big data platform, to machine learning models.

If you are interested in reading more about data science or data engineering, then read the articles below.

Developing A Data Analytics Strategy For Small Businesses And Start-ups

3 Ways To Improve Your Data Science Teams Efficiency

4 SQL Tips For Data Scientists

How To Improve Your Data-Driven Strategy

What Is A Data Warehouse And Why Use It

Mistakes That Are Ruining Your Data-Driven Strategy

5 Great Libraries To Manage Big Data With Python