Reviewing Varada And How It Can Improve Trino’s Performance

Reviewing Varada And How It Can Improve Trino’s Performance

February 11, 2022 Uncategorized 0
trino consulting

Companies are continually setting up programs and initiatives to become more data-driven. Millions are being spent on new infrastructure, hiring employees, and creating processes to drive value with data. 

The need for better tooling, faster data, and experienced data professionals will become even more important as data sources, size, and speed grow. However, eventually, the amount of data engineers you need to hire becomes unsustainable. With the average salary in the US for a data engineer around 90-150k, companies may struggle to manage large data teams.

Thus, in the future, companies will likely be forced to rely on better tooling to help reduce the amount of manual intervention and there seems to be a growing number of them.

Whether companies are looking into Fivetran, Airbyte, Monte Carlo, or Bigeye, there are a lot of choices and I am constantly parsing through new tools to see the pros and cons of them and how they could benefit a company’s data stack. For example, companies like Dremio and Starburst have been well funded in the last year and have been working on a lot of great features. However, some features are missing. For example, with Trino there are some limitations due to how the underlying engine tries to optimize queries that can lead to increasing costs. That’s where another solution could come into play – Varada.

What Is Varada?

One of the many companies that I have come across in the past year is Varada. Varada has put together an indexing solution that can be plugged into various types of Trino and Presto systems. To put it simply, Varada offers a unique big data indexing technology that serves as a smart acceleration layer on your data lake. In particular, Varada is focused on helping to improve Trino instances by improving the overall performance and cost of their queries. All of this is thanks to several technologies Varada has developed to help mitigate some of the costs associated with running expensive jobs on Trino.

There are many things Varada is set up to do well, and in some places, I would love to see them improve. Here are my thoughts on Varada.

What Varada Does Well

Indexing Data

At the end of the day what Varada does well is indexing queries and data. The approach to managing big data indexes is based on what Varada has dubbed nano-block indexing. Nano-block indexing involves storing multiple small chunks of each index. Each chunk is a segment of the complete index. Put together, all of the individual index segments recreate the equivalent of a global index in a traditional database. Nano-block partitions are written independently and read in parallel at query time. Since big data rarely changes, nano-blocks don’t need to be optimized for regular updates the way transactional indexes are. 

Compared with basic big data partitions, which are limited to a single primary segment column, users can create big data indexes on any column, adding and removing column indexes without updating the primary dataset.

Reducing Trino Costs

Another selling point about Varada is it helps reduce costs for Trino users. Here are the recent results from Varada testing Trino with the Hive connector (Data Lake connector) and Trino  Varada connector. As you can see the cost for these 4 queries is reduced.

Cost / Query ($, normalized for 1-hour 100% utilization, except for on-demand)

With Varada, data teams and users no longer need to compromise on performance to achieve agility and fast cost-effectiveness: they can leverage the power of autonomous indexing, caching of data, and intermediate results to accelerate Presto queries by 10x-100x on their existing cluster. This is great for teams managing Trino.

Reducing the need for heavy data engineering

Another part of Varada’s vision is reducing data engineering work by allowing Trino to connect to a wide array of underlying databases. This is limited to the Trino instance you’re using. As Trino does have a limited number of connectors this does pose some issues. There are options here such as using Starburst to increase the number of connectors.

This is one sticking point that we will need to talk about in terms of things Trino and Varada can work on.

What Varada Can Work On

Varada is still early on in developing many of its features and I imagine they have a lot of new features on their roadmap. Here are some of my thoughts on what Varada can improve.

It’s Easier Said Than Done – Removing Data Engineering Work

Data engines like Trino have a limited set of connectors and don’t pull data directly from business applications. Meaning that there is still a need for data engineers to do a lot of work since a lot of data comes from business applications like Workday, Salesforce, and Zendesk. 

Although one of the goals Varada has is to help connect with Trino and reduce the need for redundant data engineering work, there is still a lot to be managed. Arguably the harder being the API, SFTP, and more finicky connectors. Direct database connections are pretty easy and one of the larger issues that data engineers deal with here is the possibility of software engineers changing the underlying data tables (which is still a pain).

So it’s important to realize you will still have a lot of data engineering work ahead. Even using tools like Fivetran to help manage the business application side will still require a data engineer to manage much of the setup.

Wider Range Of Data Engines

Sticking with one data engine can pose a lot of risk for a company. As someone who has watched other plug-in-style companies struggle to keep up with the changing data stacks because they picked the wrong underlying data warehouse. There are clear benefits to picking a strong underlying platform because if said platform succeeds, then so will the third-party add-on.

However, if that underlying data engine doesn’t, then there goes the third-party add-on.

Varada and Trino’s Future

Big data will continue to be a big problem for companies that don’t start looking for tools like Trino to help them manage all their data. Companies want access to all their data sources faster and finding a balance between speed to data and cost is a challenge. Of course, there are a lot of people and processes that need to be put into play. Varada has a lot of promise and I look forward to seeing where it will go in the next few years.

If you’re looking to improve your Trino setup, or just talk to other big data experts, then check out Varada’s slack community and test out their community connector edition.