26 Data Catalogs – From Open Source To Managed

26 Data Catalogs – From Open Source To Managed

August 22, 2022 cloud data analytics strategy data engineering 0
data catalog landscape

It can be easy to take certain tools for granted when you work for companies with mature infrastructure. One of my favorite tools at Facebook was iData.

iData was Facebook’s data discoverability tool. It provided a lot of functionality that I have started to miss. This included the baseline functions you would expect including the ability to find tables, trace lineage, and track down owners of said tables.

But there were also other beneficial features like cost tracking, data quality assessments, and table certification. All of these features made it easy for a new data engineer to quickly orient themselves as they started on new projects.

My Favorite iData Feature

My favorite features involved being able to see how other users were using the data on a query level. This provided a lot more context than just commented fields. ERDs and data lineage are all great. But seeing exactly how other users were using the data made it easy to understand(also they were great people to ping if you had questions).

It was so easy to quickly understand how the data was already being used. This provided several benefits including:

  • Reducing the duplication of work
  • Providing context on how data could join together(even across multiple data sources)
  • It would let you know who to ask questions about the data. Sure, the owner is one great place to start, but sometimes owners, over time, move away from datasets

Upon leaving the company formerly known as Facebook I felt like I kept stumbling on a new data catalog or discoverability tool every week. At this point, I am sure I have come across at least 3-5 dozen data discovery tools all of which add their own flair to helping teams manage their metadata.

With so many data discoverability tools out there, I wanted to take a moment and catalog all the data catalogs. Below you will find just some of the ever growing list of data discovery tools.

Catalog Of Data Catalogs

A

Aggua

Alation

Amundsen Lyft

Atlan

..Honestly probably a half dozen other “A” data catalogs.

B

Boomi Data Catalog

C

Castor

Cloudera Data Catalog

Collibra

D

Databook Uber

DataHub LinkedIn

Data World

E

erwin Data Catalog

G

Glue Data Catalog

I

Informatica Enterprise Data Catalog

L

Lumada Data Catalog Hitachi

M

Magda Data Catalog

Metacat Netflix

O

Octopai

Oracle Data Catalog

OvalEdge Data Catalog

S

Secoda

Select Star

R

Redgate Data Catalog

T

Talend Data Catalog

Trudat Data Catalog

Z

Zeenea Data Catalog Software

Do You Find Data Discovery Tools Helpful?

Data discovery tools can help reduce the amount of onboarding time for new hires as well as improve data workflows. But they are also often difficult to gain buy-in and just have so many options that many companies struggle to figure out which solution fits their needs best.

For me, having an easy-to-search tool at a company that probably had 30,000+ tables was a must. I did spend a lot of in the UI. Sometimes it was because I needed to update a few of the information so my table could be certified. Other times it’s because I found a table that looked like it had data I wanted to pull in so I wanted to see if I could avoid building another pipeline.

Overall, as teams grow, the need for data discovery tools becomes a must. But I would love to hear your thoughts.

If you are interested in reading more about data science or data engineering, then read/watch the articles/videos below.

5 Great Data Engineering Tools For 2021 – My Favorite Data Engineering Tools

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate

5 Great Libraries To Manage Big Data With Python

Kafka Vs RabbitMQ

SQL Best Practices — Designing An ETL Video