One viable option for organizations looking to harness massive amounts of data is the data lake, a single repository for storing all the raw data, both structured and unstructured, that floods into the company. But that isn’t the end of the story. The key to making a data lake work is data governance, using metadata to provide valuable context through tagging and cataloging.
This practical report examines why metadata is essential for managing, migrating, accessing, and deploying any big data solution. Authors Federico Castanedo and Scott Gidley dive into the specifics of analyzing metadata for keeping track of your data—where it comes from, where it’s located, and how it’s being used—so you can provide safeguards and reduce risk. In the process, you’ll learn about methods for automating metadata capture.
This report also explains the main features of a data lake architecture, and discusses the pros and cons of several data lake management solutions that support metadata. These solutions include:
- Traditional data integration/management vendors such as the IBM Research Accelerated Discovery Lab
- Tooling from open source projects, including Teradata Kylo and Informatica
- Startups such as Trifacta and Zaloni that provide best of breed technology