What is Azure Data Catalog?
Azure Data Catalog is a Microsoft cloud service using a crowdsourced approach. It provides an inventory of data used for discovering and understanding data sources. Microsoft Azure is a software-as-a-service (SaaS) application.
Azure Data Catalog enhances old investments’ performance by adding metadata and notation around the Azure environment’s data. It informs about the Data sources we have discovered or which we already have. It expresses documentation and describes the schema of the data source. The data source location and a copy of the metadata are present in the Azure Data Catalog. The user can access it easily when needed, and the indexing of metadata helps discover data through a search.
Why do you need an Azure Data catalog?
There were multiple challenges before employing Azure Data Catalog, be it for a consumer or a producer.
Challenges as a Consumer
- Before Azure Data Catalog, the users were working only on the familiar datasets.
- Problems faced while selecting the data sources.
- Data availability issue.
- Without the information of the data, the consumer cannot connect to the data.
- No central location for data storage.
- Discovering data sources was based on tribal knowledge.
- Without documentation, a problem in understanding the data as well as its use.
- For queries related to the dataset, the consumer has to locate the producer responsible for the creation and ask them.
- Even after discovering and going through the documentation won’t help, the consumer does not know how to request access to the data source.
- Time wasted in understanding the data, which can rather be used for the analysis.
Challenges as a Producer
- Creating and maintaining the documentation for the data source is time-consuming and complex.
- Making documentation readily available to all the clients is also a challenge.
- As the data sources keep updating, there is a need to regularly update the documentation, which is an ongoing responsibility for the data producer.
- Annotating the source with metadata also won’t help as the clients mostly ignore the data sources’ descriptions.
Azure Data Catalog to the Rescue
Azure data catalog provides a cloud-based service into which the data sources can be registered and which makes it available to the client all the time. The data remains in the existing location, but a copy of the metadata is added, and the data source’s location to Azure Data Catalog, enabling the client to get the data whenever they need it. It eliminates the need for tribal knowledge as all the data source information is in the Data Catalog. The metadata in the Catalog is indexed, discovering the data source easily through search and understanding to the user who discovers it. Once the data is registered, its metadata can be updated by the user who registered it or another user of the enterprise, which solves the problem of regular updation of the documentation required before the data catalog
As Azure Data Catalog provides a clear view of the data source, the time used for understanding the data can be utilized to analyze the data. Hence, more analysis can be done without hiring a new analyst. It informs about the intended use of the data so the data sources can be chosen according to the data requirements and evaluate its suitability for the purpose. It provides the convenience of opening the data source in the tool of choice.
Azure Data Catalog Roadmap
Given below is a properly sequenced roadmap while working in Azure Data Catalogs:
- Create a data catalog
- Go to Azure Portal
- Create a resource
- Select Data Catalog
- Name the Catalog and selecting the location for the catalog, and then create it
- Publish the data
- Go to the Data Catalog home page
- Click on Publish Data
Get to know the steps in Azure Data Catalog Edition:
- Go to Settings
- Select the Catalog Edition (Free or Standard)
- Add Users
- Go to Catalog Users
- Click on Add Users
- Portal Title (Expand Portal Title first, and then add text to be displayed on the Portal.)
- Publish Page (Go to the Settings page, and then navigate to the Publish page and click on it.)
What about the steps involved in Searching Data Catalog on Azure Portal:
- Navigate to Azure Portal and Sign In to the Account.
- Go to All Services.
- Select Data Catalog
- Now you can see the Data Catalog which you just created.
What is the Data Sources of Azure Data Catalog?
The metadata can be published using the public API or manually entering information directly into the Azure Data Catalog. Azure Data Catalog supports most of the data sources. The major data sources object that are supported are Azure Data Lake Store directory, Azure Data Lake Store file, Azure Blob storage, Azure Storage directory, HDFS directory, and file, Hive table, and view, MySQL table, and view, Oracle Database view and table, Teradata table and view, SAP Business Warehouse, SAP HANA view, Cassandra table and view, MongoDb table and view, etc. Explore Azure Data Lake and Analytics Services here.
What are the Azure Data Catalog Challenges?
Azure Data Catalog has heard the clients’ feedback and solved most of the problems and bugs. Still, there is always a scope for improvement. Below mentioned are certain primary and secondary challenges to this approach.
Primary Challenges
- The UI is quite simple, but the complex task becomes engaging in the Azure Data Catalog. So, the UI needs to be simple, and things should be easily accessible without too much clutter.
- The Azure cloud service is slow while uploading the large dataset, and the uploading time is not consistent. So, the cloud server’s capacity and the connection between the client and the server should be better.
- Getting all the things cataloged is quite a task, so there needs to be a better data cataloging approach.
- Azure Data Catalog is not updated in a while, and updating regularly is an important thing as it provides bug fixes and new features available to the client and this way, the client stays assured that there is a team which is regularly working for the betterment of the product and continue developing it.
- There is a need for Better integration of Power BI with ADC. (Application Delivery Controller)
Secondary Challenges
- Data lineage is the data’s life cycle, which helps the analyst or any consumer understand the origin’s data flow to the destination. Metadata keeps on changing regularly, and updating it manually is a burden, so there’s a need for a scheduled update of the metadata. Azure Data Catalog only supports a single catalog per organization. The organization is growing and working on different data sources. Managing all the data into a single catalog doesn’t seem feasible. The number of data catalogs per organization should be increased.
- Data catalog support Azure Data Lake Storage Gen1, which is deprecated instead of supporting retrieval of data from Azure Data Lake Storage Gen2. SnowFlake is a widely used database, and many clients are using it, so it should support the metadata of SnowFlake.
- Data Catalog should provide a backup feature, which should be handy in case of overwriting or removing the data to restore the data catalog to a previous point. Data Catalog REST API has limitations related to asset root size, the number of annotations, the asset’s overall size, and deleting an asset deletes all associated annotations.
The above-mentioned points could help the Azure Data Catalog grow its business and market share in Data Catalogs as these are the things in which it is lacking. Microsoft should plan their features and updates accordingly to compete with their rivals and solve the clients’ problems. The product’s cost and service are a major deciding factor before choosing any product, so they should focus on it.
Conclusion
Azure Data Catalog is a cloud-based service that comes with its pros and cons. It has gained a lot of popularity in recent years because of its cloud-based service, making it available round the clock. The setup is easy, with good customer support. The Azure Data Catalog is suitable for organizations with more space and security requirements. The drawbacks are the cost and don’t offer extra features than some other rivals in the field like Erwin, SAP, and Oracle. Consider all the pros and cons and then choose the Data Catalog Service Wisely!