As numerous components in our pipeline process data, proper schema management becomes essential. The data need to be understood the same way, whether we process data for our front-end apps or within our AI/ML platform. This is even more important, when you consider that the structure, data quality, and other parameters of incoming data vary over time. In other words, an updated schema in source systems must not cause failures in downstream components. The schema registry allows for versioned schema management, schema evolution, additional tagging (like GDPR), etc.
Note that in the reference architecture at the beginning of this article, both the message queue and schema registry are preceded by the RESTful interface. The reason for using the REST API, instead of native clients, is the preference for protocols over specific tools.
Batch sources are best served by batch ingestion, as they are typically scheduled on a regular basis (like your daily DWH export), or on-demand (e.g., a department loading a one-off dataset). A scalable and highly-durable storage service is the preferred choice to help accommodate structured, semi-structured, and non-structured data in one place, to avoid the same data silo problem outlined above. We prefer features like geofencing, storage classes (differentiation between hot and cold storage), object lifecycle management (automatically change storage class after a specified time period), versioning, support for handling static content (think static content for your Web applications), encryption-at-rest, fine-grained security policies, etc.