With Snowflake Document AI, information can be easily extracted from documents, such as invoices or handwritten documents, within the data platform. Document AI is straightforward and easy to use: either via a graphical user interface, via code in a pipeline or integrated into a Streamlit application. In this article, we explain the feature, describe how the integration into the platform works and present interesting application possibilities.
Inhaltsverzeichnis
Snowflake, the AI Data Cloud
Snowflake is a cloud data platform with a wide range of possible applications, from classic data warehousing to generative artificial intelligence. As a cloud-native platform - i.e. Snowflake was developed explicitly for the cloud - Snowflake offers great potential for optimizing the use of resources.
Among other things, this includes
Demand-oriented scaling of separate computing and storage instances,
the simple sharing of data via the Snowflake Marketplace
and the replication of data across different cloud providers.
Compatible with cloud providers Google Cloud Platform, Amazon Web Services and Microsoft Azure, the Snowflake platform is a robust data solution that is independent of individual providers. While Snowflake initially represented a modern option for data warehousing and reporting when the company was founded in 2012, the platform has developed into a comprehensive, all-encompassing data platform with remarkable speed in recent years.
In particular, the analytical capabilities of the Snowflake AI Data Cloud go far beyond traditional reporting functions.
Snowflake ML
With Snowflake ML, the platform offers the possibility of implementing end-to-end machine learning. For example, the platform supports Snowflake Notebooks, an integrated development environment in which data can be conveniently analyzed and visualized and programmed with both Python and SQL. Snowflake also offers a feature store, in which the model input can be stored, and a model registry, in which all artifacts belonging to the model can be stored.
Snowflake Cortex AI
Snowflake ML is supplemented by Snowflake Cortex AI, Snowflake’s offering for generative artificial intelligence (GenAI). Among other features, Cortex AI provides SQL functions that can be easily applied to tables stored in Snowflake using SQL or Python. For example, the Translate function effortlessly translates text into different languages, while the Sentiment function classifies the valence of a text. These functions are based on Large Language Models (LLMs), which are directly integrated into the platform. Snowflake currently supports a wide range of proprietary and open models.
Snowflake Document AI
LLMs also form the basis for Snowflake Document AI. Document AI is a feature that can be used to extract information from documents in order to obtain structured data from unstructured data. The unstructured data can consist of handwritten or machine-generated documents, but logos and checkboxes can also be processed by the AI. The feature has been “generally available”, i.e. fully available, in most regions of the cloud providers AWS and Azure since October 2024.
For Document AI, Snowflake uses Arctic-TILT, a proprietary LLM that was specially developed for processing documents. As the LLM is already pre-trained, it often achieves strong performance even with new document types that are previously unknown to the model (zero-shot extraction). If this performance is not sufficient, the LLM can be retrained for a specific document type in a dedicated fine-tuning process.
Data security is guaranteed as the data used for training does not have to leave the Snowflake platform. Neither the newly trained model nor the data used for it is shared with other Snowflake customers. Another advantage of using Arctic-TILT for data extraction is the size of the model: with 0.8 billion parameters, the model is comparatively small and therefore inexpensive to operate. Despite its small size, Arctic-TILT offers outstanding performance. In DocVQA, the standard test for answering questions based on visual documents, it beats large generic models such as GPT-4.
Figure 1: Modular process for providing a document AI model
Implementing a new use case is straightforward and can be done in just a few steps in the Snowflake User Interface, Snowsight:
First, a new Snowflake object, a so-called Model Build, is created and a sufficiently large computing instance is selected on which the model is executed.
Documents containing the data to be extracted are uploaded to the Snowflake platform via a graphical user interface.
You can now ask questions, known as prompts, about the document in natural language, which are answered by the LLM.
The accuracy of the answer can be determined by a direct comparison with the content of the document (see Figure 1). On the other hand, the LLM also generates a confidence score for each answer. This score is a numerical value between 0 and 1 and indicates how “certain” the LLM is about the given answer.
If the answer is incorrect, it can be corrected manually. The corrected answer then flows into the (optional) fine-tuning, which is started with a click before it is published, i.e. made available for use. If the model is already sufficiently correct in answering the questions even without fine-tuning, it can be published directly without further training.
Figure 2: Extraction of data from unstructured documents using natural language in a graphical user interface
Automated Data Extraction in Snowflake
A successfully trained model offers the greatest added value if it is embedded in a productive pipeline that processes documents automatically, extracts the unstructured data and makes it available as structured data for analysis and further processing. Such a pipeline could be set up as follows:
Create a stage, a Snowflake object for storing files, which is used to load new documents into the Snowflake platform. The Encryption parameter can be used to specify an encryption method for the documents. The Directory parameter makes it possible to record metadata for the loaded documents.
A stream on the stage can be used to recognize newly uploaded documents.
The extraction of data from the documents and thus the movement of information from the stage to a table can be orchestrated by another Snowflake object, a task. The task refers to the stream and selects the metadata and document content from newly arrived documents. Using a simple function call (predict), the LLM, which may have been retrained, is used to answer the questions relating to the document content and save the extracted results together with the confidence score in a table. As usual in Snowflake, the execution speed and frequency can be controlled by selecting a virtual warehouse to be used for execution and the task schedule.
In order to be able to use the results of the model in a meaningful way, the data is processed further in the final step. As the predict function returns semi-structured data, this can be conveniently transformed further using the Snowflake-native functions for semi-structured data and materialized in a table. This step can also be automated with an additional task.
In addition, a Streamlit application can be integrated into the pipeline for manual validation in a graphical user interface.
In just a few minutes, the user can easily provide a productive pipeline that uses the trained model to transform unstructured data into structured data.
Figure 3: Pipeline for extracting data from documents with Document AI in Snowflake
Applications of Snowflake Document AI
Document AI is easy to use and has no domain-specific focus. The range of possible applications for the LLM-supported service is correspondingly broad.
In most companies, invoices have to be entered into systems and checked. These two steps can be automated and simplified with Document AI. Employees can manage invoices in one place, which is accessible via a Snowflake Stage. Uploading the document then triggers the process listed above, which extracts the required information from the documents. Employees can then use the confidence score to focus on a few borderline cases when checking the documents. An interface for this can be easily implemented using Streamlit.
In addition to such general scenarios, very specific use cases can also be described. For example, the evaluation of clinical studies in the pharmaceutical industry could benefit from Snowflake Document AI by using it to extract facts from scientific articles or read out questionnaires from patients. In the latter case in particular, transmission errors can be prevented by machine evaluation. In general, Snowflake Document AI is ideal for processing all types of handwritten questionnaires and forms.
Conclusion
These use cases show that Document AI offers great added value across all domains, which is reflected not only in the acceleration of processes and the reduction of costs, but also in the improvement of quality.
If you are also facing the challenge of transforming your documents into structured data in order to generate real added value for your company, we will be happy to support you. Contact us for an initial meeting to discuss your use case with no obligation.
Who is b.telligent?
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Neural Networks for Tabular Data: Ensemble Learning Without Trees
Neural networks are applied to just about any kind of data (images, audio, text, video, graphs, ...). Only with tabular data, tree-based ensembles like random forests and gradient boosted trees are still much more popular. If you want to replace these successful classics with neural networks, ensemble learning may still be a key idea. This blog post tells you why. It is complemented by a notebook in which you can follow the practical details.
Azure AI Search, Microsoft’s top serverless option for the retrieval part of RAG, has unique sizing, scaling, and pricing logic. While it conceals many complexities of server based solutions, it demands specific knowledge of its configurations.
Polars, the Pandas challenger written in Rust, is much faster, not only in executing the code, but also in development. Pandas has always suffered from an API that "grew historically" in many places. Polars is completely different: it ensures significantly faster development, since its API is designed to be logically consistent from the outset, carefully maintaining stringency with every release (sometimes at the expense of backwards compatibility). Polars can often easily replace Pandas: for example, in Ibis Analytics projects and, of course, for all kinds of daily data preparation tasks. Polars’ superior performance is also helpful in interactive environments like Power BI.