Snowflake Document AI – Easily Extract Data From Unstructured Documents

Snowflake Document AI – Easily Extract Data From Unstructured Documents

With Snowflake Document AI, information can be easily extracted from documents, such as invoices or handwritten documents, within the data platform. Document AI is straightforward and easy to use: either via a graphical user interface, via code in a pipeline or integrated into a Streamlit application. In this article, we explain the feature, describe how the integration into the platform works and present interesting application possibilities.

Inhaltsverzeichnis

Snowflake, the AI Data Cloud

Snowflake is a cloud data platform with a wide range of possible applications, from classic data warehousing to generative artificial intelligence. As a cloud-native platform - i.e. Snowflake was developed explicitly for the cloud - Snowflake offers great potential for optimizing the use of resources.

Among other things, this includes

  • Demand-oriented scaling of separate computing and storage instances,
  • the simple sharing of data via the Snowflake Marketplace
  • and the replication of data across different cloud providers.

Compatible with cloud providers Google Cloud Platform, Amazon Web Services and Microsoft Azure, the Snowflake platform is a robust data solution that is independent of individual providers. While Snowflake initially represented a modern option for data warehousing and reporting when the company was founded in 2012, the platform has developed into a comprehensive, all-encompassing data platform with remarkable speed in recent years.

In particular, the analytical capabilities of the Snowflake AI Data Cloud go far beyond traditional reporting functions.

Snowflake ML

With Snowflake ML, the platform offers the possibility of implementing end-to-end machine learning. For example, the platform supports Snowflake Notebooks, an integrated development environment in which data can be conveniently analyzed and visualized and programmed with both Python and SQL. Snowflake also offers a feature store, in which the model input can be stored, and a model registry, in which all artifacts belonging to the model can be stored.

Snowflake Cortex AI

Snowflake ML is supplemented by Snowflake Cortex AI, Snowflake’s offering for generative artificial intelligence (GenAI). Among other features, Cortex AI provides SQL functions that can be easily applied to tables stored in Snowflake using SQL or Python. For example, the Translate function effortlessly translates text into different languages, while the Sentiment function classifies the valence of a text. These functions are based on Large Language Models (LLMs), which are directly integrated into the platform. Snowflake currently supports a wide range of proprietary and open models.

Snowflake Document AI

LLMs also form the basis for Snowflake Document AI. Document AI is a feature that can be used to extract information from documents in order to obtain structured data from unstructured data. The unstructured data can consist of handwritten or machine-generated documents, but logos and checkboxes can also be processed by the AI. The feature has been “generally available”, i.e. fully available, in most regions of the cloud providers AWS and Azure since October 2024.

For Document AI, Snowflake uses Arctic-TILT, a proprietary LLM that was specially developed for processing documents. As the LLM is already pre-trained, it often achieves strong performance even with new document types that are previously unknown to the model (zero-shot extraction). If this performance is not sufficient, the LLM can be retrained for a specific document type in a dedicated fine-tuning process.

Data security is guaranteed as the data used for training does not have to leave the Snowflake platform. Neither the newly trained model nor the data used for it is shared with other Snowflake customers. Another advantage of using Arctic-TILT for data extraction is the size of the model: with 0.8 billion parameters, the model is comparatively small and therefore inexpensive to operate. Despite its small size, Arctic-TILT offers outstanding performance. In DocVQA, the standard test for answering questions based on visual documents, it beats large generic models such as GPT-4.

Flowchart for creating a Snowflake Document AI model with the following steps: Create Model, Define Variables, Validate Performance, Train the Model, and Publish the Model.
Figure 1: Modular process for providing a document AI model

Implementing a new use case is straightforward and can be done in just a few steps in the Snowflake User Interface, Snowsight:

  • First, a new Snowflake object, a so-called Model Build, is created and a sufficiently large computing instance is selected on which the model is executed.
  • Documents containing the data to be extracted are uploaded to the Snowflake platform via a graphical user interface.
  • You can now ask questions, known as prompts, about the document in natural language, which are answered by the LLM.
  • The accuracy of the answer can be determined by a direct comparison with the content of the document (see Figure 1). On the other hand, the LLM also generates a confidence score for each answer. This score is a numerical value between 0 and 1 and indicates how “certain” the LLM is about the given answer.
  • If the answer is incorrect, it can be corrected manually. The corrected answer then flows into the (optional) fine-tuning, which is started with a click before it is published, i.e. made available for use. If the model is already sufficiently correct in answering the questions even without fine-tuning, it can be published directly without further training.
Screenshot of a document analysis in Snowflake with a USPTO patent document. The right side displays automatically extracted information such as invention_title, application_number, inventor, and other fields.
Figure 2: Extraction of data from unstructured documents using natural language in a graphical user interface

Automated Data Extraction in Snowflake

A successfully trained model offers the greatest added value if it is embedded in a productive pipeline that processes documents automatically, extracts the unstructured data and makes it available as structured data for analysis and further processing. Such a pipeline could be set up as follows:

  • Create a stage, a Snowflake object for storing files, which is used to load new documents into the Snowflake platform. The Encryption parameter can be used to specify an encryption method for the documents. The Directory parameter makes it possible to record metadata for the loaded documents.
  • A stream on the stage can be used to recognize newly uploaded documents.
  • The extraction of data from the documents and thus the movement of information from the stage to a table can be orchestrated by another Snowflake object, a task. The task refers to the stream and selects the metadata and document content from newly arrived documents. Using a simple function call (predict), the LLM, which may have been retrained, is used to answer the questions relating to the document content and save the extracted results together with the confidence score in a table. As usual in Snowflake, the execution speed and frequency can be controlled by selecting a virtual warehouse to be used for execution and the task schedule.
  • In order to be able to use the results of the model in a meaningful way, the data is processed further in the final step. As the predict function returns semi-structured data, this can be conveniently transformed further using the Snowflake-native functions for semi-structured data and materialized in a table. This step can also be automated with an additional task.
  • In addition, a Streamlit application can be integrated into the pipeline for manual validation in a graphical user interface.

In just a few minutes, the user can easily provide a productive pipeline that uses the trained model to transform unstructured data into structured data.

Flowchart of an extraction pipeline in Snowflake with the following steps: Upload Document, Stream & Task, automatic storage of results in a Snowflake table, and manual validation.
Figure 3: Pipeline for extracting data from documents with Document AI in Snowflake

Applications of Snowflake Document AI

Document AI is easy to use and has no domain-specific focus. The range of possible applications for the LLM-supported service is correspondingly broad.

In most companies, invoices have to be entered into systems and checked. These two steps can be automated and simplified with Document AI. Employees can manage invoices in one place, which is accessible via a Snowflake Stage. Uploading the document then triggers the process listed above, which extracts the required information from the documents. Employees can then use the confidence score to focus on a few borderline cases when checking the documents. An interface for this can be easily implemented using Streamlit.

In addition to such general scenarios, very specific use cases can also be described. For example, the evaluation of clinical studies in the pharmaceutical industry could benefit from Snowflake Document AI by using it to extract facts from scientific articles or read out questionnaires from patients. In the latter case in particular, transmission errors can be prevented by machine evaluation. In general, Snowflake Document AI is ideal for processing all types of handwritten questionnaires and forms.

Conclusion

These use cases show that Document AI offers great added value across all domains, which is reflected not only in the acceleration of processes and the reduction of costs, but also in the improvement of quality.

If you are also facing the challenge of transforming your documents into structured data in order to generate real added value for your company, we will be happy to support you. Contact us for an initial meeting to discuss your use case with no obligation.

Want To Learn More? Contact Us!

Dr. Sebastian Petry

Your contact person

Dr. Sebastian Petry

Domain Lead Data Science & AI

Related Posts

chevron left icon
Previous post
Next post
chevron right icon

No previous post

No next post