Introduction
Artifact makes unstructured data ETL effortless. It transforms documents (e.g., HTML, PDF, CSV, PPTX, DOC), images (e.g., JPG, PNG, TIFF), audio (e.g., WAV, MP3 ) and video (e.g., MP4, MOV) into a Catalog - a unified AI-ready format. A Catalog serves as a sophisticated Knowledge Base that ensures your data has been effectively processed, curated, and prepared for all of your future AI and Retrieval-Augmented Generation (RAG) needs.
Artifact can help with:
- Getting AI and RAG ready: Upload and process files into high-quality AI-ready data Catalogs.
- Simplicity: Operate via low-latency API calls, or at the click of a button with Console.
- Integration: Seamlessly integrate with Pipeline and Model to provide a complete a full-stack AI solution.
- Transparency: Manage and view your data at the Catalog, file, or chunk levels.
- Data Integrity: Ensure AI application reliability with markdown-based source-of-truth.
- Scalability: Automate unstructured data transformation & growing data volumes efficiently.
- Versatility: Support for a plethora of unstructured, semi-structured, and structured file types and data sources.
How it Works
When you upload files, Artifact allows you to process their contents by parsing the files into high-quality structured Markdown text, which then undergoes chunking and embedding using preset Pipelines.
A Catalog stores the chunk embeddings in a vector database for efficient search and retrieval. The processed chunks and original files are also stored within the Catalog, which defines a standardised AI-ready JSON object. This means that all data, encompassing unstructured, semi-structured, and structured data types, can be effortlessly ingested by Pipeline to solve a wide array of downstream AI and data tasks.
Importantly, it also allows you to view and inspect the data in your Catalogs at different levels of granularity. To obtain the parsed Markdown text and the processed chunks, please see the Get Single Source-of-Truth and View Chunks documentation.
Supported File Types
The current catalog supports the following file types: .md
, .txt
, .pdf
, .html
, .ppt
, .pptx
, .doc
, .docx
, .xls
, .xlsx
and .csv
. If you need to see more details, please refer to the Supported File Types page.
Updated 9 days ago