Get a File in a Catalog

This page guides you through retrieving detailed information about a specific file stored in a Catalog using API.

Get File's Single Source-of-Truth

This API enables you get the parsed Markdown text that has been extracted from the raw file and stored within a Catalog as the file's single source-of-truth.

export INSTILL_API_TOKEN=********

curl -X GET 'http://localhost:8080/v1alpha/namespaces/NAMESPACE_ID/catalogs/CATALOG_ID/files/FILE_UID/source' \
--header "Authorization: Bearer $INSTILL_API_TOKEN" \
--header "Content-Type: application/json"

from instill.clients import init_artifact_client

artifact = init_artifact_client(api_token="INSTILL_API_TOKEN", url="http://localhost:8080")
artifact.get_source_file(
  namespace_id="NAMESPACE_ID", catalog_id="CATALOG_ID", file_uid="FILE_UID"
)
artifact.close()

Note that the NAMESPACE_ID, CATALOG_ID, and FILE_UID path parameters must be replaced by the Catalog owner's ID (namespace), the identifier of the Catalog whose file's source is to be retrieved, and the unique identifier (UID) of the processed file whose source is to be retrieved, respectively.

Example Response

A successful response will return a JSON object containing the details of the single source-of-truth of the file:

{
  "sourceFile": {
    "originalFileUid": "file123",
    "content": "file_content", // single source of truth (markdown text)
    "createTime": "2024-07-01T12:00:00Z",
    "updateTime": "2024-07-01T12:00:00Z"
  }
}

Output Description

sourceFile: An object containing the details of the source file.
- originalFileUid (string): The unique identifier of the original file.
- content (string): The base64 encoded content of the source file (single source-of-truth).
- createTime (string): The creation time of the source file.
- updateTime (string): The last update time of the source file.

Get File in a Catalog

This API provides metadata about a file, including its type, size, and processing status, as well as details on the content transformed through various pipelines. It also returns any chunks associated with the file, providing a comprehensive view of the file's data within the Catalog.

Example of Using `fileUid`

export INSTILL_API_TOKEN=********

curl --location 'http://localhost:8080/v1alpha/namespaces/NAMESPACE_ID/catalogs/CATALOG_ID?fileUid=9f1c8f09-52d6-4aca-8f61-58909d3adcde' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $INSTILL_API_TOKEN'

from instill.clients import init_artifact_client
artifact = init_artifact_client(api_token="INSTILL_API_TOKEN", url="http://localhost:8080")

artifact.get_file_catalog(
    namespace_id="NAMESPACE_ID",
    catalog_id="CATALOG_ID",
    file_uid="9f1c8f09-52d6-4aca-8f61-58909d3adcde",
)
artifact.close()

Example of Using `fileId`

Note that the NAMESPACE_ID and CATALOG_ID path parameters must be replaced by the Catalog owner's ID (namespace) and the identifier of the Catalog you are querying. The fileUid and fileId fields identify the specific file within the Catalog that you want to retrieve information about.

export INSTILL_API_TOKEN=********

curl --location 'http://localhost:8080/v1alpha/namespaces/NAMESPACE_ID/catalogs/CATALOG_ID?fileId=test.pdf' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer $INSTILL_API_TOKEN'

from instill.clients import init_artifact_client
artifact = init_artifact_client(api_token="INSTILL_API_TOKEN", url="http://localhost:8080")

artifact.get_file_catalog(
    namespace_id="NAMESPACE_ID",
    catalog_id="CATALOG_ID",
    file_id="test.pdf",
)
artifact.close()

Example Response

A successful response will return detailed metadata about the file, including the transformed content and any chunks associated with it:

{
    "originalData": "< original file content in base64 format >",
    "metadata": {
      "fileUid": "example-file-uid",
      "fileId": "example-file-id.pdf",
      "fileType": "FILE_TYPE_PDF",
      "fileSize": "12345",
      "fileUploadTime": "2024-07-23T14:35:00Z",
      "fileProcessStatus": "FILE_PROCESS_STATUS_COMPLETED"
    },
    "text": {
      "pipelineIds": ["pipeline1", "pipeline2"],
      "transformedContent": "Transformed content here...",
      "transformedContentChunkNum": 10,
      "transformedContentTokenNum": 1500,
      "transformedContentUpdateTime": "2024-07-23T15:00:00Z"
    },
    "chunks": [
      {
        "uid": "chunk1-uid",
        "type": "CHUNK_TYPE_TEXT",
        "startPos": 0,
        "endPos": 100,
        "content": "This is a chunk of text.",
        "tokensNum": 20,
        "embedding": [0.1, 0.2, 0.3, ...],
        "createTime": "2024-08-13T15:01:00Z",
        "retrievable": true
      },
      {
        "uid": "chunk2-uid",
        "type": "CHUNK_TYPE_TEXT",
        "startPos": 101,
        "endPos": 200,
        "content": "Another chunk of text.",
        "tokensNum": 25,
        "embedding": [0.4, 0.5, 0.6, ...],
        "createTime": "2024-08-13T15:02:00Z",
        "retrievable": true
      }
    ]
  }

Output Description

originalData: The original file data encoded in base64.
metadata: An object containing metadata about the file.
- fileUid (string): The unique identifier of the file.
- fileId (string): The file's ID.
- fileType (string): The type of the file, e.g., FILE_TYPE_TEXT, FILE_TYPE_PDF, FILE_TYPE_HTML, FILE_TYPE_PPTX, FILE_TYPE_DOCX.
- fileSize (string): The size of the file in bytes.
- fileUploadTime (string): The time when the file was uploaded.
- fileProcessStatus (string): The processing status of the file, which could be FILE_PROCESS_STATUS_COMPLETED, FILE_PROCESS_STATUS_FAILED, etc.
text: An object containing the transformed text content.
- pipelineIds (array): The IDs of the pipelines that processed the file.
- transformedContent (string): The content transformed through the pipelines.
- transformedContentChunkNum (integer): The number of chunks in the transformed content.
- transformedContentTokenNum (integer): The number of tokens in the transformed content.
- transformedContentUpdateTime (string): The last update time of the transformed content.
chunks: An array of objects, each representing a chunk of the file's
content.
- uid (string): The unique identifier of the chunk.
- type (string): The type of the chunk, e.g., CHUNK_TYPE_TEXT.
- startPos (integer): The start position of the chunk in the file.
- endPos (integer): The end position of the chunk in the file.
- content (string): The content of the chunk.
- tokensNum (integer): The number of tokens in the chunk.
- embedding (array): The embedding vector of the chunk.
- createTime (string): The time when the chunk was created.
- retrievable (boolean): Whether the chunk is retrievable.

This is essential for retrieving and understanding detailed information about files stored within a Catalog, enabling users to analyze file data at a granular level.

Get File's Single Source-of-Truth

Example Response

Output Description

Get File in a Catalog

Example of Using fileUid

Example of Using fileId

Example Response

Output Description

Example of Using `fileUid`

Example of Using `fileId`