Get a File in a Catalog
This page guides you through retrieving detailed information about a specific file stored in a Catalog using API.
Get File's Single Source-of-Truth
This API enables you get the parsed Markdown text that has been extracted from the raw file and stored within a Catalog as the file's single source-of-truth.
export INSTILL_API_TOKEN=********
curl -X GET 'http://localhost:8080/v1alpha/namespaces/NAMESPACE_ID/catalogs/CATALOG_ID/files/FILE_UID/source' \
--header "Authorization: Bearer $INSTILL_API_TOKEN" \
--header "Content-Type: application/json"
from instill.clients import init_artifact_client
artifact = init_artifact_client(api_token="INSTILL_API_TOKEN", url="http://localhost:8080")
artifact.get_source_file(
namespace_id="NAMESPACE_ID", catalog_id="CATALOG_ID", file_uid="FILE_UID"
)
artifact.close()
Note that the NAMESPACE_ID
, CATALOG_ID
, and FILE_UID
path parameters must be replaced by the Catalog owner's ID (namespace), the identifier of the Catalog whose file's source is to be retrieved, and the unique identifier (UID) of the processed file whose source is to be retrieved, respectively.
Example Response
A successful response will return a JSON object containing the details of the single source-of-truth of the file:
{
"sourceFile": {
"originalFileUid": "file123",
"content": "file_content", // single source of truth (markdown text)
"createTime": "2024-07-01T12:00:00Z",
"updateTime": "2024-07-01T12:00:00Z"
}
}
Output Description
sourceFile
: An object containing the details of the source file.originalFileUid
(string): The unique identifier of the original file.content
(string): The base64 encoded content of the source file (single source-of-truth).createTime
(string): The creation time of the source file.updateTime
(string): The last update time of the source file.
Get File in a Catalog
This API provides metadata about a file, including its type, size, and processing status, as well as details on the content transformed through various pipelines. It also returns any chunks associated with the file, providing a comprehensive view of the file's data within the Catalog.
Example of Using fileUid
fileUid
export INSTILL_API_TOKEN=********
curl --location 'http://localhost:8080/v1alpha/namespaces/NAMESPACE_ID/catalogs/CATALOG_ID?fileUid=9f1c8f09-52d6-4aca-8f61-58909d3adcde' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $INSTILL_API_TOKEN'
from instill.clients import init_artifact_client
artifact = init_artifact_client(api_token="INSTILL_API_TOKEN", url="http://localhost:8080")
artifact.get_file_catalog(
namespace_id="NAMESPACE_ID",
catalog_id="CATALOG_ID",
file_uid="9f1c8f09-52d6-4aca-8f61-58909d3adcde",
)
artifact.close()
Example of Using fileId
fileId
Note that the NAMESPACE_ID
and CATALOG_ID
path parameters must be replaced by the Catalog owner's ID (namespace) and the identifier of the Catalog you are querying. The fileUid
and fileId
fields identify the specific file within the Catalog that you want to retrieve information about.
export INSTILL_API_TOKEN=********
curl --location 'http://localhost:8080/v1alpha/namespaces/NAMESPACE_ID/catalogs/CATALOG_ID?fileId=test.pdf' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer $INSTILL_API_TOKEN'
from instill.clients import init_artifact_client
artifact = init_artifact_client(api_token="INSTILL_API_TOKEN", url="http://localhost:8080")
artifact.get_file_catalog(
namespace_id="NAMESPACE_ID",
catalog_id="CATALOG_ID",
file_id="test.pdf",
)
artifact.close()
Example Response
A successful response will return detailed metadata about the file, including the transformed content and any chunks associated with it:
{
"originalData": "< original file content in base64 format >",
"metadata": {
"fileUid": "example-file-uid",
"fileId": "example-file-id.pdf",
"fileType": "FILE_TYPE_PDF",
"fileSize": "12345",
"fileUploadTime": "2024-07-23T14:35:00Z",
"fileProcessStatus": "FILE_PROCESS_STATUS_COMPLETED"
},
"text": {
"pipelineIds": ["pipeline1", "pipeline2"],
"transformedContent": "Transformed content here...",
"transformedContentChunkNum": 10,
"transformedContentTokenNum": 1500,
"transformedContentUpdateTime": "2024-07-23T15:00:00Z"
},
"chunks": [
{
"uid": "chunk1-uid",
"type": "CHUNK_TYPE_TEXT",
"startPos": 0,
"endPos": 100,
"content": "This is a chunk of text.",
"tokensNum": 20,
"embedding": [0.1, 0.2, 0.3, ...],
"createTime": "2024-08-13T15:01:00Z",
"retrievable": true
},
{
"uid": "chunk2-uid",
"type": "CHUNK_TYPE_TEXT",
"startPos": 101,
"endPos": 200,
"content": "Another chunk of text.",
"tokensNum": 25,
"embedding": [0.4, 0.5, 0.6, ...],
"createTime": "2024-08-13T15:02:00Z",
"retrievable": true
}
]
}
Output Description
originalData
: The original file data encoded in base64.metadata
: An object containing metadata about the file.fileUid
(string): The unique identifier of the file.fileId
(string): The file's ID.fileType
(string): The type of the file, e.g.,FILE_TYPE_TEXT
,FILE_TYPE_PDF
,FILE_TYPE_HTML
,FILE_TYPE_PPTX
,FILE_TYPE_DOCX
.fileSize
(string): The size of the file in bytes.fileUploadTime
(string): The time when the file was uploaded.fileProcessStatus
(string): The processing status of the file, which could beFILE_PROCESS_STATUS_COMPLETED
,FILE_PROCESS_STATUS_FAILED
, etc.
text
: An object containing the transformed text content.pipelineIds
(array): The IDs of the pipelines that processed the file.transformedContent
(string): The content transformed through the pipelines.transformedContentChunkNum
(integer): The number of chunks in the transformed content.transformedContentTokenNum
(integer): The number of tokens in the transformed content.transformedContentUpdateTime
(string): The last update time of the transformed content.
chunks
: An array of objects, each representing a chunk of the file's
content.uid
(string): The unique identifier of the chunk.type
(string): The type of the chunk, e.g.,CHUNK_TYPE_TEXT
.startPos
(integer): The start position of the chunk in the file.endPos
(integer): The end position of the chunk in the file.content
(string): The content of the chunk.tokensNum
(integer): The number of tokens in the chunk.embedding
(array): The embedding vector of the chunk.createTime
(string): The time when the chunk was created.retrievable
(boolean): Whether the chunk is retrievable.
This is essential for retrieving and understanding detailed information about files stored within a Catalog, enabling users to analyze file data at a granular level.
Updated 8 days ago