Catalog
¶
Catalog
¶
Catalog(
id_generator: AbstractIDGenerator,
metastore: AbstractMetastore,
warehouse: AbstractWarehouse,
cache_dir=None,
tmp_dir=None,
client_config: Optional[dict[str, Any]] = None,
warehouse_ready_callback: Optional[
Callable[[AbstractWarehouse], None]
] = None,
)
Source code in datachain/catalog/catalog.py
cleanup_temp_tables
¶
Drop tables created temporarily when processing datasets.
This should be implemented even if temporary tables are used to
ensure that they are cleaned up as soon as they are no longer
needed. When running the same DatasetQuery
multiple times we
may use the same temporary table names.
Source code in datachain/catalog/catalog.py
clone
¶
clone(
sources: list[str],
output: str,
force: bool = False,
update: bool = False,
recursive: bool = False,
no_glob: bool = False,
no_cp: bool = False,
edatachain: bool = False,
edatachain_file: Optional[str] = None,
ttl: int = TTL_INT,
*,
client_config=None
) -> None
This command takes cloud path(s) and duplicates files and folders in them into the dataset folder. It also adds those files to a dataset in database, which is created if doesn't exist yet Optionally, it creates a .edatachain file
Source code in datachain/catalog/catalog.py
cp
¶
cp(
sources: list[str],
output: str,
force: bool = False,
update: bool = False,
recursive: bool = False,
edatachain_file: Optional[str] = None,
edatachain_only: bool = False,
no_edatachain_file: bool = False,
no_glob: bool = False,
ttl: int = TTL_INT,
*,
client_config=None
) -> list[dict[str, Any]]
This function copies files from cloud sources to local destination directory If cloud source is not indexed, or has expired index, it runs indexing It also creates .edatachain file by default, if not specified differently
Source code in datachain/catalog/catalog.py
2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 |
|
create_dataset
¶
create_dataset(
name: str,
version: Optional[int] = None,
*,
columns: Sequence[Column],
feature_schema: Optional[dict] = None,
query_script: str = "",
create_rows: Optional[bool] = True,
validate_version: Optional[bool] = True,
listing: Optional[bool] = False
) -> DatasetRecord
Creates new dataset of a specific version. If dataset is not yet created, it will create it with version 1 If version is None, then next unused version is created. If version is given, then it must be an unused version number.
Source code in datachain/catalog/catalog.py
create_new_dataset_version
¶
create_new_dataset_version(
dataset: DatasetRecord,
version: int,
*,
columns: Sequence[Column],
sources="",
feature_schema=None,
query_script="",
error_message="",
error_stack="",
script_output="",
create_rows_table=True,
job_id: Optional[str] = None,
is_job_result: bool = False
) -> DatasetRecord
Creates dataset version if it doesn't exist. If create_rows is False, dataset rows table will not be created
Source code in datachain/catalog/catalog.py
dataset_stats
¶
Returns tuple with dataset stats: total number of rows and total dataset size.
Source code in datachain/catalog/catalog.py
get_client
¶
get_client(uri: StorageURI, **config: Any) -> Client
Return the client corresponding to the given source uri
.
Source code in datachain/catalog/catalog.py
get_file_signals
¶
Function that returns file signals from dataset row. Note that signal names are without prefix, so if there was 'laion__file__source' in original row, result will have just 'source' Example output: { "source": "s3://ldb-public", "parent": "animals/dogs", "name": "dog.jpg", ... }
Source code in datachain/catalog/catalog.py
merge_datasets
¶
merge_datasets(
src: DatasetRecord,
dst: DatasetRecord,
src_version: int,
dst_version: Optional[int] = None,
) -> DatasetRecord
Merges records from source to destination dataset. It will create new version of a dataset with records merged from old version and the source, unless existing version is specified for destination in which case it must be in non final status as datasets are immutable
Source code in datachain/catalog/catalog.py
1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 |
|
query
¶
query(
query_script: str,
envs: Optional[Mapping[str, str]] = None,
python_executable: Optional[str] = None,
save: bool = False,
save_as: Optional[str] = None,
preview_limit: int = 10,
preview_offset: int = 0,
preview_columns: Optional[list[str]] = None,
capture_output: bool = True,
output_hook: Callable[[str], None] = noop,
params: Optional[dict[str, str]] = None,
job_id: Optional[str] = None,
) -> QueryResult
Method to run custom user Python script to run a query and, as result, creates new dataset from the results of a query. Returns tuple of result dataset and script output.
Constraints on query script
- datachain.query.DatasetQuery should be used in order to create query for a dataset
- There should not be any .save() call on DatasetQuery since the idea is to create only one dataset as the outcome of the script
- Last statement must be an instance of DatasetQuery
If save is set to True, we are creating new dataset with results from dataset query. If it's set to False, we will just print results without saving anything
Example of query script
from datachain.query import DatasetQuery, C DatasetQuery('s3://ldb-public/remote/datasets/mnist-tiny/').filter( C.size > 1000 )
Source code in datachain/catalog/catalog.py
1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 |
|
register_dataset
¶
register_dataset(
dataset: DatasetRecord,
version: int,
target_dataset: DatasetRecord,
target_version: Optional[int] = None,
) -> DatasetRecord
Registers dataset version of one dataset as dataset version of another one (it can be new version of existing one). It also removes original dataset version
Source code in datachain/catalog/catalog.py
1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 |
|
remove_dataset_version
¶
remove_dataset_version(
dataset: DatasetRecord,
version: int,
drop_rows: Optional[bool] = True,
) -> None
Deletes one single dataset version. If it was last version, it removes dataset completely
Source code in datachain/catalog/catalog.py
storage_stats
¶
storage_stats(uri: StorageURI) -> Optional[DatasetStats]
Returns tuple with storage stats: total number of rows and total dataset size.
Source code in datachain/catalog/catalog.py
update_dataset
¶
Updates dataset fields.