Schemas

Data Schemas

High Level Overview

emb_opt is designed to run hill climbing algorithms in embedding spaces. In practice, this means we are searching through some explicit vector database or the implicit embedding space of some generative model, which we refer to as a DataSource. We denote the continuous space as referring to embeddings, and the discrete space as referring to discrete things represented by embeddings.

The DataSource is queried with a Query. The Query contains a query embedding and optionally an item (some discrete thing represented by the embedding). The DataSource uses the Query to return a list of Item objects. An Item represents a discrete thing returned by the DataSource

The Item results are optionally sent to a Filter, which removes results based on some True/False criteria.

The Item results are then sent to a Score which assigns some numeric score value to each Item.

The Query and scored Item results are sent to a Update which uses the scored items to generate a new Query. Update methods are denoted as discrete or continuous. continuous updates generate new queries purely in embedding space (ie by averaging Item embeddings). discrete updates create new queries specifically from Item results, such that each query can have a specific item associated with it (not possible with continuous updates). continuous updates generally converge faster, but certain types of DataSource may require a discrete item query and therefore be incompatible with continuous updates.

Some Update methods generate multiple new queries. To control the total number of queries, a Prune step is optionally added before the Update step.

The general flow is: 1. Start with a Batch of Query objects * Query the DataSource * (optional) Send results to the Filter * Send results to the Score * (optional) Prune queries * Use scored results to Update to a new set of queries

The schemas present here define the required input/output structure for each step to allow for fully flexible plugins to the process

Data Objects

Internal Data

InternalData tracks internal information as part of the embedding search. This data is managed internally, but may be useful for certain Prune or Update configurations.

InternalData.removed denotes if the related Item or Query has been removed or invalidated by some step (see DataSourceResponse, FilterResponse, ScoreResponse, PruneResponse)

InternalData.removal_reason details the removal reason

InternalData.parent_id is the ID string of the parent Query to the related Item or Query object. InternalData.parent_id always points to a Query, never an Item

InternalData.collection_id groups Item and Query objects that come from the same initial Query. This is useful when an Update step generates multiple new queries from a single input

InternalData.iteration denotes which iteration of the search created the related Item or Query


source

InteralData

 InteralData (removed:bool, removal_reason:Optional[str],
              parent_id:Optional[str], collection_id:Optional[int],
              iteration:Optional[int])

Internal Data Tracking

Item

The Item schema is the basic “object” or “thing” we are looking for. The goal of emb_opt is to discover an Item with a high score

Item.id is the index/ID of the item (for example the database index). If no ID is provided, one will be created as a UUID. emb_opt assumes Item.id is unique to the item.

Item.item is the discrete thing itself

Item.score is the score of the item. emb_opt assumes a hill climbing scenario where higher scores are better than lower scores.

Item.data is a dictionary container for any other information associated with the item (ie other fields returned from a database query)


source

Item

 Item (id:Union[int,str,NoneType], item:Optional[Any],
       embedding:List[float], score:Optional[float], data:Optional[dict],
       **extra_data:Any)

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.
Type Details
data Any
Returns None type: ignore
item = Item(id=None, embedding=[0.1], item=None, score=None, data=None)
assert item.id
old_id = item.id
item = Item.model_validate(item)
assert item.id == old_id

Query

A Query is the basic object for searching a DataSource and holding Item results returned by the search.

Query.item is an (optional) discrete item associated with the Query. This is populated automatically when they query is created from an Item via Query.from_item

Query.embedding is the embedding associated with the Query

Query.data is a dictionary container for any other information associated with the query

Query.query_results is a list of Item objects returned from a query


source

Query

 Query (item:Optional[Any], embedding:List[float], data:Optional[dict],
        query_results:Optional[list[__main__.Item]], **extra_data:Any)

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.
Type Details
data Any
Returns None type: ignore

A Query holds Items, tracks parent/child relationships, and allows for convenient iteration

query = Query.from_minimal(embedding=[0.1])
query.update_internal(collection_id=0) # add collection ID

query_results = [
    Item.from_minimal(item='item1', embedding=[0.1]),
    Item.from_minimal(item='item2', embedding=[0.1]),
]

query.add_query_results(query_results)

# iteration over query results
assert len([i for i in query]) == 2

# propagation of query parent data

for query_result in query:
    assert query_result.internal.parent_id == query.id
    assert query_result.internal.collection_id == query.internal.collection_id

Items may be removed by various steps. Removed items are kept within the Query for logging purposes. Query.valid_results and Query.enumerate_query_results allow us to automatically skip removed items during iteration

assert len(list(query.valid_results())) == 2

query.query_results[0].update_internal(removed=True) # set first result to removed

assert len(list(query.valid_results())) == 1

assert len(list(query.enumerate_query_results())) == 1
assert len(list(query.enumerate_query_results(skip_removed=False))) == 2

query.query_results[1].update_internal(removed=True) # set second result to removed
query.update_internal() # update query internal
assert query.internal.removed # query sets itself to removed when all query results are removed

Queries can be created from another Query or another Item, with automatic data propagation between them

# create query from item
item = Item.from_minimal(item='test_item', embedding=[0.1])
query = Query.from_item(item)
assert query.item == item.item

# create query from query
query = Query.from_minimal(embedding=[0.1])
new_query = Query.from_parent_query(embedding=[0.2], parent_query=query)
assert new_query.internal.parent_id == query.id

Batch

The Batch object holds a list of Query objects and provides convenience functions for iterating over queries and query results


source

Batch

 Batch (queries:List[__main__.Query])

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.

A Batch allows us to iterate over the queries and items in the batch in several ways

def build_test_batch(n_queries, n_items):
    queries = []
    
    for i in range(n_queries):
        query = Query.from_minimal(item=f'query_{i}', embedding=[0.1])
        for j in range(n_items):
            item = Item.from_minimal(item=f'item_{j}', embedding=[0.1])
            query.add_query_results([item])
        queries.append(query)
    return Batch(queries=queries)

n_queries = 3
n_items = 4
batch = build_test_batch(n_queries, n_items)

assert len(list(batch.valid_queries())) == n_queries

idxs, results = batch.flatten_query_results()
assert len(results) == n_queries*n_items
assert batch.get_item(*idxs[0]) == batch[idxs[0][0]][idxs[0][1]]

When items or queries are removed, this is accounted for

batch = build_test_batch(n_queries, n_items)

batch[1].update_internal(removed=True) # invalidate query
batch[0][0].update_internal(removed=True) # invalidate item
batch[0][1].update_internal(removed=True) # invalidate item

assert len(list(batch.valid_queries())) == n_queries-1 # 1 batch removed

idxs, results = batch.flatten_query_results(skip_removed=False) # return all queries
assert len(results) == n_queries*n_items

# skips results where `removed=True`, and all results under a query with `removed=True`
idxs, results = batch.flatten_query_results(skip_removed=True)

# n_items removed from invalid query 1, 2 items invalidated
assert len(results) == n_queries*n_items - n_items - 2

Data Source

The DataSourceFunction schema defines the interface for data source queries. The function takes a list of MinimalQuery objects and returns a list of DataSourceResponse objects.


source

DataSourceResponse

 DataSourceResponse (valid:bool, data:Optional[Dict],
                     query_results:List[__main__.Item])

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.
Type Details
data Any
Returns None type: ignore

Filter

The FilterFunction schema defines the interface for filtering result items. The function takes a list of Item objects and returns a list of FilterResponse objects.


source

FilterResponse

 FilterResponse (valid:bool, data:Optional[Dict])

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.
Type Details
data Any
Returns None type: ignore

Score

The ScoreFunction schema defines the interface for scoring result items. The function takes a list of Item objects and returns a list of ScoreResponse objects.


source

ScoreResponse

 ScoreResponse (valid:bool, score:Optional[float], data:Optional[Dict])

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.
Type Details
data Any
Returns None type: ignore

Prune

The PruneFunction schema defines the interface for pruning queries. The function takes a list of Query objects and returns a list of PruneResponse objects.


source

PruneResponse

 PruneResponse (valid:bool, data:Optional[Dict])

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.
Type Details
data Any
Returns None type: ignore

Update

The UpdateFunction schema defines the interface for pruning queries. The function takes a list of Query objects and returns a list of new Query objects.


source

UpdateResponse

 UpdateResponse (query:__main__.Query, parent_id:Optional[str])

Usage docs: https://docs.pydantic.dev/2.7/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of classvars defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The signature for instantiating the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The pydantic-core schema used to build the SchemaValidator and SchemaSerializer.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a `RootModel`.
__pydantic_serializer__: The pydantic-core SchemaSerializer used to dump instances of the model.
__pydantic_validator__: The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_extra__: An instance attribute with the values of extra fields from validation when
    `model_config['extra'] == 'allow'`.
__pydantic_fields_set__: An instance attribute with the names of fields explicitly set.
__pydantic_private__: Instance attribute with the values of private attributes set on the model instance.