Huggingface Plugins

Huggingface datasets functions and classes

Dataset Executor

DatasetExecutor is an Executor that uses the Datasets library as a backend for parallel processing


source

DatasetExecutor

 DatasetExecutor (function:Callable, batched:bool, batch_size:int=1,
                  concurrency:Optional[int]=1,
                  map_kwargs:Optional[dict]=None)

DatasetExecutor - executes function in parallel using Dataset.map

Type Default Details
function typing.Callable function to be wrapped
batched bool if inputs should be batched
batch_size int 1 batch size (set batch_size=0 to pass all inputs)
concurrency typing.Optional[int] 1 number of concurrent threads
map_kwargs typing.Optional[dict] None kwargs for Dataset.map
class TestInput(BaseModel):
    value: float
        
class TestOutput(BaseModel):
    result: bool

def test_function_hf(input: dict) -> dict:
    return {'result' : input['value']>0.5}

def test_function_hf_batched(input: dict) -> dict:
    return {'result' : [i>0.5 for i in input['value']]}
        
np.random.seed(42)
values = np.random.uniform(size=100).tolist()
inputs = [TestInput(value=i) for i in values]
expected_outputs = [TestOutput(result=i>0.5) for i in values]

# dataset

executor = DatasetExecutor(test_function_hf, batched=False, concurrency=None, batch_size=1)
res11 = executor(inputs)
assert [TestOutput.model_validate(i) for i in res11] == expected_outputs

executor = DatasetExecutor(test_function_hf, batched=False, concurrency=2, batch_size=1)
res12 = executor(inputs)
assert [TestOutput.model_validate(i) for i in res12] == expected_outputs

executor = DatasetExecutor(test_function_hf_batched, batched=True, concurrency=2, batch_size=5)
res13 = executor(inputs)
assert [TestOutput.model_validate(i) for i in res13] == expected_outputs

executor = DatasetExecutor(test_function_hf_batched, batched=True, concurrency=None, batch_size=5)
res14 = executor(inputs)
assert [TestOutput.model_validate(i) for i in res14] == expected_outputs
                                                                                

Data Plugin

The HugggingfaceDataPlugin uses a Huggingface Dataset with a faiss embedding index as a data source


source

HugggingfaceDataPlugin

 HugggingfaceDataPlugin (k:int, dataset:datasets.arrow_dataset.Dataset,
                         index_name:str, item_key:Optional[str]=None,
                         id_key:Optional[str]=None,
                         distance_cutoff:Optional[float]=None)

HugggingfaceDataPlugin - data plugin for working with huggingface datasets library.

The input dataset should have a faiss embedding index denoted by index_name.

The data query will run k nearest neighbors against the dataset index based on the metric used to create the index

Optionally, item_key denotes the column in dataset defining a specific item, and id_key denotes the column defining an item’s ID

if distance_cutoff is specified, query results with a distance greater than distance_cutoff are ignored

Type Default Details
k int k nearest neighbors to return
dataset Dataset input dataset
index_name str name of the faiss index in dataset
item_key typing.Optional[str] None dataset column denoting item value
id_key typing.Optional[str] None dataset column denoting item id
distance_cutoff typing.Optional[float] None query to result distance cutoff
n_vectors = 256
d_vectors = 64
k = 10
n_queries = 5

vectors = np.random.randn(n_vectors, d_vectors)

vector_data = [{'index':str(np.random.randint(0,1e6)), 
                'other':np.random.randint(0,1e3), 
                'item':str(np.random.randint(0,1e4)),
                'embedding':vectors[i]
               } 
               for i in range(vectors.shape[0])]

dataset = Dataset.from_list(vector_data)
dataset.add_faiss_index('embedding')

data_function = HugggingfaceDataPlugin(k, dataset, 'embedding', 'item', 'index')
data_module = DataSourceModule(data_function)

batch = build_batch_from_embeddings(np.random.randn(n_queries, d_vectors))
batch2 = data_module(batch)

for q in batch2:
    for r in q:
        assert r.internal.parent_id == q.id
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1936.43it/s]