prompttools.harness package#

Submodules#

prompttools.harness.chat_history_harness module#

class prompttools.harness.chat_history_harness.ChatHistoryExperimentationHarness(model_name, chat_histories, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for compare multiple chat histories.

Parameters:

model_name (str) – The name of the model.
chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the model.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

prepare()#

Initializes and prepares the experiment.

Return type:: None

run()#: Runs the underlying experiment.

prompttools.harness.chat_model_comparison_harness module#

class prompttools.harness.chat_model_comparison_harness.ChatModelComparisonHarness(model_names, chat_histories, runs=1, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for comparing chat models. Multi-model version of ChatHistoryExperimentationHarness.

Parameters:

model_names (List[str]) – The names of the models that you would like to compare
chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the models.
runs (int) – Number of runs to execute. Defaults to 1.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['model', 'messages']#

compare()#

prepare()#

Initializes and prepares the experiment.

Return type:: None

run()#: Runs the underlying experiment.

prompttools.harness.document_retrieval_harness module#

prompttools.harness.function_call_harness module#

prompttools.harness.harness module#

class prompttools.harness.harness.ExperimentationHarness#

Bases: object

Base class for experimentation harnesses. This should not be used directly, please use the subclasses instead.

PIVOT_COLUMNS: list#

evaluate(metric_name, eval_fn, static_eval_fn_kwargs={}, **eval_fn_kwargs)#

Uses the given eval_fn to evaluate the results of the underlying experiment.

Parameters:

metric_name (str) –
eval_fn (Callable) –
static_eval_fn_kwargs (dict) –

Return type:

None

experiment: Experiment#

property full_df#

classmethod load_experiment(experiment_id)#

experiment_id (str): experiment ID of the experiment that you wish to load.

Parameters:: experiment_id (str) –

classmethod load_revision(revision_id)#

revision_id (str): revision ID of the experiment that you wish to load.

Parameters:: revision_id (str) –

property partial_df#

prepare()#

Prepares the underlying experiment.

Return type:: None

rank(metric_name, is_average=False)#

Scores and ranks the experiment inputs using the pivot columns, e.g. prompt templates or system prompts.

Parameters:

metric_name (str) –
is_average (bool) –

Return type:

dict[str, float]

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:: clear_previous_results (bool) –
Return type:: None

save_experiment(name=None)#

name (str, optional): Name of the experiment. This is optional if you have previously loaded an experiment: into this object.

Parameters:: name (Optional[str]) –

property score_df#

visualize(pivot=False)#

Displays a visualization of the experiment results.

Parameters:: pivot (bool) –
Return type:: None

prompttools.harness.multi_experiment_harness module#

class prompttools.harness.multi_experiment_harness.MultiExperimentHarness(experiments)#

Bases: object

This is designed to run experiments across multiple model providers. The underlying APIs for different models (e.g. LlamaCpp and OpenAI) are different, this provides a way to manage that complexity. This will run experiments for different providers, and combine the results into a single table.

The notebook “examples/notebooks/GPT4vsLlama2.ipynb” provides a good example how this can used to test prompts across different models.

Parameters:: experiments (list[Experiment]) – The list of experiments that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)

evaluate(metric_name, eval_fn)#

Parameters:

metric_name (str) –
eval_fn (Callable) –

Return type:

None

gather_feedback()#

Return type:: None

prepare()#

rank(metric_name, is_average=False)#

Parameters:

metric_name (str) –
is_average (bool) –

Return type:

Dict[str, float]

run()#

visualize(colname=None)#

Parameters:: colname (str) –
Return type:: None

prompttools.harness.prompt_template_harness module#

class prompttools.harness.prompt_template_harness.PromptTemplateExperimentationHarness(experiment, model_name, prompt_templates, user_inputs, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various prompt templates. We use jinja templates, e.g. “Answer the following question: {{input}}”.

Parameters:

experiment (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)
model_name (str) – The name of the model.
prompt_templates (List[str]) – A list of prompt jinja-styled templates.
user_inputs (List[Dict[str, str]]) – A list of dictionaries representing user inputs.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['prompt_template', 'user_input']#

prepare()#

Creates prompts from templates to use for the experiment, and then initializes and prepares the experiment.

Return type:: None

run()#: Runs the underlying experiment.

prompttools.harness.system_prompt_harness module#

class prompttools.harness.system_prompt_harness.SystemPromptExperimentationHarness(experiment, model_name, system_prompts, human_messages, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various system prompts.

Parameters:

experiment (Type[Experiment]) – The experiment that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)
model_name (str) – The name of the model.
system_prompts (List[str]) – A list of system prompts for the model
human_messages (List[str]) – A list of human (user) messages to pass into the model
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None. Note that the values are not lists.

PIVOT_COLUMNS: list = ['system_prompt', 'user_input']#

get_table(get_all_cols=False)#

Parameters:: get_all_cols (bool) –
Return type:: DataFrame

prepare()#

Creates messages to use for the experiment, and then initializes and prepares the experiment.

Return type:: None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:: clear_previous_results (bool) –

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:: get_all_cols (bool) –

Module contents#

class prompttools.harness.ChatHistoryExperimentationHarness(model_name, chat_histories, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for compare multiple chat histories.

Parameters:

model_name (str) – The name of the model.
chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the model.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

prepare()#

Initializes and prepares the experiment.

Return type:: None

run()#: Runs the underlying experiment.

class prompttools.harness.ChatModelComparisonHarness(model_names, chat_histories, runs=1, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for comparing chat models. Multi-model version of ChatHistoryExperimentationHarness.

Parameters:

model_names (List[str]) – The names of the models that you would like to compare
chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the models.
runs (int) – Number of runs to execute. Defaults to 1.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['model', 'messages']#

compare()#

experiment: Experiment#

prepare()#

Initializes and prepares the experiment.

Return type:: None

run()#: Runs the underlying experiment.

class prompttools.harness.ChatPromptTemplateExperimentationHarness(experiment, model_name, message_templates, user_inputs, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various prompt templates for chat models. We use jinja templates, e.g. “Answer the following question: {{input}}”.

Parameters:

experiment (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)
model_name (str) – The name of the model.
message_templates (List[str]) – A list of prompt jinja-styled templates. Each template should have two messages inside (first system prompt and second a user message).
user_inputs (List[Dict[str, str]]) – A list of dictionaries representing user inputs.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None. Note that the values are not lists.

PIVOT_COLUMNS: list = ['prompt_template', 'user_input']#

get_table(get_all_cols=False)#

Parameters:: get_all_cols (bool) –
Return type:: DataFrame

prepare()#

Creates prompts from templates to use for the experiment, and then initializes and prepares the experiment.

Return type:: None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:: clear_previous_results (bool) –

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:: get_all_cols (bool) –

class prompttools.harness.ExperimentationHarness#

Bases: object

Base class for experimentation harnesses. This should not be used directly, please use the subclasses instead.

PIVOT_COLUMNS: list#

evaluate(metric_name, eval_fn, static_eval_fn_kwargs={}, **eval_fn_kwargs)#

Uses the given eval_fn to evaluate the results of the underlying experiment.

Parameters:

metric_name (str) –
eval_fn (Callable) –
static_eval_fn_kwargs (dict) –

Return type:

None

experiment: Experiment#

property full_df#

classmethod load_experiment(experiment_id)#

experiment_id (str): experiment ID of the experiment that you wish to load.

Parameters:: experiment_id (str) –

classmethod load_revision(revision_id)#

revision_id (str): revision ID of the experiment that you wish to load.

Parameters:: revision_id (str) –

property partial_df#

prepare()#

Prepares the underlying experiment.

Return type:: None

rank(metric_name, is_average=False)#

Scores and ranks the experiment inputs using the pivot columns, e.g. prompt templates or system prompts.

Parameters:

metric_name (str) –
is_average (bool) –

Return type:

dict[str, float]

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:: clear_previous_results (bool) –
Return type:: None

save_experiment(name=None)#

name (str, optional): Name of the experiment. This is optional if you have previously loaded an experiment: into this object.

Parameters:: name (Optional[str]) –

property score_df#

visualize(pivot=False)#

Displays a visualization of the experiment results.

Parameters:: pivot (bool) –
Return type:: None

class prompttools.harness.ModelComparisonHarness(model_names, system_prompts, user_messages, model_arguments=[], runs=1)#

Bases: ExperimentationHarness

An experimentation harness used for comparing models.

Parameters:

model_names (List[str]) – The names of the models that you would like to compare
system_prompts (List[str]) – A list of system messages, one for each model.
model_arguments (List[Optional[Dict]]) – A list of model arguments, one for each model.
user_messages (List[str]) –
runs (int) – Number of runs to execute. Defaults to 1.

PIVOT_COLUMNS: list = ['model', 'messages']#

evaluate(metric_name, eval_fn, static_eval_fn_kwargs={}, **eval_fn_kwargs)#

Uses the given eval_fn to evaluate the results of the underlying experiment.

Parameters:

metric_name (str) –
eval_fn (Callable) –
static_eval_fn_kwargs (dict) –

Return type:

None

property full_df#

get_table(get_all_cols=False)#

Parameters:: get_all_cols (bool) –
Return type:: DataFrame

property partial_df#

prepare()#

Initializes and prepares the experiment.

Return type:: None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:: clear_previous_results (bool) –

property score_df#

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:: get_all_cols (bool) –

class prompttools.harness.MultiExperimentHarness(experiments)#

Bases: object

The notebook “examples/notebooks/GPT4vsLlama2.ipynb” provides a good example how this can used to test prompts across different models.

Parameters:: experiments (list[Experiment]) – The list of experiments that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)

evaluate(metric_name, eval_fn)#

Parameters:

metric_name (str) –
eval_fn (Callable) –

Return type:

None

gather_feedback()#

Return type:: None

prepare()#

rank(metric_name, is_average=False)#

Parameters:

metric_name (str) –
is_average (bool) –

Return type:

Dict[str, float]

run()#

visualize(colname=None)#

Parameters:: colname (str) –
Return type:: None

class prompttools.harness.PromptTemplateExperimentationHarness(experiment, model_name, prompt_templates, user_inputs, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various prompt templates. We use jinja templates, e.g. “Answer the following question: {{input}}”.

Parameters:

experiment (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)
model_name (str) – The name of the model.
prompt_templates (List[str]) – A list of prompt jinja-styled templates.
user_inputs (List[Dict[str, str]]) – A list of dictionaries representing user inputs.
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['prompt_template', 'user_input']#

experiment: Experiment#

prepare()#

Creates prompts from templates to use for the experiment, and then initializes and prepares the experiment.

Return type:: None

run()#: Runs the underlying experiment.

class prompttools.harness.RetrievalAugmentedGenerationExperimentationHarness(vector_db_experiment, llm_experiment_cls, llm_arguments, extract_document_fn, extract_query_metadata_fn, prompt_template='Given these documents:{{documents}}\n\n{{prompt}}\n')#

Bases: ExperimentationHarness

An experimentation harness used to test the Retrieval-Augmented Generation process, which involves a vector DB and a LLM at the same time.

Parameters:

vector_db_experiment (Experiment) – An initialized vector DB experiment.
llm_experiment_cls (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)
llm_arguments (dict[str, list]) – Dictionary of arguments for the LLM.
extract_document_fn (Callable) – A function, when given a row of results from the vector DB experiment, extract the relevant documents (list[str]) that will be inserted into the template.
extract_query_metadata_fn (Callable) – A function, when given a row of results from the vector DB experiment, extract the relevant metadata and return a str that will be shown for visualization in the final result table
prompt_template (str) – A jinja-styled templates, where documents and prompt will be inserted.

run()#

Runs the underlying experiment.

Return type:: None

visualize()#

Displays a visualization of the experiment results.

Return type:: None

class prompttools.harness.SystemPromptExperimentationHarness(experiment, model_name, system_prompts, human_messages, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various system prompts.

Parameters:

experiment (Type[Experiment]) – The experiment that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)
model_name (str) – The name of the model.
system_prompts (List[str]) – A list of system prompts for the model
human_messages (List[str]) – A list of human (user) messages to pass into the model
model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None. Note that the values are not lists.

PIVOT_COLUMNS: list = ['system_prompt', 'user_input']#

experiment: Experiment#

get_table(get_all_cols=False)#

Parameters:: get_all_cols (bool) –
Return type:: DataFrame

prepare()#

Creates messages to use for the experiment, and then initializes and prepares the experiment.

Return type:: None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:: clear_previous_results (bool) –

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:: get_all_cols (bool) –