prompttools.harness package#

Submodules#

prompttools.harness.chat_history_harness module#

class prompttools.harness.chat_history_harness.ChatHistoryExperimentationHarness(model_name, chat_histories, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for compare multiple chat histories.

Parameters:
  • model_name (str) – The name of the model.

  • chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the model.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

prepare()#

Initializes and prepares the experiment.

Return type:

None

run()#

Runs the underlying experiment.

prompttools.harness.chat_model_comparison_harness module#

class prompttools.harness.chat_model_comparison_harness.ChatModelComparisonHarness(model_names, chat_histories, runs=1, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for comparing chat models. Multi-model version of ChatHistoryExperimentationHarness.

Parameters:
  • model_names (List[str]) – The names of the models that you would like to compare

  • chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the models.

  • runs (int) – Number of runs to execute. Defaults to 1.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['model', 'messages']#
compare()#
prepare()#

Initializes and prepares the experiment.

Return type:

None

run()#

Runs the underlying experiment.

prompttools.harness.document_retrieval_harness module#

prompttools.harness.function_call_harness module#

prompttools.harness.harness module#

class prompttools.harness.harness.ExperimentationHarness#

Bases: object

Base class for experimentation harnesses. This should not be used directly, please use the subclasses instead.

PIVOT_COLUMNS: list#
evaluate(metric_name, eval_fn, static_eval_fn_kwargs={}, **eval_fn_kwargs)#

Uses the given eval_fn to evaluate the results of the underlying experiment.

Parameters:
  • metric_name (str) –

  • eval_fn (Callable) –

  • static_eval_fn_kwargs (dict) –

Return type:

None

experiment: Experiment#
property full_df#
classmethod load_experiment(experiment_id)#

experiment_id (str): experiment ID of the experiment that you wish to load.

Parameters:

experiment_id (str) –

classmethod load_revision(revision_id)#

revision_id (str): revision ID of the experiment that you wish to load.

Parameters:

revision_id (str) –

property partial_df#
prepare()#

Prepares the underlying experiment.

Return type:

None

rank(metric_name, is_average=False)#

Scores and ranks the experiment inputs using the pivot columns, e.g. prompt templates or system prompts.

Parameters:
  • metric_name (str) –

  • is_average (bool) –

Return type:

dict[str, float]

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:

clear_previous_results (bool) –

Return type:

None

save_experiment(name=None)#
name (str, optional): Name of the experiment. This is optional if you have previously loaded an experiment

into this object.

Parameters:

name (Optional[str]) –

property score_df#
visualize(pivot=False)#

Displays a visualization of the experiment results.

Parameters:

pivot (bool) –

Return type:

None

prompttools.harness.multi_experiment_harness module#

class prompttools.harness.multi_experiment_harness.MultiExperimentHarness(experiments)#

Bases: object

This is designed to run experiments across multiple model providers. The underlying APIs for different models (e.g. LlamaCpp and OpenAI) are different, this provides a way to manage that complexity. This will run experiments for different providers, and combine the results into a single table.

The notebook “examples/notebooks/GPT4vsLlama2.ipynb” provides a good example how this can used to test prompts across different models.

Parameters:

experiments (list[Experiment]) – The list of experiments that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)

evaluate(metric_name, eval_fn)#
Parameters:
Return type:

None

gather_feedback()#
Return type:

None

prepare()#
rank(metric_name, is_average=False)#
Parameters:
  • metric_name (str) –

  • is_average (bool) –

Return type:

Dict[str, float]

run()#
visualize(colname=None)#
Parameters:

colname (str) –

Return type:

None

prompttools.harness.prompt_template_harness module#

class prompttools.harness.prompt_template_harness.PromptTemplateExperimentationHarness(experiment, model_name, prompt_templates, user_inputs, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various prompt templates. We use jinja templates, e.g. “Answer the following question: {{input}}”.

Parameters:
  • experiment (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)

  • model_name (str) – The name of the model.

  • prompt_templates (List[str]) – A list of prompt jinja-styled templates.

  • user_inputs (List[Dict[str, str]]) – A list of dictionaries representing user inputs.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['prompt_template', 'user_input']#
prepare()#

Creates prompts from templates to use for the experiment, and then initializes and prepares the experiment.

Return type:

None

run()#

Runs the underlying experiment.

prompttools.harness.system_prompt_harness module#

class prompttools.harness.system_prompt_harness.SystemPromptExperimentationHarness(experiment, model_name, system_prompts, human_messages, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various system prompts.

Parameters:
  • experiment (Type[Experiment]) – The experiment that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)

  • model_name (str) – The name of the model.

  • system_prompts (List[str]) – A list of system prompts for the model

  • human_messages (List[str]) – A list of human (user) messages to pass into the model

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None. Note that the values are not lists.

PIVOT_COLUMNS: list = ['system_prompt', 'user_input']#
get_table(get_all_cols=False)#
Parameters:

get_all_cols (bool) –

Return type:

DataFrame

prepare()#

Creates messages to use for the experiment, and then initializes and prepares the experiment.

Return type:

None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:

clear_previous_results (bool) –

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:

get_all_cols (bool) –

Module contents#

class prompttools.harness.ChatHistoryExperimentationHarness(model_name, chat_histories, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for compare multiple chat histories.

Parameters:
  • model_name (str) – The name of the model.

  • chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the model.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

prepare()#

Initializes and prepares the experiment.

Return type:

None

run()#

Runs the underlying experiment.

class prompttools.harness.ChatModelComparisonHarness(model_names, chat_histories, runs=1, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used for comparing chat models. Multi-model version of ChatHistoryExperimentationHarness.

Parameters:
  • model_names (List[str]) – The names of the models that you would like to compare

  • chat_histories (List[List[Dict[str, str]]]) – A list of chat histories that will be fed into the models.

  • runs (int) – Number of runs to execute. Defaults to 1.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['model', 'messages']#
compare()#
experiment: Experiment#
prepare()#

Initializes and prepares the experiment.

Return type:

None

run()#

Runs the underlying experiment.

class prompttools.harness.ChatPromptTemplateExperimentationHarness(experiment, model_name, message_templates, user_inputs, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various prompt templates for chat models. We use jinja templates, e.g. “Answer the following question: {{input}}”.

Parameters:
  • experiment (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)

  • model_name (str) – The name of the model.

  • message_templates (List[str]) – A list of prompt jinja-styled templates. Each template should have two messages inside (first system prompt and second a user message).

  • user_inputs (List[Dict[str, str]]) – A list of dictionaries representing user inputs.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None. Note that the values are not lists.

PIVOT_COLUMNS: list = ['prompt_template', 'user_input']#
get_table(get_all_cols=False)#
Parameters:

get_all_cols (bool) –

Return type:

DataFrame

prepare()#

Creates prompts from templates to use for the experiment, and then initializes and prepares the experiment.

Return type:

None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:

clear_previous_results (bool) –

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:

get_all_cols (bool) –

class prompttools.harness.ExperimentationHarness#

Bases: object

Base class for experimentation harnesses. This should not be used directly, please use the subclasses instead.

PIVOT_COLUMNS: list#
evaluate(metric_name, eval_fn, static_eval_fn_kwargs={}, **eval_fn_kwargs)#

Uses the given eval_fn to evaluate the results of the underlying experiment.

Parameters:
  • metric_name (str) –

  • eval_fn (Callable) –

  • static_eval_fn_kwargs (dict) –

Return type:

None

experiment: Experiment#
property full_df#
classmethod load_experiment(experiment_id)#

experiment_id (str): experiment ID of the experiment that you wish to load.

Parameters:

experiment_id (str) –

classmethod load_revision(revision_id)#

revision_id (str): revision ID of the experiment that you wish to load.

Parameters:

revision_id (str) –

property partial_df#
prepare()#

Prepares the underlying experiment.

Return type:

None

rank(metric_name, is_average=False)#

Scores and ranks the experiment inputs using the pivot columns, e.g. prompt templates or system prompts.

Parameters:
  • metric_name (str) –

  • is_average (bool) –

Return type:

dict[str, float]

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:

clear_previous_results (bool) –

Return type:

None

save_experiment(name=None)#
name (str, optional): Name of the experiment. This is optional if you have previously loaded an experiment

into this object.

Parameters:

name (Optional[str]) –

property score_df#
visualize(pivot=False)#

Displays a visualization of the experiment results.

Parameters:

pivot (bool) –

Return type:

None

class prompttools.harness.ModelComparisonHarness(model_names, system_prompts, user_messages, model_arguments=[], runs=1)#

Bases: ExperimentationHarness

An experimentation harness used for comparing models.

Parameters:
  • model_names (List[str]) – The names of the models that you would like to compare

  • system_prompts (List[str]) – A list of system messages, one for each model.

  • model_arguments (List[Optional[Dict]]) – A list of model arguments, one for each model.

  • user_messages (List[str]) –

  • runs (int) – Number of runs to execute. Defaults to 1.

PIVOT_COLUMNS: list = ['model', 'messages']#
evaluate(metric_name, eval_fn, static_eval_fn_kwargs={}, **eval_fn_kwargs)#

Uses the given eval_fn to evaluate the results of the underlying experiment.

Parameters:
  • metric_name (str) –

  • eval_fn (Callable) –

  • static_eval_fn_kwargs (dict) –

Return type:

None

property full_df#
get_table(get_all_cols=False)#
Parameters:

get_all_cols (bool) –

Return type:

DataFrame

property partial_df#
prepare()#

Initializes and prepares the experiment.

Return type:

None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:

clear_previous_results (bool) –

property score_df#
visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:

get_all_cols (bool) –

class prompttools.harness.MultiExperimentHarness(experiments)#

Bases: object

This is designed to run experiments across multiple model providers. The underlying APIs for different models (e.g. LlamaCpp and OpenAI) are different, this provides a way to manage that complexity. This will run experiments for different providers, and combine the results into a single table.

The notebook “examples/notebooks/GPT4vsLlama2.ipynb” provides a good example how this can used to test prompts across different models.

Parameters:

experiments (list[Experiment]) – The list of experiments that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)

evaluate(metric_name, eval_fn)#
Parameters:
Return type:

None

gather_feedback()#
Return type:

None

prepare()#
rank(metric_name, is_average=False)#
Parameters:
  • metric_name (str) –

  • is_average (bool) –

Return type:

Dict[str, float]

run()#
visualize(colname=None)#
Parameters:

colname (str) –

Return type:

None

class prompttools.harness.PromptTemplateExperimentationHarness(experiment, model_name, prompt_templates, user_inputs, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various prompt templates. We use jinja templates, e.g. “Answer the following question: {{input}}”.

Parameters:
  • experiment (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)

  • model_name (str) – The name of the model.

  • prompt_templates (List[str]) – A list of prompt jinja-styled templates.

  • user_inputs (List[Dict[str, str]]) – A list of dictionaries representing user inputs.

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None.

PIVOT_COLUMNS: list = ['prompt_template', 'user_input']#
experiment: Experiment#
prepare()#

Creates prompts from templates to use for the experiment, and then initializes and prepares the experiment.

Return type:

None

run()#

Runs the underlying experiment.

class prompttools.harness.RetrievalAugmentedGenerationExperimentationHarness(vector_db_experiment, llm_experiment_cls, llm_arguments, extract_document_fn, extract_query_metadata_fn, prompt_template='Given these documents:{{documents}}\n\n{{prompt}}\n')#

Bases: ExperimentationHarness

An experimentation harness used to test the Retrieval-Augmented Generation process, which involves a vector DB and a LLM at the same time.

Parameters:
  • vector_db_experiment (Experiment) – An initialized vector DB experiment.

  • llm_experiment_cls (Type[Experiment]) – The experiment constructor that you would like to execute within the harness (e.g. prompttools.experiment.OpenAICompletionExperiment)

  • llm_arguments (dict[str, list]) – Dictionary of arguments for the LLM.

  • extract_document_fn (Callable) – A function, when given a row of results from the vector DB experiment, extract the relevant documents (list[str]) that will be inserted into the template.

  • extract_query_metadata_fn (Callable) – A function, when given a row of results from the vector DB experiment, extract the relevant metadata and return a str that will be shown for visualization in the final result table

  • prompt_template (str) – A jinja-styled templates, where documents and prompt will be inserted.

run()#

Runs the underlying experiment.

Return type:

None

visualize()#

Displays a visualization of the experiment results.

Return type:

None

class prompttools.harness.SystemPromptExperimentationHarness(experiment, model_name, system_prompts, human_messages, model_arguments=None)#

Bases: ExperimentationHarness

An experimentation harness used to test various system prompts.

Parameters:
  • experiment (Type[Experiment]) – The experiment that you would like to execute (e.g. prompttools.experiment.OpenAICompletionExperiment)

  • model_name (str) – The name of the model.

  • system_prompts (List[str]) – A list of system prompts for the model

  • human_messages (List[str]) – A list of human (user) messages to pass into the model

  • model_arguments (Optional[Dict[str, object]], optional) – Additional arguments for the model. Defaults to None. Note that the values are not lists.

PIVOT_COLUMNS: list = ['system_prompt', 'user_input']#
experiment: Experiment#
get_table(get_all_cols=False)#
Parameters:

get_all_cols (bool) –

Return type:

DataFrame

prepare()#

Creates messages to use for the experiment, and then initializes and prepares the experiment.

Return type:

None

run(clear_previous_results=False)#

Runs the underlying experiment.

Parameters:

clear_previous_results (bool) –

visualize(get_all_cols=False)#

Displays a visualization of the experiment results.

Parameters:

get_all_cols (bool) –