prompttools.utils package#

Submodules#

prompttools.utils.autoeval module#

prompttools.utils.autoeval.autoeval_binary_scoring(row, prompt_column_name, response_column_name='response')#

Uses auto-evaluation to score the model response with “gpt-4” as the judge, returning 0.0 or 1.0.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
prompt_column_name (str) – name of the column that contains the input prompt
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.autoeval.compute(prompt, response, model='gpt-4')#

Uses a high quality chat model, like GPT-4, to automatically evaluate a given prompt/response pair. Outputs can be 0 or 1.

Parameters:

prompt (str) – The input prompt.
response (str) – The model response.
model (str) – The OpenAI chat model to use for generating an expected response. Defaults to GPT-4.

Return type:

float

prompttools.utils.autoeval.evaluate(prompt, response, _metadata)#

Uses auto-evaluation to score the model response with “gpt-4” as the judge, returning 0.0 or 1.0.

Parameters:

prompt (str) – The input prompt.
response (str) – The model response.
metadata (str) – Not used.
_metadata (Dict) –

Return type:

float

prompttools.utils.error module#

exception prompttools.utils.error.PromptToolsUtilityError#

Bases: Exception

An exception to throw when something goes wrong with the prompttools utility.

prompttools.utils.expected module#

prompttools.utils.expected.compute(prompt, model='gpt-4')#

Computes the expected result of a given prompt by using a high quality LLM, like GPT-4.

Parameters:

prompt (str) – The input prompt.
model (str) – The OpenAI chat model to use for generating an expected response. Defaults to GPT-4.

Return type:

str

prompttools.utils.expected.compute_similarity_against_model(row, prompt_column_name, model='gpt-4', response_column_name='response')#

Computes the similarity of a given response to the expected result generated from a high quality LLM (by default GPT-4) using the same prompt.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
prompt_column_name (str) – name of the column that contains the input prompt
model (str) – name of the model that will serve as the judge
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

str

prompttools.utils.expected.evaluate(prompt, response, model='gpt-4')#

Computes the similarity of a given response to the expected result generated from a high quality LLM (by default GPT-4) using the same prompt.

Parameters:

prompt (str) – The input prompt.
response (str) – The model response.
model (str) – The OpenAI chat model to use for generating an expected response. Defaults to GPT-4.

Return type:

str

prompttools.utils.json module#

prompttools.utils.python module#

prompttools.utils.similarity module#

Use a list to optionally hold a reference to the embedding model and client, allowing for lazy initialization.

prompttools.utils.similarity.compute(doc1, doc2, use_chroma=False)#

Computes the semantic similarity between two documents, using either ChromaDB or HuggingFace sentence_transformers.

Parameters:

doc1 (str) – The first document.
doc2 (str) – The second document.
use_chroma (bool) – Indicates whether or not to use Chroma. If False, uses HuggingFace sentence_transformers.

prompttools.utils.similarity.evaluate(prompt, response, metadata, expected)#

A simple test that checks semantic similarity between the expected response (provided by the user) and the model’s text responses.

Parameters:

prompt (str) – Not used.
response (str) – the response string that will be compared against
metadata (dict) – Not used.
expected (str) – the expected response

Return type:

float

prompttools.utils.similarity.semantic_similarity(row, expected, response_column_name='response')#

A simple test that checks semantic similarity between the expected response (provided by the user) and the model’s text responses.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected (str) – the expected responses for each row in the column
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.similarity.structural_similarity(row, expected, response_column_name='response')#

Compute the structural similarity index measure (SSIM) between two images.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected (str) – the column name of the expected image responses in each row
response_column_name (str) – the column name that contains the model’s response, defaults to "response"

Return type:

float

Module contents#

prompttools.utils.autoeval_binary_scoring(row, prompt_column_name, response_column_name='response')#

Uses auto-evaluation to score the model response with “gpt-4” as the judge, returning 0.0 or 1.0.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
prompt_column_name (str) – name of the column that contains the input prompt
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.autoeval_from_expected_response(row, expected, prompt_column_name, response_column_name='response')#

Parameters:

row (Series) –
expected (str) –
prompt_column_name (str) –
response_column_name (str) –

prompttools.utils.autoeval_scoring(row, expected, response_column_name='response')#

Uses auto-evaluation to score the model response.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected (str) – the expected response
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.autoeval_with_documents(row, documents, response_column_name='response')#

Given a list of documents, score whether the model response is accurate with “gpt-4” as the judge, returning an integer score from 0 to 10.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
documents (list[str]) – documents to provide relevant context for the model to judge
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.chunk_text(text, max_chunk_length)#

Given a long string paragraph of text and a chunk max length, returns chunks of texts where each chunk’s length is smaller than the max length, without breaking up individual words (separated by space).

Parameters:

text (str) – source text to be chunked
max_chunk_length (int) – maximum length of a chunk

Return type:

list[str]

prompttools.utils.compute_similarity_against_model(row, prompt_column_name, model='gpt-4', response_column_name='response')#

Computes the similarity of a given response to the expected result generated from a high quality LLM (by default GPT-4) using the same prompt.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
prompt_column_name (str) – name of the column that contains the input prompt
model (str) – name of the model that will serve as the judge
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

str

prompttools.utils.ranking_correlation(row, expected_ranking, ranking_column_name='top doc ids')#

A simple test that compares the expected ranking for a given query with the actual ranking produced by the embedding function being tested.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected_ranking (list) – the expected list of ranking to compare
ranking_column_name (str) – the column name of the actual ranking produced by the model, defaults to "top doc ids"

Return type:

float

Example

>>> EXPECTED_RANKING_LIST = [
>>>     ["id1", "id3", "id2"],
>>>     ["id2", "id3", "id1"],
>>>     ["id1", "id3", "id2"],
>>>     ["id2", "id3", "id1"],
>>> ]
>>> experiment.evaluate("ranking_correlation", ranking_correlation, expected_ranking=EXPECTED_RANKING_LIST)

prompttools.utils.semantic_similarity(row, expected, response_column_name='response')#

A simple test that checks semantic similarity between the expected response (provided by the user) and the model’s text responses.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected (str) – the expected responses for each row in the column
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.validate_json_response(row, response_column_name='response')#

Validate whether response string is in a valid JSON format.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.validate_python_response(row, response_column_name='response')#

Validate whether response string follows Python’s syntax.

Parameters:

row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float