Evaluation and Validation#

These built-in functions help you to evaluate the outputs of your experiments. They can also be used with prompttest for be part of your CI/CD system.

prompttools.utils.autoeval_binary_scoring(row, prompt_column_name, response_column_name='response')#

Uses auto-evaluation to score the model response with “gpt-4” as the judge, returning 0.0 or 1.0.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • prompt_column_name (str) – name of the column that contains the input prompt

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.autoeval_scoring(row, expected, response_column_name='response')#

Uses auto-evaluation to score the model response.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • expected (str) – the expected response

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.autoeval_with_documents(row, documents, response_column_name='response')#

Given a list of documents, score whether the model response is accurate with “gpt-4” as the judge, returning an integer score from 0 to 10.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • documents (list[str]) – documents to provide relevant context for the model to judge

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.chunk_text(text, max_chunk_length)#

Given a long string paragraph of text and a chunk max length, returns chunks of texts where each chunk’s length is smaller than the max length, without breaking up individual words (separated by space).

Parameters:
  • text (str) – source text to be chunked

  • max_chunk_length (int) – maximum length of a chunk

Return type:

list[str]

prompttools.utils.compute_similarity_against_model(row, prompt_column_name, model='gpt-4', response_column_name='response')#

Computes the similarity of a given response to the expected result generated from a high quality LLM (by default GPT-4) using the same prompt.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • prompt_column_name (str) – name of the column that contains the input prompt

  • model (str) – name of the model that will serve as the judge

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

str

prompttools.utils.ranking_correlation(row, expected_ranking, ranking_column_name='top doc ids')#

A simple test that compares the expected ranking for a given query with the actual ranking produced by the embedding function being tested.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • expected_ranking (list) – the expected list of ranking to compare

  • ranking_column_name (str) – the column name of the actual ranking produced by the model, defaults to "top doc ids"

Return type:

float

Example

>>> EXPECTED_RANKING_LIST = [
>>>     ["id1", "id3", "id2"],
>>>     ["id2", "id3", "id1"],
>>>     ["id1", "id3", "id2"],
>>>     ["id2", "id3", "id1"],
>>> ]
>>> experiment.evaluate("ranking_correlation", ranking_correlation, expected_ranking=EXPECTED_RANKING_LIST)
prompttools.utils.validate_json_response(row, response_column_name='response')#

Validate whether response string is in a valid JSON format.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.validate_json.validate_keys(text, valid_keys)#

Guarantees that all keys in the generated JSON are valid.

Parameters:
  • text (str) – The generated text, which should be valid JSON.

  • valid_keys (List[str]) – A list of valid keys which may appear in the JSON.

prompttools.utils.validate_python_response(row, response_column_name='response')#

Validate whether response string follows Python’s syntax.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.semantic_similarity(row, expected, response_column_name='response')#

A simple test that checks semantic similarity between the expected response (provided by the user) and the model’s text responses.

Parameters:
  • row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).

  • expected (str) – the expected responses for each row in the column

  • response_column_name (str) – name of the column that contains the model’s response, defaults to "response"

Return type:

float

prompttools.utils.similarity.compute(doc1, doc2, use_chroma=False)#

Computes the semantic similarity between two documents, using either ChromaDB or HuggingFace sentence_transformers.

Parameters:
  • doc1 (str) – The first document.

  • doc2 (str) – The second document.

  • use_chroma (bool) – Indicates whether or not to use Chroma. If False, uses HuggingFace sentence_transformers.