Evaluation and Validation#
These built-in functions help you to evaluate the outputs of your experiments.
They can also be used with prompttest
for be part of your CI/CD system.
- prompttools.utils.autoeval_binary_scoring(row, prompt_column_name, response_column_name='response')#
Uses auto-evaluation to score the model response with “gpt-4” as the judge, returning 0.0 or 1.0.
- Parameters:
row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
prompt_column_name (str) – name of the column that contains the input prompt
response_column_name (str) – name of the column that contains the model’s response, defaults to
"response"
- Return type:
- prompttools.utils.autoeval_scoring(row, expected, response_column_name='response')#
Uses auto-evaluation to score the model response.
- Parameters:
- Return type:
- prompttools.utils.autoeval_with_documents(row, documents, response_column_name='response')#
Given a list of documents, score whether the model response is accurate with “gpt-4” as the judge, returning an integer score from 0 to 10.
- Parameters:
row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
documents (list[str]) – documents to provide relevant context for the model to judge
response_column_name (str) – name of the column that contains the model’s response, defaults to
"response"
- Return type:
- prompttools.utils.chunk_text(text, max_chunk_length)#
Given a long string paragraph of text and a chunk max length, returns chunks of texts where each chunk’s length is smaller than the max length, without breaking up individual words (separated by space).
- prompttools.utils.compute_similarity_against_model(row, prompt_column_name, model='gpt-4', response_column_name='response')#
Computes the similarity of a given response to the expected result generated from a high quality LLM (by default GPT-4) using the same prompt.
- Parameters:
row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
prompt_column_name (str) – name of the column that contains the input prompt
model (str) – name of the model that will serve as the judge
response_column_name (str) – name of the column that contains the model’s response, defaults to
"response"
- Return type:
- prompttools.utils.ranking_correlation(row, expected_ranking, ranking_column_name='top doc ids')#
A simple test that compares the expected ranking for a given query with the actual ranking produced by the embedding function being tested.
- Parameters:
row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected_ranking (list) – the expected list of ranking to compare
ranking_column_name (str) – the column name of the actual ranking produced by the model, defaults to
"top doc ids"
- Return type:
Example
>>> EXPECTED_RANKING_LIST = [ >>> ["id1", "id3", "id2"], >>> ["id2", "id3", "id1"], >>> ["id1", "id3", "id2"], >>> ["id2", "id3", "id1"], >>> ] >>> experiment.evaluate("ranking_correlation", ranking_correlation, expected_ranking=EXPECTED_RANKING_LIST)
- prompttools.utils.validate_json_response(row, response_column_name='response')#
Validate whether
response
string is in a valid JSON format.
- prompttools.utils.validate_json.validate_keys(text, valid_keys)#
Guarantees that all keys in the generated JSON are valid.
- prompttools.utils.validate_python_response(row, response_column_name='response')#
Validate whether
response
string follows Python’s syntax.
- prompttools.utils.semantic_similarity(row, expected, response_column_name='response')#
A simple test that checks semantic similarity between the expected response (provided by the user) and the model’s text responses.
- Parameters:
row (pandas.core.series.Series) – A row of data from the full DataFrame (including input, model response, other metrics, etc).
expected (str) – the expected responses for each row in the column
response_column_name (str) – name of the column that contains the model’s response, defaults to
"response"
- Return type:
- prompttools.utils.similarity.compute(doc1, doc2, use_chroma=False)#
Computes the semantic similarity between two documents, using either ChromaDB or HuggingFace sentence_transformers.