Complete guide on "How to evaluate LangChain & other LLM Agents in Production with LangSmith?" - Code Included

Read Time:
minutes

Introduction

We all have heard about AI agents and how they can automate different kinds of workflows autonomously, But how do you monitor the performance of an LLM Agent šŸ¤”? As we know, whenever we train a new LLM model or fine-tune a model we always have to evaluate the performance of the LLM based on the dataset to test the accuracy and other metrics of the LLM.

Similarly, AI agents can be random sometimes because they use LLMs under the hood so to make them production ready it is necessary to evaluate the agent performance to examine your agent to evaluate different aspects like,

  • whether the agent is making the expected tool calls or not.
  • whether the agent response is relevant to the user query or it is hallucinating.
  • whether the LLM we are using for our agent is performing well or not.

There can be more than these aspects to evaluate your agent because it completely depends on your agent and the workflow you are trying to automate. In this blog we will try to evaluate our agent based on the given 3 aspects and see how our agent performs.

LangChain vs LangSmith

Now some of you might get confused between Langchain and LangSmith so let me make it clear for you. Langchain is a framework to create LLM chains, AI Agents and do other cool stuff with large language models. But on the other side, LangSmith is created to evaluate the things you created using Langchain to monitor and test their performance and results.

In this blog, we are going to create an AI agent using Langchain and then we will evaluate it using LangSmith. I am assuming that you have basic knowledge about AI agents and you can make a basic agent using Langchain or any other AI agent framework because here i wonā€™t be going too much into agent creation but i will discuss the evaluation of agent which is the next step after creating an agent. If you want to know how to create AI agents then take a look at my other blogs where i have created several AI agents like AI SDR agent, AI code review agent and more from scratch.

What is LangSmith?

Now the first question that might come in your mind is that how do we evaluate our agent šŸ¤”? There are several frameworks to evaluate your agent like LangSmith, Phoenix and Mosaic AI but in this blog we are going to use LangSmith which is LLM evaluation framework created by Langchain team. You can trace agent activity, create datasets, run experiments or evaluations using those datasets, make annotation and more using LangSmith and the best part is that it is completely free for single developer account šŸ¤‘!

In this blog, we are going to create one simple research agent using Langchain and then we will evaluate the agent response, tool calling functionality and different LLM performance for this agent so letā€™s get started šŸš€!

Letā€™s create our agent

Now before going too much into LangSmith letā€™s create our simple research agent using Langchain. This agent will be able to fetch information from the internet and do some basic maths calculations (because i like math). We will also use one more tool to summarize our search results to get the final response in specific format.

So basically we will have 3 main tools:

  • Calculator Tool (For basic math calculations)
  • Google Search Tool (To search something on google)
  • Summarization Tool (To summarize any text)

Prerequisites

Before building this agent, make sure you have:

Workflow

Letā€™s take a look at the agent workflow to understand it better

As we can see, the agent will have access to 3 tools and it can decide which tool to use based on the user query. If the user wants to know something that requires internet access then it will use the google search tool and then pass the search results to summary tool to summarize the results and respond back to the user. If the user wants to do some basic math calculations then it will use the calculator tool (it can just do addition, subtraction, multiplication and division šŸ’€).

Now we know the full workflow of our agent so letā€™s start coding šŸ‘Øā€šŸ’»!

Creating Tools

First, letā€™s install all the required dependencies. Donā€™t worry I will discuss every dependency when we will use it.


!pip install langchain openai LangSmith tiktoken langchain_community requests langchain_openai langchainhub langchain_groq

we will be using langchain, openai, tiktoken, langchain_community and langchain_openai to create our agent so lets import the required modules.


from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI as langchain_OpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
import os
from google.colab import userdata
from langchain.schema import SystemMessage
import openai

Now letā€™s get all of our api keys and store them in global variables


# Setup API Keys
openai_secret = userdata.get('OPENAI_KEY')
LangSmith_secret = userdata.get('LangSmith_API_KEY')
serper_secret= userdata.get('SERPER_API_KEY')
groq_secret = userdata.get('GROQ_API_KEY')

Now letā€™s initialize our LLMs, we will be using ā€œgpt-4ā€ as our LLM for now but in evaluation section we will try different models.


# Initializing objects

ChatOpenAI_LLM = ChatOpenAI(temperature=0, model="gpt-4",api_key=openai_secret)
OpenAI_LLM = openai.Client(api_key=openai_secret)

Now letā€™s create our first tool function for calculator tool. It will take 3 parameters: first number, second number and the operation to perform. so letā€™s make it


from langchain.pydantic_v1 import BaseModel, Field
from typing import Type, List
from langchain.tools import BaseTool

# Schema for calculator tool
class CalculatorToolInput(BaseModel):
    first_number: int = Field(...,description="The first number")
    second_number: int = Field(...,description="The second number")
    operation: str = Field(...,description="The operation to perform, it should be either 'add', 'sub', 'mul' or 'div'. ")

# Calculator Tool
def calculator_tool(first_number: int,second_number: int, operation: str):
  """
  Use this tool when you have 2 numbers and want to perform any mathematical operations on them
  """
  if operation == "add":
    return first_number + second_number
  if operation == "sub":
    return first_number - second_number
  if operation == "mul":
    return first_number * second_number
  if operation == "div":
    return first_number / second_number

  return "Invalid Operator!"

Now letā€™s create our second tool which will do google search using serper API and returns the search results as a string. It will take user query as a parameter.


import json
import requests
# Schema for search tool
class SearchToolInput(BaseModel):
  query: str = Field(...,description = "The query to search")

# Google search tool
def search_tool(query: str):
  """
  Use this tool when you want to search about anything on google.
  """
  top_result_to_return = 3
  url = "https://google.serper.dev/search"
  payload = json.dumps({"q": query})
  headers = {
      'X-API-KEY': serper_secret,
      'content-type': 'application/json'
  }
  response = requests.request("POST", url, headers=headers, data=payload)
  # check if there is an organic key
  if 'organic' not in response.json():
    return "Sorry, I couldn't find anything about that, there could be an error with you serper api key."
  else:
    results = response.json()['organic']
    string = []
    for result in results[:top_result_to_return]:
      try:
        string.append('\n'.join([
            f"Title: {result['title']}", f"Link: {result['link']}",
            f"Snippet: {result['snippet']}", "\n-----------------"
        ]))
      except KeyError:
        next

    return '\n'.join(string)

Now letā€™s create our final tool which is summarization tool. It will take search results and user query as a parameter to summarize the results. We will use gpt-4 here to summarize our results but you are free to use any other model.


# Input class for tool so that it can follow strict input parameter schema
class SummaryToolInput(BaseModel):
  search_result: str = Field(...,description = "search results to summarize")
  query: str = Field(...,description = "original search query")

# Summary Tool
def summary_tool(search_result: str, query: str):
  """
  Use this tool when you want to summarize any given text
  """

  prompt = f"""
    You are an text summarization expert. You will be given a search result and you have to summarize it in very easy to read format.
    You will also be given a user query for the given search result. Come up with a good title for this search result based on user query.

    Your final summary response should look like this:
    ---
    Title:
    title of the given search result summary

    Summary:
    summary of the given results in 200 words

    Result:
    Answer to user's original query

    Links:
    Reference links from the search result if present
    ---

    Here are the search results:
    ---
    {search_result}
    ---

    Here is the users search query
    ---
    {query}
    ---

 """
  response = ChatOpenAI_LLM.invoke(prompt)
  return response.content

Creating Agent

Now itā€™s time to create and run our agent!


from langchain.agents import initialize_agent, AgentType
# from langchain_core.utils.function_calling import convert_to_openai_function
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.tools import Tool
from langchain_core.tools import StructuredTool
# Creating agent
system_message = SystemMessage(
    content="""
    You are an AI research agent. You have access to several tools like calculator and search engine.

    Your job is to use the right tool to fullfill the user's query and give them detailed information from internet.
    Make sure whenever you use google_search_tool then pass the search result string to summary_tool to summarize the search result.
    If user query don't require google search then you don't need to use summary tool.

    Your final response should look like this:

    Result:
    Result to user's query

    Source:
    Links for reference if there are any
    """
)
agent_kwargs = {
    "system_message": system_message,
}
memory = ConversationBufferWindowMemory(
    memory_key='memory',
    k=1,
    return_messages=True
)
# Converting the tool functions into langchain tools
calculator_structured_tool = StructuredTool.from_function(
    func=calculator_tool,
    name="calculator_tool",
    description="Use this tool when you have 2 numbers and want to perform any mathematical operations on them",
    args_schema=CalculatorToolInput
)

search_structured_tool = StructuredTool.from_function(
    func=search_tool,
    name="google_search_tool",
    description="Use this tool when you want to search about anything on google",
    args_schema=SearchToolInput
)

summary_structured_tool = StructuredTool.from_function(
    func=summary_tool,
    name="summary_tool",
    description="Use this tool when you want to summarize any given text",
    args_schema=SummaryToolInput
)
tools = [calculator_structured_tool, search_structured_tool, summary_structured_tool]

agent = initialize_agent(
    tools,
    llm=ChatOpenAI_LLM,
    verbose=True,
    agent_kwargs=agent_kwargs,
    agent=AgentType.OPENAI_FUNCTIONS,
    memory=memory,
    handle_parsing_errors=True
)

And itā€™s time to invoke our agent, Letā€™s try it by asking about the latest windows BSOD error because of crowdstrike.


agent_prompt = """
What is the latest windows BSOD error caused by crowdstrike?
"""

response = agent({"input":agent_prompt})
print(response["output"])

As we can see, we are getting results related to the latest news!

LangSmith Overview

Now itā€™s time to add LangSmith in this agent, so first we will import required dependencies and add some minor code to trace our agent activities, but before that we need to create a project in LangSmith dashboard so letā€™s go there firstšŸƒ.

Your LangSmith dashboard should like this (you might not have any project if you are using it for first time)

On the dashboard, click on ā€œNew Projectā€ button to create a new project and give it a name and now we are ready to integrate LangSmith in our project.

Tracing Agent Activity

Now first we will trace the agent activity using LangSmith so for that we will need to add some minor code in our existing agent code.

First import the required dependencies


from LangSmith.wrappers import wrap_openai
from LangSmith import traceable

Now add some environment variables to integrate your LangSmith project with the current agent


os.environ['LANGCHAIN_TRACING_V2'] = "true"
os.environ['LANGCHAIN_API_KEY']=LangSmith_secret
os.environ['LANGCHAIN_PROJECT']="agent_evaluation"

First we need to wrap our openai LLM so that LangSmith can trace every LLM activity.


OpenAI_LLM = wrap_openai(openai.Client(api_key=openai_secret))

Secondly, add ā€œ@traceableā€ decorator to every tool function to trace the function calls. This decorator accepts 2 main parameters,

  • run_type: The type of the function. It can be a tool call, simple chain or LLM call.
  • name: The name of the item which will be shown in trace

So letā€™s add this decorator to our functions


# Calculator Tool
@traceable(run_type="tool",name="calculator_tool")
def calculator_tool(first_number: int,second_number: int, operation: str):
	# ... Other code here
	
# Google search tool
@traceable(run_type="tool",name="google_search_tool")
def search_tool(query: str):
	# ... Other code here
	
# Summary Tool
@traceable(run_type="tool",name="summary_tool")
def summary_tool(search_result: str, query: str):
	# ... Other code here

And now we are ready to trace our agent activity, so letā€™s try it by running our agent!

After successfully running the agent, you will be able to see a new entry in your project runs table and after clicking that run you can see the detailed trace of the agent including information like input, output, total tokens and tool calls.

In this example, I invoked the agent with that same windows BSOD news input and as we can see in the below image it first made an LLM call to decide which tool to use and then it called the google search tool and then after again it asked LLM to decide the next steps and then used summary tool to summarize the results.

Tracing Every Single Agent Step

If we want to know how exactly our agent works under the hood and how it does the function calling then we can also do it using LangSmith. We just need to invoke our agent a bit differently so letā€™s take a look at it!

First we will need to import some dependencies


from langchain import hub
from langchain.agents import AgentExecutor, create_tool_calling_agent

To manually run the agent, we will need to pass the custom prompt for function calling and you can get such prompts from Huggingface. We will use langhub to access huggingface and we will create one tool-calling agent using ā€œcreate_tool_calling_agentā€ method and then we will invoke that agent using ā€œAgentExecutorā€.


prompt = hub.pull("hwchase17/openai-functions-agent")
llm=langchain_OpenAI(model="gpt-4",temperature=0,api_key=openai_secret)
# Creating tool calling agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# Invoking agent
response = agent_executor.invoke({"input": "What is the latest windows BSOD error caused by crowdstrike?"})

After running this agent, you will see one more entry in your LangSmith runs table but this time you will be able to see more information than before which allows you to monitor every single step taken by our agent to get the final result.

Evaluation

Now itā€™s time to evaluate our agent. Agent evaluation using LangSmith is very similar to evaluating an LLM model using a dataset. We will follow the same procedure here as well to evaluate our agent using a dataset and our evaluation function to check if the agent performed well or not so letā€™s get started!

In this section, we will evaluate our agent for the 3 main aspects:

  • Agent Step Evaluation
  • Agent Response Evaluation
  • LLM Evaluation

So letā€™s get started šŸš€!

Agent Step Evaluation

To make your agent production ready and to automate any workflow without any problem, you need to ensure that it follows a fixed structured path because LLMs can be random sometimes and it can take different path and make different tool call which can cause errors. To evaluate this quality of an agent it is necessary to evaluate the tool calling capability of your agent.

This is where agent step evaluation comes in, In this evaluation we will create a dataset of input and expected tool name and compare it with the steps agent is taking. So letā€™s start it by creating our dataset šŸ“!

There are 2 main ways to create a dataset in LangSmith

  • Create a dataset using code
  • Create a dataset from your existing runs

We will discuss these both ways in this blog. For agent step evaluation, we will create the dataset using code so letā€™s create one!


from LangSmith import Client

client = Client()

# Create a dataset
examples = [
    ("What is the sum of 15 and 20", "calculator_tool"),
    ("I have 100 rupees and i purchased apples of 50 rupees so how many rupees i have now?", "calculator_tool"),
    ("What is the latest news about windows shutdown and BSOD error?","google_search_tool"),
    ("Summarize the given text: Use os.environ to set environment variables in Python, ensuring keys and values are strings for proper assignment.","summary_tool"),
    ("What is the current price of bitcoin", "google_search_tool"),
]

dataset_name = "function_calls_dataset"
if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)
    inputs, outputs = zip(
        *[({"input": text}, {"output": label}) for text, label in examples]
    )
    client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)

The above code first initializes a LangSmith client and then checks if the user doesnā€™t have a dataset with that name then it will convert the examples in input and output format and add it to the dataset. You are free to add your own examples in this dataset and play around it.

After running the above code, you will see your dataset examples in the datasets section.

Now we have our dataset ready so letā€™s create evaluation functions. Now the first question which might come in your mind is that ā€œHow do i get the name of tool which is being called? šŸ¤”ā€ because generally we only get the final response. So there is a method called ā€œbind_toolsā€ which allows you to get the next tool name along with the output so letā€™s implement it.


llm=langchain_OpenAI(model="gpt-4",temperature=0,api_key=openai_secret)
assistant_runnable = llm.bind_tools(tools)

Now we just need to invoke this assistant using the invoke method and it will give us the response.

To evaluate any agent, we will need 2 functions,

  • Function to get the assistant response
  • Function to evaluate the assistant response

In the assistant response function, we will just invoke the assistant and return the result and in evaluate method we will get the expected function call from the dataset and compare it with the agent response and if it matches then we will give it a score of 1 otherwise we will score it 0.


from LangSmith.schemas import Example, Run

def predict_assistant(example: dict):
    """Invoke assistant for single tool call evaluation"""
    msg = [ ("user", example["input"]) ]
    result = assistant_runnable.invoke(msg)
    return {"response": result}


def check_specific_tool_call(root_run: Run, example: Example) -> dict:
    """
    Check if the first tool call in the response matches the expected tool call.
    """
    # Exepected tool call
    expected_tool_call = example.outputs["output"]

    # Run
    response = root_run.outputs["response"]

    # Get tool call
    try:
        tool_call = getattr(response, 'tool_calls', [])[0]['name']
    except (IndexError, KeyError):
        tool_call = None
        
		# Give score
    score = 1 if tool_call == expected_tool_call else 0
    return {"score": score}

Finally run the evaluation!


from LangSmith.evaluation import evaluate
experiment_prefix = "gpt-4"
metadata = "gpt-4 AI research agent"
experiment_results = evaluate(
    predict_assistant,
    data=dataset_name,
    evaluators=[check_specific_tool_call],
    experiment_prefix=experiment_prefix + "-single-tool",
    num_repetitions=1,
    metadata={"version": metadata},
)

After running the above code, you will see a new entry in experiments tab of your dataset.

You can see the score, latency and other information in this row. Click on it to see more information about the evaluation.

As we can see, our agent passed this evaluation with the score of 1! You can also click on individual evaluation item to trace the agent activity for that input.

Agent Response Evaluation

Now letā€™s evaluate the agent response. Here we will have a dataset of input and expected output and we will compare it with agentā€™s final response.

This time we will create the database from the agentā€™s previous runs but you can create it using code by following the same way we discussed above. In the LangSmith dashboard, after opening any run you will see a button called ā€œadd to datasetā€ which will allow you to add the input and output of that run into your dataset.

After clicking that button, you can edit the input and output and add it in dataset. if you donā€™t have any dataset created then create one.

I ran my agent 5 times with different examples and added all the responses in my dataset. Here is how it looks like:

Now we are ready to evaluate our agent so letā€™s get started!

First create the evaluator and response function. In the evaluator function, i am using gpt-4 to evaluate the agent response based on the given input, expected output and agent output and it will return the similarity score between 0 and 1.

It is a good practice to use different LLMs to evaluate your agent.

def predict_agent_answer(example: dict):
    """Invoke assistant to get response"""
    response = agent_executor.invoke({"input": example["input"]})
    return {"response": response["output"]}

def answer_evaluator(run, example) -> dict:
    """
    A simple evaluator for agent answer accuracy
    """
		# prompt for grading
    grade_prompt_answer_accuracy = hub.pull("langchain-ai/rag-answer-vs-reference")

    # Get question, ground truth answer, RAG chain answer
    input_question = example.inputs["input"]
    reference = example.outputs["output"]

    prediction = run.outputs["response"]

    # LLM grader

    # Structured prompt
    answer_grader = grade_prompt_answer_accuracy | llm

    # Run evaluator
    score = answer_grader.invoke({"question": input_question,
                                  "correct_answer": reference,
                                  "student_answer": prediction})
    score = score["Score"]

    return {"score": score}

And finally letā€™s evaluate this agent šŸ¤–!


experiment_prefix = "gpt-4"
metadata = "gpt-4 AI research agent"
dataset_name = "agent-response" # replace with your dataset name
experiment_results = evaluate(
    predict_agent_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix=experiment_prefix + "-response-v-reference",
    num_repetitions=1,
    metadata={"version": metadata},
)

After running the evaluation, you can see the results in the experiments tab of your dataset.

As we can see, our agent successfully passed the test šŸŽ‰!

You can add more complex examples to test this agent for production but for the sake of the blog i have added very basic examples.

LLM Evaluation

We all know that all AI agents uses large language models (LLMs) under the hood so it becomes essential to test your LLMs function calling capabilities to choose the right model for your agent. The above 2 methods are used to evaluate your function implementation and your agent workflow.

In this section, we will test 4 most popular LLMs for our agent and see how every LLM performs.

GPT-4

We already have evaluated gpt-4 model in the agent response section so i am just going to add the evaluation results for gpt-4 model.

GPT-4o

Letā€™s test our agent for latest gpt-4 omni model. We just need to change to model name in our openai wrapper and it will work.


llm=langchain_OpenAI(model="gpt-4o",temperature=0,api_key=openai_secret)
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

The evaluation functions will stay the same you just need to change the experiment prefix and metadata to differentiate between different model evaluations.


experiment_prefix = "gpt-4o"
metadata = "GPT-4o AI research agent"
dataset_name = "agent-response"
experiment_results = evaluate(
    predict_agent_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix=experiment_prefix + "-response-v-reference",
    num_repetitions=1,
    metadata={"version": metadata},
)

Here are the evaluation results for gpt-4 Omni

GPT 3.5 Turbo

Letā€™s test the function calling capabilities of gpt 3.5 turbo model. This is very basic agent and the dataset examples are also basic so it should also pass all the tests.


llm=langchain_OpenAI(model="gpt-3.5-turbo",temperature=0,api_key=openai_secret)
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

Run the evaluation


experiment_prefix = "gpt-3.5-Turbo"
metadata = "GPT-3.5 Turbo AI research agent"
dataset_name = "agent-response"
experiment_results = evaluate(
    predict_agent_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix=experiment_prefix + "-response-v-reference",
    num_repetitions=1,
    metadata={"version": metadata},
)

Here are the results!

Now letā€™s compare these 3 GPT models side by side. To do this, click on ā€œ+Addā€ button on the next column of your evaluation results table and select any other evaluation for this same dataset.

Here is the side by side comparison between GPT-4, GPT-4 omni and GPT-3.5 Turbo

As we can see, every model was able to pass these basic tests but if you take a look at the latency then surprisingly GPT-3.5 Turbo was faster than the other 2 models. It can be the case because the agent and dataset examples were very basic so GPT 3.5 Turbo can be the best option for basic models.

Llama3

Now letā€™s test llama3 model with our agent. To use llama3 as an LLM in our model, we first need to configure some things. I am going to use Groq here to access llama3 but you are free to use any other inference providers or you can run it locally using Ollama (If you have a capable pc ofc).

I am going to use ChatGroq from langchain_groq module to access groq models.


from langchain_groq import ChatGroq
prompt = hub.pull("hwchase17/openai-functions-agent")
llm = ChatGroq(model_name="llama3-70b-8192", groq_api_key=groq_secret, temperature=0)
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

The evaluation functions will stay same so you just need to change the experiment prefix and metadata.


from LangSmith.evaluation import evaluate
experiment_prefix = "llama3"
metadata = "llama3 AI research agent"
dataset_name = "agent-response"
experiment_results = evaluate(
    predict_agent_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix=experiment_prefix + "-response-v-reference",
    num_repetitions=1,
    metadata={"version": metadata},
)

Here are the results after using Llama3 as an LLM for our agent (Ignore that null error because it happened because i hit the rate limit for Groq šŸ’€)

Llama 3.1

Recently meta released a new llama version called Llama3.1 and they are claiming that it is the most capable model till date. Also it is an open source model so it will be so much cheaper than gpt models. But i donā€™t think it is still properly fine-tuned for function calling šŸ¤” so we might need to wait for fine-tuned versions of llama3.1 which supports function calling. There might be some instruct models on huggingface and other platforms but currently i am going to use ā€œllama-3.1-70b-versatileā€ model from Groq.

You just need to change the model name in ChatGroq method and you are good to go!


from langchain_groq import ChatGroq
prompt = hub.pull("hwchase17/openai-functions-agent")
llm = ChatGroq(model_name="llama-3.1-70b-versatile", groq_api_key=groq_secret, temperature=0)
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

Letā€™s evaluate the model!


from LangSmith.evaluation import evaluate
experiment_prefix = "llama3.1"
metadata = "llama3.1 AI research agent"
dataset_name = "agent-response"
experiment_results = evaluate(
    predict_agent_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix=experiment_prefix + "-response-v-reference",
    num_repetitions=1,
    metadata={"version": metadata},
)

And here are the results!

It didnā€™t perform well because I think the model I was using was not capable or fine-tuned enough for function calling but in future we might see some crazy examples. Let me know if you can find any better or fine-tuned version of llama3.1 which works perfectly.

Here is the side by side comparison of every model we evaluated so far šŸ‘‡

Conclusion

As we saw in the blog, it is a very essential step to evaluate your AI agent performance before putting it into production and you can do it very easily using LangSmith. We also discussed 3 main aspects of agent evaluation which allows us to test the tool implementation, agent workflow and LLM.

Also we compared different large language models to test the function calling capabilities of every model and every model performed well for our agent because it was very simple agent and the examples in the dataset were also simple and limited, But I highly encourage you to test your agent with more complex examples according to your workflow to evaluate your agent in the best possible way so that your agent can perform well in production with minimal latency and errors.

Level up your organization with Ionio

So, whether you are a small team looking for automating your company workflows, an individual wants to automate any task using AI or a bigger organization want to integrate autonomous AI agents in your organization then we at Ionio have experience in building custom AI agents and we will be more than happy to help you.

Since 2021, we have helped many organizations with our AI solutions to automate their workflows whether they are a big organization or a bootstrapped startup. We also have written many blogs to document our processes of creating various AI agents and AI content. If you are looking to build something with AI then kindly book a call with us and we will be happy to convert your ideas into reality to make your life easy.

Thanks for reading šŸ˜„.

Book an AI consultation

Looking to build AI solutions? Let's chat.
ā€
Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.
ā€
You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog šŸ‘€
Shivam Danawale
Writer

Shivam is an AI Researcher & Full Stack Engineer at Ionio.

Rohan Sawant
Editor

Rohan is the Founder & CEO of Ionio. I make everyone write all these nice articles... šŸ„µ