当前位置：首页 > news >正文

LLMs之MCP：《Evaluation Report on MCP Servers》翻译与解读

news 来源：原创 2025/6/9 6:18:40

导读：该论文评估了各种MCP服务器在Web搜索和数据库搜索任务中的有效性和效率。研究发现，MCP服务器的性能差异显著，且不一定比函数调用更优。通过引入声明式接口等优化方法，可以显著提高MCP服务器的准确性。该研究强调了优化MCP服务器参数和提高工具接口易用性的重要性，为未来AI驱动的搜索和数据检索解决方案提供了有价值的见解。

>> 背景痛点

● MCP服务器的有效性和效率存在显著差异：使用MCP服务器并不一定比直接函数调用带来明显的改进。

● 优化LLM需要构建的参数可以提高MCP服务器的有效性。

>> 具体的解决方案

● 提出了MCPBench评估框架：用于评估MCP服务器在准确性、时间和token使用方面的性能。该框架及数据集已在https://github.com/modelscope/MCPBench发布。

● 引入声明式接口方法：用自然语言代替结构化参数，减轻LLM构建SQL语句的负担。具体实现为XiYan MCP服务器，使用自然语言作为接口，利用XiYanSQL将自然语言查询转换为SQL。

>> 核心思路步骤

● 评估框架MCPBench：选择广泛使用的MCP服务器，包括用于Web搜索和数据库搜索的服务器。在受控环境中，使用相同的LLM和prompt对它们进行评估。评估指标包括准确性、时间消耗和token消耗。

● 声明式接口方法：将MCP的结构化参数替换为自然语言接口。使用文本到SQL模型（如XiYanSQL）将自然语言查询转换为SQL。执行SQL查询并返回结果。

>> 优势

● 声明式接口的优势：

●● 提高准确性：通过将SQL语句构建的负担从LLM转移到专门的文本到SQL模型，可以提高查询的准确性。

●● 降低token消耗：使用自然语言作为接口可以减少prompt的复杂性，从而降低token消耗。

>> 结论和观点

● MCP服务器的有效性和效率差异显著：Bing Web Search表现最佳，而DuckDuckGo表现最差。

● 使用MCP并不一定比函数调用更准确：Qwen Web Search等函数调用在准确性方面具有竞争力。

● 优化LLM需要构建的参数可以提高MCP服务器的性能：通过引入声明式接口，使用自然语言代替SQL查询，可以显著提高数据库搜索的准确性。

● Web搜索任务中，搜索结果的处理方式直接影响LLM的准确性：直接提供原始搜索结果需要LLM具备更强的推理能力，而预先分析和处理搜索结果可以简化LLM的任务。

● 建议开发者关注MCP服务器的优化：特别是优化LLM需要构建的参数，并提高工具接口的易用性。

LLMs之MCP：《Evaluation Report on MCP Servers》翻译与解读

《Evaluation Report on MCP Servers2》翻译与解读

Abstract

1、Introduction

Conclusion

《Evaluation Report on MCP Servers2》翻译与解读

地址

论文地址：[2504.11094] Evaluation Report on MCP Servers

时间

2025年4月15日

最新2025年4月18日

作者

Zhiling Luo∗, Xiaorong Shi, Xuanrui Lin, Jinyang Gao

Abstract

With the rise of LLMs, a large number of Model Context Protocol (MCP) services have emerged since the end of 2024. However, the effectiveness and efficiency of MCP servers have not been well studied. To study these questions, we propose an evaluation framework, called MCPBench. We selected several widely used MCP server and conducted an experimental evaluation on their accuracy, time, and token usage. Our experiments showed that the most effective MCP, Bing Web Search, achieved an accuracy of 64%. Importantly, we found that the accuracy of MCP servers can be substantially enhanced by involving declarative interface. This research paves the way for further investigations into optimized MCP implementations, ultimately leading to better AI-driven applications and data retrieval solutions.

随着大型语言模型（LLM）的兴起，自 2024 年底以来，出现了大量模型上下文协议（MCP）服务。然而，MCP 服务器的有效性和效率尚未得到充分研究。为了研究这些问题，我们提出了一种评估框架，称为 MCPBench。我们选取了几种广泛使用的 MCP 服务器，并对其准确性、时间和标记使用情况进行了实验评估。我们的实验表明，最有效的 MCP——必应网页搜索，准确率达到了 64%。重要的是，我们发现通过引入声明式接口，MCP 服务器的准确性可以大幅提高。这项研究为优化 MCP 实现的进一步研究铺平了道路，最终将带来更出色的 AI 驱动的应用程序和数据检索解决方案。

1、Introduction

Model Context Protocol (MCP)[1] is an open protocol that enables AI models to securely interact with local and remote resources through standardized server implementations. Thousands of MCPs have been proposed in recent months. At the same time, several model platforms, e.g. OpenAI and Alibaba-cloud announced the support of MCP in their LLM products. The outbreak of the MCP protocol has become a reality. To study the effectiveness and efficiency of MCP servers, we selected several widely used MCP servers and conducted an experiment to evaluate them using MCPBench on their accuracy, time, and token usage. We focused on two tasks: web search and database search. The former involves searching the internet to answer questions, while the latter entails fetching data from a database. All MCP servers were compared using the same LLM and prompt in a controlled environment. We aimed to answer the following questions:

• Question 1: Are MCP servers effective and efficient in practice?

• Question 2: Does using MCP provide higher accuracy compared to function calls?

• Question 3: How to enhance the performance?

To study these questions, we propose an evaluation framework, called MCPBench, which is released at https://github.com/modelscope/MCPBench. Besides, we provide the dataset of web search and database search at the same time.

模型上下文协议（MCP）[1] 是一种开放协议，它通过标准化的服务器实现，使 AI 模型能够安全地与本地和远程资源进行交互。近几个月来，已提出了数千种 MCP。与此同时，OpenAI 和阿里云等几个模型平台宣布在其 LLM 产品中支持 MCP。MCP 协议的爆发已成为现实。为了研究 MCP 服务器的有效性和效率，我们选择了几种广泛使用的 MCP 服务器，并使用 MCPBench 对它们的准确性、时间和标记使用情况进行了评估实验。我们重点关注了两个任务：网络搜索和数据库搜索。前者涉及通过搜索互联网来回答问题，而后者则需要从数据库中获取数据。在受控环境中，所有 MCP 服务器都使用相同的 LLM 和提示进行了比较。我们旨在回答以下问题：

• 问题 1：MCP 服务器在实际应用中是否有效且高效？

• 问题 2：使用 MCP 是否比函数调用提供更高的准确性？

• 问题 3：如何提升性能？为研究这些问题，我们提出了一种评估框架，名为 MCPBench，其已在 https://github.com/modelscope/MCPBench 发布。此外，我们同时提供了网络搜索和数据库搜索的数据集。

Conclusion

The evaluation of various Model Context Protocol (MCP) servers highlights significant differences in both effectiveness and efficiency. While MCPs offer distinct advantages in structuring tool usage, they do not consistently demonstrate marked improvements over non-MCP approaches, such as function calls. Our experiments showed that the most effective MCP, Bing Web Search, achieved an accuracy of 64%, whereas DuckDuckGo lagged at just 10%. Furthermore, performance varied widely in terms of time consumption, with top performers like Bing and Brave Search executing tasks in under 15 seconds, contrasted with significantly slower alternatives like Exa Search.

Importantly, we found that the accuracy of MCP servers can be substantially enhanced by optimizing the parameters that LLMs must construct. For instance, transitioning from SQL-based queries to natural language processing in the XiYan MCP server resulted in a noteworthy increase in accuracy, demonstrating that incorporating a text-to-SQL model can lead to a 22 percentage point improvement.

Overall, while MCPs provide a structured means for AI tools to interact with data, there remains considerable potential for optimization. By addressing the challenges LLMs encounter in parameter construction and enhancing the user-friendliness of tool interfaces, developers can significantly improve the performance and reliability of MCP servers. This research paves the way for further investigations into optimized MCP implementations, ultimately leading to better AI-driven search and data retrieval solutions.

对各种模型上下文协议（MCP）服务器的评估凸显了其在有效性和效率方面的显著差异。尽管 MCP 在工具使用结构化方面具有明显优势，但它们并不总是能显著优于诸如函数调用之类的非 MCP 方法。我们的实验表明，最有效的 MCP 服务器——必应网络搜索，准确率达到了 64%，而 DuckDuckGo 则仅为 10%。此外，在时间消耗方面，性能差异也很大，像必应和 Brave Search 这样的顶级服务器能在 15 秒内完成任务，而像 Exa Search 这样的服务器则明显要慢得多。

重要的是，我们发现通过优化 LLM 必须构建的参数，MCP 服务器的准确性可以大幅提高。例如，在 XiYan MCP 服务器中，从基于 SQL 的查询过渡到自然语言处理，准确率有了显著提升，这表明引入文本到 SQL 模型可以带来 22 个百分点的提升。

总的来说，尽管 MCP 为 AI 工具与数据的交互提供了一种结构化的方式，但仍有很大的优化空间。通过解决大型语言模型在参数构建方面遇到的挑战，并提高工具界面的用户友好性，开发人员能够显著提升 MCP 服务器的性能和可靠性。这项研究为优化 MCP 实现的进一步探索铺平了道路，最终将带来更出色的由人工智能驱动的搜索和数据检索解决方案。