AgentQL
AgentQL tools provides web interaction and structured data extraction from any web page using an AgentQL query or a Natural Language prompt. AgentQL can be used across multiple languages and web pages without breaking over time and change.
Overview​
AgentQL provides the following three tools:
ExtractWebDataTool
extracts structured data as JSON from a web page given a URL using either an AgentQL query or a Natural Language description of the data.
The following two tools are also bundled as AgentQLBrowserToolkit
and must be used with a Playwright
browser or a remote browser instance via Chrome DevTools Protocal (CDP):
-
ExtractWebDataBrowserTool
extracts structured data as JSON from the active web page in a browser using either an AgentQL query or a Natural Language description. -
GetWebElementBrowserTool
finds a web element on the active web page in a browser using a Natural Language description and returns its CSS selector for further interaction.
Integration details​
Class | Package | Serializable | JS support | Package latest |
---|---|---|---|---|
AgentQL | langchain-agentql | ❌ | ❌ | 1.0.0 |
Tool features​
Tool | Web Data Extraction | Web Element Extraction | Use With Local Browser |
---|---|---|---|
ExtractWebDataTool | ✅ | ❌ | ❌ |
ExtractWebDataBrowserTool | ✅ | ❌ | ✅ |
GetWebElementBrowserTool | ❌ | ✅ | ✅ |
Setup​
%pip install --quiet -U langchain_agentql
To run this notebook, install Playwright
browser and configure Jupyter Notebook's asyncio
loop.
!playwright install
# This import is required only for jupyter notebooks, since they have their own eventloop
import nest_asyncio
nest_asyncio.apply()
Credentials​
To use the AgentQL tools, you will need to get your own API key from the AgentQL Dev Portal and set the AgentQL environment variable.
import os
os.environ["AGENTQL_API_KEY"] = "YOUR_AGENTQL_API_KEY"
Instantiation​
ExtractWebDataTool
​
You can instantiate ExtractWebDataTool
with the following params:
api_key
: Your AgentQL API key from dev.agentql.com.Optional
.timeout
: The number of seconds to wait for a request before timing out. Increase if data extraction times out. Defaults to900
.is_stealth_mode_enabled
: Whether to enable experimental anti-bot evasion strategies. This feature may not work for all websites at all times. Data extraction may take longer to complete with this mode enabled. Defaults toFalse
.wait_for
: The number of seconds to wait for the page to load before extracting data. Defaults to0
.is_scroll_to_bottom_enabled
: Whether to scroll to bottom of the page before extracting data. Defaults toFalse
.mode
:"standard"
uses deep data analysis, while"fast"
trades some depth of analysis for speed and is adequate for most usecases. Learn more about the modes in this guide. Defaults to"fast"
.is_screenshot_enabled
: Whether to take a screenshot before extracting data. Returned in 'metadata' as a Base64 string. Defaults toFalse
.
ExtractWebDataTool
is implemented with AgentQL's REST API, you can view more details about the parameters in the API Reference docs.
from langchain_agentql.tools import ExtractWebDataTool
extract_web_data_tool = ExtractWebDataTool()
ExtractWebDataBrowserTool
​
To instantiate ExtractWebDataBrowserTool, you need to connect the tool with a browser instance.
You can set the following params:
timeout
: The number of seconds to wait for a request before timing out. Increase if data extraction times out. Defaults to900
.wait_for_network_idle
: Whether to wait until the network reaches a full idle state before executing. Defaults toTrue
.include_hidden
: Whether to take into account visually hidden elements on the page. Defaults toTrue
.mode
:"standard"
uses deep data analysis, while"fast"
trades some depth of analysis for speed and is adequate for most usecases. Learn more about the modes in this guide. Defaults to"fast"
.
ExtractWebDataBrowserTool
is implemented with AgentQL's SDK. You can find more details about the parameters and the functions in AgentQL's API References.
from langchain_agentql.tools import ExtractWebDataBrowserTool
from langchain_agentql.utils import create_async_playwright_browser
async_browser = await create_async_playwright_browser()
extract_web_data_browser_tool = ExtractWebDataBrowserTool(async_browser=async_browser)
GetWebElementBrowserTool
​
To instantiate GetWebElementBrowserTool, you need to connect the tool with a browser instance.
You can set the following params:
timeout
: The number of seconds to wait for a request before timing out. Increase if data extraction times out. Defaults to900
.wait_for_network_idle
: Whether to wait until the network reaches a full idle state before executing. Defaults toTrue
.include_hidden
: Whether to take into account visually hidden elements on the page. Defaults toFalse
.mode
:"standard"
uses deep data analysis, while"fast"
trades some depth of analysis for speed and is adequate for most usecases. Learn more about the modes in this guide. Defaults to"fast"
.
GetWebElementBrowserTool
is implemented with AgentQL's SDK. You can find more details about the parameters and the functions in AgentQL's API References.`
from langchain_agentql.tools import GetWebElementBrowserTool
extract_web_element_tool = GetWebElementBrowserTool(async_browser=async_browser)
Invocation​
ExtractWebDataTool
​
This tool uses AgentQL's REST API under the hood, sending the publically available web page's URL to AgentQL's endpoint. This will not work with private pages or logged in sessions. Use ExtractWebDataBrowserTool
for those usecases.
url
: The URL of the web page you want to extract data from.query
: The AgentQL query to execute. Use AgentQL query if you want to extract precisely structured data. Learn more about how to write an AgentQL query in the docs or test one out in the AgentQL Playground.prompt
: A Natural Language description of the data to extract from the page. AgentQL will infer the data’s structure from your prompt. Useprompt
if you want to extract data defined by free-form language without defining a particular structure.
Note: You must define either a query
or a prompt
to use AgentQL.
# You can invoke the tool with either a query or a prompt
# extract_web_data_tool.invoke(
# {
# "url": "https://www.agentql.com/blog",
# "prompt": "the blog posts with title, url, date of post and author",
# }
# )
extract_web_data_tool.invoke(
{
"url": "https://www.agentql.com/blog",
"query": "{ posts[] { title url date author } }",
},
)
{'data': {'posts': [{'title': 'Launch Week Recap—make the web AI-ready',
'url': 'https://www.agentql.com/blog/2024-launch-week-recap',
'date': 'Nov 18, 2024',
'author': 'Rachel-Lee Nabors'},
{'title': 'Accurate data extraction from PDFs and images with AgentQL',
'url': 'https://www.agentql.com/blog/accurate-data-extraction-pdfs-images',
'date': 'Feb 1, 2025',
'author': 'Rachel-Lee Nabors'},
{'title': 'Introducing Scheduled Scraping Workflows',
'url': 'https://www.agentql.com/blog/scheduling',
'date': 'Dec 2, 2024',
'author': 'Rachel-Lee Nabors'},
{'title': 'Updates to Our Pricing Model',
'url': 'https://www.agentql.com/blog/2024-pricing-update',
'date': 'Nov 19, 2024',
'author': 'Rachel-Lee Nabors'},
{'title': 'Get data from any page: AgentQL’s REST API Endpoint—Launch week day 5',
'url': 'https://www.agentql.com/blog/data-rest-api',
'date': 'Nov 15, 2024',
'author': 'Rachel-Lee Nabors'}]},
'metadata': {'request_id': '0dc1f89c-1b6a-46fe-8089-6cd0f082f094',
'generated_query': None,
'screenshot': None}}
ExtractWebDataBrowserTool
​
query
: The AgentQL query to execute. Use AgentQL query if you want to extract precisely structured data. Learn more about how to write an AgentQL query in the docs or test one out in the AgentQL Playground.prompt
: A Natural Language description of the data to extract from the page. AgentQL will infer the data’s structure from your prompt. Useprompt
if you want to extract data defined by free-form language without defining a particular structure.
Note: You must define either a query
or a prompt
to use AgentQL.
To extract data, first you must navigate to a web page using LangChain's Playwright tool.
from langchain_community.tools.playwright import NavigateTool
navigate_tool = NavigateTool(async_browser=async_browser)
await navigate_tool.ainvoke({"url": "https://www.agentql.com/blog"})
'Navigating to https://www.agentql.com/blog returned status code 200'
# You can invoke the tool with either a query or a prompt
# await extract_web_data_browser_tool.ainvoke(
# {'query': '{ blogs[] { title url date author } }'}
# )
await extract_web_data_browser_tool.ainvoke(
{"prompt": "the blog posts with title, url, date of post and author"}
)
/usr/local/lib/python3.11/dist-packages/agentql/_core/_utils.py:167: UserWarning: [31m🚨 The function get_data_by_prompt_experimental is experimental and may not work as expected 🚨[0m
warnings.warn(
{'blog_posts': [{'title': 'Launch Week Recap—make the web AI-ready',
'url': 'https://www.agentql.com/blog/2024-launch-week-recap',
'date': 'Nov 18, 2024',
'author': 'Rachel-Lee Nabors'},
{'title': 'Accurate data extraction from PDFs and images with AgentQL',
'url': 'https://www.agentql.com/blog/accurate-data-extraction-pdfs-images',
'date': 'Feb 1, 2025',
'author': 'Rachel-Lee Nabors'},
{'title': 'Introducing Scheduled Scraping Workflows',
'url': 'https://www.agentql.com/blog/scheduling',
'date': 'Dec 2, 2024',
'author': 'Rachel-Lee Nabors'},
{'title': 'Updates to Our Pricing Model',
'url': 'https://www.agentql.com/blog/2024-pricing-update',
'date': 'Nov 19, 2024',
'author': 'Rachel-Lee Nabors'},
{'title': 'Get data from any page: AgentQL’s REST API Endpoint—Launch week day 5',
'url': 'https://www.agentql.com/blog/data-rest-api',
'date': 'Nov 15, 2024',
'author': 'Rachel-Lee Nabors'}]}
GetWebElementBrowserTool
​
prompt
: A Natural Language description of the web element to find on the page.
selector = await extract_web_element_tool.ainvoke({"prompt": "Next page button"})
selector
"[tf623_id='194']"
from langchain_community.tools.playwright import ClickTool
# Disabling 'visible_only' will allow us to click on elements that are not visible on the page
await ClickTool(async_browser=async_browser, visible_only=False).ainvoke(
{"selector": selector}
)
"Clicked element '[tf623_id='194']'"
from langchain_community.tools.playwright import CurrentWebPageTool
await CurrentWebPageTool(async_browser=async_browser).ainvoke({})
'https://www.agentql.com/blog/page/2'
Chaining​
You can use AgentQL tools in a chain by first binding one to a tool-calling model and then calling it: