如何高效获取新闻数据：Firecrawl爬虫工具的使用与本地部署指南

博主： Kevin Zhang
发布时间：2025 年 02 月 05 日
696 次浏览
暂无评论
4282字数
分类： IT技术

在开发资讯类App时，如何从多个网页源高效获取新闻数据是一个关键问题。本文将详细介绍如何利用Firecrawl这一强大的爬虫工具，从不同网站抓取新闻数据，并分享如何在本地部署Firecrawl的详细步骤。

新闻数据获取的常见方案

对于复杂的应用数据获取方案，通常需要从多个网页源抓取数据。以下是常见的步骤：

使用网络爬虫技术：通过Python的Scrapy或BeautifulSoup等工具，从目标新闻网站提取结构化数据，包括标题、正文、发布时间、作者等信息。
处理反爬虫机制：通过模拟用户行为、使用代理IP等方式，绕过网站的反爬虫机制。
数据清洗与存储：对获取的数据进行清洗、去重和存储，以便在App中展示。

Firecrawl：一款强大的爬虫工具

Firecrawl是一款创新的爬虫工具，能够无需站点地图，抓取任何网站的所有可访问子页面。与传统爬虫工具相比，Firecrawl特别擅长处理使用JavaScript动态生成内容的网站，并且可以转换为LLM-ready的数据。

Firecrawl的优势

无需站点地图：Firecrawl可以直接抓取网站的所有可访问子页面。
处理动态内容：特别擅长处理使用JavaScript动态生成内容的网站。
LLM-ready数据：可以将抓取的数据转换为适合大语言模型（LLM）使用的格式。

本地部署Firecrawl的详细步骤

1. 基础配置

首先，从Git上克隆Firecrawl的代码到本地，并确保服务器已经安装了Docker。

按照官方文档给出的示例，在根目录下创建并配置.env文件：

# .env

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= 
SUPABASE_URL= 
SUPABASE_SERVICE_TOKEN=

# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

如果需要配置Supabase，将USE_DB_AUTHENTICATION配置为true。使用爬虫的功能一般配置为false即可。

2. 构建并运行Docker容器

进入Firecrawl目录，启动Docker容器：

docker-compose.yaml up -d

3. 使用Firecrawl获取数据

Firecrawl以Markdown格式返回数据，并且包含元数据信息。使用Apifox工具请求接口，可以方便地获取数据。

返回的数据包括Markdown格式的网页数据，以及丰富的元数据信息。

总结

Firecrawl是一款功能强大的爬虫工具，特别适合处理动态生成内容的网站。通过本地部署Firecrawl，可以高效获取新闻数据，并以Markdown格式返回，极大地方便了后续的数据处理和应用展示。

END

本文作者： Kevin Zhang
文章标题：如何高效获取新闻数据：Firecrawl爬虫工具的使用与本地部署指南
本文地址： https://www.haovps.top/archives/368.html
版权说明：若无注明，本文皆高性能VPS推荐与IT技术博客 | Haovps.Top 原创，转载请保留文章出处。

最后修改：2025 年 02 月 05 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

打卡

语录

私密评论

名称 *

🎲

邮箱 *

地址

如何高效获取新闻数据：Firecrawl爬虫工具的使用与本地部署指南

Kevin Zhang • 2025 年 02 月 05 日

<p>在开发资讯类App时，如何从多个网页源高效获取新闻数据是一个关键问题。本文将详细介绍如何利用<a href="https://www.haovps.top/tag/Firecrawl/"target="_self" title="Firecrawl">Firecrawl</a>这一强大的<a href="https://www.haovps.top/tag/%E7%88%AC%E8%99%AB%E5%B7%A5%E5%85%B7/"target="_self" title="爬虫工具">爬虫工具</a>，从不同网站抓取新闻数据，并分享如何在<a href="https://www.haovps.top/tag/%E6%9C%AC%E5%9C%B0%E9%83%A8%E7%BD%B2/"target="_self" title="本地部署">本地部署</a>Firecrawl的详细步骤。</p><h2><a href="https://www.haovps.top/tag/%E6%96%B0%E9%97%BB%E6%95%B0%E6%8D%AE%E8%8E%B7%E5%8F%96/"target="_self" title="新闻数据获取">新闻数据获取</a>的常见方案</h2><p>对于复杂的应用数据获取方案，通常需要从多个网页源抓取数据。以下是常见的步骤：</p><ol><li><strong>使用网络爬虫技术</strong>：通过<a href="https://www.haovps.top/tag/Python/"target="_self" title="Python">Python</a>的Scrapy或BeautifulSoup等工具，从目标新闻网站提取结构化数据，包括标题、正文、发布时间、作者等信息。</li><li><strong>处理反爬虫机制</strong>：通过模拟用户行为、使用<a href="https://www.haovps.top/tag/%E4%BB%A3%E7%90%86/"target="_self" title="代理">代理</a>IP等方式，绕过网站的反爬虫机制。</li><li><strong>数据清洗与存储</strong>：对获取的数据进行清洗、去重和存储，以便在App中展示。</li></ol><p><img src="https://www.haovps.top/usr/themes/handsome/assets/img/loading.svg" alt="" title="" style=""data-original="https://img.haovps.top/2025/02/05/67a3533d2edb1.webp"></p><h2>Firecrawl：一款强大的爬虫工具</h2><p>Firecrawl是一款创新的爬虫工具，能够无需站点地图，抓取任何网站的所有可访问子页面。与传统爬虫工具相比，Firecrawl特别擅长处理使用JavaScript<a href="https://www.haovps.top/tag/%E5%8A%A8%E6%80%81%E7%94%9F%E6%88%90/"target="_self" title="动态生成">动态生成</a>内容的网站，并且可以转换为LLM-ready的数据。</p><p><img src="https://www.haovps.top/usr/themes/handsome/assets/img/loading.svg" alt="" title="" style=""data-original="https://img.haovps.top/2025/02/05/67a3534381fef.webp"></p><h3>Firecrawl的优势</h3><ul><li><strong>无需站点地图</strong>：Firecrawl可以直接抓取网站的所有可访问子页面。</li><li><strong>处理动态内容</strong>：特别擅长处理使用JavaScript动态生成内容的网站。</li><li><strong>LLM-ready数据</strong>：可以将抓取的<a href="https://www.haovps.top/tag/%E6%95%B0%E6%8D%AE%E8%BD%AC%E6%8D%A2/"target="_self" title="数据转换">数据转换</a>为适合大语言模型（LLM）使用的格式。</li></ul><h2>本地部署Firecrawl的详细步骤</h2><h3><a href="https://www.haovps.top/tag/tag-bd5b31ec40f7c56a8c2c545b7d51f4cf/"target="_self" title="1">1</a>. 基础配置</h3><p>首先，从Git上克隆Firecrawl的<a href="https://www.haovps.top/tag/%E4%BB%A3%E7%A0%81/"target="_self" title="代码">代码</a>到本地，并确保<a href="https://www.haovps.top/tag/tag-3985df2e6400657cc8388614aefd5def/"target="_self" title="服务器">服务器</a>已经安装了<a href="https://www.haovps.top/tag/Docker/"target="_self" title="Docker">Docker</a>。</p><p>按照官方文档给出的示例，在根目录下创建并配置<code>.env</code>文件：</p><pre><code># .env

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= 
SUPABASE_URL= 
SUPABASE_SERVICE_TOKEN=

# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO</code></pre><p>如果需要配置Supabase，将<code>USE_DB_AUTHENTICATION</code>配置为<code>true</code>。使用爬虫的功能一般配置为<code>false</code>即可。</p><h3>2. 构建并运行Docker容器</h3><p>进入Firecrawl目录，启动Docker容器：</p><pre><code>docker-compose.yaml up -d</code></pre><p><img src="https://www.haovps.top/usr/themes/handsome/assets/img/loading.svg" alt="" title="" style=""data-original="https://img.haovps.top/2025/02/05/67a35353683db.webp"></p><h3>3. 使用Firecrawl获取数据</h3><p>Firecrawl以<a href="https://www.haovps.top/tag/Markdown%E6%A0%BC%E5%BC%8F/"target="_self" title="Markdown格式">Markdown格式</a>返回数据，并且包含元数据信息。使用Apifox工具请求接口，可以方便地获取数据。</p><p><img src="https://www.haovps.top/usr/themes/handsome/assets/img/loading.svg" alt="" title="" style=""data-original="https://img.haovps.top/2025/02/05/67a3535b70744.webp"></p><p>返回的数据包括<a href="https://www.haovps.top/tag/Markdown/"target="_self" title="Markdown">Markdown</a>格式的网页数据，以及丰富的元数据信息。</p><p><img src="https://www.haovps.top/usr/themes/handsome/assets/img/loading.svg" alt="" title="" style=""data-original="https://img.haovps.top/2025/02/05/67a35365a3bf4.webp"></p><h2>总结</h2><p>Firecrawl是一款功能强大的爬虫工具，特别适合处理动态生成内容的网站。通过本地部署Firecrawl，可以高效获取新闻数据，并以Markdown格式返回，极大地方便了后续的数据处理和应用展示。</p><hr>

如何高效获取新闻数据：Firecrawl爬虫工具的使用与本地部署指南

新闻数据获取的常见方案