在开发资讯类App时,如何从多个网页源高效获取新闻数据是一个关键问题。本文将详细介绍如何利用Firecrawl这一强大的爬虫工具,从不同网站抓取新闻数据,并分享如何在本地部署Firecrawl的详细步骤。

新闻数据获取的常见方案

对于复杂的应用数据获取方案,通常需要从多个网页源抓取数据。以下是常见的步骤:

  1. 使用网络爬虫技术​:通过Python的Scrapy或BeautifulSoup等工具,从目标新闻网站提取结构化数据,包括标题、正文、发布时间、作者等信息。
  2. 处理反爬虫机制​:通过模拟用户行为、使用代理IP等方式,绕过网站的反爬虫机制。
  3. 数据清洗与存储​:对获取的数据进行清洗、去重和存储,以便在App中展示。

Firecrawl:一款强大的爬虫工具

Firecrawl是一款创新的爬虫工具,能够无需站点地图,抓取任何网站的所有可访问子页面。与传统爬虫工具相比,Firecrawl特别擅长处理使用JavaScript动态生成内容的网站,并且可以转换为LLM-ready的数据。

Firecrawl的优势

  • 无需站点地图​:Firecrawl可以直接抓取网站的所有可访问子页面。
  • 处理动态内容​:特别擅长处理使用JavaScript动态生成内容的网站。
  • LLM-ready数据​:可以将抓取的数据转换为适合大语言模型(LLM)使用的格式。

本地部署Firecrawl的详细步骤

1. 基础配置

首先,从Git上克隆Firecrawl的代码到本地,并确保服务器已经安装了Docker

按照官方文档给出的示例,在根目录下创建并配置.env文件:

# .env

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8 
PORT=3002
HOST=0.0.0.0

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379

#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN= 
SUPABASE_URL= 
SUPABASE_SERVICE_TOKEN=

# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=

# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=

# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=

# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=

# Resend API Key for transactional emails
RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

如果需要配置Supabase,将USE_DB_AUTHENTICATION配置为true。使用爬虫的功能一般配置为false即可。

2. 构建并运行Docker容器

进入Firecrawl目录,启动Docker容器:

docker-compose.yaml up -d

3. 使用Firecrawl获取数据

Firecrawl以Markdown格式返回数据,并且包含元数据信息。使用Apifox工具请求接口,可以方便地获取数据。

返回的数据包括Markdown格式的网页数据,以及丰富的元数据信息。

总结

Firecrawl是一款功能强大的爬虫工具,特别适合处理动态生成内容的网站。通过本地部署Firecrawl,可以高效获取新闻数据,并以Markdown格式返回,极大地方便了后续的数据处理和应用展示。


END
cc
最后修改:2025 年 02 月 05 日
如果觉得我的文章对你有用,请随意赞赏