在开发资讯类App时,如何从多个网页源高效获取新闻数据是一个关键问题。本文将详细介绍如何利用Firecrawl这一强大的爬虫工具,从不同网站抓取新闻数据,并分享如何在本地部署Firecrawl的详细步骤。
新闻数据获取的常见方案
对于复杂的应用数据获取方案,通常需要从多个网页源抓取数据。以下是常见的步骤:
- 使用网络爬虫技术:通过Python的Scrapy或BeautifulSoup等工具,从目标新闻网站提取结构化数据,包括标题、正文、发布时间、作者等信息。
- 处理反爬虫机制:通过模拟用户行为、使用代理IP等方式,绕过网站的反爬虫机制。
- 数据清洗与存储:对获取的数据进行清洗、去重和存储,以便在App中展示。
Firecrawl:一款强大的爬虫工具
Firecrawl是一款创新的爬虫工具,能够无需站点地图,抓取任何网站的所有可访问子页面。与传统爬虫工具相比,Firecrawl特别擅长处理使用JavaScript动态生成内容的网站,并且可以转换为LLM-ready的数据。
Firecrawl的优势
- 无需站点地图:Firecrawl可以直接抓取网站的所有可访问子页面。
- 处理动态内容:特别擅长处理使用JavaScript动态生成内容的网站。
- LLM-ready数据:可以将抓取的数据转换为适合大语言模型(LLM)使用的格式。
本地部署Firecrawl的详细步骤
1. 基础配置
首先,从Git上克隆Firecrawl的代码到本地,并确保服务器已经安装了Docker。
按照官方文档给出的示例,在根目录下创建并配置.env
文件:
# .env
# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379
#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false
# ===== Optional ENVS ======
# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=
# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=
# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=
# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=
# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=
# Resend API Key for transactional emails
RESEND_API_KEY=
# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO
如果需要配置Supabase,将USE_DB_AUTHENTICATION
配置为true
。使用爬虫的功能一般配置为false
即可。
2. 构建并运行Docker容器
进入Firecrawl目录,启动Docker容器:
docker-compose.yaml up -d
3. 使用Firecrawl获取数据
Firecrawl以Markdown格式返回数据,并且包含元数据信息。使用Apifox工具请求接口,可以方便地获取数据。
返回的数据包括Markdown格式的网页数据,以及丰富的元数据信息。
总结
Firecrawl是一款功能强大的爬虫工具,特别适合处理动态生成内容的网站。通过本地部署Firecrawl,可以高效获取新闻数据,并以Markdown格式返回,极大地方便了后续的数据处理和应用展示。