一:示例输出
二:示例结果
三:示例说明
你是否也曾思考过——京东上成千上万的商品,消费者到底都在评论什么?本次我们通过构建一套系统化爬虫方案,成功抓取了京东平台上975个热销商品的多维度评论数据,总计获取8546条有效评论。下面为大家揭秘我们的技术方案与实操过程。
三、精准采集代码实现
1. 评论列表页解析
import re
import json
from bs4 import BeautifulSoup
def parse_comment_list(html):
soup = BeautifulSoup(html, "html.parser")
script_tag = soup.find("script", id="J-product评论-列表")
if not script_tag:
return None, 0
# 提取JSON数据(京东评论数据通过JS变量存储)
json_str = re.search(r'window.__INITIAL_STATE__=(.*?);</script>', str(script_tag)).group(1)
data = json.loads(json_str)
comments = []
for item in data["comments"]:
comments.append({
"comment_id": item["id"],
"content": item["content"],
"score": item["score"],
"user_name": item["userName"],
"creation_time": item["creationTime"],
"useful_votes": item["usefulVoteCount"],
"reply_count": item["replyCount"],
"images": [img["imgUrl"] for img in item.get("images", [])],
"user_level": item["userLevelName"],
"product_model": item.get("productColor", "") + " " + item.get("productSize", "")
})
total_comments = data["productCommentSummary"]["commentCount"]
has_next = data["page"]["pageNo"] < data["page"]["pageTotal"]
return comments, total_comments, has_next
2. 深度采集循环(含分页)
Result Object: --------------------------------------- { "items": { "totalpage": "100", "total_results": 20000, "page_size": 10, "page": "1", "item": [ { "rate_id": "21992238159", "rate_content": "物流和产品都不错,性价比高,赞赞赞 质量非常好,客服态度非常非常赞,有问题及时给解决了购物体验很棒,商品物美价廉,质量优秀。物流迅速,商家服务贴心,售后无忧。高颜值,高品质,非常好,一分钱一分货,材质外观和质量一看就很上档次,非常喜欢", "rate_date": "2024-12-23 13:49:08", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/268223/27/1816/23111/6768f9d3F79259578/1f946da747fb3842.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/267912/12/1941/21072/6768f9d3F8603a860/4778a28cfc8bb02d.jpg.dpg" ], "display_user_nick": "唐***月", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "21940879411", "rate_content": "这条充电线质量非常好,线材柔软,使用寿命长。充电速度快,兼容性强,适用于多种设备。外观设计简洁大方,白色外观显得干净整洁。而且价格合理,性价比很高。使用了一段时间,没有出现任何质量问题,非常满意。推荐给需要充电线的朋友们,绝对物超所值!", "rate_date": "2024-12-13 23:39:55", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/228791/35/34183/56412/675c553fFa70cd813/2e1022f9a25e945a.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/195256/23/50656/47240/675c5541F79f5af5e/e4d60651ed0c626f.jpg.dpg" ], "display_user_nick": "j***j", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "21971245211", "rate_content": "快递很快,质量棒极了,建议购买强烈推荐!商品物超所值,质量可靠。物流快,商家服务热情,售后服务完善。物流很快, 产品很快就收到了,比想象中还好,不错不错!希望能耐用商品质量非常好,外观设计新颖,物流速度快,商家服务态度好,性价比高。", "rate_date": "2024-12-20 06:09:37", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/262487/32/530/222631/67649996F59b29dbb/41daef4d63774912.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/254171/36/1720/29817/6764999eF1e3da91c/49f9e4e3c0f7b489.jpg.dpg" ], "display_user_nick": "j***b", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22505131588", "rate_content": "这款充电线质量真心不错! 用了两年,依然如新,充电速度也很快,完全满足日常需求。非常满意的一次购物体验! ", "rate_date": "2025-03-05 20:55:47", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/261336/2/28436/316471/67c849d2Fd7ca649e/834ef8563c36f823.jpg.dpg" ], "display_user_nick": "郑***c", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "21921806331", "rate_content": "真的超级喜欢,非常支持,质量非常好,与卖家描述的完全一致,非常满意,真的很喜欢,完全超出期望值,发货速度非常快,包装非常仔细、严实,物流公司服务态度很好,运送速度很快,很满意的一次购物", "rate_date": "2024-12-10 14:29:28", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/151256/8/50806/62364/6757dfc7Fb0e0d3d1/cfcfcea8f15baeb6.jpg.dpg" ], "display_user_nick": "j***6", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22023511937", "rate_content": "这个商品的质量真是太好了,用起来非常顺手,效果也很满意。外观精美,不仅提升了使用体验,还为家居增添了美感。价格虽然高了一些,但相比其优良的品质和体验,绝对是物超所值。强烈推荐给追求品质生活的你!", "rate_date": "2024-12-29 08:22:00", "pics": [], "display_user_nick": "驰***生", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "23041323552", "rate_content": "冲电器大小适中,冲电非常的快并且不发热。非常不错!", "rate_date": "2025-04-21 17:06:36", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/283321/39/23600/2544182/68060a9bF35b67b24/1ccea6f930ef3986.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/271623/13/24033/2449752/68060a9bFf4aa3b47/184ce7b5a88a25f4.jpg.dpg" ], "display_user_nick": "j***c", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22823734129", "rate_content": "很好的充电套装,线足够长,充电也够快,非常满意。", "rate_date": "2025-04-04 17:55:47", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/281947/15/15093/51962/67efac76Ff463f982/a596114283fc4e3a.jpg.dpg" ], "display_user_nick": "雪***泳", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22036250935", "rate_content": "东西质量非常好,与卖家描述的完全一致,非常满意\n做工质感:好\n充电速度:好\n便携性能:好\n安全性能:好\n其他特色:好", "rate_date": "2024-12-31 12:06:34", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/262528/8/5909/64887/67736dcaF7461323d/7565029f66830601.jpg.dpg" ], "display_user_nick": "j***a", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22666913189", "rate_content": "非常不错,质感很好,充电快", "rate_date": "2025-03-22 17:55:18", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/276088/1/7952/71045/67de8905Fecb7c861/5bf2b9f89b0b7897.jpg.dpg" ], "display_user_nick": "o***g", "videos": [], "auction_sku": null, "add_feedback": null } ], "_ddf": "fb" }, "secache": "5dc2b1edf5008bcf6577411b1f5fbd16", "secache_time": 1749537314, "secache_date": "2025-06-10 14:35:14", "translate_status": "", "translate_time": 0, "language": { "default_lang": "cn", "current_lang": "cn" }, "error": "", "reason": "", "error_code": "0000", "cache": 0, "api_info": "today:71 max:10000 all[374=71+49+254];expires:2030-10-30", "execution_time": "4.646", "server_time": "Beijing/2025-06-10 14:35:14", "client_ip": "106.6.46.187", "call_args": { "num_iid": "10114820943599", "data": "1" }, "api_type": "jd", "translate_language": "zh-CN", "translate_engine": "google_new", "server_memory": "3.33MB", "request_id": "gw-3.6847d21e3ae75", "last_id": "4513851984"; }
四、性能优化建议
- 分布式爬虫架构:
plaintext
┌───────────┐ ┌───────────┐ ┌───────────┐ │ 调度中心 │ │ 爬虫节点 │ │ 数据仓库 │ │ (Redis) │←──→│ (Scrapy)│←──→│ (MongoDB)│ └───────────┘ └───────────┘ └───────────┘ ↑ ↑ ↑ ├────────────┼────────────┤ │ ┌──────┼──────┐ │ └───→│ 代理池│←──────┘ │ └──────┼──────┘ ┌──┴───┐ │ 清洗 │ └──────┘ - 增量采集:
通过 Redis 记录最后采集时间和评论 ID,仅采集新更新的评论,减少重复请求。
五、注意事项
- 京东反爬升级应对:
- 定期检查页面结构变化(如评论数据存储位置从 JS 变量改为 JSON 接口)
- 使用
Selenium +undetected-chromedriver
绕过最新反爬检测
- 代码维护成本:
爬虫代码需频繁适配京东页面更新,建议搭配Playwright
等自动化工具提升健壮性。
通过以上方案,可在合规前提下实现京东评论的精准采集,确有必要时再使用爬虫,并严格控制采集规模。