Saya mencoba meng-crawl data review dari amazon dengan notebook Jupeter.

Tapi ada respon 503 dari server.

Apakah ada yang tahu apa yang salah dengan itu?

Berikut adalah url. https noreferrer ://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=

Ini kode saya.

import re, requests, csv 
from bs4 import BeautifulSoup 
from time import sleep

def reviews_info(div): 
    review_text = div.find("div", "a-row a-spacing-small review-data").get_text() 
    review_author = div.find("span", "a-profile-name").get_text()
    review_stars = div.find("span", "a-icon-alt").get_text() 
    on_review_date = div.find('span', 'a-size-base a-color-secondary review-date').get_text() 
    review_date = [x.strip() for x in re.sub("on ", "", on_review_date).split(",")] 

    return { "review_text" : review_text, 
            "review_author" : review_author, 
            "review_stars" : review_stars, 
            "review_date": review_date }
base_url = 'https://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber='


reviews = [] 

NUM_PAGES = 8

for page_num in range(1, NUM_PAGES + 1): 
    print("souping page", page_num, ",", len(reviews), "data collected") 
    url = base_url + str(page_num) 
    soup = BeautifulSoup(requests.get(url).text, 'lxml') 

    for div in soup('div', 'a-section review'): 
        reviews.append(reviews_info(div)) 
    
    sleep(30)

Akhirnya saya mencoba

requests.get(url)

Keluarannya adalah

<Response [503]>

Dan saya juga mencoba

requests.get(url).text()

Keluarannya adalah

TypeError: 'str' object is not callable

Apakah Amazon memblokir perayapan?

Saya akan menghargai jawaban Anda!

0
kw Hyun 7 Januari 2021, 17:03

2 jawaban

Jawaban Terbaik

Amazon memblokir permintaan ke server mereka saat Anda mencoba merayapinya, menggunakan python request lib. Anda dapat mencoba menggunakan Selenium dengan browser chromium yang mungkin berhasil. Berikut versi python Selenium: https://selenium-python.readthedocs.io/.

0
Nikita Galibin 8 Januari 2021, 00:13

Saya mencoba webdriver.

Ini kode saya.

from selenium import webdriver
import re
import requests 
import csv 
from bs4 import BeautifulSoup 
from time import sleep

review_list = []
NUM_PAGE = 8

base_url = 'https://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber='

for num_page in range(1, NUM_PAGE + 1):
    chrome_driver = '/Users/chromedriver'
    driver = webdriver.Chrome(chrome_driver)

    url = base_url + str(num_page)
    driver.get(url)

    src = driver.page_source
    source = BeautifulSoup(src, 'lxml', from_encoding='utf-8')

    driver.close()

    print("souping page", num_page, ",", len(source.find_all('div', 'a-section celwidget')), "의 data를 수집")

    for source in source.find_all('div', 'a-section celwidget'): 
        review_text = source.find("div", "a-row a-spacing-small review-data").get_text() 
        review_author = source.find("span", "a-profile-name").get_text()
        review_stars = source.find("span", "a-icon-alt").get_text() 
        on_review_date = source.find('span', 'a-size-base a-color-secondary review-date').get_text() 
        #review_date = [x.strip() for x in re.sub("on ", "", on_review_date).split(",")] 

        review = { "review_text" : review_text, 
                "review_author" : review_author, 
                "review_stars" : review_stars, 
                "review_date": on_review_date }

        review_list.append(review)
    
    sleep(10)
0
kw Hyun 8 Januari 2021, 01:28