Saya seorang pemula di Web-scraping dan saya mengikuti tutorial ini untuk ekstrak data film dari tautan ini , saya memilih untuk mengekstrak film antara 2016 dan 2019 untuk pengujian. Saya hanya mendapatkan 25 baris tetapi saya ingin lebih dari 30000. Apakah Anda pikir itu mungkin?

Ini kodenya:

from requests import get
from bs4 import BeautifulSoup
import csv
import pandas as pd
from time import sleep
from random import randint
from time import time
from IPython.core.display import clear_output

headers = {"Accept-Language": "en-US, en;q=0.5"}

pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2018)]

url = 'https://www.imdb.com/search/title?release_date=2016-01-01,2019-05-01'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

names = []
years = []
imdb_ratings = []
metascores = []
votes = []
start_time = time()
requests = 0

for year_url in years_url:
# For every page in the interval 1-4
   for page in pages:
# Make a get request
      response = get('http://www.imdb.com/search/title?release_date=' + year_url +'&sort=num_votes,desc&page=' + page, headers = headers)
# Pause the loop
      sleep(randint(8,15))
# Monitor the requests
      requests += 1
      elapsed_time = time() - start_time
 print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
  warn('Request: {}; Status code: {}'.format(requests, response.status_code))
# Break the loop if the number of requests is greater than expected
  if requests > 72:
    warn('Number of requests was greater than expected.')

# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, 'html.parser')
# Select all the 50 movie containers from a single page
mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
  if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
   name = container.h3.a.text
   names.append(name)
# The year
   year = container.h3.find('span', class_ = 'lister-item-year').text
   years.append(year)
# The IMDB rating
   imdb = float(container.strong.text)
   imdb_ratings.append(imdb)
# The Metascore
   m_score = container.find('span', class_ = 'metascore').text
   metascores.append(int(m_score))
# The number of votes
   vote = container.find('span', attrs = {'name':'nv'})['data-value']
   votes.append(int(vote))


   movie_ratings = pd.DataFrame({'movie': names,
  'year': years,
  'imdb': imdb_ratings,
  'metascore': metascores,
  'votes': votes
  })

#data cleansing
movie_ratings = movie_ratings[['movie', 'year', 'imdb', 'metascore', 'votes']]
movie_ratings.head()
movie_ratings['year'].unique()
movie_ratings.to_csv('movie_ratings.csv')
0
Aziza Sbai El Idrissi 2 Juni 2019, 03:17

1 menjawab

Sulit untuk mengatakan dengan tepat apa masalahnya di sini karena kurangnya fungsi tetapi dari apa yang saya lihat, Anda perlu mengurai setiap halaman secara terpisah.

Setelah setiap permintaan, Anda perlu mengurai teks. Namun, saya menduga masalah utamanya adalah urutan kode Anda, saya sarankan menggunakan fungsi.

1
Aero Blue 2 Juni 2019, 00:45