66일차 TIL

오늘은 프로젝트를 진행하면서 책에대한 제목, 이미지, 상세페이지, 내용등을 웹에서 크롤링하여 데이터베이스에 저장하는작업을 진행하였습니다.

이 부분을 진행하는이유는 책을 추천해주는 시스템을 코사인유사도로 적용해야하는데 데이터들을 모아와야하기 때문이다 그리고 프론트엔드에도 출력해야하는 데이터들도 필요해서이다.

import requests, bs4
import pandas as pd
for j in range(1,3):
    url = f"http://www.yes24.com/24/category/bestseller?CategoryNumber=001&sumgb=06&PageNumber={j}&FetchSize=80"
    response = requests.get(url).text.encode('utf-8')
    response = bs4.BeautifulSoup(response, 'html.parser')


    target = response.find('table', {'id':'category_layout', 'class':'list'})
    T = target.find_all('div', {'class' : 'goodsImgW'})



    text_list = [str(x).split('\n') for x in T]
    print(text_list[0][2])
    for i in range(0,5):
        title = text_list[i][2].split('"')[1]
        print(title)
        print("---------------------------")
        url =  text_list[i][2].split('"')[3]
        print(url)
        print("---------------------------")
        href = text_list[i][1].split('"')[1]
        print("http://www.yes24.com"+href)
        print("---------------------------")
       

        href = text_list[i][1].split('"')[1]
        each_raw = requests.get("http://www.yes24.com"+href,
                headers = {"User-Agent" : "Mozilla/5.0"})

        each_html = bs4.BeautifulSoup(each_raw.text, 'html.parser')
        contents = each_html.select("textarea.txtContentText")[0]
        content_list = [x.get_text().replace('\r\n',"") for x in contents]
        content = ''.join(s for s in content_list)

웹 크롤링을 사용하기 위해서는 beaurifulsoup4패키지를 사용해서 웹크롤링을 진행하였다.

우리는 베스트셀러책들의 데이터를 가지고와야하기에 Yes24라는 사이트를사용하기로 했다.

YES24 | 베스트셀러

www.yes24.com

1. html에 구성요소를 파악하고 그부분을 찾아들어가는작업

책의 이미지와 제목을 가져오게하기위해서 다음과같이 검사페이지에 들어가서 요소들을 파악을 했다. 그리고 안에 포함되어있는데 img와 alt를 가지고와야하는데

target.find_all('div', {'class' : 'goodsImgW'})

이런식으로 가져와주었고 그 결과값을 리스트화를 시켜주고 여러개의 배열로 쪼개어서 텍스트만 가져오게끔하는 작업을 진행하였다. 그부분은 다음과 같다.

text_list = [str(x).split('\n') for x in T]
    print(text_list[0][2])
    for i in range(0,5):
        title = text_list[i][2].split('"')[1]
        print(title)
        print("---------------------------")
        url =  text_list[i][2].split('"')[3]
        print(url)
        print("---------------------------")
        href = text_list[i][1].split('"')[1]
        print("http://www.yes24.com"+href)
        print("---------------------------")

그러면 다음과같이 제목, 이미지, 경로, 내용순으로 나오게 된다.

아 그리고 내용은 상세페이지안에 새로운 url을 파고들어가야하는데 그부분은 새로가져온 url를 다시요청하여 위방식과 똑같이 content를 가져와주었다.

 each_raw = requests.get("http://www.yes24.com"+href,
                headers = {"User-Agent" : "Mozilla/5.0"})

        each_html = bs4.BeautifulSoup(each_raw.text, 'html.parser')
        contents = each_html.select("textarea.txtContentText")[0]
        content_list = [x.get_text().replace('\r\n',"") for x in contents]
        content = ''.join(s for s in content_list)

        print("결과:", content)
        print("---------------------------")

내일은 코사인유사도를 진행할 예정이다