파이썬 나도코딩 웹크롤링 요약

출처 : https://www.youtube.com/watch?v=yQ20jZwDjTE

웹 스크래핑 : 웹페이지에서 원하는 부분만 추출하는 것.

웹 크롤링 : 웹페이지 내에 있는 링크들을 따라가면서 모든 내용 가져오는 것.

xpath

xpath란? 절대 경로를 짧게 줄일 수 있는 것이다.

브라우저에서 자동으로 해준다.

전체경로 : /학교/학년/반/학생[2]

xpath : //*[@학번=”1-1-5”]

전체경로 : /html/body/div/div/…../a(id=”login”)

xpath : //*[@id=”login”]

// : ‘//’ 아래의 모든 태그

<학교 이름='데이터고등학교'>
    <학년 value='1학년'>
        <반 value='1반'>
            <학생 value='1번', 학번='1-1-1'>이지은</학생>
            <학생 value='2번', 학번='1-1-2'>유재석</학생>

        </반>
        <반 value='2반'/>
    </학년>
</학교>

requests

import requests
res = requests.get("http://naver.com")
print("응답코드", res.status_code)  # 200 정상 동작
res = requests.get("http://*.tistory.com")
print("응답코드", res.status_code)  # 403 error

requests 오류 검출 1

res = requests.get("http://*.tistory.com")
if res.status_code == requests.codes.ok:
	print('정상입니다')
else:
	print('error code :', res.status_code)

requests 오류 검출 1

res = requests.get("http://*.tistory.com")
res.raise_for_status()

문제가 생기면 raise_for_status에서 오류를 내고 프로그램 종료한다.

정규식

정규식 예제 코드

import re

def print_match(m):
    if m:
        print(m.group())  # care
        print(m.string)  # good care
        print(m.start())  # 5
        print(m.end())  # 9
        print(m.span())  # (5, 9)
    else:
        print("매칭되지 않음")

p = re.compile("ca.e")
m = p.match("case")        # match : 처음부터 일치 확인
print_match(m)

정규표현식 규칙 적용

re.compile(‘정규식’)

. : 하나의 문자 의미

^ : 문자열의 시작 de로 시작하는 문자

$ : 문자열의 끝. case, base

정규표현식 문자 매치

re.complie().(match, search, findall 같은)특정 메소드로 문자가 일치하는지 확인한다.

match : 처음부터 일치 확인

search : 문자열 중 일치하는게 있나 확인

findall : 일치하는 모든것 리스트로 반혼

user agent

user agent 란? 웹에 접속할 때 자신의 정체를 알리는 헤더. 웹은 user agent를 통해 봇 판별, 모바일, os 등 확인 가능하다.

user agent 확인 사이트 : http://m.avalon.co.kr/check.html

url = "http://nadocoding.tistory.com"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
res = requests.get(url, headers=headers)

headers 에 User-Agent 정보를 저장한다.
requests.get에 headers를 같이 보낸다.
웹은 접속자를 해당 User-Agent로 인식한다.

naver webtoon crawling

import requests
from bs4 import BeautifulSoup

url = "https://comic.naver.com/webtoon/weekday"
res = requests.get(url)
res.raise_for_status()

# print(res.text)
soup = BeautifulSoup(res.text, 'lxml')
print(soup)

res.text란 우리가 가져온 html 문서를
lxml parser를 통해서
BeautifulSoup 객체로 만들었다.

BeautifulSoup4 get_text() vs .text

출처 : https://stackoverflow.com/questions/35496332/differences-between-text-and-get-text

It looks like [.text is just a property that calls get_text](http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L899). Therefore, calling get_text without arguments is the same thing as .text. However, get_text can also support various keyword arguments to change how it behaves (separator, strip, types). If you need more control over the result, then you need the functional form.

요약

get_text()에 argument 없이 사용할 경우 .text와 동일하다.

하지만 get_text(separator, strip, types)와 같은 옵션으로 원하는 결과를 지정할 수 있다.

soup

soup.a → soup 내부에 처음 만나는 a태그 출력

soup = BeautifulSoup(res.text, 'lxml')
print(soup.title.text)
print(soup.a)

attrs

print(soup.a)
print(soup.a.attrs)
print(soup.a['href'])

결과값

<a href="#menu" onclick="document.getElementById('menu').tabIndex=-1;document.getElementById('menu').focus();return false;"><span>메인 메뉴로 바로가기</span></a>
{'href': '#menu', 'onclick': "document.getElementById('menu').tabIndex=-1;document.getElementById('menu').focus();return false;"}
#menu

attrs : dictionary 형태로 속성들을 반환해준다.

태그['속성']⇒ 값을 출력한다.

find

find argument에 해당하는 첫번째 태그를 반환한다.

soup.find("a", attrs={"class": "Nbtn_upload"})
soup.find("a", class_="Nbtn_upload")
soup.find("a", href="/mypage/myActivity")

ex) 처음으로 만나는 a태그 class=”Nbtn_upload”인 태그를 반환한다.

attrs를 사용하지 않고 attr을 바로 작성할 수 있다.

class의 경우에는 python의 class와 겹치기 때문에 class라고 작성한다.

sibling

설명 : to navigate between page elements that are on the same level of the parse tree

같은 level에 있는 앞, 뒤 element(태그)를 가리킨다.

next_sibling

print(rank1.next_sibling)
print(rank1.next_sibling.next_sibling)

next_sibling이 whitespace를 가리 킬 경우에는

next.sibling을 한번 더 써서 whitespace 다음 태그를 가리킨다.

previous_sibling

rank2 = rank1.next_sibling.next_sibling
rank3 = rank2.next_sibling.next_sibling

rank2 == rank3.previous_sibling.previous_sibling

###

parent

현재 level에서 상위 계층 태그 가져온다.

<ol>
	<li></li>
<ol>

tag = soup.find('li')
print(tag.parent) # <ol>
									#		<li></li>
									#	</ol>

find_next_sibling(), find_previous_sibling()

next_sibling을 사용할 때 공백이 있는 경우 next_sibling을 한번 더 써야된다.

따라서 공백과 상관 없이, find_next_sibling(“li”) 처럼 다음 li태그를 찾게 한다.

rank1.find_next_sibling("li")
rank1.find_previous_sibling("li")

find_next_siblings(), find_previous_siblings()

같은 계층(level)에서 현재 태그 전(previous) or 후(next)에 해당하는 해당 태그를 전부 가져온다.

text 값으로 태그 찾기

webtoon = soup.find("a", text="반드시 해피엔딩-27화")

같은 text를 가진 “a” 태그를 찾게 된다.

get, post방식

get

정보를 url에 포함해서 보낸다.

https://www.coupang.com/np/search?**rocketAll=false**&q=%EB%85%B8%ED%8A%B8%EB%B6%81&brand=&offerCondition=&filter=&availableDeliveryFilter=&filterType=free&isPriceRange=false&priceRange=&minPrice=&maxPrice=&page=1&trcid=&traid=&filterSetByUser=true&channel=user&backgroundColor=&component=&rating=0&sorter=scoreDesc&listSize=36

? 뒤에 있는 것부터 변수=값 형식이다.

post

body에 데이터를 숨겨서 보낸다. get방식보다 안전하다.

content

image 주소로 requests.get을 가져온 후 content속성으로 바이너리 원문을 받을 수 있다.

img url의 경우에는 이미지를 얻는다.

하단은 image를 저장하는 예제이다.

for idx, image in enumerate(images):
    image_url = image['src']

    image_res = requests.get(image_url)
    image_res.raise_for_status()

    with open("movie{}.jpg".format(idx+1), "wb") as f:
        f.write(image_res.content)

이전Git Fork, Fetch, Pull Request 간단 정리

다음BeautifulSoup parser (lxml, html.parse, html5lib) 차이점