beautiful Soup, PhantomJS 사용한 javacript web scraping ( pubmed )

python 2014. 9. 24. 01:48

beautiful Soup, PhantomJS 사용한 javacript web scraping

환경 : windows 8 64bit, python 2.7.5 64bit, pycharm 3.4

참고 사이트 :

http://kochi-coders.com/2014/05/06/scraping-a-javascript-enabled-web-page-using-beautiful-soup-and-phantomjs/

http://stackoverflow.com/questions/814757/headless-internet-browser

http://coreapython.hosting.paran.com/etc/beautifulsoup4.html --> beautiful Soup 4 한글문서

http://www.crummy.com/software/BeautifulSoup/bs4/doc/# --> beautiful Soup 4 영어문서

http://phantomjs.org/quick-start.html ---> PhantomJS quick start!

<< 미션 >>

Pubmed 사이트에서 검색한 논문의 인용횟수 구하기...

< 분석 >

1. chrome 으로 pubmed 에서 찾고자 하는 논문을 검색한다.

ex) http://www.ncbi.nlm.nih.gov/pubmed/17708774

---> 화면 우측 하단에 있는 Cited by 62 PubMed Central articles 부분의 정보를 가져오면 됨!

2. 찾은 논문 페이지에서 F12 눌러서 '개발자 도구' 열어서, html 소스 분석하기.

---> id 는 유일하므로 <div id="disc_col"> 을 먼저 찾고,

-> class 는 여러개 가능하므로 <div class="portlet_title"> 를 모두 찾아서,

-> 하위의 <h3> , 또 그 하위의 <span> 까지 찾아들어간후,

-> 'Cited by' 문자가 있으면 정규식으로 62 추출하자!!!

* 준비 : beautiful Soup 4 설치하기

< 1차 시도 >

 

# -*- coding: utf-8 -*-
 
from bs4 import BeautifulSoup
import urllib
 
urlstr = 'http://www.ncbi.nlm.nih.gov/pubmed/23256168'

f = urllib.urlopen(urlstr)
tt = f.read()


pp = BeautifulSoup(tt)

# pp 내용중에서, div tag 이면서 id = 'disc_col' 인 element를 list 로 가져온다
tmp = pp.find_all('div', attrs={'id':'disc_col'})[0]

print tmp.prettify()

url2 =  tmp.a.get('href')

url2 = 'http://www.ncbi.nlm.nih.gov/' + url2

print '------------------------------------------- \n'
print url2
        
f2 = urllib.urlopen(url2)
tt2 = f2.read()

pp = BeautifulSoup(tt2)

print '------------------------------------------- \n'
print pp.prettify()

-- 출력 결과 --

*** 예상과는 다른 결과 나옴!

개발자 도구에서 <div id="disc_col"> 을 찾으면 다음을 쉽게 진행할수 있으리라 생각했으나,

href 에 새로운 url 이 나오고, 새로운 url 을 추적하였으나, javascript 에 연결되어 있고 원하는 결과 안보임 !!!

====> 새로운 해결책 필요!!!

javascript를 실행해서 우리가 원하는 결과를 가져올수 있는 뭔가가 필요하다....

그래서.. PhantomJS 찾음. (다른 도구 사용하고 싶으면 참고사이트를 참고하시라!)

<< PhantomJS 사용법 >>

1. http://phantomjs.org/download.html 에서 zip 파일 다운로드함.

2. zip 파일 압축풀면. phantomjs.exe 단독 실행 파일 존재함. (추가 설치 작업 필요없다.)

3. javascript 를 포함하고있는 목표 url 을, 분석하여 html로 변환하여, 로컬에 저장한다.

-> 이를 위해 js 파일을 만든다. (test.js)

-> phantomjs.exe 로 위에서 만든 js 파일(test.js)을 실행시킴.

-> javascript 로 된 url 구문이 html 구문으로 변환되고, 이를 로컬에 저장한다.

-> 이후에 저장된 html 파일을 분석하여 원하는 결과 가져오면됨.

* 메모장으로 test.js 라는 이름으로 텍스트파일을 만들어 아래 내용 입력(복사) 한다. (phantomjs 에서 사용하기 위함)

var page = require('webpage').create();
system = require('system');
var fs = require('fs');// File System Module

var args = system.args;
var url = args[1];    // 대상 webpage url
var output = args[2]; // 저장할 파일이름, path for saving the local file 

page.open( url, function() { // open the file 
  fs.write(output,page.content,'w'); // Write the page to the local file using page.content
  phantom.exit(); // exit PhantomJs
});

* python 에서 편하게 사용하기위해, phantomjs.exe 파일만 python 설치한 폴더로 복사함.

** phantomjs 실행 화면

---> javascript 를 포함한 http://www.ncbi.nlm.nih.gov/pubmed/17708774 를 html 로 변환하여 test.html 로 저장하라.

*** 최종 코드 ***

 

# -*- coding: utf-8 -*-
 
from bs4 import BeautifulSoup
import re
 
html_file = 'test.html'     # phantomJS 로 만든 대상 web page의 html 파일
 
f = open(html_file)
tt = f.read()
 
pp = BeautifulSoup(tt)
 
# pp 내용중에서, div tag 이면서 id = 'disc_col' 인 element를 list 로 가져온다
data = pp.find_all('div', attrs={'id': 'disc_col'})[0]
 
# print data.prettify()
 
# data 내용중에서, div tag 이면서 class = 'portlet_title' 인 element들을 list 로 가져온다
d2 = data.find_all('div', attrs={'class': 'portlet_title'})
# print len(d2)

citation = '0'
 
for x in d2:
    tmp = x.h3.span.contents    # x  -> h3 tag  -> span tag  contents
    if tmp and u'Cited by' in tmp[0]:
        # print tmp[0]
        citation = re.search('\d+', tmp[0]).group()    # string
        
print u'인용횟수 = ' , citation

성공 !!!

** 추가로.. 예외처리, 기타 자동화는 개인적으로 수정해서 사용하세요!!!!

--- 모든 경우를 확인 한것 아니므로 , 에러 발생시 chrome 개발자도구로 html 분석해서 알고리즘 추가하세요!

저작자표시 (새창열림)

'python' 카테고리의 다른 글

django 설치, 버전 확인 (0)	2014.10.14
python -- ffmpeg 이용한 video, audio capture (1)	2014.09.30
python debugging -- exception 발생한 file name, line number 찾기 (0)	2014.09.19
bioinfomatics -- vienna format 으로 RNA 그리기 (0)	2014.09.02
python -- self, 클래스멤버(변수) , 인스턴스멤버(변수), 생성자, 소멸자 (0)	2014.08.27

Posted by 자유프로그램

취미로 하는 프로그래밍 !!!

beautiful Soup, PhantomJS 사용한 javacript web scraping ( pubmed )

'python' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바