外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存_第1頁
已閱讀1頁,還剩10頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、<p><b>  外文資料原文</b></p><p>  Efficient URL Caching for World Wide Web Crawling</p><p>  Marc Najork</p><p>  BMJ (International Edition) 2009</p><p> 

2、 Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at ove

3、r 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for

4、 a reasonably fresh and complete crawl of the web, step (a) m</p><p>  A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of

5、 this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant cachin

6、g and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from</p><p>  1. INTRODUCTION</p><p>  A recent Pew Fo

7、undation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hen

8、ce, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the

9、 search engine corpus.</p><p>  Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by

10、recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is</p><p>  (a) Fetch a page</p><p>  (b) Parse it to extract all linked URLs</p><p>  (

11、c) For all the URLs not seen before, repeat (a)–(c)</p><p>  Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected dur

12、ing previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as yahoo.com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See

13、 [9] for a discussion of the graph structure of the web that leads to this phenomenon.</p><p>  If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling bec

14、omes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for g

15、raph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for inst</p><p>  However, crawling the web is

16、 not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. </p><p>  1. The web is very large. Currently, Google [20] claims to have index

17、ed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.</p><p>  2. Web pages are changing rapidly. If “change” means “any change”, then

18、 about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17]. </p><p>  These two factors imply that to obtain a

19、 reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c

20、) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, whi

21、ch further complicat</p><p>  A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several UR

22、L caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache whe

23、n run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple c</p><p>  The paper is organized as follows: Section 2 discusses the various crawling solutions proposed

24、 in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms und

25、er the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data struct</p><p>  2. CRAWLING

26、</p><p>  Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical or

27、der) and then discuss why most of these crawlers could benefit from URL caching.</p><p>  The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhausti

28、ve crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.</p><p>  The original Googl

29、e crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extra

30、ct words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.</p><p>  For the experiments d

31、escribed in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignm

32、ent of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching

33、URLs together to minimize TCP overhead.</p><p>  Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-pr

34、ocs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based

35、 on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such</p><p>  The WebFountain crawler [16] is also composed of a set of independent, comm

36、unicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.</p>

37、<p>  UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, dete

38、cts process failures, and initiates fail-over to other crawling processes.</p><p>  Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different pr

39、ocesses. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “download

40、er” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extrac</p><p>  Any web crawler must maintain a collecti

41、on of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is ach

42、ieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are bill

43、ions of valid URLs),</p><p>  Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating craw

44、ling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than

45、 once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goe</p><p>  3. CACHING</p><p>  In most computer systems, memory is hierarchical, that is, th

46、ere exist two or more levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory,

47、and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a f&

48、lt;/p><p>  in the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of v

49、isited URLs on disk, and cache a small portion in main memory.</p><p>  Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at mo

50、st k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss

51、 or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an </p><p>  Clearly, the larger the cache, the easier it is to a

52、void misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:</p><p>  _ Non-uniformity of reque

53、sts. Some requests are much more popular than others. In our context, for instance, a link to yahoo.com is a much more common occurrence than a link to the authors’ home pages.</p><p>  _ Temporal correlatio

54、n or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter terminology comes from the computer memory model – data needed now is lik

55、ely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf.

56、Section 4.2, and second, because pages </p><p>  Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorith

57、ms try to capture this intuition in various ways.</p><p>  We now describe some standard caching algorithms, whose performance we evaluate in Section 5.</p><p><b>  附錄B 漢語翻譯</b></

58、p><p>  基于網(wǎng)絡(luò)爬蟲的有效URL緩存</p><p><b>  馬克</b></p><p>  BMJ (國際版) 2009</p><p><b>  概要:</b></p><p>  要在網(wǎng)絡(luò)上爬行非常簡單:基本的算法是:(a)取得一個網(wǎng)頁(b)解析它提取所

59、有的鏈接URLs(c)對于所有沒有見過的URLs重復(fù)執(zhí)行(a)-(c)。但是,網(wǎng)絡(luò)的大?。ü烙嬘谐^40億的網(wǎng)頁)和他們變化的頻率(估計每周有7%的變化)使這個計劃由一個微不足道的設(shè)計習題變成一個非常嚴峻的算法和系統(tǒng)設(shè)計挑戰(zhàn)。實際上,光是這兩個要素就意味著如果要進行及時地,完全地爬行網(wǎng)絡(luò),步驟(a)必須每秒鐘執(zhí)行大約1000次,因此,成員檢測(c)必須每秒鐘執(zhí)行超過10000次,并有非常大的數(shù)據(jù)儲存到主內(nèi)存中。這個要求有一個分布式構(gòu)造,

60、使得成員檢測更加復(fù)雜。</p><p>  一個非常重要的方法加速這個檢測就是用cache(高速緩存),這個是把見過的URLs存入主內(nèi)存中的一個(動態(tài))子集中。這個論文最主要的成果就是仔細的研究了幾種關(guān)于網(wǎng)絡(luò)爬蟲的URL緩存技術(shù)。我們考慮所有實際的算法:隨機置換,靜態(tài)cache,LRU,和CLOCK,和理論極限:透視cache和極大的cache。我們執(zhí)行了大約1800次模擬,用不同的cache大小執(zhí)行這些算法,用

61、真實的log日志數(shù)據(jù),獲取自一個非常大的33天的網(wǎng)絡(luò)爬行,大約執(zhí)行了超過10億次的http請求。</p><p>  我們的主要的結(jié)論是 cache是非常高效的-在我們的機制里,一個有大約50000個入口的cache可以完成80%的速率。有趣的是,這cache的大小下降到一個臨界點:一個足夠的小一點的cache更有效當一個足夠的大一點的cache只能帶來很小的額外好處。我們推測這個臨界點是固有的并且冒昧的解釋一下

62、這個現(xiàn)象。</p><p><b>  1.介紹</b></p><p>  皮尤基金會最新的研究指出:“搜索引擎已經(jīng)成為互聯(lián)網(wǎng)用戶不可或缺的工具”,估計在2002年中期,初略有超過1半的美國人用網(wǎng)絡(luò)搜索獲取信息。因此,一個強大的搜索引擎技術(shù)有巨大的實際利益,在這個論文中,我們集中于一方面的搜索技術(shù),也就是搜集網(wǎng)頁的過程,最終組成一個搜索引擎的文集。</p>

63、;<p>  搜索引擎搜集網(wǎng)頁通過很多途徑,他們中,直接提交URL,回饋內(nèi)含物,然后從非web源文件中提取URL,但是大量的文集包含一個進程叫 crawling 或者 SPIDERing,他們遞歸的探索互聯(lián)網(wǎng)?;镜乃惴ㄊ牵?lt;/p><p>  Fetch a page</p><p>  Parse it to extract all linked URLs</p&g

64、t;<p>  For all the URLs not seen before,repeat(a)-(c)</p><p>  網(wǎng)絡(luò)怕從一般開始于一些 種子URLs。有些時候網(wǎng)絡(luò)爬蟲開始于一個正確連接的頁面,或者一個目錄就像:yahoo.com,但是因為這個原因相關(guān)的巨大的部分網(wǎng)絡(luò)資源無法被訪問到。(估計有超過20%)</p><p>  如果把網(wǎng)頁看作圖中的節(jié)點,把超鏈接

65、看作定向的移動在這些節(jié)點之間,那么網(wǎng)絡(luò)爬蟲就變成了一個進程就像數(shù)學中的圖的遍歷一樣。不同的遍歷策略決定著先不訪問哪個節(jié)點,下一個訪問哪個節(jié)點。2種標準的策略是深度優(yōu)先算法和廣度優(yōu)先算法-他們?nèi)菀妆粚崿F(xiàn)所以在很多入門的算法課中都有教。</p><p>  但是,在網(wǎng)絡(luò)上爬行并不是一個微不足道的設(shè)計習題,而是一個非常嚴峻的算法和系統(tǒng)設(shè)計挑戰(zhàn)因為以下2點原因:</p><p>  網(wǎng)絡(luò)非常的龐大

66、?,F(xiàn)在,Google需要索引超過30億的網(wǎng)頁。很多研究都指出,在歷史上,網(wǎng)絡(luò)每9-12個月都會增長一倍。</p><p>  網(wǎng)絡(luò)的頁面改變很頻繁。如果這個改變指的是任何改變,那么有40%的網(wǎng)頁每周會改變。如果我們認為頁面改變?nèi)种换蛘吒?,那么有大約7%的頁面每周會變。</p><p>  這2個要素意味著,要獲得及時的,完全的網(wǎng)頁快照,一個搜索引擎必須訪問1億個網(wǎng)頁每天。因此,步驟(

67、a)必須執(zhí)行大約每秒1000次,成員檢測的步驟(c)必須每秒執(zhí)行超過10000次,并有非常大的數(shù)據(jù)儲存到主內(nèi)存中。另外,網(wǎng)絡(luò)爬蟲一般使用一個分布式的構(gòu)造來平行地爬行更多的網(wǎng)頁,這使成員檢測更為復(fù)雜:這是可能的成員問題只能回答了一個同行節(jié)點,而不是當?shù)亍?lt;/p><p>  一個非常重要的方法加速這個檢測就是用cache(高速緩存),這個是把見過的URLs存入主內(nèi)存中的一個(動態(tài))子集中。這個論文最主要的成果就是仔

68、細的研究了幾種關(guān)于網(wǎng)絡(luò)爬蟲的URL緩存技術(shù)。我們考慮所有實際的算法:隨機置換,靜態(tài)cache,LRU,和CLOCK,和理論極限:透視cache和極大的cache。我們執(zhí)行了大約1800次模擬,用不同的cache大小執(zhí)行這些算法,用真實的log日志數(shù)據(jù),獲取自一個非常大的33天的網(wǎng)絡(luò)爬行,大約執(zhí)行了超過10億次的http請求。</p><p>  這個論文像這樣組織的:第2部分討論在文學著作中幾種不同的爬行解決方案

69、和什么樣的cache最適合他們。第3部分介紹關(guān)于一些cache的技術(shù)和介紹了關(guān)于cache幾種理論和實際算法。第4部分我們實現(xiàn)這些算法,在實驗機制中。第5部分描述和討論模擬的結(jié)果。第6部分是我們推薦的實際算法和數(shù)據(jù)結(jié)構(gòu)關(guān)于URLcache。第7部分是結(jié)論和指導(dǎo)關(guān)于促進研究。</p><p>  2.CRAWLING</p><p>  網(wǎng)絡(luò)爬蟲的出現(xiàn)幾乎和網(wǎng)絡(luò)同期,而且有很多的文獻描述了網(wǎng)

70、絡(luò)爬蟲。在這個部分,我們呈現(xiàn)一個摘要關(guān)于這些爬蟲程序,并討論問什么大多數(shù)的網(wǎng)絡(luò)爬蟲會受益于URL cache。</p><p>  網(wǎng)絡(luò)爬蟲用網(wǎng)絡(luò)存檔雇員多個爬行進程,每個一次性完成一個徹底的爬行對于64個hosts 。爬蟲進程儲存非本地的URLs到磁盤;在爬行的最后,一批工作將這些URLs加入到下個爬蟲的每個host的種子sets中。</p><p>  最初的google 爬蟲,實現(xiàn)

71、不同的爬蟲組件通過不同的進程。一個單獨的URL服務(wù)器進行維護需要下載的URL的集合;爬蟲程序獲取的網(wǎng)頁;索引進程提取關(guān)鍵字和超鏈接;URL解決進程將相對路徑轉(zhuǎn)換給絕對路徑。這些不同的進程通過文件系統(tǒng)通信。</p><p>  這個論文的中實驗我們使用的meractor網(wǎng)絡(luò)爬蟲。Mercator使用了一個獨立的集合,通信網(wǎng)絡(luò)爬蟲進程。每個爬蟲進程都是一個有效的web服務(wù)器子集;URLs的分配基于URLs主機組件。沒

72、有責任通過TCP傳送這個URL給網(wǎng)絡(luò)爬蟲,有責任把這些URLs綁在一起減少TCP開銷。我們描述mercator很多的細節(jié)在第4部分。</p><p>  任何網(wǎng)絡(luò)爬蟲必須維護一個集合,裝那些需要被下載的URLs。此外,不能重復(fù)地下載同一個URL,必須要個方法避免加入URLs到集合中超過一次。一般的,達到避免可以用維護一個發(fā)現(xiàn)URLs的集合。如果數(shù)據(jù)太多,可以存入磁盤,或者儲存經(jīng)常被訪問的URLs。</p&g

73、t;<p><b>  3.CACHING</b></p><p>  在大多數(shù)的計算機系統(tǒng)里面,內(nèi)存是分等級的,意思是,存在2級或更多級的內(nèi)存,表現(xiàn)出不同的空間和速度。舉個例,在一個典型的工作站里,有一個非常小但是非??斓膬?nèi)存,一個大,但是比較慢的RAM內(nèi)存,一個非常大膽是很慢的disk內(nèi)存。在一個網(wǎng)絡(luò)環(huán)境中,也是分層的。Caching就是一種想法儲存經(jīng)常用到的項目從慢速內(nèi)存

74、到快速內(nèi)存。</p><p>  Caching術(shù)語就像下面:cache是內(nèi)存用來儲存同等大小的元素。一個cache有k的大小,那么可以儲存k個項目.在每個時間段,cache接受到來自一個項目的請求.如果這個請求項目在這個cache中,這種情況將會引發(fā)一個碰撞并且不需要進一步的動作。另一方面,這種情況叫做 丟失或者失敗。如果cache沒有k個項目,那個丟失的項目被加入cache。另一方面,算法必須選擇驅(qū)逐一個項目

75、來空出空間來存放那個丟失的項目,或者不加入那個丟失的項目。Caching算法的目標是最小化丟失的個數(shù)。</p><p>  清楚的,cache越大,越容易避免丟失。因此,一個caching算法的性能要在看在一個給定大小的cache中的丟失率。</p><p>  一般的,caching成功有2個原因:</p><p>  不一致的請求。一些請求比其他一些請求多。&l

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論