外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存

上傳人：奔*** IP屬地：河北更新時(shí)間：2024-03-01 格式：docx 頁數(shù)：11 大?。?2.15KB 人氣指數(shù)：12 舉報(bào) 版權(quán)申訴

已閱讀1頁，還剩10頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、　　外文資料原文　　Efficient URL Caching for World Wide Web Crawling　　Marc Najork　　BMJ (International Edition) 2009

2、　Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at ove

3、r 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for

4、 a reasonably fresh and complete crawl of the web, step (a) m　　A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of

5、 this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant cachin

6、g and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from　　1. INTRODUCTION　　A recent Pew Fo

7、undation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hen

8、ce, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the

9、 search engine corpus.　　Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by

10、recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is　　(a) Fetch a page　　(b) Parse it to extract all linked URLs　　(

11、c) For all the URLs not seen before, repeat (a)–(c)　　Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected dur

12、ing previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as yahoo.com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See

13、 [9] for a discussion of the graph structure of the web that leads to this phenomenon.　　If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling bec

14、omes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for g

15、raph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for inst　　However, crawling the web is

16、 not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 　　1. The web is very large. Currently, Google [20] claims to have index

17、ed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.　　2. Web pages are changing rapidly. If “change” means “any change”, then

18、 about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17]. 　　These two factors imply that to obtain a

19、 reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c

20、) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, whi

21、ch further complicat　　A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several UR

22、L caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache whe

23、n run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple c　　The paper is organized as follows: Section 2 discusses the various crawling solutions proposed

24、 in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms und

25、er the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data struct　　2. CRAWLING

26、　　Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical or

27、der) and then discuss why most of these crawlers could benefit from URL caching.　　The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhausti

28、ve crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.　　The original Googl

29、e crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extra

30、ct words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.　　For the experiments d

31、escribed in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignm

32、ent of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching

33、URLs together to minimize TCP overhead.　　Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-pr

34、ocs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based

35、 on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such　　The WebFountain crawler [16] is also composed of a set of independent, comm

36、unicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.

37、　　UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, dete

38、cts process failures, and initiates fail-over to other crawling processes.　　Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different pr

39、ocesses. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “download

40、er” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extrac　　Any web crawler must maintain a collecti

41、on of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is ach

42、ieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are bill

43、ions of valid URLs),　　Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating craw

44、ling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than

45、 once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goe　　3. CACHING　　In most computer systems, memory is hierarchical, that is, th

46、ere exist two or more levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory,

47、and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a f&

48、lt;/p>　　in the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of v

49、isited URLs on disk, and cache a small portion in main memory.　　Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at mo

50、st k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss

51、 or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an 　　Clearly, the larger the cache, the easier it is to a

52、void misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:　　_ Non-uniformity of reque

53、sts. Some requests are much more popular than others. In our context, for instance, a link to yahoo.com is a much more common occurrence than a link to the authors’ home pages.　　_ Temporal correlatio

54、n or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter terminology comes from the computer memory model – data needed now is lik

55、ely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf.

56、Section 4.2, and second, because pages 　　Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorith

57、ms try to capture this intuition in various ways.　　We now describe some standard caching algorithms, whose performance we evaluate in Section 5.　　附錄B 漢語翻譯</

58、p>　　基于網(wǎng)絡(luò)爬蟲的有效URL緩存　　馬克　　BMJ (國際版) 2009　　概要：　　要在網(wǎng)絡(luò)上爬行非常簡單：基本的算法是：（a）取得一個(gè)網(wǎng)頁（b）解析它提取所

59、有的鏈接URLs（c）對(duì)于所有沒有見過的URLs重復(fù)執(zhí)行（a）-（c）。但是，網(wǎng)絡(luò)的大?。ü烙?jì)有超過40億的網(wǎng)頁）和他們變化的頻率（估計(jì)每周有7%的變化）使這個(gè)計(jì)劃由一個(gè)微不足道的設(shè)計(jì)習(xí)題變成一個(gè)非常嚴(yán)峻的算法和系統(tǒng)設(shè)計(jì)挑戰(zhàn)。實(shí)際上，光是這兩個(gè)要素就意味著如果要進(jìn)行及時(shí)地，完全地爬行網(wǎng)絡(luò)，步驟（a）必須每秒鐘執(zhí)行大約1000次，因此，成員檢測（c）必須每秒鐘執(zhí)行超過10000次，并有非常大的數(shù)據(jù)儲(chǔ)存到主內(nèi)存中。這個(gè)要求有一個(gè)分布式構(gòu)造，

60、使得成員檢測更加復(fù)雜。　　一個(gè)非常重要的方法加速這個(gè)檢測就是用cache（高速緩存），這個(gè)是把見過的URLs存入主內(nèi)存中的一個(gè)（動(dòng)態(tài)）子集中。這個(gè)論文最主要的成果就是仔細(xì)的研究了幾種關(guān)于網(wǎng)絡(luò)爬蟲的URL緩存技術(shù)。我們考慮所有實(shí)際的算法：隨機(jī)置換，靜態(tài)cache，LRU，和CLOCK，和理論極限：透視cache和極大的cache。我們執(zhí)行了大約1800次模擬，用不同的cache大小執(zhí)行這些算法，用

61、真實(shí)的log日志數(shù)據(jù)，獲取自一個(gè)非常大的33天的網(wǎng)絡(luò)爬行，大約執(zhí)行了超過10億次的http請(qǐng)求。　　我們的主要的結(jié)論是 cache是非常高效的-在我們的機(jī)制里，一個(gè)有大約50000個(gè)入口的cache可以完成80%的速率。有趣的是，這cache的大小下降到一個(gè)臨界點(diǎn)：一個(gè)足夠的小一點(diǎn)的cache更有效當(dāng)一個(gè)足夠的大一點(diǎn)的cache只能帶來很小的額外好處。我們推測這個(gè)臨界點(diǎn)是固有的并且冒昧的解釋一下

62、這個(gè)現(xiàn)象。　　1.介紹　　皮尤基金會(huì)最新的研究指出：“搜索引擎已經(jīng)成為互聯(lián)網(wǎng)用戶不可或缺的工具”，估計(jì)在2002年中期，初略有超過1半的美國人用網(wǎng)絡(luò)搜索獲取信息。因此，一個(gè)強(qiáng)大的搜索引擎技術(shù)有巨大的實(shí)際利益，在這個(gè)論文中，我們集中于一方面的搜索技術(shù)，也就是搜集網(wǎng)頁的過程，最終組成一個(gè)搜索引擎的文集。

63、;　　搜索引擎搜集網(wǎng)頁通過很多途徑，他們中，直接提交URL，回饋內(nèi)含物，然后從非web源文件中提取URL，但是大量的文集包含一個(gè)進(jìn)程叫 crawling 或者 SPIDERing，他們遞歸的探索互聯(lián)網(wǎng)。基本的算法是：　　Fetch a page　　Parse it to extract all linked URLs</p&g

64、t;　　For all the URLs not seen before，repeat（a）-(c)　　網(wǎng)絡(luò)怕從一般開始于一些種子URLs。有些時(shí)候網(wǎng)絡(luò)爬蟲開始于一個(gè)正確連接的頁面，或者一個(gè)目錄就像：yahoo.com，但是因?yàn)檫@個(gè)原因相關(guān)的巨大的部分網(wǎng)絡(luò)資源無法被訪問到。（估計(jì)有超過20%）　　如果把網(wǎng)頁看作圖中的節(jié)點(diǎn)，把超鏈接

65、看作定向的移動(dòng)在這些節(jié)點(diǎn)之間，那么網(wǎng)絡(luò)爬蟲就變成了一個(gè)進(jìn)程就像數(shù)學(xué)中的圖的遍歷一樣。不同的遍歷策略決定著先不訪問哪個(gè)節(jié)點(diǎn)，下一個(gè)訪問哪個(gè)節(jié)點(diǎn)。2種標(biāo)準(zhǔn)的策略是深度優(yōu)先算法和廣度優(yōu)先算法-他們?nèi)菀妆粚?shí)現(xiàn)所以在很多入門的算法課中都有教。　　但是，在網(wǎng)絡(luò)上爬行并不是一個(gè)微不足道的設(shè)計(jì)習(xí)題，而是一個(gè)非常嚴(yán)峻的算法和系統(tǒng)設(shè)計(jì)挑戰(zhàn)因?yàn)橐韵?點(diǎn)原因：　　網(wǎng)絡(luò)非常的龐大

66、。現(xiàn)在，Google需要索引超過30億的網(wǎng)頁。很多研究都指出，在歷史上，網(wǎng)絡(luò)每9-12個(gè)月都會(huì)增長一倍。　　網(wǎng)絡(luò)的頁面改變很頻繁。如果這個(gè)改變指的是任何改變，那么有40%的網(wǎng)頁每周會(huì)改變。如果我們認(rèn)為頁面改變?nèi)种换蛘吒啵敲从写蠹s7%的頁面每周會(huì)變。　　這2個(gè)要素意味著，要獲得及時(shí)的，完全的網(wǎng)頁快照，一個(gè)搜索引擎必須訪問1億個(gè)網(wǎng)頁每天。因此，步驟（

67、a）必須執(zhí)行大約每秒1000次，成員檢測的步驟（c）必須每秒執(zhí)行超過10000次，并有非常大的數(shù)據(jù)儲(chǔ)存到主內(nèi)存中。另外，網(wǎng)絡(luò)爬蟲一般使用一個(gè)分布式的構(gòu)造來平行地爬行更多的網(wǎng)頁，這使成員檢測更為復(fù)雜：這是可能的成員問題只能回答了一個(gè)同行節(jié)點(diǎn)，而不是當(dāng)?shù)亍?lt;/p>　　一個(gè)非常重要的方法加速這個(gè)檢測就是用cache（高速緩存），這個(gè)是把見過的URLs存入主內(nèi)存中的一個(gè)（動(dòng)態(tài)）子集中。這個(gè)論文最主要的成果就是仔

68、細(xì)的研究了幾種關(guān)于網(wǎng)絡(luò)爬蟲的URL緩存技術(shù)。我們考慮所有實(shí)際的算法：隨機(jī)置換，靜態(tài)cache，LRU，和CLOCK，和理論極限：透視cache和極大的cache。我們執(zhí)行了大約1800次模擬，用不同的cache大小執(zhí)行這些算法，用真實(shí)的log日志數(shù)據(jù)，獲取自一個(gè)非常大的33天的網(wǎng)絡(luò)爬行，大約執(zhí)行了超過10億次的http請(qǐng)求。　　這個(gè)論文像這樣組織的：第2部分討論在文學(xué)著作中幾種不同的爬行解決方案

69、和什么樣的cache最適合他們。第3部分介紹關(guān)于一些cache的技術(shù)和介紹了關(guān)于cache幾種理論和實(shí)際算法。第4部分我們實(shí)現(xiàn)這些算法，在實(shí)驗(yàn)機(jī)制中。第5部分描述和討論模擬的結(jié)果。第6部分是我們推薦的實(shí)際算法和數(shù)據(jù)結(jié)構(gòu)關(guān)于URLcache。第7部分是結(jié)論和指導(dǎo)關(guān)于促進(jìn)研究。　　2.CRAWLING　　網(wǎng)絡(luò)爬蟲的出現(xiàn)幾乎和網(wǎng)絡(luò)同期，而且有很多的文獻(xiàn)描述了網(wǎng)

70、絡(luò)爬蟲。在這個(gè)部分，我們呈現(xiàn)一個(gè)摘要關(guān)于這些爬蟲程序，并討論問什么大多數(shù)的網(wǎng)絡(luò)爬蟲會(huì)受益于URL cache。　　網(wǎng)絡(luò)爬蟲用網(wǎng)絡(luò)存檔雇員多個(gè)爬行進(jìn)程，每個(gè)一次性完成一個(gè)徹底的爬行對(duì)于64個(gè)hosts 。爬蟲進(jìn)程儲(chǔ)存非本地的URLs到磁盤；在爬行的最后，一批工作將這些URLs加入到下個(gè)爬蟲的每個(gè)host的種子sets中。　　最初的google 爬蟲，實(shí)現(xiàn)

71、不同的爬蟲組件通過不同的進(jìn)程。一個(gè)單獨(dú)的URL服務(wù)器進(jìn)行維護(hù)需要下載的URL的集合；爬蟲程序獲取的網(wǎng)頁；索引進(jìn)程提取關(guān)鍵字和超鏈接；URL解決進(jìn)程將相對(duì)路徑轉(zhuǎn)換給絕對(duì)路徑。這些不同的進(jìn)程通過文件系統(tǒng)通信。　　這個(gè)論文的中實(shí)驗(yàn)我們使用的meractor網(wǎng)絡(luò)爬蟲。Mercator使用了一個(gè)獨(dú)立的集合，通信網(wǎng)絡(luò)爬蟲進(jìn)程。每個(gè)爬蟲進(jìn)程都是一個(gè)有效的web服務(wù)器子集；URLs的分配基于URLs主機(jī)組件。沒

72、有責(zé)任通過TCP傳送這個(gè)URL給網(wǎng)絡(luò)爬蟲，有責(zé)任把這些URLs綁在一起減少TCP開銷。我們描述mercator很多的細(xì)節(jié)在第4部分。　　任何網(wǎng)絡(luò)爬蟲必須維護(hù)一個(gè)集合，裝那些需要被下載的URLs。此外，不能重復(fù)地下載同一個(gè)URL，必須要個(gè)方法避免加入U(xiǎn)RLs到集合中超過一次。一般的，達(dá)到避免可以用維護(hù)一個(gè)發(fā)現(xiàn)URLs的集合。如果數(shù)據(jù)太多，可以存入磁盤，或者儲(chǔ)存經(jīng)常被訪問的URLs。</p&g

73、t;　　3.CACHING　　在大多數(shù)的計(jì)算機(jī)系統(tǒng)里面，內(nèi)存是分等級(jí)的，意思是，存在2級(jí)或更多級(jí)的內(nèi)存，表現(xiàn)出不同的空間和速度。舉個(gè)例，在一個(gè)典型的工作站里，有一個(gè)非常小但是非常快的內(nèi)存，一個(gè)大，但是比較慢的RAM內(nèi)存，一個(gè)非常大膽是很慢的disk內(nèi)存。在一個(gè)網(wǎng)絡(luò)環(huán)境中，也是分層的。Caching就是一種想法儲(chǔ)存經(jīng)常用到的項(xiàng)目從慢速內(nèi)存

74、到快速內(nèi)存。　　Caching術(shù)語就像下面：cache是內(nèi)存用來儲(chǔ)存同等大小的元素。一個(gè)cache有k的大小，那么可以儲(chǔ)存k個(gè)項(xiàng)目.在每個(gè)時(shí)間段,cache接受到來自一個(gè)項(xiàng)目的請(qǐng)求.如果這個(gè)請(qǐng)求項(xiàng)目在這個(gè)cache中，這種情況將會(huì)引發(fā)一個(gè)碰撞并且不需要進(jìn)一步的動(dòng)作。另一方面，這種情況叫做丟失或者失敗。如果cache沒有k個(gè)項(xiàng)目，那個(gè)丟失的項(xiàng)目被加入cache。另一方面，算法必須選擇驅(qū)逐一個(gè)項(xiàng)目

75、來空出空間來存放那個(gè)丟失的項(xiàng)目，或者不加入那個(gè)丟失的項(xiàng)目。Caching算法的目標(biāo)是最小化丟失的個(gè)數(shù)。　　清楚的，cache越大，越容易避免丟失。因此，一個(gè)caching算法的性能要在看在一個(gè)給定大小的cache中的丟失率。　　一般的，caching成功有2個(gè)原因：　　不一致的請(qǐng)求。一些請(qǐng)求比其他一些請(qǐng)求多。&l

眾賞文庫> 全部分類> 畢業(yè)設(shè)計(jì)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存

文檔簡介

溫馨提示

最新文檔

評(píng)論

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存

文檔簡介

溫馨提示

最新文檔

評(píng)論

免費(fèi)下載