版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、云計(jì)算與云數(shù)據(jù)管理,陸嘉恒中國(guó)人民大學(xué)www.jiahenglu.net,《先進(jìn)數(shù)據(jù)管理》前沿講習(xí)班,主要內(nèi)容,2,云計(jì)算概述 Google 云計(jì)算技術(shù):GFS,Bigtable 和MapreduceYahoo云計(jì)算技術(shù)和Hadoop云數(shù)據(jù)管理的挑戰(zhàn),人民大學(xué)新開(kāi)的《分布式系統(tǒng)與云計(jì)算》課程,3,分布式系統(tǒng)概述分布式云計(jì)算技術(shù)綜述分布式云計(jì)算平臺(tái)分布式云計(jì)算程序開(kāi)發(fā),第一篇分布式系統(tǒng)概述,4,第一章:分布式系
2、統(tǒng)入門 第二章:客戶-服務(wù)器端構(gòu)架 第三章:分布式對(duì)象 第四章:公共對(duì)象請(qǐng)求代理結(jié)構(gòu) (CORBA),第二篇 云計(jì)算綜述,5,第五章:云計(jì)算入門 第六章:云服務(wù) 第七章:云相關(guān)技術(shù)比較7.1網(wǎng)格計(jì)算和云計(jì)算7.2 Utility計(jì)算(效用計(jì)算)和云計(jì)算 7.3并行和分布計(jì)算和云計(jì)算 7.4集群計(jì)算和云計(jì)算,第三篇 云計(jì)算平臺(tái),6,第八章:Google云平臺(tái)的三大技術(shù) 第九章:Yahoo云平臺(tái)的技
3、術(shù) 第十章:Aneka 云平臺(tái)的技術(shù)第十一章:Greenplum云平臺(tái)的技術(shù)第十二章:Amazon dynamo云平臺(tái)的技術(shù),第四篇 云計(jì)算平臺(tái)開(kāi)發(fā),7,第十三章:基于Hadoop系統(tǒng)開(kāi)發(fā) 第十四章:基于HBase系統(tǒng)開(kāi)發(fā) 第十五章:基于Google Apps系統(tǒng)開(kāi)發(fā) 第十六章:基于MS Azure系統(tǒng)開(kāi)發(fā) 第十七章:基于Amazon EC2系統(tǒng)開(kāi)發(fā),,Cloud computing,,Why we use
4、cloud computing?,Why we use cloud computing?,Case 1:Write a fileSaveComputer down, file is lostFiles are always stored in cloud, never lost,Why we use cloud computing?,Case 2:Use IE --- download, install, useUse Q
5、Q --- download, install, useUse C++ --- download, install, use……Get the serve from the cloud,What is cloud and cloud computing?,CloudDemand resources or services over Internetscale and reliability of a data center.
6、,What is cloud and cloud computing?,Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet. Users need not have knowledge of, e
7、xpertise in, or control over the technology infrastructure in the "cloud" that supports them.,Characteristics of cloud computing,Virtual. software, databases, Web servers, operating systems, storage
8、and networking as virtual servers. On demand. add and subtract processors, memory, network bandwidth, storage.,IaaSInfrastructure as a Service,PaaSPlatform as a Service,SaaSSoftware as a Service,Types of cloud se
9、rvice,Software delivery modelNo hardware or software to manageService delivered through a browserCustomers use the service on demandInstant Scalability,SaaS,ExamplesYour current CRM package is not managing the loa
10、d or you simply don’t want to host it in-house. Use a SaaS provider such as Salesforce.comYour email is hosted on an exchange server in your office and it is very slow. Outsource this using Hosted Exchange.,SaaS,Platf
11、orm delivery modelPlatforms are built upon Infrastructure, which is expensiveEstimating demand is not a science!Platform management is not fun!,PaaS,ExamplesYou need to host a large file (5Mb) on your website and m
12、ake it available for 35,000 users for only two months duration. Use Cloud Front from Amazon.You want to start storage services on your network for a large number of files and you do not have the storage capacity…use Am
13、azon S3.,PaaS,Computer infrastructure delivery modelA platform virtualization environmentComputing resources, such as storing and processing capacity. Virtualization taken a step further,IaaS,ExamplesYou want to ru
14、n a batch job but you don’t have the infrastructure necessary to run it in a timely manner. Use Amazon EC2.You want to host a website, but only for a few days. Use Flexiscale.,IaaS,,Cloud computing and other computing
15、 techniques,The 21st Century Vision Of Computing,Leonard Kleinrock , one of the chief scientists of the original Advanced Research Projects Agency Network (ARPANET) project which seeded the Internet, said: “As of now,
16、computer networks are still in theirinfancy, but as they grow up and become sophisticated, we will probably see the spread of ‘computer utilities’ which, like present electric and telephone utilities, will service indiv
17、idual homes and offices across the country.”,The 21st Century Vision Of Computing,Sun Microsystemsco-founder Bill Joy He also indicated “It would take time until these markets to mature to generate this kind ofvalue.
18、Predicting now which companies will capture the value is impossible. Many of them have not even been created yet.”,The 21st Century Vision Of Computing,Definitions,,utility,Definitions,,utility,,Utility computing is the
19、packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility,Definitions,,utility,,A computer cluster is a group of linked computers, working together cl
20、osely so that in many respects they form a single computer.,Definitions,,utility,,Grid computing is the application of several computers to a single problem at the same time — usually to a scientific or technical proble
21、m that requires a great number of computer processing cycles or access to large amounts of data,Definitions,,utility,,Cloud computing is a style of computing in which dynamically scalable and often virtualized resources
22、are provided as a service over the Internet.,Grid Computing & Cloud Computing,share a lot commonality intention, architecture and technology Difference programming model, business model, compute model, app
23、lications, and Virtualization.,Grid Computing & Cloud Computing,the problems are mostly the samemanage large facilities;define methods by which consumers discover, request and use resources provided by the central
24、 facilities; implement the often highly parallel computations that execute on those resources.,Grid Computing & Cloud Computing,VirtualizationGriddo not rely on virtualization as much as Clouds do, each individua
25、l organization maintain full control of their resources Cloudan indispensable ingredient for almost every Cloud,,2024/2/28,36,Any question and any comments ?,主要內(nèi)容,37,云計(jì)算概述 Google 云計(jì)算技術(shù):GFS,Bigtable 和MapreduceYahoo云
26、計(jì)算技術(shù)和Hadoop云數(shù)據(jù)管理的挑戰(zhàn),,Google Cloud computing techniques,The Google File System,The Google File System(GFS),A scalable distributed file system for large distributed data intensive applicationsMultiple GFS clusters are
27、currently deployed.The largest ones have:1000+ storage nodes300+ TeraBytes of disk storageheavily accessed by hundreds of clients on distinct machines,Introduction,Shares many same goals as previous distributed file
28、systemsperformance, scalability, reliability, etcGFS design has been driven by four key observation of Google application workloads and technological environment,Intro: Observations 1,1. Component failures are the norm
29、constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system2. Huge files (by traditional standards)Multi GB files are commonI/O operations and blocks sizes must be revisit
30、ed,Intro: Observations 2,3. Most files are mutated by appending new dataThis is the focus of performance optimization and atomicity guarantees4. Co-designing the applications and APIs benefits overall system by increas
31、ing flexibility,The Design,Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,The Master,Maintains all file system metadata.names space, access control info, file to chunk
32、mappings, chunk (including replicas) location, etc.Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state,The Master,Helps make sophisticated chunk placement and replicati
33、on decision, using global knowledgeFor reading and writing, client contacts Master to get chunk locations, then deals directly with chunkserversMaster is not a bottleneck for reads/writes,Chunkservers,Files are broken
34、into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle.handle is assigned by the master at chunk creationChunk size is 64 MBEach chunk is replicated on 3 (default) servers,Clients,Linked to apps u
35、sing the file system API.Communicates with master and chunkservers for reading and writingMaster interactions only for metadataChunkserver interactions for dataOnly caches metadata informationData is too large to ca
36、che.,Chunk Locations,Master does not keep a persistent record of locations of chunks and replicas.Polls chunkservers at startup, and when new chunkservers join/leave for this.Stays up to date by controlling placement o
37、f new chunks and through HeartBeat messages (when monitoring chunkservers),Operation Log,Record of all critical metadata changesStored on Master and replicated on other machinesDefines order of concurrent operationsAl
38、so used to recover the file system state,System Interactions: Leases and Mutation Order,Leases maintain a mutation order across all chunk replicasMaster grants a lease to a replica, called the primaryThe primary chose
39、s the serial mutation order, and all replicas follow this orderMinimizes management overhead for the Master,Atomic Record Append,Client specifies the data to write; GFS chooses and returns the offset it writes to and ap
40、pends the data to each replica at least onceHeavily used by Google’s Distributed applications.No need for a distributed lock managerGFS choses the offset, not the client,Atomic Record Append: How?,Follows similar cont
41、rol flow as mutationsPrimary tells secondary replicas to append at the same offset as the primaryIf a replica append fails at any replica, it is retried by the client. So replicas of the same chunk may contain differe
42、nt data, including duplicates, whole or in part, of the same record,Atomic Record Append: How?,GFS does not guarantee that all replicas are bitwise identical.Only guarantees that data is written at least once in an atom
43、ic unit.Data must be written at the same offset for all chunk replicas for success to be reported.,Detecting Stale Replicas,Master has a chunk version number to distinguish up to date and stale replicasIncrease version
44、 when granting a leaseIf a replica is not available, its version is not increasedmaster detects stale replicas when a chunkservers report chunks and versionsRemove stale replicas during garbage collection,Garbage coll
45、ection,When a client deletes a file, master logs it like other changes and changes filename to a hidden file.Master removes files hidden for longer than 3 days when scanning file system name spacemetadata is also erase
46、dDuring HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata.Chunkserver removes these files on its own,Fault Tolerance:High Availability,F
47、ast recoveryMaster and chunkservers can restart in secondsChunk ReplicationMaster Replication“shadow” masters provide read-only access when primary master is downmutations not done until recorded on all master repli
48、cas,Fault Tolerance:Data Integrity,Chunkservers use checksums to detect corrupt dataSince replicas are not bitwise identical, chunkservers maintain their own checksumsFor reads, chunkserver verifies checksum before se
49、nding chunkUpdate checksums during writes,Introduction to MapReduce,MapReduce: Insight,”Consider the problem of counting the number of occurrences of each word in a large collection of documents”How would you do it
50、in parallel ?,MapReduce Programming Model,Inspired from map and reduce operations commonly used in functional programming languages like Lisp.Users implement interface of two primary methods:1. Map: (key1, val1) → (ke
51、y2, val2)2. Reduce: (key2, [val2]) → [val3],Map operation,Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content)Draw a
52、n analogy to SQL, map can be visualized as group-by clause of an aggregate query.,Reduce operation,On completion of map phase, all the intermediate values for a given output key are combined together into a list and give
53、n to a reducer.Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.,Pseudo-code,map(String input_key, String input_value): // input_key: document
54、 name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list
55、of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));,MapReduce: Execution overview,,MapReduce: Example,,MapReduce in Parallel: Example,,MapReduce: Fault Toleran
56、ce,Handled via re-execution of tasks.Task completion committed through master What happens if Mapper fails ?Re-execute completed + in-progress map tasksWhat happens if Reducer fails ?Re-execute in progress reduce
57、 tasksWhat happens if Master fails ?Potential trouble !!,MapReduce:,Walk through of One more Application,,MapReduce : PageRank,PageRank models the behavior of a “random surfer”.C(t) is the out-degree of t, and
58、(1-d) is a damping factor (random jump)The “random surfer” keeps clicking on successive links at random not taking content into consideration.Distributes its pages rank equally among all pages it links to.The dam
59、pening factor takes the surfer “getting bored” and typing arbitrary URL.,PageRank : Key Insights,Effects at each iteration is local. i+1th iteration depends only on ith iterationAt iteration i, PageRank for individua
60、l nodes can be computed independently,PageRank using MapReduce,Use Sparse matrix representation (M)Map each row of M to a list of PageRank “credit” to assign to out link neighbours.These prestige scores are reduced
61、 to a single PageRank value for a page by aggregating over them.,PageRank using MapReduce,Source of Image: Lin 2008,Phase 1: Process HTML,Map task takes (URL, page-content) pairs and maps them to (URL, (PRinit, list-of
62、-urls))PRinit is the “seed” PageRank for URLlist-of-urls contains all pages pointed to by URLReduce task is just the identity function,Phase 2: PageRank Distribution,Reduce task gets (URL, url_list) and many (URL, va
63、l) valuesSum vals and fix up with d to get new PREmit (URL, (new_rank, url_list))Check for convergence using non parallel component,MapReduce: Some More Apps,Distributed Grep.Count of URL Access Frequency.Cluster
64、ing (K-means)Graph Algorithms.Indexing Systems,MapReduce Programs In Google Source Tree,MapReduce: Extensions and similar apps,PIG (Yahoo)Hadoop (Apache)DryadLinq (Microsoft),Large Scale Systems Architecture usin
65、g MapReduce,BigTable: A Distributed Storage System for Structured Data,Introduction,BigTable is a distributed storage system for managing structured data.Designed to scale to a very large sizePetabytes of data across t
66、housands of serversUsed for many Google projectsWeb indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, …Flexible, high-performance solution for all of Google’s products,Motivation,Lots of (
67、semi-)structured data at GoogleURLs:Contents, crawl metadata, links, anchors, pagerank, …Per-user data:User preference settings, recent queries/search results, …Geographic locations:Physical entities (shops, restau
68、rants, etc.), roads, satellite image data, user annotations, …Scale is largeBillions of URLs, many versions/page (~20K/version)Hundreds of millions of users, thousands or q/sec100TB+ of satellite image data,Why not j
69、ust use commercial DB?,Scale is too large for most commercial databasesEven if it weren’t, cost would be very highBuilding internally means system can be applied across many projects for low incremental costLow-level
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- refactoring - uml軟件工程組織-火龍果軟件工程
- 計(jì)劃模板 - uml軟件工程組織-火龍果軟件工程
- 計(jì)算機(jī)導(dǎo)論-uml軟件工程組織-火龍果軟件工程
- 基于軟件工程的UML建模研究與實(shí)現(xiàn).pdf
- 漢周云計(jì)算白皮書簡(jiǎn)版-uml軟件工程組織-火龍果軟件工程
- 研一工程組學(xué)生軟件工程訓(xùn)練
- 軟件工程
- oracle8i數(shù)據(jù)庫(kù)管理培訓(xùn) - uml軟件工程組織
- 基于uml的機(jī)票預(yù)訂管理系統(tǒng)的分析與設(shè)計(jì)(軟件工程)
- 軟件工程基礎(chǔ)
- 軟件工程答案
- 軟件工程習(xí)題
- 軟件工程案例
- 軟件工程題
- 軟件工程習(xí)題
- 軟件工程方法
- 軟件工程專業(yè)
- 軟件工程.doc
- 軟件工程作業(yè)
- 軟件工程題庫(kù)
評(píng)論
0/150
提交評(píng)論