外文翻譯----sas統(tǒng)計分析軟件和logistic回歸_第1頁
已閱讀1頁,還剩8頁未讀 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、<p>  SAS統(tǒng)計分析軟件和Logistic回歸</p><p><b>  1.概況: </b></p><p>  SAS系統(tǒng)全稱為Statistics Analysis System,最早由北卡羅來納大學的兩位生物統(tǒng)計學研究生編制,并于1976年成立了SAS軟件研究所,正式推出了SAS軟件。SAS是用于決策支持的大型集成信息系統(tǒng),但該軟件系統(tǒng)最早的

2、功能限于統(tǒng)計分析,至今,統(tǒng)計分析功能也仍是它的重要組成部分和核心功能。SAS現在的版本為9.0版,大小約為1G。經過多年的發(fā)展,SAS已被全世界120多個國家和地區(qū)的近三萬家機構所采用,直接用戶則超過三百萬人,遍及金融、醫(yī)藥衛(wèi)生、生產、運輸、通訊、政府和教育科研等領域。在英美等國,能熟練使用SAS進行統(tǒng)計分析是許多公司和科研機構選材的條件之一。在數據處理和統(tǒng)計分析領域,SAS系統(tǒng)被譽為國際上的標準軟件系統(tǒng),并在96~97年度被評選為建立

3、數據庫的首選產品??胺Q統(tǒng)計軟件界的巨無霸。在此僅舉一例如下:在以苛刻嚴格著稱于世的美國FDA新藥審批程序中,新藥試驗結果的統(tǒng)計分析規(guī)定只能用SAS進行,其他軟件的計算結果一律無效!哪怕只是簡單的均數和標準差也不行!由此可見SAS的權威地位。</p><p>  SAS系統(tǒng)是一個組合軟件系統(tǒng),它由多個功能模塊組合而成,其基本部分是BASE SAS模塊。BASE SAS模塊是SAS系統(tǒng)的核心,承擔著主要的數據管理任務

4、,并管理用戶使用環(huán)境,進行用戶語言的處理,調用其他SAS模塊和產品。也就是說,SAS系統(tǒng)的運行,首先必須啟動BASE SAS模塊,它除了本身所具有數據管理、程序設計及描述統(tǒng)計計算功能以外,還是SAS系統(tǒng)的中央調度室。它除可單獨存在外,也可與其他產品或模塊共同構成一個完整的系統(tǒng)。各模塊的安裝及更新都可通過其安裝程序非常方便地進行。SAS系統(tǒng)具有靈活的功能擴展接口和強大的功能模塊,在BASE SAS的基礎上,還可以增加如下不同的模塊而增加不

5、同的功能:SAS/STAT(統(tǒng)計分析模塊)、SAS/GRAPH(繪圖模塊)、SAS/QC(質量控制模塊)、SAS/ETS(經濟計量學和時間序列分析模塊)、SAS/OR(運籌學模塊)、SAS/IML(交互式矩陣程序設計語言模塊)、SAS/FSP(快速數據處理的交互式菜單系統(tǒng)模塊)、SAS/AF(交互式全屏幕軟件應用系統(tǒng)模塊)等等。SAS有一個智能型繪圖系統(tǒng),不僅能繪各種統(tǒng)計圖,還能繪出地圖。SAS提供多個</p><p

6、><b>  2.操作方式:</b></p><p>  SAS是由大型機系統(tǒng)發(fā)展而來,其核心操作方式就是程序驅動,經過多年的發(fā)展,現在已成為一套完整的計算機語言,其用戶界面也充分體現了這一特點:它采用MDI(多文檔界面),用戶在PGM視窗中輸入程序,分析結果以文本的形式在OUTPUT視窗中輸出。使用程序方式,用戶可以完成所有需要做的工作,包括統(tǒng)計分析、預測、建模和模擬抽樣等。但是,這

7、使得初學者在使用SAS時必須要學習SAS語言,入門比較困難。 SAS的Windows版本根據不同的用戶群開發(fā)了幾種圖形操作界面,這些圖形操作界面各有特點,使用時非常方便。但是由于國內介紹他們的文獻不多,并且也不是SAS推廣的重點,因此還不為絕大多數人所了解。</p><p>  3.SAS系統(tǒng)基本操作及基本概念 :</p><p>  3.1數據集(dataset)和庫 

8、;:</p><p>  統(tǒng)計學的操作都是針對數據的,SAS中容納數據的文件稱為數據集,數據集又包含在不同的庫(暫且理解為數據庫吧)中。SAS中的庫分為永久性和臨時性兩種。顧名思義,存在于永久庫中的數據集是永久存在的(只要你不去刪除它),臨時庫中的數據集則在你退出SAS后自動被刪除。至于SAS中庫的概念,最簡單的理解就是一個目錄,一個存放數據集的目錄。 數據集的結構完全等同于我們一般所理解的數據表,由

9、字段和記錄所構成,在統(tǒng)計學中我們習慣將字段稱為變量,在后面的內容中字段和變量我們就理解為同一種東西吧!建立數據集的方法很多,編程操作中有專門的數據讀入方法來建立數據集,但需要將數據現場錄入,費時費力。如果數據量大,我勸各位還是先以其它方法將數據集建好,否則程序語句的絕大部分會浪費在數據的輸入上。</p><p>  3.2  SAS程序概述 :</p><p>  和其

10、它計算機語言一樣,SAS語言(稱為SCL語言,SAS Component Language)也有其專有的詞匯(即關鍵字)和語法。關鍵字、名字、特殊字符和運算符等按照語法規(guī)則排列組成SAS語句,而執(zhí)行完整功能的若干個SAS語句就構成了SAS程序。 SAS程序包括多個步驟和一些控制語句,一般情況下均包括數據步和過程步,一個或多個、數據步或過程步,它們之間任何形式的組合均可成為一段SAS程序,只要能完成一個完整

11、的功能。通常情況下SAS程序還包括一些全程語句,用以控制貫穿整個SAS程序的某些選項、變量或程序運行的環(huán)境。  SAS程序的語句一般以關鍵字開始,以一個分號結束,一條語句可占多行(SAS每看到一個分號,就將其以前、上一個分號以后的所有東東當作一條語句來處理,而不管他們處在多少個不同的行中)。SAS語句對字母的大小寫不敏感,你可以根據個人習慣決定字母的大寫或小寫。 </p><p>  

12、4. Logistic回歸:</p><p>  Logistic回歸是一類統(tǒng)計模型稱為廣義線性模型。這一模型包括單一回歸,包括普通的回歸和方差分析,以及多元統(tǒng)計等變數和對數線性回歸。一個很好使用線性模型的例子為萊斯蒂。</p><p>  Logistic回歸允許一個預測離散成果,如組成員,來自于一組變量,可能是連續(xù)的,離散的,二分,或混合任何這些。一般情況下,因變量是二分變量,如在場/

13、缺席或成功/失敗。判別分析是用來預測組成員只有兩個群體。然而,判別分析只能用連續(xù)獨立變量。因此,在獨立的變量是一個絕對的,或混合的連續(xù)和明確情況,Logistic回歸是首選。</p><p><b>  4.1 模型:</b></p><p>  因變量的logistic回歸通常是二分變量,就是因變量值為1是事件發(fā)生,值為0是事件不發(fā)生。這種類型的變量被稱為伯努利(或

14、二元)變量。雖然不是常見的,也不是在事件中討論,應用Logistic回歸也已擴大到情況下,因變量是兩個以上的情況下,這種情況被稱為多項式或多級[ Tabachnick和費德爾( 1996年)使用的術語polychotomous ] 。 </p><p>  如前所述,獨立的或預測變量Logistic回歸可以采取任何形式。也就是說, Logistic回歸是不作任何假設的分布的獨立變量。他們不必正態(tài)分布,線性關系或平

15、等的差額在每個組之間的關系,預測和因變量不是一個線性函數的logistic回歸,代替他的是,Logistic回歸函數的使用是對數函數的變換:</p><p>  這里=截距項,=自變量的預測系數。 </p><p>  另一種形式的Logistic回歸方程為:</p><p>  Logistic回歸的目的是正確預測出一個模型,這個模型適用與大哥事件發(fā)生概率的預測。

16、為了實現這一目標,建立一個模型,這個模型包括一個因變量和多個自變量,多個自變量被用于預測因變量的結果。在模型建立過程中幾個不同的選擇被利用。變量在指定的順序可進入模型由研究員或logistic回歸可以測試適合的模式后,每一個系數為增加或刪除,呼吁逐步回歸。</p><p>  逐步回歸被使用在研究探索階段,但我們不建議用于理論測試(梅納爾1995年) 。理論測試是測試各個變量之間關系的變數。探索性測試是測試給定觀

17、測值各個變量之間的關系,因此,逐步回歸的目標是發(fā)現因變量與各個自變量之間的關系。 </p><p>  向后逐步回歸似乎是首選方法探索分析,在分析,首先是全部或飽和模型和變量排除在模型中的一個反復的過程。合適的模型進行測試后,消除每個變量,以確保該模型仍能充分符合數據.當沒有變量可以從模型中刪除時,整個統(tǒng)計分析工作就完成了。</p><p>  這里是logistic回歸的兩種主要用途。首

18、先是預測組成員。由于Logistic回歸計算概率或失敗之上的概率,分析結果是以優(yōu)勢率形式進行的。例如, Logistic回歸經常被用于流行病學研究,分析結果是在控制其他的風險因素前提下啦預測癌癥的發(fā)病率。 Logistic回歸還提供了變量之間關系的只是(例如,吸10包煙癌癥的發(fā)病率將高于你在棉礦中工作的癌癥發(fā)病率)。這個過程,系數測試幾個不同的技術,所有這些將在下文討論。</p><p>  4.2 Wald檢驗

19、: </p><p>  Wald檢驗是用來測試的統(tǒng)計意義的每一個自變量的系數( B)在該模型中是否是為0。Wald檢驗計算的Z是通過以下的公式得出的:</p><p>  Z值再平方,產生了瓦爾德統(tǒng)計與卡方分布。然而,一些作者已查明了使用Wald檢驗的缺陷。梅納( 1995 )警告說,系數不變,標準誤差增大,降低了Wald統(tǒng)計值。萊斯蒂指出,最大似然度對于大規(guī)模樣本要比使用Wald測試更

20、有效。 </p><p>  4.3 最大似然度檢驗: </p><p>  最大似然使用的比例,以最大化的價值,似然函數為充分模型(L1)的最大化價值的似然函數的簡單的模型( L0 ) 。的似然比檢驗統(tǒng)計量等于:</p><p>  這個記錄的可能性轉變職能產生的卡方統(tǒng)計。這是推薦的檢驗統(tǒng)計時使用的模式,通過建設落后的逐步消除。 </p><p

21、>  4.4 霍斯默- Lemshow擬合優(yōu)度檢驗: </p><p>  該霍斯默- Lemshow統(tǒng)計評估擬合優(yōu)度,創(chuàng)造10命令群體的主題,然后比較實際的人數在各組(觀察)的數量預測的Logistic回歸模型(預測) 。因此,檢驗統(tǒng)計量是卡方統(tǒng)計與理想的結果非意義,這表明該模型預測并沒有顯著不同的觀察。 </p><p>  排列的10個團體的基礎上創(chuàng)建自己的估計概率;那些估計概

22、率低于0.1形成一組,依此類推,直至與概率0.9至1.0 。每一類又分為兩組,根據實際觀察到的結果變量(成功,失敗) 。預期的頻率為每一個細胞都得到model.If模式是好的,那么大多數的主題成功屬于較高風險和那些失敗的風險較低。</p><p><b>  科技外文文獻</b></p><p>  SAS Statistical Analysis Software

23、And Logistic Regression</p><p>  I. Overview: SAS is called the Statistics Analysis System, the first from the University of North Carolina's two post-graduate preparation of biostatistics, and in 1

24、976 the Institute of SAS software is established e, the formal SAS software launched. SAS is a large-scale decision support for integrated information systems, but the software system functions limited to the first stati

25、stical analysis, since the statistical analysis is still an important part of its core functionality. the curr</p><p>  SAS is a combination of SAS software system, which is a combination of multiple functio

26、nal modules, the basic part of BASE SAS module. BASE SAS module is the core of the SAS system,which assume the main task of data management and user management environment for the conduct of the user of language processi

27、ng, call the other SAS modules and products. In other words, SAS systems, we start the BASE SAS module, which in addition has its own data management, programming and computing descriptive stat</p><p>  2. o

28、peration</p><p>  SAS was developed from the mainframe system, the core operation is the process-driven, after many years of development, SAS has now become a complete set of computer language, and its user

29、interface is also fully embodied the characteristics: It uses MDI (Multiple Document interface), the user input program in the PGM window, the results of the analysis in the form of text output in the OUTPUT window. usin

30、g the program, users can complete all the work, including statistical analysis, forecasting</p><p>  3.the basic operation and basic concepts of SAS</p><p>  3.1 Dataset (dataset) and the datab

31、ase </p><p>  Statistics are for the operation of the data, files which is filled with SAS data is named dataset. in the capacity as the data sets, data sets also included in different library (for the time

32、being it understood as a database). SAS in the library is divided into two types of permanent and temporary. As the name suggests, the existence of a permanent library in the data set is permanent (as long as you do not

33、delete it), temporary library in the data sets from the SAS you automatically be delete</p><p>  3.2 SAS language </p><p>  And other computer languages, SAS Language (known as the SCL language

34、, SAS Component Language) also has its proprietary terms (ie keywords) and grammar. Keywords, names, special characters and operators, such as the composition in accordance with the grammar rules with SAS statements, and

35、 the implementation of the full functionality of a number of SAS statements constitute the SAS procedure. SAS procedures, including a number of steps and a number of control statements, the general case, in</p>

36、;<p>  4.Logistic Regression</p><p>  Logistic regression is part of a category of statistical models called generalized linear models. This broad class of models includes ordinary regression and ANOV

37、A, as well as multivariate statistics such as ANCOVA and loglinear regression. An excellent treatment of generalized linear models is presented in Agresti (1996). </p><p>  Logistic regression allows one to

38、predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent or response variable is dichotomous, such as

39、 presence/absence or success/failure. Discriminant analysis is also used to predict group membership with only two groups. However, discriminant analysis can only be used with continuous independent variables. Thus, in i

40、nstances where the indepen</p><p>  4.1 The Model:  </p><p>  The dependent variable in logistic regression is usually dichotomous, that is, the dependent variable can take the value 1 wit

41、h a probability of success , or the value 0 with probability of failure 1-. This type of variable is called a Bernoulli (or binary) variable. Although not as common and not discussed in this treatment, applications

42、of logistic regression have also been extended to cases where the dependent variable is of more than two cases, known as multinomial or polytomous [Tabachnick </p><p>  As mentioned previously, the independe

43、nt or predictor variables in logistic regression can take any form. That is, logistic regression makes no assumption about the distribution of the independent variables. They do not have to be normally distributed, linea

44、rly related or of equal variance within each group.The relationship between the predictor and response variables is not a linear function in logistic regression, instead, the logistic regression function is used, which i

45、s the logit transforma</p><p>  Where  = the constant of the equation and,  = the coefficient of the predictor variables. An alternative form of the logistic regression equation is:</p><

46、;p>  The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most parsimonious model. To accomplish this goal, a model is created that includes all predictor vari

47、ables that are useful in predicting the response variable. Several different options are available during model creation. Variables can be entered into the model in the order specified by the researcher or logistic regre

48、ssion can test the fit of the model after each coefficient is added </p><p>  Stepwise regression is used in the exploratory phase of research but it is not recommended for theory testing (Menard 1995). Theo

49、ry testing is the testing of a-priori theories or hypotheses of the relationships between variables. Exploratory testing makes no a-priori assumptions regarding the relationships between the variables, thus the goal is t

50、o discover relationships.   </p><p>  Backward stepwise regression appears to be the preferred method of exploratory analyses, where the analysis begins with a full or saturated model and variables are

51、eliminated from the model in an iterative process. The fit of the model is tested after the elimination of each variable to ensure that the model still adequately fits the data.When no more variables can be eliminated fr

52、om the model, the analysis has been completed.   </p><p>  There are two main uses of logistic regression. The first is the prediction of group membership. Since logistic regression calculates the proba

53、bility or success over the probability of failure, the results of the analysis are in the form of an odds ratio. For example, logistic regression is often used in epidemiological studies where the result of the analysis

54、is the probability of developing cancer after controlling for other associated risks. Logistic regression also provides knowledge of the </p><p>  4.2 Wald Test: </p><p>  A Wald test is used t

55、o test the statistical significance of each coefficient () in the model. A Wald test calculates a Z statistic, which is:  </p><p>  This z value is then squared, yielding a Wald statistic with a chi-squ

56、are distribution. However, several authors have identified problems with the use of the Wald statistic. Menard (1995) warns that for large coefficients, standard error is inflated, lowering the Wald statistic (chi-square

57、) value. Agresti (1996) states that the likelihood-ratio test is more reliable for small sample sizes than the Wald test.</p><p>  4.3 Likelihood-Ratio Test: </p><p>  The likelihood-ratio test

58、 uses the ratio of the maximized value of the likelihood function for the full model (L1) over the maximized value of the likelihood function for the simpler model (L0). The likelihood-ratio test statistic equals:  

59、 </p><p>  This log transformation of the likelihood functions yields a chi-squared statistic. This is the recommended test statistic to use when building a model through backward stepwise elimination.  

60、;    </p><p>  4.4 Hosmer-Lemshow Goodness of Fit Test:   </p><p>  The Hosmer-Lemshow statistic evaluates the goodness-of-fit by creating 10 ordered groups of subjects and then

61、compares the number actually in the each group (observed) to the number predicted by the logistic regression model (predicted). Thus, the test statistic is a chi-square statistic with a desirable outcome of non-significa

62、nce, indicating that the model prediction does not significantly differ from the observed.   </p><p>  The 10 ordered groups are created based on their estimated probability; those with estimated probab

63、ility below 0.1 form one group, and so on, up to those with probability 0.9 to 1.0. Each of these categories is further divided into two groups based on the actual observed outcome variable (success, failure). The expect

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論