版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、<p><b> 第六章</b></p><p><b> 大尺寸問題</b></p><p> 在這一章節(jié)我們討論一些方法,為了解決LS-SVM的大數(shù)據(jù)設(shè)定中的方法設(shè)計和歸類問題. </p><p> We explain Nystom method as proposed in the context
2、 of Gaussian processes and incomplete Cholesky factorization for low rank approximation. Then a new technique of fixed size LS-SVM is presented. In this fixed size LS-SVM method one solves the primal problem instead of t
3、he dual, after estimating the map to the feature space B based upon the eigenfunctions obtained form kernel PCA, which is explained in more detail in the next Chapter. This method gives explicit links between function es
4、timation </p><p> 6.1 Low rank approximation methods</p><p> 6.1.1 Nystrom method</p><p> Suppose one takes a linear kernel. We already mentioned that one can in fact equally wel
5、l solve then the primal problem as the dual problem. In fact solving the primal problem is more advantageous for larger data sets wile solving the dual problem is more suitable for large dimensional input </p><
6、;p> Fig.6.1 For linear support vector machines the dual problem is suitable for solving problems with large dimensional input spaces while the primal problem is convenient twords large data sets. However, for nonline
7、ar SVMs one has no expression for B(x), as a result one can only solve the dual problem in terms of the related kernel function. In the method of Fixed Size LS-SVM the Nystrom method is used to estimate eigenfunctions. A
8、fter obtaining estimates for B(x) and linking primal-dual formulatio</p><p> spaces because the unknowns are ERn and ERN, respectively, where n denotes the dimension for the input space and N the number of
9、given training data points. for example in the linear function estimation case one has 公式(6.1) by elimination of the error variables EK, which one can immediately solve. In this case the mapping B becomes B(XK) = (XK) an
10、d there is no need to solve the dual problem in the support values A, certainly not for large data sets.</p><p> For the nonlinear case on the other hand the situation is much more complicated. For many cho
11、ices of the kernel, B (~) may become infinite dimensional and hence also the W vector. However, one may still try in this case to find meaningful estimates for B (XK).</p><p> A procedure to find such estim
12、ates is implicitly done by the Nystrom method, which is well known in the area of integral equations [14; 63] and has been successfully applied in the context of Gaussian processes by Williams & Seeger in [294]. The
13、method is related to finding a low rank approximation to the given kernel matrix by randomly choosing M rows/columns of the kernel matrix. Let us denote the big kernel matrix by (N, N) and the small kernel matrix based o
14、n the random subsample (M, M) with</p><p> These insights are used then for solving in an approximate sense the linear system (6.8)without bias term in the model, as considered in Gaussian process regressio
15、n problems. By applying the Sherman-Morrison-Woodbury formula [98] one obtains [294]: (6.9) where 2 are calculated from (6.6) base upon 2 from the small matrix. In LS-SVM classification and regression one usually consi
16、ders a bias term which leads to centering of the kernel matrix. For application of the Nystrom method the eigenvalue </p><p> The Nystrom method approach has been applied to the Bayesian LS-SVM framework at
17、 the second level of inference while solving the level 1 problems without Nystrom method approximation by the conjugate gradient method [273]. In Table 6.1 this is illustrated on three data sets cra( leptograpsus crab),
18、rsy(Ripley synthetic data), hea (heart disease) according to [273]. In [273] it has also been illustrated that larger data sets such as the UCI adult data set, a successful approximation by the Nystro</p><p>
19、; 6.1.2 Incomplete Cholesky factorization</p><p> The Sherman-Morrison-Woodbury formula which is used within the Mystrom method has also been widely used in the context of interior point algorithms for lin
20、ear programming. However, as pointed out by Fine & Scheinberg in [79] it may lead to numerical difficulties. In [79] it has been illustrated with a simple example how the computed solutions may not even be close to t
21、he correct solution when applying the Woodbury formula (when the limited machine precision is taken into account). Small numerica</p><p> Table 6.1 application of the Mystrom approximation in the Bayesian L
22、S-SVM framework at the second level of inference. Shown are performances and inferred hyperparameters for different proportions 10% … 100% of the training data N. the standard deviation on 10 randomizations of the Nystro
23、m sample is given between parentheses.</p><p> Solutions may not even be close to the correct solution when applying the Woodbury formula (when the limited machine precision is taken into account ). Small n
24、umerical perturbations due to limited machine precision may cause numerical instabilities resulting into computed solutions that are far from the true solution.</p><p> A numerically stable method for kerne
25、l based methods has been studied in [79], which is based on a low rank update of the Cholesky factorization [98] of the kernel matrix for solving kernel based learning problems of the form 2 with kernel matrix 1. It is
26、 often the case that the exact low-rank representation of the kernel matrix 3 is not given or even does not exist when the rank of the kernel matrix is large. The best that one may hope then is to find a good approximati
27、on together with keepin</p><p> Fig.6.2 Fixed size LS-SVM: the number of support vectors is pre-fixed beforehand and the support vectors are actively selected from the pool of training data. After estimatin
28、g eigenfunctions the model is computed in the primal space with calculation of 1. In the working set of support vectors a point is randomly selected and replaced by a randomly selected point from the training data if thi
29、s new point improves an entropy criterion which is related to density estimation and kernel PCA.</p><p> 6.2 Fixed Size LS-SVMs</p><p> In the Nystrom method an approximate solution to the lin
30、ear system is computed based upon a random subsample of the given training data set. However, an important open question at this point is whether the random points can be selected in another way than just trying out seve
31、ral random subsamples and comparing these solutions on a separate validation set. We explain now how active selection of such support vectors can be done in relation to the Nystrom method for a fixed size LS-SVM.</p&g
32、t;<p> Fig. 6.3 In the method of fixed size LS-SVM the Nystrom method, kernel PCA and density estimation are linked to estimation of eigenfunctions and active selection of support vectors. Regression in the prima
33、l space is done in a next step.</p><p> 6.2.1 Estimation in primal weight space</p><p> As we discussed in (6.11), for large data sets it would be advantageous if one could solve the problem i
34、n the primal space (Fig.6.1). However, in the nonlinear case we would then need a explicit expression for 1 which is in principle only implicitly determined by the kernel trick. On the other hand, if one succeeds in find
35、ing a meaningful estimate of 2 one could solve the following ridge regression problem in the primal weight space with unknowns 3(6.11). Such an estimate can be made after obtai</p><p> The model takes the f
36、orm (6.13). The support values corresponding to the number of M support vectors are then (6.14) if we represent the model as (6.15). This approach gives explicit links between the primal and the dual representation. Howe
37、ver, the approximation is based on a random selection of the support vectors, but does not provide an algorithm for making a good selection of the support vectors.</p><p> 6.2.2 Active selection of support
38、vectors</p><p> In order to make a more suitable selection of the support vectors instead of a random selection, one can relate the Nystrom method to kernel principal component analysis, density estimation
39、and entropy criteria, as discussed by Girolami in [94]. These links will be explained in more detail in the Chapter on unsupervised learning and support vector machine formulations to kernel PCA.</p><p> In
40、 [94] an analysis is done of the quadratic Renyi entropy (6.16) in relation to kernel PCA and density estimation with (6.17) where 1v = [1; 1; …1;] and a normalized kernel is assumed with respect to density estimation. O
41、ne chooses a fixed size < then and actively selects points form the pool of training data as candidate support vectors (Fig.6.2). in the working set of support vectors a point is randomly selected and replaced by a ra
42、ndomly selected point from the training data set if the new p</p><p> Fixed size LS-SVM algorithm:</p><p> Given normalized and standardized training data {Xk, Yk}Nk = 1 with inputs Xk E Rn, o
43、utputs Yk E R and N training data.</p><p> Choose a working set with size M and impose in this way a number of M support vectors (typically M << N).</p><p> Randomly select a support vec
44、tor X* form the working set of M support vectors.</p><p> Randomly select a point Ct* from the N training data and replace X* by Xt* in the working set. If the entropy increases by taking the point Xt* inst
45、ead of X* then this point Xt* is rejected (and returned to the training data pool) and the support vector X* stays in the working set.</p><p> Calcuate the entropy value for the present working set.</p&g
46、t;<p> Stop if the change in entropy value (6.17) is small or the number of iterations is exceeded, otherwise go to (3).</p><p> Estimate W, E in the primal space after estimating the eigenfunctions
47、 from the Nystrom approximation.</p><p> Illustrative examples are shown in Figs.6.4-6.8 on a noisy sinc function and a double spiral classification problem, both with RBF kernels,. A bad initial choice of
48、the support vectors is intentionally taken, after which a self organizing process is taking place. Corresponding plots for the evolution of the entropy criterion are shown. An important aspect of the algorithm is that th
49、e determination of the support vectors is done independently of the regression problem of W, B after estimating E. </p><p> The fixed size LS-SVM method is also suitable for adaptive signal processing appli
50、cations where on-line updating of W, B and recursive methods for the eigenvalue decomposition can be applied. Both for recursive least squares and singular value decomposition updating, various efficient algorithms have
51、been developed in the past [109; 138; 165; 166; 167; 190; 298], which can be used at this point. Also for transductive inference the use of fixed size LS-SVM is very natural due to the fact that the</p><p>
52、 In Figs.6.4-6.5 N = 200 training data were generated for a sinc function with zero mean white Gaussian noise with standard deviation 0.1. A fixed size M = 10 was taken and an RBF kernel with Q2 = 3. in Fig.6.6 N = 20000
53、 were generated for a noisy sinc, with the same model. In Fig.6.7 N = 400 training data were generated and a fixed size LS-SVM model with RBF kernel was taken for M = 20.</p><p> An application of Fixed Siz
54、e LS-SVM with RBF kernel to the UCI adult data set with a training set of 22000 data and a validation set of 11000 data gives the following validation set performance indication: M = 20. The value of R is determined by B
55、ayesian inference.</p><p> 6.3 Basis construction in the feature space</p><p> Another approach to large scale problems is to find a suitable set of basis vectors in the feature space in order
56、 to build and span a large subspace (Fig.6.10). The following method has been proposed by Cawley [37] for LS-SVMs.</p><p> Fig.6.4 Fixed Size LS-SVM with M = 10 and N = 200 on a noisy sinc function. The sup
57、port vectors are actively selected by means of entropy criterion (6.17). A bad initial choice of support vectors (grey dots) is intentionally taken, which after a number of iterations leads to a uniform distribution of s
58、upport vectors over the x-axis. The figures show different stages of this self- organizing process.</p><p> The complete LS-SVM model based on N training data for function estimation takes the form (6.18) w
59、here a, b are the solution (6.19). For the solution vector in the primal weight space one has (6.20). In other words, W is represented by means of a basis 公式 consisting of N vectors. In order to find a sparse representat
60、ion one expresses W in terms of fewer basis vectors 公式. These vectors Qk all belong to the set of training data 公式. One expresses (6.21) where typically M << N. Such an approach has </p><p> Fig.6.5 (
61、Continued)(Top) entropy with respect to iteration steps; (Button) Corresponding error on an independent test set for the noisy sinc function of the previous figure.</p><p> Fig.6.6 Fixed size LS-SVM for a n
62、oisy sinc function with M = 10 and N = 20000. (Top) given training set; (Bottom) true sinc function (solid line), estimate of Fixed Size LS-SVM (dashed line), and support vectors (grey dots).</p><p> Fig.6.
63、7 Fixed size LS-SVM with M = 20 and N = 400 for a double spiral classification problem. The support vectors are actively selected by means of a entropy criterion. The figures show different stages of this self-organizing
64、 process of support vector selection.</p><p> Fig.6.8 (Continued) corresponding plot of the entropy during the iteration process.</p><p> In [37] the following objective function is taken for
65、determination of B, b: (6.23).this is a parametric optimization problem in B, b which leads to solving a much smaller linear system of dimension (M + 1)*(M + 1) instead of (N + 1)*(N + 1). However, one needs to know the
66、vectors Er before one can actually compute B, b.</p><p> The following method was proposed by Baudat & Amouar in [17] for finding the reduced set of basis vectors 公式 and has been further applied in [37]
67、. For each given data point Xk one can consider the ratio (6.24) where 公式 is the basis constituted by a subset S of selected training data points. In this way one characterized the quality of the approximation in the fea
68、ture space for the reduced set. By application of the kernel trick one obtains (6.25) where SS denotes the M * M submatrix of the kerne</p><p> Fig.6.9 Illustration of the selection of support vectors for a
69、 multi-spiral problem with 1600 training data points and a fixed size LS-SVM (M = 80) with RBF kernel (O = 0.25). The entropy criterion is used to select the support vectors (depicted by the squares). No class labels are
70、 shown for the training data in this unsupervised learning process.</p><p> Fig.6.10 Expressing an estimate for the model G in terms of a reduced set of basis vectors 公式 instead of 公式 with M <<N. a su
71、itable selection of the vectors Wr is made as a subset of the given training data set.</p><p> Fig.6.11 Mining large data sets using a committee network of LS-SVMs where the individual LS-SVMs are trained o
72、n subsets of the entire training data set.</p><p> 6.4 Combining submodels</p><p> 6.4.1 Committee network approach</p><p> In the first chapter we discussed committee methods fo
73、r combining results form estimated models which is well-known in the area of neural networks. The models could be MLPs but also different kind of networks. The committee of models aims at realizing the principle of “the
74、whole is more than the sum of its parts”, in the sense that the committee as a whole can perform better than each of the individual models in terms of the bias-variance trade-off [12; 23; 179].</p><p> We c
75、an use this method also to decompose the problem of training LS-SVMs into smaller problems consisting of less training data points and train individual LS-SVMs on each of the subproblems. Because the committee is a liner
76、 combination of the submodels, the functional form of the LS-SVM committee stays the same as for each of the individual submodels, i.e. a weighted sum of kernel functions evaluated at given data points. In Fig.6.11 and F
77、ig.6.12 this approach is shown and illustrated on the tr</p><p> Fig.6.12 Illustration of a committee network approach to the estimation of a noisy sinc function with 4000 given data points. (Top) In this i
78、llustration the given training data are divided into four intervals. For each of the four parts a separate LS-SVM has been trained on 1000 data points. the results of the four LS-SVMs are combined into an overall model b
79、y means of a committee network. (Bottom) true sinc function (solid line), estimate of the committee of LS-SVMs (dashed line). Better result</p><p> Fig.6.13 Results of the individual LS-SVMs for the example
80、 of the previous Figure, trained on each of the subintervals: true sinc function (solid line), individual LS-SVM estimates trained on one of the four intervals (dashed line).</p><p> The committee network t
81、hat condidts of the M submodels takes the form (6.27) where 公式, H(x) is the true function to be estimated and 公式 where (6.28) is the i-th LS-SVM model trained on the data 公式 with resulting support values aki, bias term B
82、i and kernel Ki for the i-th submodel and I = 1,…,m with m the number of LS-SVM submodels.</p><p> As explained in [12; 23; 179] one considers the covariance matrix (6.29) where in practice one works with a
83、 finite-sample approximation (6.30) and the N data are a representative subset of the overall training data set ( or the whole training data set itself). The committee error equals (6.31). An optimal choice of B follows
84、then from (6.32). From the Lagrangian with Lagrange multiplier X (6.33) one obtains the conditions for optimality: (6.34) with optimal solution (6.35) and corresponding comm</p><p> Committees have been suc
85、cessfully applied to SVM and Gaussian processes problems e.g. by Tresp [257; 258] and Collebert el al. [44]. The committee method is also related to boosting [75; 86]. A method of coupled local minimizers proposed in [24
86、8] may be considered as an extension to the committee network method towards solving general differentiable optimization problems. Individual optimizers correspond then to individual LS-SVMs which are interacting through
87、 state synchronization constraints, </p><p> 6.4.2 Multilayer networks of LS-SVMs</p><p> In a committee network one linearly combines the individual LS-SVM models. A natural extension could b
88、e to combine the networks in a nonlinear fashion. This might be done in several ways. One potion is to take an MLP which has the outputs of the individual LS-SVMs as input to the network. The LS-SVMs are first trained on
89、 subsets of data D1, D2, …, Dm and then serve in fact as a first nonlinear preprocessing layer after which an output layer is trained that is represented by an MLP network. Inste</p><p> A parametric approa
90、ch to the training of kernel based models was also taken e.g. by Tipping [253] in the relevance vector machine. A multilayer network of LS-SVMs where the combination of the models is determined by solving a parametric op
91、timization problem (e.g. by Bayesian learning), can be considered as a nonlinear extension of this. The relevance of the LS-SVM submodels is determined by the estimation of the second layer.</p><p> When ta
92、king an MLP in the second layer, the model is described by (6.37) with (6.38) where M denotes the number of individual LS-SVM models whose outputs Zi are the input to a MLP with output weight vector WERnh, hidden layer
93、matrix VERn*m and bias vector dErnh where Nh denotes the number of hidden units. On can take multiple outputs as well. The coefficients aki, bki for i = 1, …, m are the solutions to a number of m linear systems for each
94、of the individual LS-SVMs trained on data sets Di(Fig</p><p> In Fig.6.15 and fig.6.16 the approach is illustrated on the estimation of a noisy sinc function, both tor m= 4 and m = 40 combined LS-SVMs train
95、ed with RBF kernel Q = 5 and regularization constant R = 1. the outputs of these LS-SVMs are taken as input to an MLP with one hidden layer and 6 hidden units. The MLP is trained by Bayesian learning. The results are com
96、pared with taking (6.39) which corresponds also to one single neuron with linear activation function. Both are trained by the function t</p><p> Fig.6.14 Nonlinear combinations of trained LS-SVM models. Fir
97、st LS-SVM models are trained on subsets of data D1,…, Dm. Then a nonlinear combination of the LS-SVM network outputs is taken. This leads to a multilayer network where the nonlinear function G(.) can for example be repre
98、sented by a multilayer perceptron or by a second LS-SVM layer which is trained then with the outcomes of the LS-SVM models as input data. Alternatively, one may also view the LS-SVM models as a linear or nonlinear prepr&
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 計算機專業(yè)畢業(yè)設(shè)計文獻翻譯--動作腳本
- 計算機專業(yè)畢業(yè)設(shè)計外文文獻翻譯部分
- 計算機專業(yè)畢業(yè)設(shè)計外文翻譯
- 計算機專業(yè)畢業(yè)設(shè)計-外文翻譯
- 計算機專業(yè)畢業(yè)設(shè)計外文翻譯27
- 計算機專業(yè)畢業(yè)設(shè)計外文翻譯--internet
- 計算機專業(yè)外文翻譯(文獻翻譯)
- 計算機相關(guān)專業(yè)畢業(yè)設(shè)計電阻式觸摸屏--文獻翻譯
- 計算機專業(yè)畢業(yè)設(shè)計外文資料翻譯3
- 計算機專業(yè)畢業(yè)設(shè)計-外文翻譯--matlab 介紹
- 計算機專業(yè)畢業(yè)設(shè)計外文文獻及譯文
- 計算機專業(yè)外文翻譯(文獻翻譯)
- 130計算機專業(yè)畢業(yè)設(shè)計外文文獻翻譯:介紹java web開發(fā)
- 計算機畢業(yè)設(shè)計外文翻譯
- 計算機專業(yè)畢業(yè)設(shè)計文獻翻譯--一切都是對象
- 計算機專業(yè)畢業(yè)設(shè)計外文翻譯--jsp內(nèi)置對象
- 計算機專業(yè)畢業(yè)設(shè)計外文翻譯--數(shù)據(jù)庫
- 計算機專業(yè)畢業(yè)設(shè)計(論文)外文翻譯2篇
- 計算機專業(yè)畢業(yè)設(shè)計外文翻譯--ds1820
- 計算機專業(yè)畢業(yè)設(shè)計題目大全
評論
0/150
提交評論