UniversityofCalifornia,Berkeley
Berkeley,CA94720
epxing,ang,jordan,russell@cs.berkeley.edu
Abstract
Manyalgorithmsrelycriticallyonbeinggivenagoodmetricovertheirinputs.Forinstance,datacanoftenbeclusteredinmany“plausible”ways,andifaclusteringalgorithmsuchasK-meansinitiallyfailstofindonethatismeaningfultoauser,theonlyrecoursemaybefortheusertomanuallytweakthemetricuntilsufficientlygoodclustersarefound.Fortheseandotherapplicationsrequiringgoodmetrics,itisdesirablethatweprovideamoresystematicwayforuserstoindicatewhattheycon-sider“similar.”Forinstance,wemayaskthemtoprovideexamples.Inthispaper,wepresentanalgorithmthat,givenexamplesofsimilar(and,ifdesired,dissimilar)pairsofpointsin,learnsadistancemetricoverthatrespectstheserelationships.Ourmethodisbasedonposingmet-riclearningasaconvexoptimizationproblem,whichallowsustogiveefficient,local-optima-freealgorithms.Wealsodemonstrateempiricallythatthelearnedmetricscanbeusedtosignificantlyimproveclusteringperformance.
1Introduction
Theperformanceofmanylearninganddataminingalgorithmsdependcriticallyontheirbeinggivenagoodmetricovertheinputspace.Forinstance,K-means,nearest-neighborsclassifiersandkernelalgorithmssuchasSVMsallneedtobegivengoodmetricsthatreflectreasonablywelltheimportantrelationshipsbetweenthedata.Thisproblemisparticularlyacuteinunsupervisedsettingssuchasclustering,andisrelatedtotheperennialproblemofthereoftenbeingno“right”answerforclustering:Ifthreealgorithmsareusedtoclusterasetofdocuments,andoneclustersaccordingtotheauthorship,anotherclustersaccordingtotopic,andathirdclustersaccordingtowritingstyle,whoistosaywhichisthe“right”answer?Worse,ifanalgorithmweretohaveclusteredbytopic,andifweinsteadwantedittoclusterbywritingstyle,therearerelativelyfewsystematicmechanismsforustoconveythistoaclusteringalgorithm,andweareoftenlefttweakingdistancemetricsbyhand.Inthispaper,weareinterestedinthefollowingproblem:Supposeauserindicatesthatcertainpointsinaninputspace(say,)areconsideredbythemtobe“similar.”Canweautomaticallylearnadistancemetricoverthatrespectstheserelationships,i.e.,onethatassignssmalldistancesbetweenthesimilarpairs?Forinstance,inthedocumentsexample,wemighthopethat,bygivingitpairsofdocumentsjudgedtobewritteninsimilarstyles,itwouldlearntorecognizethecriticalfeaturesfordeterminingstyle.
Oneimportantfamilyofalgorithmsthat(implicitly)learnmetricsaretheunsupervisedonesthattakeaninputdataset,andfindanembeddingofitinsomespace.ThisincludesalgorithmssuchasMultidimensionalScaling(MDS)[2],andLocallyLinearEmbedding(LLE)[9].Onefeaturedistinguishingourworkfromtheseisthatwewilllearnafullmetric
overtheinputspace,ratherthanfocusingonlyon(findinganembed-dingfor)thepointsinthetrainingset.Ourlearnedmetricthusgeneralizesmoreeasilytopreviouslyunseendata.Moreimportantly,methodssuchasLLEandMDSalsosufferfromthe“norightanswer”problem:Forexample,ifMDSfindsanembeddingthatfailstocap-turethestructureimportanttoauser,itisunclearwhatsystematiccorrectiveactionswouldbeavailable.(SimilarcommentsalsoapplytoPrincipalComponentsAnalysis(PCA)[7].)Asinourmotivatingclusteringexample,themethodsweproposecanalsobeusedinapre-processingsteptohelpanyoftheseunsupervisedalgorithmstofindbettersolutions.Inthesupervisedlearningsetting,forinstancenearestneighborclassification,numerousattemptshavebeenmadetodefineorlearneitherlocalorglobalmetricsforclassification.Intheseproblems,aclear-cut,supervisedcriterion—classificationerror—isavailableandcanbeoptimizedfor.(Seealso[11],foradifferentwayofsupervisingclustering.)Thisliteratureistoowidetosurveyhere,butsomerelevantexamplesinclude[10,5,3,6],and[1]alsogivesagoodoverviewofsomeofthiswork.Whilethesemethodsoftenlearngoodmetricsforclassification,itislessclearwhethertheycanbeusedtolearngood,generalmetricsforotheralgorithmssuchasK-means,particularlyiftheinformationavailableislessstructuredthanthetraditional,homogeneoustrainingsetsexpectedbythem.
Inthecontextofclustering,apromisingapproachwasrecentlyproposedbyWagstaffetal.[12]forclusteringwithsimilarityinformation.Iftoldthatcertainpairsare“similar”or“dissimilar,”theysearchforaclusteringthatputsthesimilarpairsintothesame,anddis-similarpairsintodifferent,clusters.Thisgivesawayofusingsimilarityside-informationtofindclustersthatreflectauser’snotionofmeaningfulclusters.ButsimilartoMDSandLLE,the(“instance-level”)constraintsthattheyusedonotgeneralizetopreviouslyunseendatawhosesimilarity/dissimilaritytothetrainingsetisnotknown.Wewilllaterdiscussthisworkinmoredetail,andalsoexaminetheeffectsofusingthemethodsweproposeinconjunctionwiththesemethods.
2LearningDistanceMetrics
Supposewehavesomesetofpointspairsofthemare“similar”:
,andaregiveninformationthatcertain
andaresimilar
Howcanwelearnadistancemetricbetweenpointsandspecifically,sothat“similar”pointsendupclosetoeachother?Considerlearningadistancemetricoftheform
if
(1)
thatrespectsthis;
12
Technically,thisalsoallowspseudometrics,wheredoesnotimply.
Notethat,butputtingtheoriginaldatasetthroughanon-linearbasisfunctionandconsidering
standardEuclideanmetrictotherescaleddata;thiswilllaterbeusefulinvisualizingthelearnedmetrics.
Asimplewayofdefiningacriterionforthedesiredmetricistodemandthat
inhave,say,smallsquareddistancebetweenthem:pairsofpoints
.Thisistriviallysolvedwith,whichisnot
useful,andweaddtheconstrainttoensurethatdoesnotcollapsethedatasetintoasinglepoint.Here,canbeasetofpairsofpointsknowntobe“dissimilar”ifsuchinformationisexplicitlyavailable;otherwise,wemaytakeittobeallpairsnotin.Thisgivestheoptimizationproblem:
(3)
s.t.
(4)(5)
Thechoiceoftheconstant1intherighthandsideof(4)isarbitrarybutnotimportant,andchangingittoanyotherpositiveconstantresultsonlyinbeingreplacedby.Also,thisproblemhasanobjectivethatislinearintheparameters,andbothoftheconstraintsarealsoeasilyverifiedtobeconvex.Thus,theoptimizationproblemisconvex,whichenablesustoderiveefficient,local-minima-freealgorithmstosolveit.Wealsonotethat,whileonemightconsidervariousalternativesto(4),“
”wouldnotbeagoodchoicedespiteitsgivingasimplelinearconstraint.It
wouldresultinalwaysbeingrank1(i.e.,thedataarealwaysprojectedontoaline).3
2.1Thecaseofdiagonal
Inthecasethatwewanttolearnadiagonal
anefficientalgorithmusingtheNewton-Raphsonmethod.Define
,wecanderive
Itisstraightforwardtoshowthatminimizing(subjectto)isequivalent,uptoamultiplicationofbyapositiveconstant,tosolvingtheoriginalproblem(3–5).WecanthususeNewton-Raphsontoefficientlyoptimize.4
2.2Thecaseoffull
Inthecaseoflearningafullmatrix,theconstraintthatbecomesslightlytrickiertoenforce,andNewton’smethodoftenbecomesprohibitivelyexpensive(requiringtimetoinverttheHessianoverparameters).Usinggradientdescentandtheideaofiterativeprojections(e.g.,[8])wederiveadifferentalgorithmforthissetting.
IterateIterate
until
converges
untilconvergence
Figure1:Gradientascent+Iterativeprojectionalgorithm.Here,matrices().
istheFrobeniusnormon
Weposetheequivalentproblem:
(6)
s.t.
(7)
(8)
tooptimize(6),followedbythemethodofWewilluseagradientascentstepon
iterativeprojectionstoensurethattheconstraints(7)and(8)hold.Specifically,wewillrepeatedlytakeagradientstep,andthenrepeatedlyprojectintothesetsand.ThisgivesthealgorithmshowninFigure1.5
Themotivationforthespecificchoiceoftheproblemformulation(6–8)isthatprojectingontoorcanbedoneinexpensively.Specifically,thefirstprojectionstep
involvesminimizingaquadraticobjectivesubjectto
asinglelinearconstraint;thesolutiontothisiseasilyfoundbysolving(intime)asparsesystemoflinearequations.Thesecondprojectionsteponto,thespaceofallpositive-semidefinitematrices,isdonebyfirstfindingthediagonalization,
isadiagonalmatrixof’seigenvaluesandthecolumnsofwhere
contains’scorrespondingeigenvectors,andtaking,where
.(E.g.,see[4].)
3ExperimentsandExamples
Webeginbygivingsomeexamplesofdistancemetricslearnedonartificialdata,andthenshowhowourmethodscanbeusedtoimproveclusteringperformance.3.1Examplesoflearneddistancemetrics
ConsiderthedatashowninFigure2(a),whichisdividedintotwoclasses(shownbythedifferentsymbolsand,whereavailable,colors).Supposethatpointsineachclassare“sim-ilar”toeachother,andwearegivenreflectingthis.6Dependingonwhetherwelearnadiagonalorafull,weobtain:
1.036
1.007
3.2453.2860.0813.2863.3270.0820.0810.0820.002
Tovisualizethis,wecanusethefactdiscussedearlierthatlearningisequivalenttofindingarescalingofthedata,thathopefully“moves”thesimilarpairs
2−class data (original)2−class data projection (Newton)2−class data projection (IP)50−550y−5−5x0zz50−550y−5−5x0z50−520200−20−20x550y(a)(b)(c)Figure2:(a)Originaldata,withthedifferentclassesindicatedbythedifferentsymbols(andcol-ors,whereavailable).(b)Rescalingofdatacorrespondingtolearneddiagonal.(c)Rescalingcorrespondingtofull.3−class data (original)3−class data projection (Newton)3−class data projection (IP)20−250y−5−5x0zz20−250y−5−5x0z20−220y−2−2x0552(a)(b)(c).(c)Rescalingcorre-Figure3:(a)Originaldata.(b)Rescalingcorrespondingtolearneddiagonalspondingtofull.together.Figure2(b,c)showstheresultofplotting.Aswesee,thealgorithmhassuccessfullybroughttogetherthesimilarpoints,whilekeepingdissimilaronesapart.Figure3showsasimilarresultforacaseofthreeclusterswhosecentroidsdifferonlyinthexandydirections.AsweseeinFigure3(b),thelearneddiagonalmetriccorrectlyignoresthezdirection.Interestingly,inthecaseofafull,thealgorithmfindsasurprisingprojectionofthedataontoalinethatstillmaintainstheseparationoftheclusterswell.3.2Applicationtoclustering
Oneapplicationofourmethodsis“clusteringwithsideinformation,”inwhichwelearnadistancemetricusingsimilarityinformation,andclusterdatausingthatmetric.Specifi-cally,supposewearegiven,andtoldthateachpairmeansandbelongtothesamecluster.Wewillconsiderfouralgorithmsforclustering:
1.K-meansusingthedefaultEuclideanmetricbetweenpoints
todefinedistortion(andignoring).clustercentroids
and
2.ConstrainedK-means:K-meansbutsubjecttopoints
assignedtothesamecluster[12].7
alwaysbeing
3.K-means+metric:K-meansbutwithdistortiondefinedusingthedistancemetric
learnedfrom.4.ConstrainedK-means+metric:ConstrainedK-meansusingthedistancemetriclearnedfrom.
Original 2−class dataPorjected 2−class data100−10200y−20−20x0zz20100−10200y−20−20x0201.2.3.4.K-means:Accuracy=0.4975
ConstrainedK-means:Accuracy=0.5060K-means+metric:Accuracy=1
ConstrainedK-means+metric:Accuracy=1(a)(b)Figure4:(a)Originaldataset(b)Datascaledaccordingtolearnedmetric.(gavevisuallyindistinguishableresults.)shown,but
’sresultis)betheclustertowhichpointisassignedbyanautomaticclusteringLet(
algorithm,andletbesome“correct”ordesiredclusteringofthedata.Following[?],inthecaseof2-clusterdata,wewillmeasurehowwellthe’smatchthe’saccordingtoAccuracy
Inthecaseofmany()clusters,thisevaluationmetrictendstogiveinflatedscoressincealmostanyclusteringwillcorrectlypredictthatmostpairsareindifferentclusters.Inthissetting,wethereforemodifiedthemeasureaveragingnotonly,drawnuniformlyatrandom,butfromthesamecluster(asdeterminedby)withchance0.5,andfromdifferentclusterswithchance0.5,sothat“matches”and“mis-matches”aregiventhesameweight.AllresultsreportedhereusedK-meanswithmultiplerestarts,andareaveragesoveratleast20trials(exceptforwine,10trials).9
wasgeneratedbypickingarandomsubsetofallpairsofpointssharingthesameclass.Inthecaseof“little”side-information,thesizeofthesubsetwaschosensothattheresultingnumberofresultingconnectedcomponents(seefootnote7)wouldbeveryroughly90%ofthesizeoftheoriginaldataset.Inthecaseof“much”side-information,thiswaschangedto70%.
8
Original dataProjected data500−50500y−50−50x050zz500−50500y−50−50x0501.2.3.4.K-means:Accuracy=0.4993ConstrainedK-means:Accuracy=0.5701K-means+metric:Accuracy=1ConstrainedK-means+metric:Accuracy=1(a)(b)Figure5:(a)Originaldataset(b)Datascaledaccordingtolearnedmetric.(gavevisuallyindistinguishableresults.)shown,but’sresultisBoston housing (N=506, C=3, d=13)10.80.60.40.20Kc=447Kc=35410.80.60.40.20ionosphere (N=351, C=2, d=34)10.80.60.40.2Kc=269Kc=1870Iris plants (N=150, C=3, d=4)Kc=133Kc=116wine (N=168, C=3, d=12)10.80.60.40.20Kc=153Kc=12710.80.60.40.20balance (N=625, C=3, d=4)10.80.60.40.2Kc=548Kc=4000breast cancer (N=569, C=2, d=30)Kc=482Kc=358soy bean (N=47, C=4, d=35)10.80.60.40.20Kc=41Kc=3410.80.60.40.20protein (N=116, C=6, d=20)10.80.60.40.2Kc=92Kc=610diabetes (N=768, C=2, d=8)Kc=694Kc=611Figure6:Clusteringaccuracyon9UCIdatasets.Ineachpanel,thesixbarsontheleftcorrespondtoanexperimentwith“little”side-information,andthesixontherightto“much”side-information.Fromlefttoright,thesixbarsineachsetarerespectivelyK-means,K-meansdiagonalmet-ric,K-meansfullmetric,ConstrainedK-means(C-Kmeans),C-Kmeansdiagonalmetric,andC-Kmeansfullmetric.Alsoshownare:sizeofdataset;:numberofclasses/clusters;:di-mensionalityofdata;:meannumberofconnectedcomponents(seefootnotes7,9).1s.e.barsarealsoshown.
Performance on Protein dataset11Performance on Wine dataset0.90.9performanceperformance0.80.80.70.7kmeansc−kmeanskmeans + metric (diag A)c−kmeans + metric (diag A)kmeans + metric (full A)c−kmeans + metric (full A)0.60.5kmeansc−kmeanskmeans + metric (diag A)c−kmeans + metric (diag A)kmeans + metric (full A)c−kmeans + metric (full A)0.60.500.10.200.10.2ratio of constraintsratio of constraints(a)
(b)Figure7:Plotsofaccuracyvs.amountofside-information.Here,the-axisgivesthefractionofallpairsofpointsinthesameclassthatarerandomlysampledtobeincludedin.
margin.Notsurprisingly,wealsoseethathavingmoreside-informationintypicallyleadstometricsgivingbetterclusterings.
Figure7alsoshowstwotypicalexamplesofhowthequalityoftheclusteringsfoundin-creaseswiththeamountofside-information.Forsomeproblems(e.g.,wine),ouralgo-rithmlearnsgooddiagonalandfullmetricsquicklywithonlyaverysmallamountofside-information;forsomeothers(e.g.,protein),thedistancemetric,particularlythefullmetric,appearshardertolearnandprovideslessbenefitoverconstrainedK-means.
4Conclusions
Wehavepresentedanalgorithmthat,givenexamplesofsimilarpairsofpointsin,learnsadistancemetricthatrespectstheserelationships.Ourmethodisbasedonposingmetriclearningasaconvexoptimizationproblem,whichallowedustoderiveefficient,local-optimafreealgorithms.Wealsoshowedexamplesofdiagonalandfullmetricslearnedfromsimpleartificialexamples,anddemonstratedonartificialandonUCIdatasetshowourmethodscanbeusedtoimproveclusteringperformance.
References
[1]C.Atkeson,A.Moore,andS.Schaal.Locallyweightedlearning.AIReview,1996.[2]T.CoxandM.Cox.MultidimensionalScaling.Chapman&Hall,London,1994.
[3]C.DomeniconiandD.Gunopulos.Adaptivenearestneighborclassificationusingsupportvec-tormachines.InAdvancesinNeuralInformationProcessingSystems14.MITPress,2002.[4]G.H.GolubandC.F.VanLoan.MatrixComputations.JohnsHopkinsUniv.Press,1996.[5]T.HastieandR.Tibshirani.Discriminantadaptivenearestneighborclassification.IEEETrans-actionsonPatternAnalysisandMachineLearning,18:607–616,1996.[6]T.S.JaakkolaandD.Haussler.Exploitinggenerativemodelsindiscriminaiveclassifier.InProc.
ofTenthConferenceonAdvancesinNeuralInformationProcessingSystems,1999.[7]I.T.Jolliffe.PrincipalComponentAnalysis.Springer-Verlag,NewYork,19.[8]R.Rockafellar.ConvexAnalysis.PrincetonUniv.Press,1970.
[9]S.T.RoweisandL.K.Saul.Nonlineardimensionalityreductionbylocallylinearembedding.
Science290:2323-2326.[10]B.ScholkopfandA.Smola.LearningwithKernels.InPress,2001.
[11]N.Tishby,F.Pereira,andW.Bialek.Theinformationbottleneckmethod.InProc.ofthe37th
AllertonConferenceonCommunication,ControlandComputing,1999.[12]K.Wagstaff,C.Cardie,S.Rogers,andS.Schroedl.Constrainedk-meansclusteringwithback-groundknowledge.InProc.18thInternationalConferenceonMachineLearning,2001.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- vipyiyao.com 版权所有 湘ICP备2023022495号-8
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务