On detection methods and analyis of malware
Transcription
On detection methods and analyis of malware
On detection methods and analysis of malware Jean-Yves Marion LORIA jeudi 1 novembre 2012 On detection methods and analysis of malware 1. A quick tour of Malware detection methods 2. Behavioral analysis using model-checking 3. Cryptographic function identification jeudi 1 novembre 2012 What is a malware ? jeudi 1 novembre 2012 What is a malware ? • A malware is a program which has malicious intentions jeudi 1 novembre 2012 What is a malware ? • A malware is a program which has malicious intentions • A malware is a virus, a worm, a botnet ... jeudi 1 novembre 2012 What is a malware ? • A malware is a program which has malicious intentions • A malware is a virus, a worm, a botnet ... • Giving a mathematical definition is difficult jeudi 1 novembre 2012 What is a malware ? • A malware is a program which has malicious intentions • A malware is a virus, a worm, a botnet ... • Giving a mathematical definition is difficult - How to protect a system ? jeudi 1 novembre 2012 What is a malware ? • A malware is a program which has malicious intentions • A malware is a virus, a worm, a botnet ... • Giving a mathematical definition is difficult - How to protect a system ? - How to detect a malware ? jeudi 1 novembre 2012 What is a malware urquoi tracer ? (1/3)? • A malware is a program which has malicious intentions nition : l’analyse binaire, c’est e l’analyse deisprogramme • A malware a virus, a worm, a botnet ... ù le programme est inconnu • Giving a mathematical definition is difficult ⇒ on a juste un blob binaire sons : - How to protect a system ? auts indirects - How detect a malware ? =⇒ flot detocontrôle indécidable ectures/écritures indirectes =⇒ flot de données indécidable ode auto-modifiant =⇒ syntaxe indécidable jeudi 1 novembre 2012 Code protection !"#$%&'()*(+'#,-.(/0(1-0#"'2.3#Detection is hard because malware are protected džĂŵƉůĞ͗tŝŶϯϮ͘^ǁŝnjnjŽƌ͛ƐƉĂĐŬĞƌ 1.Obfuscation 2.Cryptography 3.Self-modification 4.Anti-analysis tricks 5"36#,7*(8&9&":&;&-<3-&&"3-<(#0('2%=2"&(62>?&":(0#"(@,''3&:(; A&&6B&> )C4C( Win32.Swizzor Packer displayed by IDA jeudi 1 novembre 2012 44 A common protection scheme for malware Slide à remplacer ? layer 1 jeudi 1 novembre 2012 #$ %&'()%**"+, A common protection scheme for malware Slide à remplacer ? layer 1 jeudi 1 novembre 2012 #$ %&'()%**"+, A common protection scheme for malware Decrypt Slide à remplacer ? layer 1 jeudi 1 novembre 2012 ""#$%&'()%**"+, %&'()%**"+, #$ A common protection scheme for malware Decrypt Slide à remplacer ? layer 1 jeudi 1 novembre 2012 layer 2 ""#$%&'()%**"+, %&'()%**"+, #$ A common protection scheme for malware Decrypt Decrypt Slide à remplacer ? layer 1 jeudi 1 novembre 2012 layer 2 ""#$%&'()%**"+, %&'()%**"+, #$ A common protection scheme for malware Decrypt Decrypt Decrypt Slide à remplacer ? layer 1 jeudi 1 novembre 2012 layer 2 .......... ""#$%&'()%**"+, %&'()%**"+, #$ A common protection scheme for malware Decrypt Decrypt Decrypt Slide à remplacer ? layer 1 jeudi 1 novembre 2012 layer 2 .......... ""#$%&'()%**"+, %&'()%**"+, #$ A common protection scheme for malware Decrypt Decrypt Decrypt !""#$ %&'()%**" Slide à remplacer ? layer 1 layer 2 .......... payload jeudi 1 novembre 2012 Exemple (4/5) Packer protections • hostname packé avec Themida Different code waves with their relations Themida packer Yoda packer jeudi 1 novembre 2012 Malware detection by string scanning • Signature is a regular expression denoting a sequence of bytes Pros : • Accuracy: low rate of false positive ➡ programs which are not malware are not detected • Efficient : Fast string matching algorithm ➡ Karp & Rabin, Knuth, Morris & Pratt, Boyer & Moore jeudi 1 novembre 2012 Malware detection by string scanning • Signature is a regular expression denoting a sequence of bytes Pros : • Accuracy: low rate of false positive ➡ programs which are not malware are not detected • Efficient : Fast string matching algorithm ➡ Karp & Rabin, Knuth, Morris & Pratt, Boyer & Moore Cons : • Signature are quasi-manually constructed • Signatures are not robust to malware protections ➡ Mutations, Code obfuscations, ... • Static analysis of binary is very difficult jeudi 1 novembre 2012 Detection by integrity check • Identify a file using a hash function Files Hash function Hash numbers a b jeudi 1 novembre 2012 numerical fingerprints Detection by integrity check • Identify a file using a hash function Files Hash function Hash numbers a numerical fingerprints b Cons : • File systems are updated, so numerical fingerprints change • Difficult to maintain in practice • Files may change with the same numerical fingerprint (due to hash fct) jeudi 1 novembre 2012 Behavioral detection • Identification of a sequence of actions : • System calls or library calls, Network interactions, ... jeudi 1 novembre 2012 Behavioral detection • Identification of a sequence of actions : • System calls or library calls, Network interactions, ... • Two approaches jeudi 1 novembre 2012 Behavioral detection • Identification of a sequence of actions : • System calls or library calls, Network interactions, ... • Two approaches • Anomaly Detection from a set of normal behaviours jeudi 1 novembre 2012 Behavioral detection • Identification of a sequence of actions : • System calls or library calls, Network interactions, ... • Two approaches • Anomaly Detection from a set of normal behaviours • Detection from a set of potential malicious behaviours jeudi 1 novembre 2012 Behavioral detection • Identification of a sequence of actions : • System calls or library calls, Network interactions, ... • Two approaches • Anomaly Detection from a set of normal behaviours • Detection from a set of potential malicious behaviours Cons : • Difficult to have a set of normal or bad behaviours • Difficult to maintain in practice • Functional obfuscations : jeudi 1 novembre 2012 Behavioral detection • Identification of a sequence of actions : • System calls or library calls, Network interactions, ... • Two approaches • Anomaly Detection from a set of normal behaviours • Detection from a set of potential malicious behaviours Cons : • Difficult to have a set of normal or bad behaviours • Difficult to maintain in practice • Functional obfuscations : Two ways of writing into a file h=fopen(C:\windows\sys.dll);fwrite(«test»,h) ✓ Several possible implementations of a high level action h=createFile(C:\windows\sys.dll);writeFile(h,«test») jeudi 1 novembre 2012 • FKI$%#&%KGG$"''<&$%"#$<#*?'.-3$3-0,#-]&$9O^W_^WOBKP$1/(+%0'(7 • FKI$%#&%KGH$"''<&$%"#$<#*?'.-3$3-0,#-]&$9(%#--/5%$;#-,0+#$O#D/#&%7 Anti-virus tests against unknown threats • Source : FKI $ %#&%KG\ $ 0(&%.))& $ . $ `+".0(#3a $ 3#,0+# $ 3-0,#- $ A"0+" $ 5).+#& $ 0%&#)1 $ ?#%A##( $ %"# $ <#*?'.-3$ A 3-0,#-$.(3$/55#-$)#,#)$0(5/%$3#,0+#$3-0,#-&7 study of anti-virus response to unknown threats by C. Devine and !"#$%#&%&$A#-#$-/($1'-$'(#$20(/%#@$3/-0(6$A"0+"$<#*&$A#-#$#(%#-#37$!"#$'/%5/%$'1$#.+"$&.25)#$ (EICAR 2009) A.&$%"#($+"#+<#37 3/'+*4$& 56#. $.%$789 $.%$78: $.%$78; $.%$799 $.%$79: $.%$79; !"#"$%&'(&$)"%&%$*+, c'$.)#-%d$<#*&$ c'$.)#-%d$<#*&$ c'$.)#-%d$ c'$.)#-%d$ c'$.)#-%d$ • .,.&%b !"#$%#&%&$'()*$+',#-$%"#$#,.)/.%0'($,#-&0'(&$'1$.1'-#2#(%0'(#3$.(%04,0-/&$5-'3/+%&$.(3$2.*$ )'66#37 )'66#37 <#*&$)'66#37 <#*&$)'66#37 <#*&$)'66#37 6#(#-.%#$3011#-#(%$-#&/)%&$1-'2$%"#$5.*0(6$,#-&0'(&7 !"0&$&%/3*$1'+/$'($89:;4)0<#$='(4%"#41)*>$3#%#+%0'($'1$2.)0+0'/&$?#".,0'-7$8#(+#1'-%"@$0%$ KeN c'$.)#-%d$<#*&$ c'$.)#-%d$<#*&$ c'$.)#-%d$ c'$.)#-%d$ c'$.)#-%d$ 2.*$('%$?#$-#)#,.(%$%'$%"#$&+.((0(6$+.5.?0)0%0#&$'1$&.03$5-'3/+%&7 )'66#37 )'66#37 <#*&$)'66#37 <#*&$)'66#37 <#*&$)'66#37 • !"0&$&#%$'1 $%#&%&$0&$)020%#3$.(3$3'#&$('% $+',#-$%*50+.) $2#%"'3&$1'-$2.)A.-# $%'$?#+'2#$ 5#-&0&%#(%$.+-'&&$-#?''%&$=&/+"$.&$%"#$.330(6$'-$2'3010+.%0'($'1$-#60&%-*$<#*&>7 K,0-. c'$.)#-%d$<#*&$ c'$.)#-%d$<#*&$ c'$.)#-%d$ c'$.)#-%d$ c'$.)#-%d$ )'66#37 )'66#37 <#*&$)'66#37 <#*&$)'66#37 <#*&$)'66#37 • B.+"$%#&%$5-'6-.2$A.&$-/($1'-$.$)020%#3$.2'/(%$'1$%02#$=%*50+.))*@$'(#$20(/%#>7 • Z0%P#1#(3#-.%$&/.%*0$% c'$.)#-%d$<#*&$ )'66#37 c'$.)#-%d$<#*&$ )'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$ <#*&$)'66#37 c'$.)#-%d$<#*&$ c'$.)#-%d$<#*&$ c'$.)#-%d$ c'$.)#-%d$ c'$.)#-%d$ )'66#37 )'66#37 <#*&$)'66#37 <#*&$)'66#37 <#*&$)'66#37 !"0&$&#-0#&$'1$%#&%&$0(+)/3#&$&0C$<#*)'660(6$%#+"(0D/#&@$%"-##$'1$A"0+"$+.($?#$-/($1-'2$/&#-$&5.+#$ .(3$3'$('%$-#D/0-#$.320(0&%-.%0'($5-0,0)#6#&7$E%"#-&$-#D/0-#$%"#$)'.30(6$'1$.$<#-(#)$3-0,#-@$.(3$A#-#$ '-060(.))*$3#,#)'5#3$?*$!"'2.&$;.?'('$FGHI$1'-$%"#$5/-5'&#$'1$%#&%0(6$.(%04-''%<0%$5-'6-.2&7 c'$.)#-%d$ <#*&$)'66#37 1.,0'22./% B;B! • FJI$%#&%KLGM$!"#$N#%O.A9(5/%P.%.=>$K:9$A.&$0(%-'3/+#3$0($Q0(3'A&$R:$%'$.++#&&$0(5/%$ 3#,0+#&$.%$.$)'A$)#,#)@$2.0()*$1'-$P0-#+%R4#(.?)#3$6.2#&7$!"0&$1/(+%0'($A.&$3'+/2#(%#3$0($ HLLS$'($%"#$T0-#A.))$U#.<$!#&%#-$FVI$A#?$&0%#7 • FJI$%#&%KLH$0(&%.))&$.$Q8WXBYZEKOPWUU$A0(3'A&$"''<$%'$+.5%/-#$.))$<#*?'.-3$#,#(%&$ =+'(%-.-*$%'$%"#$Q8WXBYZEKOP$"''<@$0%$3'#&$('%$0([#+%$.$PUU$0(%'$'%"#-$5-'+#&&#&>7 • FJI $ %#&%KL\M $ !"# $ N#%K&*(+X#*;%.%#=> $ K:9 $ .))'A& $ D/#-*0(6 $ %"# $ &%.%# $ '1 $ %"# $ <#*?'.-3$ .&*(+"-'('/&)*7 • FKI$%#&%KGG$"''<&$%"#$<#*?'.-3$3-0,#-]&$9O^W_^WOBKP$1/(+%0'(7 • FKI$%#&%KGH$"''<&$%"#$<#*?'.-3$3-0,#-]&$9(%#--/5%$;#-,0+#$O#D/#&%7 • FKI $ %#&%KG\ $ 0(&%.))& $ . $ `+".0(#3a $ 3#,0+# $ 3-0,#- $ A"0+" $ 5).+#& $ 0%&#)1 $ ?#%A##( $ %"# $ <#*?'.-3$ 3-0,#-$.(3$/55#-$)#,#)$0(5/%$3#,0+#$3-0,#-&7 !"#$%#&%&$A#-#$-/($1'-$'(#$20(/%#@$3/-0(6$A"0+"$<#*&$A#-#$#(%#-#37$!"#$'/%5/%$'1$#.+"$&.25)#$ A.&$%"#($+"#+<#37 jeudi 1 novembre 2012 N. Richaud *"'"+#,%-.%/.0-#,12,345.63%(3075.#%.8".#"5#"& Versions of anti-virus software !"#$%"&'%#$&($)*&+,%-.$-&$/#$-#.-#+$01.$/1.#+$&2$-"#'*$)&),31*'-45$'2$&*+#*$-&$%&6#*$-"#$31*7#.-$ '2.-133#+$/1.#+$1.$)&..'/3#8$9,*-"#*:&*#5$-':#$%&2.-*1'2-.$0&,3+$2&-$"16#$133&0#+$-#.-'27$-"#$(,33$ *127#$&($133$161'31/3#$12-';6'*,.$)*&7*1:.8$<&2.'+#*'27$2&$*#%#2-$12+$(*##34$161'31/3#$.-,+4$&($12-'; 6'*,.$:1*=#-$."1*#$%&,3+$/#$(&,2+5$0#$*#3'#+$&2$-"*##$+#2&:'21-&*.$-&$:1=#$&,*$+#%'.'&2> • ?&023&1+$.-1-'.-'%.$(&*$-"#$@2-';6'*,.$.#%-'&2$&($-"#$A&(-)#+'1$0#/.'-#$BCD5 • E&&73#$2,:/#*$&($*#.,3-.$(&*$-"#$F,#*4$G+&023&1+$B21:#$&($12-';6'*,.$HDI5 • !"#$12-';6'*,.$6#2+&*$"1+$-&$)*&6'+#$1$(*##$#613,1-'&2$6#*.'&2$&($"'.$)*&+,%-8 @2$'2'-'13$3'.-$&($-"'*-4;#'7"-$)*&+,%-.$01.$*#-*'#6#+$(*&:$J'*,.-&-13$BKD8$!"'.$3'.-$01.$-"#2$21**&0#+$ +&02$-&$-0#36#$)*&+,%-.$%"&.#2$(&*$.,/.#F,#2-$-#.-'275$12+$1*#$."&02$1.$(&33&0.> 93%&4+#.-07" :"35,%-.#"5#"& 161.-L$)*&(#..'&213$#+'-'&2 M8K8NOPQ @JE$R2-#*2#-$A#%,*'-4 K8S8OSS @6'*1$T*#:',:$A#%,*'-4$A,'-# K8O8S8OUO V'-?#(#2+#*$!&-13$A#%,*'-4$OSSP NO8S8NN8O WAW!$A:1*-$A#%,*'-4$XYZ?[O\ [8S8QCO8S 9;A#%,*#$R2-#*2#-$A#%,*'-4$OSSP P8SS$/,'3+$NMP ]1.)#*.=4$@2-';J'*,.$9&*$^'2+&0.$^&*=.-1-'&2. Q8S8[8K[C _%@(##$!&-13$T*&-#%-'&2$OSSP N[8S8ONK Y&*-&2$[QS$J#*.'&2$O8S O8U8S8U T12+1$R2-#*2#-$A#%,*'-4$OSSP NM8SS8SS A&)"&.$@2-';J'*,.$`$<3'#2-$9'*#0133 C8Q8O !*#2+$_'%*&$R2-#*2#-$A#%,*'-4$T*& NC8S8N[SU ;08'".<=.>,5#.%/."20'40#"&.0-#,12,345.63%(3075 @ $ ^'2+&0. $ HT $ &)#*1-'27 $ .4.-#: $ X#273'." $ 6#*.'&2\ $ 0'-" $ AT[ $ '2-#7*1-#+ $ 01. $ '2.-133#+ $ '2.'+# $ 1$ J_01*# $ 6'*-,13 $ :1%"'2#8 $ !0& $ 1%%&,2-. $ 0#*# $ %*#1-#+5 $ &2# $ 0'-" $ 1+:'2'.-*1-'6# $ *'7"-. $ X21:#+$ G3&%131+:'2I\5$12+$12&-"#*$0'-"&,-$XG3&%13,.#*I\8$Y&$1++'-'&213$)1-%"#.$&*$%&2('7,*1-'&2$%"127#.$ 0#*#$1))3'#+8$@$.21)."&-$&($-"#$6'*-,13$:1%"'2#$.-1-#$01.$-"#2$:1+#5$0"'%"$.#*6#+$1.$-"#$'2.-133$ /1.#$1.$0#33$1.$-"#$%&2-*&3$.,/a#%-8 jeudi 1 novembre 2012 jeudi 1 novembre 2012 2012 jeudi 1 novembre 2012 On detection methods and analysis of malware 1. A quick tour of Malware detection methods 2. Behavioral analysis using model-checking 3. Cryptographic function identification Joint work with Philippe Beaucamps and Isabelle Gnaedig Esorics 2012 jeudi 1 novembre 2012 Allaple excerpt Low level Traces / ∗ Behavior p a t t e r n : p i n g o f a remote h o s t ∗ / hIcmp = I c m p C r e a t e F i l e ( ) ; f o r ( i n t i = 0 ; i < 2 ; ++ i ) IcmpSendEcho ( hIcmp , i p a d d r , icmpData , 10 , NULL, r e p l y , 128 , 1 0 0 0 ) ; IcmpCloseHandle ( hIcmp ) ; void s c a n _ d i r ( const char∗ d i r ) { HANDLE hFind ; char szFilename [ 2 0 4 8 ] ; WIN32_FIND_DATA f i n d D a t a ; s p r i n t f ( szFilename , "%s \\% s " , d i r , " ∗.∗ " ) ; hFind = F i n d F i r s t F i l e ( szFilename , &f i n d D a t a ) ; i f ( hFind == INVALID_HANDLE_VALUE) r e t u r n ; do { s p r i n t f ( szFilename , "%s \\% s " , d i r , f i n d D a t a . cFileName ) ; i f ( findData . d w Fi l e At t r i but es & FILE_ATTRIBUTE_DIRECTORY ) s c a n _ d i r ( szFilename ) ; else { . . . } } while ( F i n d N e x t F i l e ( hFind , &f i n d D a t a ) ) ; FindClose ( hFind ) ; / ∗ Behavior p a t t e r n : N e t b i o s c o n n e c t i o n ∗ / SOCKET s = s o c k e t ( AF_INET , SOCK_STREAM, 0 ) ; s t r u c t s o c k a d d r _ in s i n = { AF_INET , i p a d d r , htons ( 1 3 9 ) / ∗ N e t b i o s ∗ / } ; i f ( connect ( s , (SOCKADDR∗)& s i n , s i z e o f ( s i n ) ) ! = SOCKET_ERROR) { ... } / ∗ Behavior p a t t e r n : scanning o f l o c a l d r i v e s ∗ / char b u f f e r [ 1 0 2 4 ] ; GetLogicalDriveStrings ( sizeof ( buffer ) , buffer ) ; c o n s t char∗ s z D r i v e = b u f f e r ; w h i l e (∗ s z D r i v e ) { i f ( GetDriveType ( s z D r i v e ) == DRIVE_FIXED ) scan_dir ( szDrive ) ; s z D r i v e += s t r l e n ( s z D r i v e ) + 1 ; } } void main ( i n t argc , char∗∗ argv ) { HANDLE hIcmp ; const char∗ icmpData = " Babcdef . . . " ; char r e p l y [ 1 2 8 ] ; } Allaple.a excerpt Example execution trace of library calls: ...GetLogicalDriveStrings.GetDriveType.FindFirstFile.FindFirstFile. FindNextFile... 8 / 22 jeudi 1 novembre 2012 Allaple excerpt Low level Traces / ∗ Behavior p a t t e r n : p i n g o f a remote h o s t ∗ / hIcmp = I c m p C r e a t e F i l e ( ) ; f o r ( i n t i = 0 ; i < 2 ; ++ i ) IcmpSendEcho ( hIcmp , i p a d d r , icmpData , 10 , NULL, r e p l y , 128 , 1 0 0 0 ) ; IcmpCloseHandle ( hIcmp ) ; void s c a n _ d i r ( const char∗ d i r ) { HANDLE hFind ; char szFilename [ 2 0 4 8 ] ; WIN32_FIND_DATA f i n d D a t a ; s p r i n t f ( szFilename , "%s \\% s " , d i r , " ∗.∗ " ) ; hFind = F i n d F i r s t F i l e ( szFilename , &f i n d D a t a ) ; i f ( hFind == INVALID_HANDLE_VALUE) r e t u r n ; do { s p r i n t f ( szFilename , "%s \\% s " , d i r , f i n d D a t a . cFileName ) ; i f ( findData . d w Fi l e At t r i but es & FILE_ATTRIBUTE_DIRECTORY ) s c a n _ d i r ( szFilename ) ; else { . . . } } while ( F i n d N e x t F i l e ( hFind , &f i n d D a t a ) ) ; FindClose ( hFind ) ; / ∗ Behavior p a t t e r n : N e t b i o s c o n n e c t i o n ∗ / SOCKET s = s o c k e t ( AF_INET , SOCK_STREAM, 0 ) ; s t r u c t s o c k a d d r _ in s i n = { AF_INET , i p a d d r , htons ( 1 3 9 ) / ∗ N e t b i o s ∗ / } ; i f ( connect ( s , (SOCKADDR∗)& s i n , s i z e o f ( s i n ) ) ! = SOCKET_ERROR) { ... } / ∗ Behavior p a t t e r n : scanning o f l o c a l d r i v e s ∗ / char b u f f e r [ 1 0 2 4 ] ; GetLogicalDriveStrings ( sizeof ( buffer ) , buffer ) ; c o n s t char∗ s z D r i v e = b u f f e r ; w h i l e (∗ s z D r i v e ) { i f ( GetDriveType ( s z D r i v e ) == DRIVE_FIXED ) scan_dir ( szDrive ) ; s z D r i v e += s t r l e n ( s z D r i v e ) + 1 ; } } void main ( i n t argc , char∗∗ argv ) { HANDLE hIcmp ; const char∗ icmpData = " Babcdef . . . " ; char r e p l y [ 1 2 8 ] ; } Allaple.a excerpt Example trace of library calls: Trace areexecution finite terms: ...GetLogicalDriveStrings.GetDriveType.FindFirstFile.FindFirstFile. FindfirstFile(x,y).FindNextFile(z,x). FindNextFile(z,x).FindClose(z). FindNextFile... IcmpSendEcho(u,...).IcmpSendEcho(u,...).IcmpCloseHandle(u).... 8 / 22 jeudi 1 novembre 2012 Program behaviour • The program behaviour is given by sequences of system calls - represented by a set L of terms • How to collect traces ? jeudi 1 novembre 2012 Program behaviour • The program behaviour is given by sequences of system calls - represented by a set L of terms • How to collect traces ? jeudi 1 novembre 2012 Static analysis • A good approximation of a set of execution traces • Good detection coverage • But static analysis is difficult to perform Program behaviour • The program behaviour is given by sequences of system calls - represented by a set L of terms • How to collect traces ? Static analysis • A good approximation of a set of execution traces • Good detection coverage • But static analysis is difficult to perform Dynamic analysis • Collect an execution trace (with use PIN) • Monitor program interactions (sys calls, network calls, ...) • What is the detection coverage ? partial behaviours ... jeudi 1 novembre 2012 Trace abstraction jeudi 1 novembre 2012 Trace abstraction - Several ways to send a ping : jeudi 1 novembre 2012 Trace abstraction - Several ways to send a ping : 1. Call the socket function with the parameter IPPROTO_ICMP and then call the sendto function with ICMP_ECHOREQ jeudi 1 novembre 2012 Trace abstraction - Several ways to send a ping : 1. Call the socket function with the parameter IPPROTO_ICMP and then call the sendto function with ICMP_ECHOREQ 2. Call the IcmpSendEcho function jeudi 1 novembre 2012 Trace abstraction - Several ways to send a ping : 1. Call the socket function with the parameter IPPROTO_ICMP and then call the sendto function with ICMP_ECHOREQ 2. Call the IcmpSendEcho function - Abstract the ping behaviour by a predicate PING(x) to represent a ping on socket x jeudi 1 novembre 2012 Trace abstraction - Several ways to send a ping : 1. Call the socket function with the parameter IPPROTO_ICMP and then call the sendto function with ICMP_ECHOREQ 2. Call the IcmpSendEcho function - Abstract the ping behaviour by a predicate PING(x) to represent a ping on socket x -Define an abstraction relation R as a term rewrite system socket(x,u).sendto(x,v,y) IcmpSendEcho(x) socket(x,u).sendto(x,v,y).PING(x) IcmpSendEcho(x).PING(x) - We abstract/rewrite a pattern on a trace only once jeudi 1 novembre 2012 Trace abstraction - Several ways to send a ping : 1. Call the socket function with the parameter IPPROTO_ICMP and then call the sendto function with ICMP_ECHOREQ 2. Call the IcmpSendEcho function - Abstract the ping behaviour by a predicate PING(x) to represent a ping on socket x -Define an abstraction relation R as a term rewrite system socket(x,u).sendto(x,v,y) IcmpSendEcho(x) socket(x,u).sendto(x,v,y).PING(x) IcmpSendEcho(x).PING(x) - We abstract/rewrite a pattern on a trace only once -As a result, we have a terminating and rational abstraction system IcmpSendEcho(u,...).IcmpSendEcho(u,...).IcmpCloseHandle(u) IcmpSendEcho(u,...).PING(u).IcmpSendEcho(u,...).IcmpCloseHandle(u) IcmpSendEcho(u,...).PING(u).IcmpSendEcho(u,...). PING(u).IcmpCloseHandle(u) - We keep the LHS to deal with complex patterns jeudi 1 novembre 2012 Computation of Abstract Trace language Abstract a trace language L by reducing it w.r.t. an abstraction relation R L → ... → L ↓ Theorem : Let R be a rational abstraction relation and L be a trace language. If L is regular then so is L↓ - Based on tree automata methods Related work - Martignoni et al. 2008: multi-layered abstraction on a single trace jeudi 1 novembre 2012 sitions.of FOLTL an extension of tion endto (x, β, y)),iswhere the first paramused socket, the second parameter is the sent data and th e Appendix A2) such that: Example 1. Let us consider the functionality of se eated socket and the second parameter ormula der the functionality of sending a ping. One way of realizing it consists in calling the third one is the target, the unique parameter of closesock a, thenfirst ϕ isparameter an FOLTL ; is , the offormula sendto the s itbeen gormula consists in calling the socket function with the parameter IPPROTO_ICMP descri and Y ⊆ X is a set of d parameter is the sent data and the is the freed socket and constants α and β in Fd identify th Behaviour patterns er IPPROTO_ICMP describing the network protocol and, then, calling the sendto funct and unique ∀Y.ϕ are FOLTL offormulas, the parameter closesocket above parameters IPPROTO_ICMP and ICMP_ECHOREQ. ,uences calling the sendto function with the parameter ICMP_ECHOREQ describing the data to ≡constants ¬∃Y.¬ϕ.α and β in Fd identify the REQ describing the data to be sent. Between these two calls,the the socket should not be freed A ping may also be realized using function IcmpSend • A behavior pattern is a First-order LTL (Linear temporal logic) formula OTO_ICMP and ICMP_ECHOREQ. for ϕ ∧ X (� U ϕ ). 2 .e socket . .) 1∈should not be freed . parameter This by the FOLTL formula: ϕ1 = ∃x, y. socket Echo, whose represents the ping target. This corr ealized the function formula using is closed when it IcmpSendhas no is described way as rmula: ∃x, y. socket (x, α) ∧ (¬closesocket (x) U sendto (x, β, y)), where the firs 1 = iable is ϕbound by a quantifier. represents the ping target. This corresponds to the FOLTL formula: ϕ = ∃x. IcmpSendEcho (x 2 eter of socket is the created socket and the second pa to (x, β,ϕy)), where the first paramfrmula: there variables of sort Data and σ ∈ 2 = ∃x. IcmpSendEcho (x). Hence, the ping functionality may be described by th is the network protocol, the first parameter of send d socket and the second parameter tution over may Y . The nctionality be application described of by the used socket, the second parameter is the sent data FOLTL formula: ϕ = ϕ ∨ ϕ . es=naturally first parameter of sendto is the ping 1 2 defined by the formula ∨ ϕ2 . cesϕ1of arameter is the issent e x in ϕ which in Ydata hasand beenthe third one is the target, the unique parameter of clos Quantification is define the set ofbehavior paramerter names isathe freed socket and constants α and in Ftrac ehavior patternWe asdomain the set of finite traces ets of d ide then pattern as the setβ of unique parameter of closesocket IPPROTO_ICMP and ICMP_ECHORE ality, i.e., ascarrying the set validating ntified is validated onβinfinite sequences nstants α and in of Fdtraces identify the above parameters out its functionality, i.e., as the set of traces validatin A ping may also be realized using the function Icm the functionality. s, denoted by ξ = (ξ , ξ , . . .) ∈ O_ICMP ICMP_ECHOREQ. 0 1 describing the functionality. s ξt =andthe formula Echo, whose parameter represents the ping target. Th s ϕ) is defined in the same way as zed using theisfunction vior pattern a set ofIcmpSendtraces B sponds ⊆ ted by to the FOLTL formula: ϕ2 = ∃x. IcmpSendE additional rule: ξ |= ∃Y.ϕ iff there presents ping target. correa closedthe FOLTL formulaThis ϕ1.on = Hence, the Definition AAPbehavior pattern is a set of traces B ping functionality may be described ubst that ξ) |= ϕσ. la:∈YϕT2such = ∃x. IcmpSendEcho (x). FOLTL {t (F | t |= ϕ}). validating Trace T Σ (F a closed FOLTL formula ϕ on AP formula: ϕ = ϕ ∨ ϕ . ping 1 2 Trace Σ mula isLetmay validated over traces ofthe tomata L be be the described behaviour of program P. If a trace t of L satisfies a behaviour onality bythe T (F , X): B = {t ∈ T (F ) | t |= ϕ} . sequences of singleton sets of ϕ Action Σ Trace Σ ϕ pattern , then P has the behaviour described by We then define a behavior pattern as the set o ϕETECTION ∨ ϕ . dix A3 P ROBLEM 1 2 trace t = a0 · · · an is identified carrying out its functionality, i.e., as the set of traces v tomata oal is to be atomic ableastothe detect, inofaξtgiven vior pattern set traces of sets of predicates = set the formula describing the functionality. jeudi 1 novembre 2012 sitions.of FOLTL an extension of ation endto (x, β, y)),iswhere the first paramthird one is the target, the unique parameter of closesocket used socket, the second parameter is the sent data and th e Appendix A2) such that: Example 1. Let us consider the functionality of se eated socket and the second parameter n ormula der the functionality of sending a ping. One it consists in calling the is the freed socket and constants α way andofβ realizing inparameter Fd identify the third one is the target, the unique of closesock a, thenfirst ϕ isparameter an FOLTL ; is , the offormula sendto the s itbeen gormula consists in calling the socket function with the parameter IPPROTO_ICMP descri and Y ⊆ X is a set of above parameters IPPROTO_ICMP and ICMP_ECHOREQ. d parameter is the sent data and the is the freed socket and constants α and β in Fd identify th Behaviour patterns serandIPPROTO_ICMP describing the network protocol and, then, calling the sendto funct ∀Y.ϕ are FOLTL the unique parameter offormulas, closesocket A ping may also be realized using the function IcmpSendabove parameters IPPROTO_ICMP and ICMP_ECHOREQ. ,constants calling the sendto function with ∈uences the parameter ICMP_ECHOREQ describing the data to ≡ ¬∃Y.¬ϕ. α and β in F identify the d Echo, whose parameter represents the ping target. This correREQ describing the data to be sent. Between these two calls,the the socket should not be freed A ping may also be realized using function IcmpSend s.OTO_ICMP • A∧behavior pattern is a First-order LTL (Linear temporal logic) formula and ICMP_ECHOREQ. for ϕ X (� U ϕ ). 1∈ 2 . .) sponds to the FOLTL formula: ϕ2 =by∃x. IcmpSendEcho (x). eeealized socket should not be freed . This is described the FOLTL formula: ϕ = ∃x, y. socket 1 Echo, whose represents the ping target. This corr the function formula using is closed when it IcmpSendhasparameter no way as sendto (x, β,by y)),the where the firs the ping This functionality may (x) be Udescribed rmula: ϕbound ∃x, y. socket (x, α) ∧ (¬closesocket 1 = iable isHence, by a quantifier. represents the ping target. corresponds to the FOLTL formula: ϕ = ∃x. IcmpSendEcho (x 2 eter of socket is the created socket and the second pa to (x,FOLTL β,ϕy)), where the first paramfrmula: there formula: ϕ = ϕ ∨ ϕ . ping 1 2 variables of sort Data and σ ∈ 2 = ∃x. IcmpSendEcho (x). f Hence, the ping functionality may be described by th is the network protocol, the first parameter of send d socket and the second parameter tution over may Y . The application of nctionality be described by the fe first parameter We then define a behavior pattern as the set of traces used socket, the second parameter is the sent data FOLTL formula: ϕ = ϕ ∨ ϕ . of sendto is the ping 1 2 s naturally defined by the formula = ϕ1of ∨ ϕ2 . ces theset target, the unique parameter of clos de x in carrying itsYdata functionality, i.e.,one asisthe of traces validating arameter is theout ϕ which issent in hasand beenthe third Quantification domain is define the set ofbehavior paramerter names isfunctionality. freed socket and constants α and in Ftrac ehavior patternWe asdescribing the set of finite traces of d ide then athe pattern as the setβ of =ets unique parameter of closesocket the formula the IPPROTO_ICMP and ICMP_ECHORE i.e., ascarrying the set validating ntified is validated onβinfinite sequences nstants α and in of Fdtraces identify the above parameters yality, out its functionality, i.e., as the set of traces validatin A ping may alsoset be realized usingBthe⊆function Icm the functionality. Definition 1. A behavior pattern is a of traces s, denoted by ξ = (ξ , ξ , . . .) ∈ O_ICMP andthe ICMP_ECHOREQ. 0 1 describing the functionality. s ξt • = formula Echo,are whose parameter represents the ping target. Th Traces satisfying a FO-LTL formula : T (F ) validating a closed FOLTL formula ϕ on AP = s ϕ) is defined in the same way as Trace using theisfunction vior pattern aΣ set ofIcmpSendtraces B sponds ⊆ azed ted by to the FOLTL formula: ϕ2 = ∃x. IcmpSendE additional rule: ξ |= ∃Y.ϕ iff there T (F , X): B = {t ∈ T (F ) | t |= ϕ} . presents the ping target. This correa closed FOLTL formula ϕ on AP = Definition 1. A behavior pattern is a set of traces B Action Σ Trace Σ Hence, the ping functionality may be described 3 ubst that ξ) |= ϕσ. la:∈YϕT2such = ∃x. IcmpSendEcho (x). FOLTL {t (F | t |= ϕ}). validating Trace T Σ (F a closed FOLTL formula ϕ on AP formula: ϕ = ϕ ∨ ϕ . ping 1 2 Trace Σ a mula isLetmay validated over traces ofthe tomata L be be the described behaviour of the program P. If a trace t of L satisfies a behaviour onality by IV. Dbehaviour ETECTION P ROBLEM T (F , X): B = {t ∈ T (F ) | t |= ϕ} . sequences of singleton sets of ϕ Action Σ Trace Σ ϕ pattern , then P has the described by We then define a behavior pattern as the set o ϕETECTION ∨ ϕ . dix A3 P ROBLEM 1 2 trace tAs = said a0 · · before, · an is identified carrying functionality, as theset set of traces v etomata our goal is to be out ableitsto detect, in i.e., a given oal is to be atomic ableastothe detect, inofaξtgiven set vior pattern set traces of sets of predicates = the formula describing of thecombinations functionality. yjeudi 1 of novembre 2012 traces, some predefined behavior composed Malicious behavior detection Theorem : Let L be a finite set of finite traces. Let L↓ be a trace correctly abstracted from a rational abstraction relation R. Let ϕ be a FOLTL formula. Deciding whether deciding L↓ is infected by ϕ is linear-time computable. It works also when L is regular (and infinite), see the paper for details Related work -Jacob et al., 2009: low-level functionalities, exponential-time detection jeudi 1 novembre 2012 A C Keylogger or a sms message leaking app Ainsi, dans cette section, nous prenons l’exemple d’un programme capturant les caractères entrés au clavier (afin d’en extraire notamment mots de passe, informations bancaires et autres données sensibles). L’analyse sous IDA de sa forme compilée permet de déterminer la structure de la pile et le type des variables la composant. Par simplicité, nous travaillons sur son code source : 1 2 3 4 5 LRESULT WndProc(HWND hwnd, UINT msg, WPARAM wParam, LPARAM lParam) { RAWINPUTDEVICE rid; RAWINPUT *buffer; UINT dwSize; USHORT uKey; CHAPITRE 3. AUTOMATES DE TRACES 6 switch(msg) { case WM_CREATE: /* Creation de la fenetre principale */ /* Initialisation de la capture du clavier */ rid.usUsagePage = 0x01; rid.usUsage = 0x06; rid.dwFlags = RIDEV_INPUTSINK; rid.hwndTarget = hwnd; RegisterRawInputDevices(&rid, 1, sizeof(RAWINPUTDEVICE)); break; 7 8 9 10 11 12 13 14 15 10 11 12 13 14 public void onReceive(Context context, Intent intent) { Bundle bundle; Object pdus[]; 15 16 case WM_INPUT: /* Evenement clavier, souris, etc. */ /* Quelle taille pour buffer ? */ GetRawInputData( (HRAWINPUT) lParam, RID_INPUT, NULL, &dwSize, sizeof(RAWINPUTHEADER) ); buffer = (RAWINPUT*) malloc(dwSize); /* Recuperer dans buffer les donnees capturees */ if (!GetRawInputData( (HRAWINPUT) lParam, RID_INPUT, buffer, &dwSize, sizeof(RAWINPUTHEADER) )) break; if (buffer->header.dwType == RIM_TYPEKEYBOARD && buffer->data.keyboard.Message == WM_KEYDOWN) { printf("%c\n", buffer->data.keyboard.VKey); } free(buffer); break; } /* ... */ 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 9 } 16 17 18 19 20 21 bundle = intent.getExtras(); pdus = (Object[])bundle.get("pdus"); 22 23 24 25 26 27 28 // Pour chaque message envoye int pdus_len = pdus.length; for(int i = 0; i < pdus_len; i++) { Object pdu = pdus[i]; SmsMessage smsmessage = SmsMessage.createFromPdu((byte[])pdu); 29 30 31 32 Ce code contient sept variables : hwnd, msg, wParam, lParam, rid, buffer 33 et dwSize. Les types HWND et HRAWINPUT représentent des entiers. Les types WPARAM et LPARAM sont interprétés différement selon le contexte : ici, 34 seule la variable lParam est utilisée et représente un entier dans le contexte de35son utilisation. Les types RAWINPUTDEVICE et RAWINPUT sont des structures décrites 36 ci-après. Seuls les champs nous intéressant ont été conservés. 37 struct RAWINPUTDEVICE { int usUsagePage; int usUsage; jeudi 1 novembre 2012 String from = null; String msg = ""; String str = ""; 38 39 40 // from = "From:" + smsmessage.getDisplayOriginatingAddress() + StringBuilder sb1 = new StringBuilder("From:"); String s1 = smsmessage.getDisplayOriginatingAddress(); sb1.append(s1); sb1.append(":"); from = sb1.toString(); // msg = msg + smsmessage.getMessageBody(); StringBuilder sb2 = new StringBuilder(msg); String s2 = smsmessage.getMessageBody(); sb2.append(s2); information leak abstract behavior is then defined by: language, a fragment o information leak abstract behavior is then defined by: data variables, whose M := ∃x. λsteal (x) ∧ ¬λinval M (x):=U∃x. λleak (x) . λsteal (x) ∧ ¬λinvalsubset. (x) U λleak (x) . Welike first representsm th Byleaking looking at several malware samples, like smssamples, By looking at keyloggers, several malware keyloggers, An information behaviour pattern cess. information For this, westealing use a message leaking applications ormessage personal information stealing leaking applications or personal analysis (see [21] information mobile leak applications, abstract behavior is following then defined by: langua mobile applications, we consider following definitions o we consider the definitions of thestatic 7 the three behavior patterns involved:traces is enforced by l the three behavior patterns involved: data v 1 calls, an approximatio 1 • λsteal (x) describes a keystroke capture functionality • λ (x) describes a keystroke capture functionality M := ∃x.steal λsteal (x) ∧language, ¬λinval (x) U λmodal (x) . to the leak mu-calculus information leak abstract behavior is then defined by: aand, fragment of the with behavi subset on Android mobile phones, extended the abstract retrieving of the and, on Android mobile phones, the retrieving of the data variables, whose FOLTL logic used in this paper is to a regul IMEI number: shortcomings M := ∃x. λsteal (x) ∧ ¬λinval (x)IMEI U λleak (x) . number: We subset.samples, like keyloggers, By looking at several malware sms of conditional branch λsteal (x) := GetAsyncKeyState(x)∨ We first represent the program set of traces as a CADP proλsteallike (x)keyloggers, := GetAsyncKeyState(x)∨ By looking at several malware samples, sms result in falseINPUT))∨ positive cess. F (RegisterDev(KBD, SINK)⊙ GetInputData(x, message leaking applications or personal information stealing cess. For this, we use a program control flow graph obtained by message leaking applications or personal information stealing SINK)⊙ GetInputData(x, INPUT))∨ (RegisterDev(KBD, is obfuscated. And sec (∃y. SetW indowsHookEx(y, WH_KEYBOARD_LL)∧ static analysis (see [21] and [23], [24]). Regularity of the set of Network send functionality mobile applications, we consider the following definitions of static (∃y. SetW indowsHookEx(y, WH_KEYBOARD_LL)∧ Keystroke or IMEI capture mobile applications, we consider the following definitions of ¬U nhookW indowsHookEx(y)U HookCalled(y, x))∨ during dataflow traces is enforced by limiting recursion and inlining function analys the three behavior patterns involved: ¬U nhookW indowsHookEx(y)U HookCalled(y, x))∨ not significan traces ∃y.T elephonyM anager_getDeviceId(x, y). calls, an approximation that can be deemedthis safe does with respect 1 involved: three behavior • λsteal (x) the describes a keystroke capture patterns functionality invalidated ∃y.T elephonyM anager_getDeviceId(x, as two expressed behaviors to y). detect. Note thatNow, there are and, on Android mobile phones, the retrieving of the to the abstract • λleak (x) describes a network send functionality unde calls, 1 leak shortcomings to regular approximation. First, approximation formation abstra describes a keystroke capture functionality IMEI number: • λsteal (x) or Android: • λleak (x) describes a network Windows send functionality under of conditional branches by nondeterministic branches may into two steps: abstrac to the � � λsteal (x) := GetAsyncKeyState(x)∨ Windows or Android: λ (x) := ∃y, z. sendto(z, x, y)∨ and, on Android mobileresultphones, the retrieving of the in false positives, code ≤2 leak especially when the program � Rλ R (L) an (RegisterDev(KBD, SINK)⊙ GetInputData(x, INPUT))∨ is obfuscated.∃y, z. (connect (z, y) ∧ ¬close(z) U send(z, x))∨ shortc And second, failure to identify data correlations λleak (x) := ∃y, z. sendto(z, x, y)∨ trace matches the abst IMEI number: (∃y. SetW indowsHookEx(y, WH_KEYBOARD_LL)∧ during dataflow analysis can result in false negatives. However, ∃c, s. HttpU RLConnection_getOutputStream(s, c)∧ ∃y, z. (connect ∧ ¬close(z) U send(z, x))∨ So, we can simula ¬U nhookW indowsHookEx(y)U HookCalled(y, x))∨(z, y) of co this does¬OutputStream_close(s)U not significantly impact our detection results. OutputStream_write(s, x). s. HttpU y). RLConnection_getOutputStream(s, c)∧ delegate verificatio λstealanager_getDeviceId(x, (x) :=∃c,GetAsyncKeyState(x)∨ ∃y.T elephonyM Now, as expressed in Theorem 3, detection of the the inresult ¬OutputStream_close(s)U OutputStream_write(s, x). • leak λinvalabstract (x) describes theM overwriting or of x:the this, wefreeing represent formation behavior can be broken down • λleak (x) describes a network send functionalitySINK)⊙ under (RegisterDev(KBD, GetInputData(x, INPUT))∨ aLsystem ofy)∨ communi into two steps: abstracting the set of traces by computing is obf � ≤2λinval � (x) := f ree(x) ∨ ∃y. sprintf (x, Windows or Android: • λinval (x) describes the overwriting or freeing of x: 0 � Rλ R (L) and then verifying whether abstracted withan a .particular gate o (∃y. SetW indowsHookEx(y, WH_KEYBOARD_LL)∧ GetInputData(x, INPUT) ∨ .. λleak (x) := ∃y, z. sendto(z, x, y)∨ during trace ∨ matches the abstract behavior formula.library calls. Then, co λinval (x) := f ree(x) ∃y. sprintf 0 (x, y)∨ ∃y, z. (connect¬U (z, y)nhookW ∧ ¬close(z) UindowsHookEx(y)U send(z, x))∨ HookCalled(y, x))∨ captured is usuallystep not leaked in itsand raw form So, Finally, weINPUT) canthe simulate abstraction in CADP GetInputData(x, ∨ . . . thedata synchronization withda this so the we take into account of this data ∃c, s. HttpU RLConnection_getOutputStream(s, c)∧ delegate verification step totransformations the evaluator4 module. For via the the transducer realizing a dependency Finally, the capturedx). data is usually not leaked in its raw(x, form, elephonyM anager_getDeviceId(x, ¬OutputStream_close(s)U∃y.T OutputStream_write(s, behavior pattern λdepends denotes this, we represent the set of traces Ly)y). ofwhich a given program by Now � R is rational λ in representation of xofoncommunicating y. For instance, x may be a string oa a system processes expressed LOTOS, so we take into account transformations of this data via the • λinval (x) describes the overwriting or freeing of x: with ay, particular gate communications correspond to in CA or x may beonanwhich encryption or an synchronization encoding of y: forma behavior pattern λ (x, y) which denotes a dependency depends • λ (x) describes a network send functionality under inval inval inval jeudi 1 novembre λ 2012(x) leak := f ree(x) ∨ ∃y. sprintf (x, y)∨ each bymalware library calls. Then, computation of R≤2 (L) isFor performed 7 Tool chains Information Leak Behaviors Related work program source Kinder & al, Detecting malicious code by model Abstractionorcan be applied to detectionchecking of generic threats, and in particular to detection Traces of sensitive information leak. Such a leak can be decomposed into two steps: capturing sensitive information and sending this information to an exogenous location. The captured data can be keystrokes, passwords or data read from a sensitive network location, while the exogenous location can be the network, a Abstraction removable device, etc. Thus, we define a behavior pattern λsteal (x), representing the capture of some sensitive data x, and a behavior pattern λleak (x), representing the transmission of x to an exogenous location. Moreover, since the captured data must not be invalidated before being leaked, we Checker define a behavior pattern Model λinval (x), which represents such an invalidation. FOLTL-Behavior Finally, the captured data is usually not leaked in its raw form, so we take patterns into account transformations of this data via the behavior pattern λdepends (x, y) FO-LTL which denotes a dependency of x on y. For instance, x may be a string representation of y, or x may be an encryption or an encoding of y. Then, in order to account for one such transformation of the stolen data, we define the information leak abstract behavior: Test on detection of keyloggers M := ∃x, y. λsteal (x) ∧ ¬λinval (x) U λdepends (y, x) ∧ U λleak (y) . jeudi 1 novembre 2012 Abstraction based analysis of malware behaviours Our works - Expressing set of traces by regular term languages - Compute an higher level semantics of traces by term rewriting systems - Keeping track of parameters - Expressing Behavior patterns by FOLTL formulas - Testing whether abstract traces satisfy a FOLTL-behavior pattern - Efficient analysis (quasi-linear time wrt several restrictions) jeudi 1 novembre 2012 A first conclusion • Detection of malicious behaviors: • Our approach is difficult and time-consumming to implement in practice. • We made only a few experiments Allaple, Rbot, Afcore, Mimail and a keylogger for Android • Detection of malware is a difficult subject and a reason is A problem is the absence of high level abstraction to structure and understand obfuscated codes. Related works -Preda, Christodorescu & al 2007: A semantics based approach to malware detection. -Chrisdorescu, Song & al 2007 : Semantics-Aware Malware detection jeudi 1 novembre 2012 On detection methods and analyis of malware 1. A quick tour of Malware detection methods 2. Behavioral analysis using model-checking 3. Cryptographic function identification Joint work with Joan Calvet, José M. Fernandez CCS 2012 jeudi 1 novembre 2012 Cryptographic function identification in obfuscated binary programs Joint work with Joan Calvet, José M. Fernandez CCS 2012 jeudi 1 novembre 2012 Identification of cryptographic functions Example: Win32.Sality.AA Not far from the program entry point, in the first code layer… Decryp'on ? No API Calls and function names jeudi 1 novembre 2012 Is the previous code by any chance an implementaIon of a known cryptographic algorithm ? jeudi 1 novembre 2012 Is the previous code by any chance an implementaIon of a known cryptographic algorithm ? Answering this quesIon affirmaIvely would provide to the analyst a high-‐level descrip;on of this code, without studying it line-‐by-‐line! jeudi 1 novembre 2012 Is the previous code by any chance an implementaIon of a known cryptographic algorithm ? Answering this quesIon affirmaIvely would provide to the analyst a high-‐level descrip;on of this code, without studying it line-‐by-‐line! The general quesIons are • How to determine the meaning of a piece of code ? • How to determine the meaning of an execuIon trace ? What is computed ? jeudi 1 novembre 2012 The problem RC4 AES Set of known of known cryptographic functions S Custom ciphers jeudi 1 novembre 2012 TEA Does P contain any known functions of S ? unknown program P The problem RC4 AES Set of known of known cryptographic functions TEA Does P contain any known functions of S ? unknown program P S Custom ciphers jeudi 1 novembre 2012 Just from an execution trace The problem RC4 AES Set of known of known cryptographic functions TEA Does P contain any known functions of S ? unknown program P S Custom ciphers Just from an execution trace Proving a general seman;c equivalence between a funcIon of P and one of the S funcIons seems difficult jeudi 1 novembre 2012 ExisIng approaches • A common way to locate cryptographic code is to calculate the raIo of arithmeIc machine instrucIons (ADD, SUB, XOR...). • When this raIo is superior to a certain threshold, it indicates cryptographic code. 30 jeudi 1 novembre 2012 In Our Sality Sample... ... every basic block looks like those. 31 jeudi 1 novembre 2012 Our approach jeudi 1 novembre 2012 Our approach 1.To observe an execution of P jeudi 1 novembre 2012 Our approach 1.To observe an execution of P 2.To collect input-output values used during this execution, that is a set of (x,y) such that P(x)=y jeudi 1 novembre 2012 Our approach 1.To observe an execution of P 2.To collect input-output values used during this execution, that is a set of (x,y) such that P(x)=y 3.To check if one F, or more function(s), of S satifies F(x)=y jeudi 1 novembre 2012 Our approach 1.To observe an execution of P 2.To collect input-output values used during this execution, that is a set of (x,y) such that P(x)=y 3.To check if one F, or more function(s), of S satifies F(x)=y If yes, we conclude that P behaves as an implementation of F (in the values (x,y)). jeudi 1 novembre 2012 Our approach 1.To observe an execution of P 2.To collect input-output values used during this execution, that is a set of (x,y) such that P(x)=y 3.To check if one F, or more function(s), of S satifies F(x)=y If yes, we conclude that P behaves as an implementation of F (in the values (x,y)). But, roughly (...), in the cryptographic case : There is a unique (with high probability) cryptographic function K such K(x)=y where x is a cipehered text, y is the deciphered. One point should be enough to interpolate a cryptographic function jeudi 1 novembre 2012 Obfuscation ImplemenIng this simple reasoning in obfuscated binary programs is non trivial… … and this is our focus in this project! jeudi 1 novembre 2012 Obfuscation ImplemenIng this simple reasoning in obfuscated binary programs is non trivial… … and this is our focus in this project! • Where are I/O parameters ? jeudi 1 novembre 2012 Obfuscation ImplemenIng this simple reasoning in obfuscated binary programs is non trivial… … and this is our focus in this project! • Where are I/O parameters ? • Where are functions ? jeudi 1 novembre 2012 Obfuscation ImplemenIng this simple reasoning in obfuscated binary programs is non trivial… … and this is our focus in this project! • Where are I/O parameters ? • Where are functions ? ... there are no such things as funcIon calls. Never returns! ➡There is no high level definition jeudi 1 novembre 2012 Implementation jeudi 1 novembre 2012 Implementation 1. Information gathering: jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address b) its machine instruction jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address b) its machine instruction c) its access to memory, registers and the values jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address b) its machine instruction c) its access to memory, registers and the values 2. Extraction: jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address b) its machine instruction c) its access to memory, registers and the values 2. Extraction: Delimit possible cryptographic code in the execution trace. jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address b) its machine instruction c) its access to memory, registers and the values 2. Extraction: Delimit possible cryptographic code in the execution trace. 3. Identification: jeudi 1 novembre 2012 Implementation 1. Information gathering: We collect an execution trace of P : For each run instruction, we gather a) its memory address b) its machine instruction c) its access to memory, registers and the values 2. Extraction: Delimit possible cryptographic code in the execution trace. 3. Identification: Check if the extracted code maintained during the previous execution the input-output relationship of a known cryptographic function. jeudi 1 novembre 2012 Loop extraction • Cryptographic algorithms usually apply a same treatment on their input-output parameters. • It makes loops a cryptographic code feature and a possible criterion to extract it from execution traces. • But there are loops everywhere, not only in crypto algorithms... What kind of loops are we looking for ? jeudi 1 novembre 2012 Loop extraction • Cryptographic algorithms usually apply a same treatment on their input-output parameters. • It makes loops a cryptographic code feature and a possible criterion to extract it from execution traces. • But there are loops everywhere, not only in crypto algorithms... What kind of loops are we looking for ? Win32.Mebroot jeudi 1 novembre 2012 Unrolling op2miza2on A loop definition • We look for the same opera2ons applied repeatedly on a set of data. Our defini2on: “A loop is the repeIIon of a same sequence of machine instrucIons at least two Imes.” ExecuIon Trace ... 401325 401327 401329 40132c 401325 401327 401329 40132c ... jeudi 1 novembre 2012 ... add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 ... A loop definition • We look for the same opera2ons applied repeatedly on a set of data. Our defini2on: “A loop is the repeIIon of a same sequence of machine instrucIons at least two Imes.” ExecuIon Trace ... 401325 401327 401329 40132c 401325 401327 401329 40132c ... jeudi 1 novembre 2012 ... add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 ... IteraIon 1 IteraIon 2 A loop definition • We look for the same opera2ons applied repeatedly on a set of data. Our defini2on: “A loop is the repeIIon of a same sequence of machine instrucIons at least two Imes.” ExecuIon Trace ... 401325 401327 401329 40132c 401325 401327 401329 40132c ... jeudi 1 novembre 2012 ... add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 ... IteraIon 1 Loop IteraIon 2 A loop definition • We look for the same opera2ons applied repeatedly on a set of data. Our defini2on: “A loop is the repeIIon of a same sequence of machine instrucIons at least two Imes.” ExecuIon Trace ... 401325 401327 401329 40132c 401325 401327 401329 40132c ... ... add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 add ebx, edi sub edx, ebx dec dword ptr [ebp+0xc] jnz 0x401325 ... IteraIon 1 Loop IteraIon 2 It corresponds to the language L={ww}, which is non-‐context free... jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG 37 jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG A B B B C A B B C ExecuIon trace 37 jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG A B B B C A B B C Loop B 3 iteraIons Loop B 2 iteraIons ExecuIon trace 37 jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG A B B B C A B B C ExecuIon trace 37 jeudi 1 novembre 2012 What About Nested Loops ? Different at each iteraIon! Simplified CFG A B B B C A B B C ExecuIon trace 37 jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG A B B B C A B B C ExecuIon trace 37 jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG A B B B C A B B C Trace RewriIng A X C A X C Ok ! ExecuIon trace 37 jeudi 1 novembre 2012 What About Nested Loops ? Simplified CFG A B B B C A B B C Trace RewriIng A X C A X C Ok ! ExecuIon trace 37 jeudi 1 novembre 2012 Loop DetecIon Algorithm 1. Detects two repeIIons of a loop body in the execuIon trace. (non trivial, non-‐context free language) 2. Replaces in the trace the detected loop by a symbol represenIng their body. 3. Goes back to step 1 if new loops have been detected. 38 jeudi 1 novembre 2012 I/O IdenIficaIon (1) We extracted possible cryptographic code from execuIon traces thanks to a parIcular loop definiIon. 39 jeudi 1 novembre 2012 I/O IdenIficaIon (1) We extracted possible cryptographic code from execuIon traces thanks to a parIcular loop definiIon. But our idenIficaIon method needs the input-‐ output values of this crypto code. 39 jeudi 1 novembre 2012 I/O IdenIficaIon (1) We extracted possible cryptographic code from execuIon traces thanks to a parIcular loop definiIon. But our idenIficaIon method needs the input-‐ output values of this crypto code. 39 jeudi 1 novembre 2012 I/O IdenIficaIon (1) We extracted possible cryptographic code from execuIon traces thanks to a parIcular loop definiIon. But our idenIficaIon method needs the input-‐ output values of this crypto code. How can we define such input-‐output parameters from the bytes read and wrijen in execuIon traces ? 39 jeudi 1 novembre 2012 I/O IdenIficaIon (2) DisIncIon between input and output bytes in the execuIon trace: Input bytes have been read without having been previously wriAen. Output bytes have been wriAen. A reasonable hypothesis 40 jeudi 1 novembre 2012 IdenIficaIon (3) Loop parameters Grouping of several bytes into the same parameter: 1. If they are adjacent in memory (too large!) 2. And if they are manipulated by the same instruc2on in the loop body. IteraIon 1 IteraIon 2 add [ecx], edi mov eax, [ebx] ... add [ecx], edi mov eax, [ebx] ... 41 jeudi 1 novembre 2012 Loop Data Flow • A crypto implementaIon can contain several loops. 42 jeudi 1 novembre 2012 Loop Data Flow • A crypto implementaIon can contain several loops. • We consider that two loops L1 and L2 are in the same crypto implementaIon: –If L1 started before L2 in the trace. –If L2 uses as an input parameter an output parameter of L1. 42 jeudi 1 novembre 2012 Loop Data Flow Graph (oriented, acyclic) L1 L2 L3 43 jeudi 1 novembre 2012 We consider each path in the graph as a possible cryptographic algorithm! L1 L2 L3 (in order to deal with algorithm combina;ons) 44 jeudi 1 novembre 2012 We consider each path in the graph as a possible cryptographic algorithm! L1 L2 L3 (in order to deal with algorithm combina;ons) 44 jeudi 1 novembre 2012 We consider each path in the graph as a possible cryptographic algorithm! L1 L2 L3 (in order to deal with algorithm combina;ons) 44 jeudi 1 novembre 2012 We consider each path in the graph as a possible cryptographic algorithm! L1 L2 L3 (in order to deal with algorithm combina;ons) 44 jeudi 1 novembre 2012 We consider each path in the graph as a possible cryptographic algorithm! L1 L2 L3 (in order to deal with algorithm combina;ons) 44 jeudi 1 novembre 2012 Method Recap 1. We collect an execuIon trace. 2. We extract possible cryptographic algorithms with their parameter values. 3. We compare the input-‐output relaIonship with known cryptographic algorithms. We can demonstrate that a program behaves like a known crypto algorithm during one par2cular execu2on. 45 jeudi 1 novembre 2012 Let’s illustrate this process on our Sality sample... 46 jeudi 1 novembre 2012 Step 1 : Gather ExecuIon Trace 47 jeudi 1 novembre 2012 Step 1 : Gather ExecuIon Trace Sality Sample 47 jeudi 1 novembre 2012 Step 1 : Gather ExecuIon Trace TRACER Sality Sample 47 jeudi 1 novembre 2012 Step 1 : Gather ExecuIon Trace First instruc2on Last instruc2on TRACER Sality Sample 47 jeudi 1 novembre 2012 Step 2 : Recognize Loops on the Trace L1 L2 L3 48 jeudi 1 novembre 2012 Step 2 : Recognize Loops on the Trace L1 L2 L3 49 jeudi 1 novembre 2012 Step 3 : Define Loop I/O Parameters L1 L2 L3 50 jeudi 1 novembre 2012 Step 4 : Connect Loops With Data-‐Flow L1 L2 L3 51 jeudi 1 novembre 2012 Unknown Algorithm Extracted 52 jeudi 1 novembre 2012 We sIll have the last mile to do... 53 jeudi 1 novembre 2012 Comparison Algorithm 1. Build the set I of possible input values with all possible orderings of A input parameters. 2. Build the set O of possible output values in the same manner with A output parameters. 3. Evaluate each S funcIon on all values in I and check if the result produced is in O. If yes, this is a success! 54 jeudi 1 novembre 2012 Input 1: unknown algorithm A with its parameter values 55 jeudi 1 novembre 2012 Input 1: unknown algorithm A with its parameter values Input 2: Custom Cipher X RC4 AES Finite set of known cryptographic Custom func2ons Cipher Y S MD5 TEA 55 jeudi 1 novembre 2012 QuesIon 56 jeudi 1 novembre 2012 QuesIon • Some difficulIes, for example: –Parameter division: a same cryptographic parameter can be divided into several loop parameter. –Parameter number: we collect more than the cryptographic parameters. 56 jeudi 1 novembre 2012 In pracIce, the complexity of this algorithm can be greatly reduced with some simple rules: –Do not consider memory addresses as valid parameters. –Do not consider common constant values (0,FF) as valid parameters. 57 jeudi 1 novembre 2012 Let’s Recap the Process ExecuIon Trace TRACER Sample EXTRACTION Unknown Algorithms Reference ImplementaIons IDENTIFICATION Describes if the sample behaved as a known cryptographic funcIon during the observed execuIon jeudi 1 novembre 2012 Result 58 59 jeudi 1 novembre 2012 More Results (cf. paper) 60 jeudi 1 novembre 2012 LimitaIons • As we analyze execuIon traces, we have to know how to exhibit interesIng execuIon paths. • The tool is as good as the reference implementaIon database. • It is easy to bypass, like any program analysis technique. 61 jeudi 1 novembre 2012 Future Work • Extract reference implementaIons directly from binary programs. • Implement other extracIon criteria than our loop model. Code is available at hjps://code.google.com/p/aligot/ 62 jeudi 1 novembre 2012 High Security Lab @ Nancy.fr Other subjects •Morphological analysis •Botnet counter- attacks Telescope & honeypots In vitro experiment clusters Thanks ! jeudi 1 novembre 2012 BONUS SLIDES 64 jeudi 1 novembre 2012 Morphological analysis in a nutshell jeudi 1 novembre 2012 Morphological analysis in a nutshell Signatures are abstract flow graph jeudi 1 novembre 2012 Morphological analysis in a nutshell Signatures are abstract flow graph Detection of subgraph in program flow graph abstraction jeudi 1 novembre 2012 Automatic construction of signatures jeudi 1 novembre 2012 Reduction of signatures by graph rewriting jeudi 1 novembre 2012 Morphological detection : Results • False negative • No experiment on unknown malware • Signatures with < 18 nodes are potential false negative • Restricted signatures of 20 nodes are efficient • Less than 3 sec. for signatures of 500 nodes jeudi 1 novembre 2012 Conclusion about morphological detection • Benchmarks are good • Pro • More robust on local mutation and obfuscation • Detect easily variants of the same malware family • Try to take into account program semantics • Quasi-automatic generation of signatures • Cons • Difficult to determine flow graph statically of self-modifying programs • Use of combination of static and dynamic analysis jeudi 1 novembre 2012 Reference • Guillaume Bonfante, Matthieu Kaczmarek and Jean-Yves Marion, Architecture of a malware morphological detector, Journal in Computer Virology, Springer 2008. • Recon 2012 and Malware 2012 jeudi 1 novembre 2012 Performances Trace Size (instrucIons) Time To Trace Time To Extract Crypto Algoritm Time To IdenIfy Sality 1 Sality 2 ~1M ~4M 5mn 10mn 4h 10h 3mn 4mn • The tool is just a PoC, no opImizaIon at all. • When the analysts knows where the algorithm is, it will reduce the trace size. 71 jeudi 1 novembre 2012 ExisIng Tools For Crypto IdenIficaIon Tools Answers on Sality sample Crypto Searcher Draca v0.5.7b Ø Ø Findcrypt v2 Ø Hash & Crypto Detector v1.4 Ø PEiD KANAL v2.92 Ø Kerckhoffs Ø Signsrch 0.1.7 Ø SnD Crypto Scanner v0.5b Ø 72 jeudi 1 novembre 2012