Changes: GPL Tools/match.py

Latest revision as of 09:14, 17 November 2011

This is the script for matching functions and data addresses between a bunch of IDC databases.

Alternative tools which do (more or less) the same job:

Theory[]

Function signature[]

It's the sequence of mnemonics of each instruction, also with flags. To see it, you may use a Fun object:

In [1]: f = Fun(t2i,"setFiltreOff")
Unknown function: setFiltreOff. Using closest match: SetFilterOff.

In [2]: f.sig
Out[2]: push add mov mov bl pop mov b

Exact matches are easiest to find, since they can be indexed. If the function start and end addresses are known, that helps a lot. If not, the Aho-Corasick algorithm is *extremely* fast, but my code is a bit buggy.

Evaluating matches[]

Each match found is evaluated with a score. The bigger the score, the better the match. Ideally, matches with positive score should be OK, and matches with negative score should be bad.

Function pairs with identical signature[]

Scoring for those functions is computed in data_match_funcpair. Each line has one or two data references (first is a number, and then, if that number is a ROM address, the second data ref is the contents of that address).

We can divide references into small numbers (let's say < 0x1000), which are offsets, small constants and other uninteresting stuff, and large numbers, which can yield data matches. If the reference is actually a string, we'll consider it a separate case. Now let's compute:

$smallmatch = \dfrac{smallmatch_{OK}}{smallmatch_{total}}$

$bigmatch = \dfrac{bigmatch_{OK}}{bigmatch_{total}}$

$stringmatch = \dfrac{stringmatch_{OK}}{stringmatch_{total}}$

A string match is much more important than a simple number match, so let's weight them like this:

$score_m = (smallmatch - 0.5) \sqrt{smallmatch_{total}} + (bigmatch - 0.5) \sqrt{smallmatch_{total}} + 20 \cdot (stringmatch - 0.5) \sqrt{stringmatch_{total}}$

If the strings are almost identical, their match ratio would be also added to $stringmatch_{OK}$ .

Further hints may be given by how many times the function is called, and how many function it calls. These numbers don't have to match exactly, so instead let's compute a mismatch: $\mathrm{mismatch}(a,b) = \dfrac{|a-b|}{(a+b)/2}$

With this, the score of a function pair becomes:

$score = score_m - 2 \cdot \mathrm{mismatch}(refsto_1, refsto_2) - 2 \cdot \mathrm{mismatch}(refsfrom_1, refsfrom_2)$

Does it work? We'll see. Is there a better formula? Sure, it just waits for you to find it :)

Function pairs with different signature[]

That's more difficult.

Data address pairs[]

These pairs are obtaining by comparing two functions with identical signature, and matching their referenced addresses line by line.

For those addresses, we have these sources of information:

the quality of the pair(s) from which a given data pair was obtained
analyzing how is the address referenced throughout the firmware

Since the score of a function pair is additive, we can try to add the scores of all the pairs which yielded this match. If a data match was detected from two matching function pairs, or more, that's much better than being detected from only one function pair. Also, if the data pair was detected from 10 functions or more, it's almost certain it's a good match.

So let's try this formula:

$score_{datapair} = AVG(funcpairs) \cdot \sqrt{COUNT(funcpairs)-0.95}$

Usage[]

This script is included in GPL Tools/ARM console.

As a tutorial, let's try to generate a stub for 550D 1.0.9. Let's say we already have the stubs and firmware dumps for 5d2 2.0.4 and 550d 1.0.8. No IDC files are needed (only the stubs).

In [3]: D = load_dumps("(108|109|204)")

In [4]: t108, t109, mk2 = D

In [5]: M = match.CodeMatch(D)
Creating codesigs for 550d.108.0xff010000.bin...
Creating codesigs for 550d.109.0xff010000.bin...
Creating codesigs for 5d2.204.0xff810000.bin...
...
Remaining 14915 code matches between 550d.108.0xff010000.bin and 550d.109.0xff010000.bin.
Remaining 9752 code matches between 550d.109.0xff010000.bin and 5d2.204.0xff810000.bin.
Remaining 9906 code matches between 550d.108.0xff010000.bin and 5d2.204.0xff810000.bin.

In [6]: DM, SRC = match.DataMatch(M)
Found 20824 possible code matches between 550d.108.0xff010000.bin and 550d.109.0xff010000.bin.
Found 1053 data matches between 550d.108.0xff010000.bin and 550d.109.0xff010000.bin.
Found 37765 possible code matches between 550d.109.0xff010000.bin and 5d2.204.0xff810000.bin.
...
Found 4010 data matches between 550d.109.0xff010000.bin and 5d2.204.0xff810000.bin.
Found 37768 possible code matches between 550d.108.0xff010000.bin and 5d2.204.0xff810000.bin.
Found 4001 data matches between 550d.108.0xff010000.bin and 5d2.204.0xff810000.bin.

In [7]: match.sync(D, M, DM, idc=True, stub=False)
...
# this will choose the best match for each function and save the results. IDC format is recommended, since it also contains function definitions, not just stubs.

In [8]: match.MakeStubs(t109, D, M, DM, SRC)
NSTUB(0xFF0673EC, DebugMsg)                       // OK (dumps=2, score=8.2e+02)
NSTUB(0xFF1C69EC, FIO_CloseFile)                  // OK (dumps=2, score=79)
...
NSTUB(0xFF1D6638, vsnprintf)                      // OK (dumps=2, score=51)

The generated stub is displayed at the terminal (just copy/paste it to its final destination). You'll also want to check the logs: there's a big, colorful one called MakeStubs-diff.htm, which shows side-by-side diffs for matched functions. There are a few other text logs; see the reference section for details.

API reference[]

CodeMatch*[]

These functions implement some code (i.e. function) matching algorithms.

The first one search for matches only in the functions marked in the database (usually loaded from IDC). It's fast, but it relies on functions being correctly identified for all the firmware dumps, so it may miss many functions.

In [8]: M = match.CodeMatch_dbfuncs(D)

This one performs full-text signature search using an Aho-Corasick tree. It needs the functions to be correctly defined only in the source firmwares.

My interpretation of aho-corasick search results is very slow (I'm counting spaces in the signature to find the function offsets) and will be optimized.

In [9]: M = match.CodeMatch_corasick(D)

Not implemented yet: a frontend to gensig/finsig. Useful for comparison and for creating side-by-side diffs for gensig/finsig matches.

In [10]: M = match.CodeMatch_finsig(D)

The last one calls all the above algorithms and combines the results (it hopefully extracts the best from all algorithms):

In [11]: M = match.CodeMatch(D)

All those functions return a dictionary which looks like this:

In [12]: M.keys()
Out[12]: 
[(Dump of 550d.108.0xff010000.bin, Dump of 550d.109.0xff010000.bin),
 (Dump of 550d.109.0xff010000.bin, Dump of 5d2.204.0xff810000.bin),
 (Dump of 550d.108.0xff010000.bin, Dump of 5d2.204.0xff810000.bin)]

In [13]: len(M[t108,mk2])
Out[13]: 9906

In [14]: pair = M[t108,mk2][5]

In [15]: funcpair((t108,mk2), pair)
Out[15]:                                     0xff33cff8:sdcomSendData <-------> 0xffa1d5fc:AJ_sdcomSendData

In [16]: Score((t108,mk2), pair)
Out[16]: 61.9513542013

These functions create some logs: code-matches-*.txt and raw-match-log-*.txt.

DataMatch[]

Two matched function which identical code structure can yield matches for data addresses (like sounddev, additional_version etc.). They can also suggest matches for functions which are not identical, but they are called from the same place.

This is the low-level function, which analyze only a pair of functions (and also provides a score):

In [17]: data_match_funcpair((t108,mk2), pair)
...
big numbers match: 0.82 (212 / 260)
STRING MATCH: 1 (27 / 27)
score: 62
Out[17]: (61.951354201275649, [(4278638820L, 4287044188L), (4281582832L, 4288793332L), (133060, 827 ...

And this analyzes all the matches:

In [18]: DM, SRC = match.DataMatch(M)

Outputs: a dictionary of data matches (DM) and a dictionary of sources (SRC, i.e. matched functions from which the match was deducted). This function also creates data-matches.txt and possible-code-matches.txt (one for data addresses and other for functions).

Scoring[]

In [19]: Score(dumpair, addrpair)

In [20]: Score_refmatch(dumpair, addrpair, srcscores)

In [21]: combined_score([1,2,10,20])
Out[21]: 14.4080055872

In [22]: combined_score([5])
Out[22]: 1.11803398875

In [23]: combined_score([5,5])
Out[23]: 5.12347538298

MakeStubs[]

This is used for creating stubs for a new firmware version. See the Usage section.

SyncIDC[]

Not implemented yet. It will write some IDCs for name syncing purposes.

Logs[]

MakeStubs-diff.htm[]

It is created by match.MakeStubs and contains the following:

for each function pair: a side-by-side diff
for each data pair: side-by-side diffs of the functions from which that data pair was found.

raw-match-log-*.txt[]

It shows detailed info about the matching process, for each pair of functions.

code-matches-*.txt[]

This contains the matched function addresses for all the firmware pairs, sorted by score.

data-matches.txt[]

This contains the matched data addresses (i.e. not functions) for all the firmware pairs.

possible-code-matches.txt[]

This contains matched functions found with the data matching method. They are not verified (yet) and may contain many bad matches.

Happy Matching!

--Alexdu 18:12, December 1, 2010 (UTC)

@@ Line 1: / Line 1: @@
 This is the script for matching functions and data addresses between a bunch of IDC databases.
+Alternative tools which do (more or less) the same job:
+* [[Gensig_finsig|Gensig/Finsig from CHDK]]
+* [http://code.google.com/p/patchdiff2/ patchdiff2]
+* [http://corelabs.coresecurity.com/index.php?module=Wiki&action=view&type=tool&name=turbodiff turbodiff]
 ==Theory==
-[[IDAPython/Firmware matching]]
+Also check [[IDAPython/Firmware matching]]
-===Evaluating matches===
+===Function signature===
+It's the sequence of mnemonics of each instruction, also with flags. To see it, you may use a Fun object:
-Each match found is evaluated with a score. The bigger the score, the better the match. Ideally, matches with positive score should be OK, and matches with negative score should be bad.
+ In [1]: '''f = Fun(t2i,"setFiltreOff")'''
+ Unknown function: setFiltreOff. Using closest match: SetFilterOff.
+ In [2]: '''f.sig'''
+ Out[2]: push add mov mov bl pop mov b
+Exact matches are easiest to find, since they can be indexed. If the function start and end addresses are known, that helps a lot. If not, the Aho-Corasick algorithm is *extremely* fast, but my code is a bit buggy.
+===Evaluating matches===
+Each match found is evaluated with a score. The bigger the score, the better the match. Ideally, matches with positive score should be OK, and matches with negative score should be bad.
 ====Function pairs with identical signature====
@@ Line 23: / Line 37: @@
 A string match is much more important than a simple number match, so let's weight them like this:
-<math>score_m = (smallmatch - 0.5) \sqrt{smallmatch_{total}} + (bigmatch - 0.5) \sqrt{smallmatch_{total}} + 20 * (stringmatch - 0.5) \sqrt{stringmatch_{total}}</math>
+<math>score_m = (smallmatch - 0.5) \sqrt{smallmatch_{total}} + (bigmatch - 0.5) \sqrt{smallmatch_{total}} + 20 \cdot (stringmatch - 0.5) \sqrt{stringmatch_{total}}</math>
+If the strings are almost identical, their match ratio would be also added to <math>stringmatch_{OK}</math>.
 Further hints may be given by how many times the function is called, and how many function it calls. These numbers don't have to match exactly, so instead let's compute a mismatch:
-<math>\mathrm{mismatch(a,b) = \dfrac{|a-b|}{(a+b)/2}}
+<math>\mathrm{mismatch}(a,b) = \dfrac{|a-b|}{(a+b)/2}</math>
 With this, the score of a function pair becomes:
-<math>score = score_m - 2*mismatch(refsto_1, refsto_2)  - 2*mismatch(refsfrom_1, refsfrom_2) </math>
+<math> score = score_m - 2 \cdot \mathrm{mismatch}(refsto_1, refsto_2)  - 2 \cdot \mathrm{mismatch}(refsfrom_1, refsfrom_2) </math>
 Does it work? We'll see. Is there a better formula? Sure, it just waits for you to find it :)
+====Function pairs with different signature====
+That's more difficult.
 ====Data address pairs====
@@ Line 50: / Line 67: @@
 <math>score_{datapair} = AVG(funcpairs) \cdot \sqrt{COUNT(funcpairs)-0.95}</math>
 ==Usage==
 This script is included in [[GPL Tools/ARM console]].
+As a tutorial, let's try to generate a stub for 550D 1.0.9. Let's say we already have the stubs and firmware dumps for 5d2 2.0.4 and 550d 1.0.8. No IDC files are needed (only the stubs).
+ In [3]: '''D = load_dumps("(108|109|204)")'''
+ In [4]: '''t108, t109, mk2 = D'''
+ In [5]: '''M = match.CodeMatch(D)'''
+ Creating codesigs for 550d.108.0xff010000.bin...
+ Creating codesigs for 550d.109.0xff010000.bin...
+ Creating codesigs for 5d2.204.0xff810000.bin...
+ ...
+ Remaining 14915 code matches between 550d.108.0xff010000.bin and 550d.109.0xff010000.bin.
+ Remaining 9752 code matches between 550d.109.0xff010000.bin and 5d2.204.0xff810000.bin.
+ Remaining 9906 code matches between 550d.108.0xff010000.bin and 5d2.204.0xff810000.bin.
+ In [6]: '''DM, SRC = match.DataMatch(M)'''
+ Found 20824 possible code matches between 550d.108.0xff010000.bin and 550d.109.0xff010000.bin.
+ Found 1053 data matches between 550d.108.0xff010000.bin and 550d.109.0xff010000.bin.
+ Found 37765 possible code matches between 550d.109.0xff010000.bin and 5d2.204.0xff810000.bin.
+ ...
+ Found 4010 data matches between 550d.109.0xff010000.bin and 5d2.204.0xff810000.bin.
+ Found 37768 possible code matches between 550d.108.0xff010000.bin and 5d2.204.0xff810000.bin.
+ Found 4001 data matches between 550d.108.0xff010000.bin and 5d2.204.0xff810000.bin.
+ In [7]: '''match.sync(D, M, DM, idc=True, stub=False)'''
+ ...
+ # this will choose the best match for each function and save the results. IDC format is recommended, since it also contains function definitions, not just stubs.
+ In [8]: '''match.MakeStubs(t109, D, M, DM, SRC)'''
+ NSTUB(0xFF0673EC, DebugMsg)                       // OK (dumps=2, score=8.2e+02)
+ NSTUB(0xFF1C69EC, FIO_CloseFile)                  // OK (dumps=2, score=79)
+ ...
+ NSTUB(0xFF1D6638, vsnprintf)                      // OK (dumps=2, score=51)
+The generated stub is displayed at the terminal (just copy/paste it to its final destination).
+You'll also want to check the logs: there's a big, colorful one called ''MakeStubs-diff.htm'', which shows side-by-side diffs for matched functions. There are a few other text logs; see the reference section for details.
 ==API reference==
+===CodeMatch*===
-[under construction]
+These functions implement some code (i.e. function) matching algorithms.
+The first one search for matches only in the functions marked in the database (usually loaded from IDC). It's fast, but it relies on functions being correctly identified for all the firmware dumps, so it may miss many functions.
-Matches functions between different versions of firmware.
+ In [8]: '''M = match.CodeMatch_dbfuncs(D)'''
+This one performs full-text signature search using an Aho-Corasick tree. It needs the functions to be correctly defined only in the source firmwares.
- In [41]: '''M = CodeMatch(D)'''
- Creating codesigs for 550D_108_05_0xff010000.bin...
- Creating codesigs for 5D_204_06_0xff810000.bin...
- Found 25075 raw code matches between 550D_108_05_0xff010000.bin and 5D_204_06_0xff810000.bin.
- Saving raw match log...
- Removing duplicates...
- Remaining 8979 code matches between 550D_108_05_0xff010000.bin and 5D_204_06_0xff810000.bin.
+My interpretation of aho-corasick search results is very slow (I'm counting spaces in the signature to find the function offsets) and will be optimized.
- In [42]: '''M'''
- Out[42]:
+ In [9]: '''M = match.CodeMatch_corasick(D)'''
- {(Dump of 550D_108_05_0xff010000.bin, Dump of 5D_204_06_0xff810000.bin): [(4281969332L,
-                                                                            4290005892L),
+Not implemented yet: a frontend to gensig/finsig. Useful for comparison and for creating side-by-side diffs for gensig/finsig matches.
-                                                                           ...
+ In [10]: '''M = match.CodeMatch_finsig(D)'''
-                                                                           (4278835108L,
-                                                                            4287210464L)]}
+The last one calls all the above algorithms and combines the results (it hopefully extracts the best from all algorithms):
+ In [11]: '''M = match.CodeMatch(D)'''
+All those functions return a dictionary which looks like this:
+ In [12]: '''M.keys()'''
+ Out[12]:
+ [(Dump of 550d.108.0xff010000.bin, Dump of 550d.109.0xff010000.bin),
+  (Dump of 550d.109.0xff010000.bin, Dump of 5d2.204.0xff810000.bin),
+  (Dump of 550d.108.0xff010000.bin, Dump of 5d2.204.0xff810000.bin)]
- In [43]: '''M[t2i,mk2]'''
+ In [13]: '''len(M[t108,mk2])'''
- Out[43]:
+ Out[13]: 9906
- [(4281969332L, 4290005892L),
+ In [14]: '''pair = M[t108,mk2][5]'''
-  ...
-  (4278835108L, 4287210464L)]
+ In [15]: '''funcpair((t108,mk2), pair)'''
+ Out[15]:                                     0xff33cff8:sdcomSendData <-------> 0xffa1d5fc:AJ_sdcomSendData
+ In [16]: '''Score((t108,mk2), pair)'''
+ Out[16]: 61.9513542013
+These functions create some logs: code-matches-*.txt and raw-match-log-*.txt.
+===DataMatch===
+Two matched function which identical code structure can yield matches for data addresses (like sounddev, additional_version etc.). They can also suggest matches for functions which are not identical, but they are called from the same place.
+This is the low-level function, which analyze only a pair of functions (and also provides a score):
+ In [17]: '''data_match_funcpair((t108,mk2), pair)'''
+ ...
+ big numbers match: 0.82 (212 / 260)
+ STRING MATCH: 1 (27 / 27)
+ score: 62
+ Out[17]: (61.951354201275649, [(4278638820L, 4287044188L), (4281582832L, 4288793332L), (133060, 827 ...
+And this analyzes all the matches:
+ In [18]: '''DM, SRC = match.DataMatch(M)'''
+Outputs: a dictionary of data matches (DM) and a dictionary of sources (SRC, i.e. matched functions from which the match was deducted).
+This function also creates data-matches.txt and possible-code-matches.txt (one for data addresses and other for functions).
+===Scoring===
+ In [19]: '''Score(dumpair, addrpair)'''
+ In [20]: '''Score_refmatch(dumpair, addrpair, srcscores)'''
+ In [21]: '''combined_score([1,2,10,20])'''
+ Out[21]: 14.4080055872
+ In [22]: '''combined_score([5])'''
+ Out[22]: 1.11803398875
+ In [23]: '''combined_score([5,5])'''
+ Out[23]: 5.12347538298
+===MakeStubs===
+This is used for creating stubs for a new firmware version. See the Usage section.
+===SyncIDC===
+Not implemented yet. It will write some IDCs for name syncing purposes.
+==Logs==
- In [44]: '''pair = M[t2i,mk2][5]'''
+====MakeStubs-diff.htm====
+It is created by ''match.MakeStubs'' and contains the following:
+* for each function pair: a side-by-side diff
+* for each data pair: side-by-side diffs of the functions from which that data pair was found.
+====raw-match-log-*.txt====
+It shows detailed info about the matching process, for each pair of functions.
+====code-matches-*.txt====
+This contains the matched function addresses for all the firmware pairs, sorted by score.
+====data-matches.txt====
+This contains the matched data addresses (i.e. not functions) for all the firmware pairs.
+====possible-code-matches.txt====
+This contains matched functions found with the data matching method. They are not verified (yet) and may contain many bad matches.
- In [45]: '''Score((t2i,mk2), pair)'''
- Out[45]: 60.355478562035763
+Happy Matching!
- In [46]: '''data_match_funcpair((t2i,mk2), pair)'''
- <lots of details>
+--[[User:Alexdu|Alexdu]] 18:12, December 1, 2010 (UTC)
-To find the match results, just sort the working directory by modification date.
-* match-log.txt: shows detailed info about the matching process, for each pair of functions.