GPL Tools/match.py

This is the script for matching functions and data addresses between a bunch of IDC databases.

Theory

Evaluating matches

Each match found is evaluated with a score. The bigger the score, the better the match. Ideally, matches with positive score should be OK, and matches with negative score should be bad.

Function pairs with identical signature

Scoring for those functions is computed in data_match_funcpair. Each line has one or two data references (first is a number, and then, if that number is a ROM address, the second data ref is the contents of that address).

We can divide references into small numbers (let's say < 0x1000), which are offsets, small constants and other uninteresting stuff, and large numbers, which can yield data matches. If the reference is actually a string, we'll consider it a separate case. Now let's compute:

$smallmatch = \dfrac{smallmatch_{OK}}{smallmatch_{total}}$

$bigmatch = \dfrac{bigmatch_{OK}}{bigmatch_{total}}$

$stringmatch = \dfrac{stringmatch_{OK}}{stringmatch_{total}}$

A string match is much more important than a simple number match, so let's weight them like this:

$score_{m}=(smallmatch-0.5){\sqrt {smallmatch_{total}}}+(bigmatch-0.5){\sqrt {smallmatch_{total}}}+20*(stringmatch-0.5){\sqrt {stringmatch_{total}}}$

Further hints may be given by how many times the function is called, and how many function it calls. These numbers don't have to match exactly, so instead let's compute a mismatch: Failed to parse (syntax error): {\displaystyle \mathrm{mismatch}(a,b) = \dfrac{|a-b|}{(a+b)/2}} With this, the score of a function pair becomes: <math>score = score_m - 2*\mathrm{mismatch}(refsto_1, refsto_2) - 2*\mathrm{mismatch}(refsfrom_1, refsfrom_2) }

Does it work? We'll see. Is there a better formula? Sure, it just waits for you to find it :)

Data address pairs

These pairs are obtaining by comparing two functions with identical signature, and matching their referenced addresses line by line.

For those addresses, we have these sources of information:

the quality of the pair(s) from which a given data pair was obtained
analyzing how is the address referenced throughout the firmware

Since the score of a function pair is additive, we can try to add the scores of all the pairs which yielded this match. If a data match was detected from two matching function pairs, or more, that's much better than being detected from only one function pair. Also, if the data pair was detected from 10 functions or more, it's almost certain it's a good match.

So let's try this formula:

$score_{datapair} = AVG(funcpairs) \cdot \sqrt{COUNT(funcpairs)-0.95}$

Usage

This script is included in GPL Tools/ARM console.

API reference

[under construction]

Matches functions between different versions of firmware.

In [41]: M = CodeMatch(D)
Creating codesigs for 550D_108_05_0xff010000.bin...
Creating codesigs for 5D_204_06_0xff810000.bin...
Found 25075 raw code matches between 550D_108_05_0xff010000.bin and 5D_204_06_0xff810000.bin.
Saving raw match log...
Removing duplicates...
Remaining 8979 code matches between 550D_108_05_0xff010000.bin and 5D_204_06_0xff810000.bin.

In [42]: M
Out[42]: 
{(Dump of 550D_108_05_0xff010000.bin, Dump of 5D_204_06_0xff810000.bin): [(4281969332L,
                                                                           4290005892L),
                                                                          ...  
                                                                          (4278835108L,
                                                                           4287210464L)]}

In [43]: M[t2i,mk2]
Out[43]: 
[(4281969332L, 4290005892L),
 ...
 (4278835108L, 4287210464L)]

In [44]: pair = M[t2i,mk2][5]

In [45]: Score((t2i,mk2), pair)
Out[45]: 60.355478562035763

In [46]: data_match_funcpair((t2i,mk2), pair)
<lots of details>

To find the match results, just sort the working directory by modification date.

match-log.txt: shows detailed info about the matching process, for each pair of functions.