We wish to approximate the movement of the feature points by an affine transform, because it can account for rotation, zooming, and panning, all of which are common features in videos. The coordinates of a feature in the old frame are written as (x0,y0)(x0,y0) and in the new frame as (x1,y1)(x1,y1). Then an affine transform can be written as:
x
1
y
1
=
a
b
c
d
x
0
y
0
+
e
f
x
1
y
1
=
a
b
c
d
x
0
y
0
+
e
f
(1)However, this form needs some modification to deal with multiple point pairs at once, and needs rearranging to find aa, bb, cc, dd, ee, and ff. It can be easily verified that the form below is equivalent to the one just given:
x
0
y
0
0
0
1
0
0
0
x
0
y
0
0
1
a
b
c
d
e
f
=
x
1
y
1
x
0
y
0
0
0
1
0
0
0
x
0
y
0
0
1
a
b
c
d
e
f
=
x
1
y
1
(2)With this form, it is easy to add multiple feature points by stacking two additional rows on the left and on the right. Denoting the pairs of points as ((x0(1),y0(1)),(x1(1),y1(1)))((x0(1),y0(1)),(x1(1),y1(1))), ((x0(2),y0(2)),(x1(2),y1(2)))((x0(2),y0(2)),(x1(2),y1(2))), ((x0(3),y0(3)),(x1(3),y1(3)))((x0(3),y0(3)),(x1(3),y1(3))), etc, the matrices will now look like:
x
0
(
1
)
y
0
(
1
)
0
0
1
0
0
0
x
0
(
1
)
y
0
(
1
)
0
1
x
0
(
2
)
y
0
(
2
)
0
0
1
0
0
0
x
0
(
2
)
y
0
(
2
)
0
1
x
0
(
3
)
y
0
(
3
)
0
0
1
0
0
0
x
0
(
3
)
y
0
(
3
)
0
1
⋮
⋮
⋮
⋮
⋮
⋮
a
b
c
d
e
f
=
x
1
(
1
)
y
1
(
1
)
x
1
(
2
)
y
1
(
2
)
x
1
(
3
)
y
1
(
3
)
⋮
x
0
(
1
)
y
0
(
1
)
0
0
1
0
0
0
x
0
(
1
)
y
0
(
1
)
0
1
x
0
(
2
)
y
0
(
2
)
0
0
1
0
0
0
x
0
(
2
)
y
0
(
2
)
0
1
x
0
(
3
)
y
0
(
3
)
0
0
1
0
0
0
x
0
(
3
)
y
0
(
3
)
0
1
⋮
⋮
⋮
⋮
⋮
⋮
a
b
c
d
e
f
=
x
1
(
1
)
y
1
(
1
)
x
1
(
2
)
y
1
(
2
)
x
1
(
3
)
y
1
(
3
)
⋮
(3)So long as there are more than three points, the system of equations will be overdetermined. Therefore the objective is to find the solution [a,b,c,d,e,f][a,b,c,d,e,f] in the least squares sense. This is done using the pseudoinverse of the matrix on the left.