Presentation

Transcription

Presentation
Total Cluster:
A person agnostic clustering
method for broadcast videos
Makarand Tapaswi, Omkar M. Parkhi, Esa Rahtu,
Eric Sommerlade, Rainer Stiefelhagen, Andrew Zisserman
Face track clustering
Input
Output
Video with face tracks
Face track clusters
2
Why cluster face tracks?
• Large number of videos
• Good basis for manual or automatic labelling
• Facilitates video retrieval or summarization
3
Overview
• Editing structure in videos
• Shots, threads and scenes
• Using editing structure for clustering
• Negative pairs
• Scene level clustering
• Episode level clustering
• Dataset and evaluation
• Scrubs, Buffy
• Weighted clustering purity
• Clustering results
T1
T2
T3
T4
Related work
Person Identification in TV series
(Everingham et al., 2006)
(Cour et al., 2009)
(Bojanowski et al., 2013)
(Cinbis et al., 2011)
(Khoury et al., 2013)
(Ramanathan et al., 2014)
Face Clustering
(Ramanan et al., 2007)
(Wu et al., 2013, 2013b)
Novelty of the current work
• Error free clustering
• Utilize video-editing structure
• Use state-of-the-art face track descriptor
5
Overview
• Editing structure in videos
• Shots, threads and scenes
• Using editing structure for clustering
• Negative pairs
• Scene level clustering
• Episode level clustering
• Dataset and evaluation
• Scrubs, Buffy
• Weighted clustering purity
• Clustering results
T1
T2
T3
T4
Video editing overview
Episode
Scenes
Threads
…
Shots
7
Shots and Threads
shot
thread
(n.)(n.)
A sequence
set of shots
oftaken
uninterrupted
from the frames
same camera
for a
short
with the
period
same
ofviewing
time angle and content
A
B
A
C
D
C
Scrubs S01 E01
D
…
8
Scenes
scene (n.)
A sequence of shots in a video with the same
characters, at the same location and time
Scene
4
Scene
5
Scene
6
Scene boundary detection (Tapaswi et al. 2014)
 Maximize within-scene visual similarity
 Don’t break shot threads
9
Overview
• Editing structure in videos
• Shots, threads and scenes
• Using editing structure for clustering
•
•
•
•
Face tracking
Negative pairs
Scene level clustering
Episode level clustering
• Dataset and evaluation
• Scrubs, Buffy
• Weighted clustering purity
• Clustering results
T1
T2
T3
T4
Face tracks
Face tracking
Face track descriptor
(Parkhi et al. 2014)
•
•
•
•
frontal+profile VJ detector
multi-pose head detector
KLT tracking
false positive removal
• dense SIFT
• Fisher encoding
• discriminative
dimensionality reduction
11
Negative constraints
Tracks in the same frame
(Cinbis et al. 2011, Wu et al. 2013)
do NOT merge
these pairs!
Tracks in a thread
A
B
A
C
B
12
Overview
• Editing structure in videos
• Shots, threads and scenes
• Using editing structure for clustering
• Negative pairs
• Scene level clustering
• Episode level clustering
• Dataset and evaluation
• Scrubs, Buffy
• Weighted clustering purity
• Clustering results
T1
T2
T3
T4
for tracks not in
negative pairs
Within scene clustering
Strong face similarity
In a thread,
• area overlap
• relaxed face similarity
T1
T2
T2
T1
128D
T4
Dr. Kelso
T3
Turk
14
Overview
• Editing structure in videos
• Shots, threads and scenes
• Using editing structure for clustering
• Negative pairs
• Scene level clustering
• Episode level clustering
• Dataset and evaluation
• Scrubs, Buffy
• Weighted clustering purity
• Clustering results
T1
T2
T3
T4
Cluster descriptor
compare frontal vs. frontal, profile vs. profile
Frontal
Frontal descriptor
Mixed
Both descriptors
Profile
Profile descriptor
16
Full episode clustering
Exemplar SVMs
Cluster across scenes
• one positive cluster
• against negative clusters
+ YouTube Faces
• score clusters with e-SVM
• merge high-scoring
17
Recap
A
A
B
B
• Video editing structure
• Do not merge track pairs
• Scene level clustering
T2
T1
Turk
• Episode level clustering
18
Overview
• Editing structure in videos
• Shots, threads and scenes
• Using editing structure for clustering
• Negative pairs
• Scene level clustering
• Episode level clustering
• Dataset and evaluation
• Scrubs, Buffy
• Weighted clustering purity
• Clustering results
T1
T2
T3
T4
Data set
• episodes 1..5, 23
• ~20 minutes each
• sitcom, interns at a hospital
average episode
# named characters
# tracks
# tracks in thread
in shots
# do not
merge pairs in threads
• episodes 1..6
• ~40 minutes each
• supernatural drama, action
SCRUBS
BUFFY
16
412
280
214
851
15
756
515
440
1891
20
Evaluation metrics
Goal: minimize #clusters, don’t make errors!
WCP: weighted clustering purity
• Cluster 1: Purity p1 = 3/3
• Cluster 2: Purity p2 = 4/5
1
• WCP:
𝑐 𝑛𝑐 𝑝𝑐 = 0.875
Cluster 1
Purity: 1
𝑁
Trade-off WCP vs. #clusters
Cluster 2
Purity: 0.8
Purity
1
# clusters
21
Within scene clustering
SCRUBS
Tracks
Best HAC
Ours
#Clusters
#Clusters
450
350
250
BUFFY
950
850
750
650
550
450
150
SCR-1
SCR-2
SCR-3
SCR-4
SCR-5
Lower number of clusters
@ purity = 1
BF-1
BF-2
BF-3
BF-5
BF-6
Prevent errors
22
Full episode clustering
SCRUBS
500
Best HAC
Ours
400
#Clusters
#Clusters
Tracks
1000
300
200
100
BUFFY
850
700
550
400
SCR-1 SCR-2 SCR-3 SCR-4 SCR-5
BF-1
BF-2
BF-3
BF-5
BF-6
Full episode clustering
• hard to choose threshold for HAC
• proposed method reduces #clusters @ purity → 1
23
Summary
Purity
1
• Minimize #clusters at very high purity
• Affords manual and automatic annotation
A
# clusters
B
A
• Leverage video editing structure
• Face tracks are not independent data points
• 4x number of do not merge track pairs
• Stage-wise clustering
• Small number of characters in one scene
• At purity 1, #clusters is about half #tracks
24
Thank you!
Questions?
Intro, Related work, Video-editing, Negative pairs
Scene clustering, Episode clustering, Dataset, Evaluation

Similar documents

Complexity-Adaptive Distance Metric for Object Proposals Generation

Complexity-Adaptive Distance Metric for Object Proposals Generation First we evaluate the performance on PASCAL VOC 2012 dataset. Figure 1 shows the recall versus IoU threshold curves with candidates number varying from 500-3000. The quantitative results of corresp...

More information