A Method for Detecting Subtitle Regions in Videos Using Video Text

Transcription

A Method for Detecting Subtitle Regions in Videos Using Video Text
International Journal of Advanced Intelligence
Volume 2, Number 1, pp.37-55, July, 2010.
c AIA International Advanced Information Institute
⃝
A Method for Detecting Subtitle Regions in
Videos Using Video Text Candidate Images and
Color Segmentation Images
Yoshihide Matsumoto, Tadashi Uemiya, Masami Shishibori and Kenji Kita
Faculty of Engineering, The University of Tokushima
2-1 Minami-josanjima, Tokushima 770-8506, Japan
matsumoto@laboatec.com; uchikosi@helen.ocn.ne.jp;
{bori;kita}@is.tokushima-u.ac.jp
Received (January 2010)
Revised (May 2010)
In this paper, a method for detecting text regions in digital videos with telop, such as
drama, movie and news programming, is proposed. The typical characteristics of telop
are that it does not move, and that its edges are strong. This method takes advantage of
these characteristics to produce video text candidate images. Then, this method produces
the video text region images from both the video text candidate images and the color
segmentation images. The video text region images and the original image are used to
identify the color of the telop. Finally, text regions are detected by increasing neighboring
pixels of the identified color. The experiment results show that the precision of this method
was 80.36% and the recall was 77.55%, whereas the precision of the traditional method was
40.22% with the recall 75.48%. Higher accuracy was achieved by using this new method.
Keywords: Video text candidate image; Color segmentation image; Video text region
image; Multimedia information retrieval.
1. Introduction
In recent years, with the spread of the Internet, increased hardware specifications,
and the development of imaging devices such as digital cameras and digital video
cameras, there are more and more opportunities to accumulate large amounts of
video content in personal computers. It is difficult to efficiently search the required
image or scene within these contents, and so the information is needed that clearly
describes the content. The required information usually includes cut points, camera
work, sound, and subtitles. Subtitles often describe the subject being photographed
or the topic. Subtitles also appear in sync with the video, making them noteworthy
as useful strings that reflect the semantic content.
One of the well-known image-handling technologies focusing on subtitles is the
Informedia project1,2 , where large-size image data are processed using images from
cut scenes, subtitle-recognition characters, and speech-recognition data. A method
has been proposed for matching cooking instructions and cooking images using subtitles and closed caption.3 A method to index the semantic attributes corresponding
37
38
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
to scenes in news programs using the closed caption has been proposed.4 A method
has also been proposed for recognizing text residing within a subtitle region.5 To
implement applied methods like this, it is first necessary to detect temporal and
spatial ranges of the subtitles in the image. The establishment of a highly accurate
method for detecting subtitle regions is desired.
Sato et al.6 have proposed a traditional subtitle detection method, where macro
block coding information is used to detect subtitle regions in images compressed
as MPEGs. While this method allows for fast processing, the accuracy has not yet
reached a practical level. Arai et al.7 have focused on a feature of subtitles, called
edge pairs, to propose another method, where subtitle regions are detected from the
spatial distribution and temporal continuity of edge pairs. Although the detection
accuracy of this method has been developed to a practical level, the absence of a
learning function may decrease the accuracy as the text fonts change. Hori et al.8
have proposed yet another method, where text candidate images are obtained from
the logical products of low-dispersion images and immovable-edge images, followed
by learning-based detection of subtitle regions. While this method leads to high
recall, precision is low. Thus it tends to detect excessive regions as subtitles, resulting in subtitle text getting crushed. Additionally, there has been a proposal to
increase the detection accuracy of subtitle regions by first creating text candidate
images, and then using a learning-based identification device called Support Vector Machine (SVM)9 and a feature point extraction operator called Harris Interest
Operator (Harris operator)10,11 . Although this method12 increases precision, it has
its own issues such as the fact that it needs data for learning, and that the recall
decreases.
This paper proposes a method for detecting subtitle regions with high accuracy
by first generating video text candidate images in the same way as in traditional
methods7,8 , followed by checking color segmentation images against the original image. In this method, text candidate images are obtained first in the same way as in
the traditional method8 , based on the regions where little brightness change occurs
between continuous frame images, and on the regions with no changes in edges. The
subtitles within the text candidates images obtained this way are detected almost
perfectly, but the background tends to be excessively detected at the same time.
In other words, the recall is high while the precision is low. As a workaround, the
text candidate images and the color segmentation images obtained this way are
combined, after which only the color segments that appear to be text are selected,
thereby generating text region images with low background noise. The text region
images thus obtained have few instances of the background falsely detected as text.
However, because subtitle regions are detected based on color segments, some characters in minute color segments of the subtitle text tend to escape detection. In
other words, the precision is high while the recall decreases. In an effort to improve
the recall, text color was used, assuming that the color information of the subtitles
does not change. Specifically, the color information of the subtitles is determined us-
A Method for Detecting Subtitle Regions in Videos
39
ing multiple text region images generated within continuous frames and the original
image. This is followed by the improvement of the recall by increasing neighboring
pixels that have similar color information, thereby accurately detecting subtitle regions.
Chapter 2 introduces a traditional method for detecting subtitle regions using
video text candidate images. Chapter 3 proposes a method for generating text region images using video text candidate images and color segmentation images, as
well as a method for detecting subtitles by automatically setting the color of the
subtitle text using text region images and the original image. Chapter 4 provides experiments for assessing the validity of the proposed method, along with the results
and discussions. Finally, Chapter 5 presents the conclusion and describes future
issues.
2. Overview of Traditional Methods
This chapter introduces a traditional method for detecting subtitle regions using
text candidate images. Text candidate images are also used in our proposed method
as subtitle region images in their first phase.
2.1. A method for generating video text candidate images using
low distributed images and immovable edge images
Hori et al.8 have proposed a method for generating video text candidate images from
the logical products of low distributed images and immovable edge images. First, one
low distributed image is created from continuous frame images based on an arbitrary
number of brightness images. If the arbitrary number is an“ N, ”brightness frames
for N frames are then used to obtain the distribution value of the brightness of
each pixel. In this method, we chose brightness images for 4 frames. Pixels whose
distribution values are lower than a specified threshold value are assigned a value
of 1, with other pixels assigned 0 or 2, in order to obtain low distribute images. The
threshold value is set using discriminate analysis. Static regions such as subtitles
have little change in brightness, thus their distribution values are low. More dynamic
regions have higher distribution values. Therefore, the resultant low distributed
images tend to have most of the subtitles intact.
Similarly, one immovable edge is created from continuous frame images based on
an arbitrary number of brightness images. First, edge images with the value of 2 are
obtained from brightness images. Wavelet conversion is used to detect edges. Then
the logical product of the edge images for N frames is obtained. In this method, we
chose brightness images for 4 frames. The images obtained by the logical product
are called immovable edge images, which have sharp edges on the boundaries with
the background. Static pixels are prone to remain here, and so the subtitles tend
to remain in a similar way to low distributed images. Low distributed images and
immovable edge images are obtained in the flow shown in Fig. 1. The logical product
40
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
Fig. 1. An illustration of making a video text candidate image from each video frame.
obtained from the low distributed images and immovable edge images, which in turn
are obtained in the above manner, will generate video text candidate images. 2.2. A method for detecting subtitles using SVM and the Harris
operator
Hiramatsu et al.12 have proposed a method which suppresses erroneous detection
with the use of SVM and the Harris operator. In this method, video text candidate
images are first generated excluding as much as possible the background parts
except for the subtitles. Then the video text candidate images are divided into
blocks similar in size to the pre-determined text, as shown in Fig. 2. Brightness
histogram for each block is created from the white pixels remaining in that block.
Each block is assessed using SVM, labeling subtitle-bearing blocks as positive, and
those without subtitles as negative.
The Harris operator, which is high in recall for image enlargement, is then applied
to images determined to be subtitle regions by SVM, to increase the precision. The
interest points detected by the Harris operator are seen abundant in parts with
large color variation as well as in edges. Since in many cases subtitle regions are
represented as supplementary colors for the images around them, it is expected that
many interest points will be detected in the vicinity of subtitle regions. Therefore,
blocks with positive identification by SVM are detected as subtitle regions if they
have many interest points. Subtitle regions may not be recognized if edges of the
text reside within blocks. Subtitles are long text strings aligned horizontally in a
long string. Therefore, to avoid this non-recognition issue, the number of interest
points on the right and left side of the region in question is used to determine if
that region is a subtitle region.
A Method for Detecting Subtitle Regions in Videos
41
Fig. 2. An example of histogram data generation.
Fig. 3. An example of detecting the interest points by the Harris operator.
2.3. Issues on traditional methods
Traditional methods eliminates the background using the characteristics of subtitles found in images, and then recognizes subtitle regions using SVM with the
manually-prepared positive and negative data, and the interest points. However,
42
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
such traditional methods have the following issues:
(i) Positive and negative data must be prepared manually so that SVM can learn
them.
(ii) Subtitle texts not residing within divided blocks may make subsequent text recognition difficult.
To solve these issues, we focused on techniques for dividing image regions. In the
subsequent chapters, we will discuss a subtitle-region detection method based on a
technique for dividing image regions.
3. The Proposed Method
This paper proposes a method for detecting subtitle regions based on images that
have been processed with color segmentation and video text candidate images. We
will call subtitle images generated using video text candidate images and color
segmentation images “text region images.” We will first discuss a method for generating text region images using video text candidate images and color segmentation
images. After that, we will use the text region images and the original image to
automatically set the text color, and discuss the process flow for detecting final
subtitle regions.
3.1. Generating text region images using color segmentation
images
3.1.1. Introduction to the method
The process flow for detecting subtitles based on color a segmentation image is
shown in Fig. 4. First, a video text candidate image is obtained in the same way
as in the traditional method8. At the same time, an image processed with color
segmentation (“a color segmentation image”) and color segmentation image data
are obtained (Step 1 of Fig. 4). The color segmentation image data include the region
numbers, the size of each region (the total number of pixels), the central coordinate
(x, y), the color information of the regions (luv), and the coordinates that belong to
the regions. A video text candidate image is created from four continuous frames,
while a color segmentation image is created from the first frame that was used when
creating the video text candidate image.
Then, these two images are used to eliminate noise (Step 2 of Fig. 4). We will process
the elimination in two ways: (1) by horizontally scanning the video text candidate
image so that only the pixels within the subtitles remain, and (2) by checking the
video text candidate image against the color segmentation image data in order to
select only the color segments that appear to be subtitles. After this elimination
process, we will eliminate the edges of subtitles, because it is common for subtitles
to have edges added on (Step 3 of Fig. 4). Specifically, we take advantage of the
fact that the bodies and edges of subtitle characters use different colors. We will
A Method for Detecting Subtitle Regions in Videos
43
Fig. 4. Outline of detecting text regions by using color segmentation images.
use the k-means method to classify the colors of the regions that contain the white
pixels that are left after the noise elimination process. Finally, we will supplement
the text characters (Step 4 of Fig. 4) to improve recall. Specifically, we will search
each segmentation region around pixels that remain as part of the subtitles at the
end of Step 3, and increase the regions that resemble the subtitle region in size and
typical color. Below are detailed discussions of each module.
3.1.2. Generating color segmentation images
In this process, the region integration method is used to generate color segmentation
images. Region integration is a method for dividing an image into multiple sets
(regions) of pixels that have similar amount of characteristic and are spatially close,
based on such characteristics as the pixel values and the texture. The reason we
chose this method is the characteristic of the subtitles. As discussed in the section
on low distributed images, the brightness of subtitles vary little, and their color
does not change much. In other words, all subtitles have more or less the same
characteristics, which led us to speculate the color segmentation process might
successfully extract subtitle regions. Below are the steps for integrating regions.
Fig. 5 and 6 show examples of color segmentation images generated using the region
integration method.
Step 1 Search for each pixel by raster scanning, flagging any unlabeled and unclassified pixels and labeling them.
Step 2 Check eight (8) pixels neighboring the flagged pixels, and assign them the
same label as that of the flagged pixels if the pixel value is the same.
Step 3 Repeat Step 2 with the newly labeled pixels as the flagged pixels.
44
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
Step 4 If no pixels are labeled in Step 2, repeat Step 1.
Step 5 The process is complete when all pixels have been labeled. Sets (regions) of
neighboring pixels with the same pixel value are obtained at this point. Proceed
to the next step using the labeled pixels.
Step 6 Obtain the average pixel values among the pixels bearing the same label.
Step 7 Of the neighboring sets of pixels, integrate the two that have the smallest
difference in the average pixel values obtained in Step 6.
Step 8 Repeat Steps 6 and 7. To avoid the eventuality of only one existing set
of pixels, the maximum average difference should be established for allowing
integration. [End of the steps of the region integration method]
Fig. 5. Original image.
Fig. 6. Color segmentation image.
Fig. 7. Noise elimination by scanning of white pixels.
3.1.3. Noise elimination
The process of noise elimination is two-fold. The first phase starts with horizontal
scanning of a video text candidate image as shown in Fig. 7, creating a histogram
with a tally of white pixels. The scanning direction depends on the direction of
the subtitles. Because white pixels are packed into subtitles, the histogram shows
locally high numbers where subtitles are found. Based on this observation, locations
where the histogram numbers show sharp climbs and falls are identified, and only
A Method for Detecting Subtitle Regions in Videos
45
these locations are kept, thus narrowing down subtitle-containing regions.
In the second phase of noise elimination, we take advantage of the characteristics
of subtitles, i.e., images processed with color segmentation based on color information are used. Because each character of the subtitle has the same color information,
we can predict that the background and the subtitles reside in different regions of
a color segmentation image. We can also predict that the subtitle regions are narrower than the background. The noise elimination process takes advantage of these
characteristics. First, the video text candidate image and the color segmentation
image data are checked against each other after the noise eliminating process in
Phase 1, and the ratio of white pixels in each region is measured. Next, regions that
have higher number of white pixels than the threshold value are made all white,
and all other pixels are made black. Since subtitle regions are smaller than backgrounds, even a single white pixel remaining in a region might continue to remain
after this process largely depending on whether the region is a subtitle region or a
background. Fig. 9 shows an image after noise elimination.
Fig. 8. Noise elimination by ratio of white pixels.
Fig. 9. Image with noise eliminated.
46
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
3.1.4. Classification by k-means
Generally, each character of subtitles consists of the edge part and the character
itself, each in its own color. After noise elimination, an image may still have both of
these parts left. If the edge part is still left, the entire character is crushed, making
it difficult to indentify the character, especially if it is a complicated character such
as kanji. The k-means method enables the classification of each pixel in the subtitle
characters based on the color information, and it detects only the pixels that belong
to the characters. In a video text candidate image with noise eliminated, the colors
of the regions to which the remaining white pixels belong are classified using kmeans, as shown in Fig. 10. After the classification, only the regions with colors
that belong to the class with the most clusters are kept. Fig. 11 shows an example
of an image after classification by k-means. 3.1.5. Character complementation
As can be seen in Fig. 11, images that have been classified by k-means tend to be
high in precision and low in recall, leading to frequent non-detection. We will now
focus on the characteristics of each region, and supplement the subtitle region. The
part of the region that falls within the 16 x 16 square pixels of the remaining white
pixels is searched as shown in Fig. 12. Then the Euclidean distance is calculated
with the size of the region, the central coordinate of the region, and the color of
the region as the amounts of characteristic. If the resultant Euclidean distance is
less than the threshold value, that region is added as a subtitle. Fig. 13 shows the
image after character complementation, i.e., the video text region image after the
application of the method based on color segmentation images.
Fig. 10. Classification of each pixel by k-means.
A Method for Detecting Subtitle Regions in Videos
47
Fig. 11. Image after classification by k-means.
Fig. 12. An illustration of complementation of the text characters.
Fig. 13. An example of the video text region image after application of the proposed method.
3.2. Method for detecting subtitles by automatically setting the
text color
3.2.1. Overview of the method
Text region images generated following the method described in the preceding section tends to escape detection in the minute segmentation regions residing within
the subtitle section, lowering the recall. We will now apply a recall-improving technique based on text color (Fig. 14). First, the color information of the subtitles is
specified using the multiple text region images generated among continuous frames,
and the original picture image (Step 1 of Fig. 14). Then the text characters are
supplemented (Step 2 of Fig. 14). The pixels remaining after supplementation are
labeled, and regions that are too large are removed (Step 3 of Fig. 14). We will
discuss the details of each module.
48
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
Fig. 14. Outline of detecting text regions by specifying the text color.
Fig. 15. An illustration of color histogram generation.
3.2.2. Automatically setting the text color
Multiple text region images generated from continuous images and the original
picture image of the top frame that was used to generate each text region image are
used to specify the range of the subtitle text color. In this experiment, we focused
on the pixels remaining in thirty (30) text region images. These pixels are checked
against the original picture image to extract the RGB value. Then the RGB 256
gradation levels are compressed into 16 levels to generate histogram (Fig. 15). The
gradation level with the most pixels is determined to be the range of the color of
this text.
A Method for Detecting Subtitle Regions in Videos
49
3.2.3. Character supplementation and labeling
Eight (8) square pixels around the white pixels remaining after Step 7 of Fig. 14
(pixels that have been detected as being within the subtitle region) are searched.
Pixels residing within the range determined in Step 7 are made white. Eight square
pixels around these new white pixels are similarly searched until no more pixels
are joining. After characters are supplemented, they are labeled. Labels that are
connected above the threshold value are removed. Fig. 16 shows an example of the
final result after automatically setting the text color.
Fig. 16. An example of a video text region image generated by the proposed method.
4. Assessment
4.1. Method of experiment
We conducted an experiment to confirm the validity of our proposed method. As
the experiment data, we used drama image data that includes subtitles with full
RGB colors, a resolution of 352 x 240, and a frame rate of 29.97 fps. As the correct
data, we only used the images of the subtitles in the overlay region of this drama.
The correct data, text candidate images, and the text region images were checked
against one another for each pixel to calculate the precision and recall. 30 images
were selected at random from the drama for assessment. The criteria for assessment
recall (r) and precision (p) can be represented in the following formulae (1),(2):
precision : p =
recall : r =
Nd
Nd + Nf
Nd
Nd + Nm
(1)
(2)
Nd : Number of correctly detected pixels
Nm : Number pixels that escaped detection
Nf : Number of falsely detected pixels
The level of Precision and Recall in the method in the past is as shown in Table 1.
The unit used in detection is the number of pixels. Detection is deemed correct
if white pixels exist where the subtitles within the frame are displayed. It is deemed
“ escaped detection ”if white pixels do not exist. Detection is deemed false if white
50
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
Table 1. Evaluation of method in the past.
Precision
Recall
40.22%
75.48%
pixels exist in frames or locations where there are no subtitles. The experiment
criteria for each method are listed below:
• Method for detecting subtitles using segmentation region images
– The threshold value used for noise elimination in the second phase is 50
– The number of classes for k-means is 2, 3, 4, and 5, variable.
– The threshold values of the Euclidean distance when characters are supplemented are 10, 20, and 30, variable.
• Method for detecting subtitles by automatically setting the text color
Regions are removed by labeling if they include 128 or more pixels that are
connected.
4.2. Experiment results
Fig. 17 shows the shift in the detection accuracy when the number of classes used by
k-means in the text detection method with region segmentation images changes, and
when the threshold value of the Euclidean distance changes in character supplementation. In Fig. 17, K=N Precision represents the precision value when the number
of classes under the k-means method is set to N (2, 3, 4, or 5), and K=N Recall
represents the recall value when the number of classes is set to N. In addition, eucM
represents the detection accuracy when the threshold for the Euclidean distance is
set to M (10, 20, or 30) in character supplementation.
The results shown in Fig. 17 indicate that precision is higher than recall in each
case as well as the recall remains steady. Changes in parameter values did not affect
accuracy significantly when the threshold for the Euclidean distance changed. On
the other hand, when the number of classes under k-means changed, both precision
and recall were affected. When the number of classes was 3, the precision was
highest, decrease in recall was at a minimum, and the balance between these two
elements was optimal, producing the best accuracy.
Each character in subtitles generally consists of three parts: the background, the
edges, and the character body. It is expected that setting the number of classes to
three (3) enabled appropriate classification and detection of the character bodies.
When the number of classes was set to two (2), the background mixed into the
selected class, resulting in lower precision. Higher numbers of classes such as 4 and
5 resulted in big drops in recall, with increased numbers of pixels that escaped
detection. This is because subtitle characters do not consist of exactly the same
color, but rather the color varies slightly from character to character. For example,
subtitle characters that are seemingly white were found to consist of four (4) smaller
A Method for Detecting Subtitle Regions in Videos
51
parts: mostly white, light gray, gray, and dark gray. Although the mostly white
part has more pixels than the dark gray, larger numbers of classes ultimately lower
the probability of the mostly white part being detected. We can reason that, as a
result, there were more pixels escaping detection and edges were falsely detected,
significantly lowering recall.
Fig. 17. Experiment results.
Fig. 18 shows the results of an experiment comparing one of traditional methods
(video text candidate images) and our proposed method (text region images and
final result images), using the parameters with which the accuracy was best in the
experiment shown in Fig. 17 (the number of classes = 3, and the threshold for the
Euclidean distance = 30).
52
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
Fig. 18. Experiment Results.
Fig. 19. An example of a video text candidate Fig. 20. An example of a video text region imimage.
age.
Fig. 21. An example of a video text region im- Fig. 22. An example of a video text region image that uses a color segmentation image.
age with automatic setting of the telop color.
Fig. 18 shows that our proposed method brings about better results in both precision and recall compared to traditional methods. Noise elimination was a factor
in the improvement of precision. In rather static video scenes with such objects
as a building, many non-subtitle pixels remained in the video text candidate image, lowering precision. It may be argued that our proposed method eliminated the
non-subtitle pixels, improving precision. Additionally, edges of video text candidate images had a strong tendency to remain in subtitles as Fig. 19 indicates, and
in many cases only the edge of a character remained. The fact that our proposed
method enabled the supplementation of missing parts of a character as in Fig. 20
may have contributed improved precision.
Recall of the method based on region segmentation images did not significantly
improve over traditional methods. To deal with this issue, color classification under
A Method for Detecting Subtitle Regions in Videos
53
k-means in our proposed method is simply based on the number of elements in each
class. In cases where many edge pixels remained, this may have disabled supplementation of subtitles almost completely as Fig. 21 shows, resulting in lower recall.
In the future, the location of each element in a class should be taken into consideration to develop an algorithm that enables more accurate selection of the class for
the character body. Furthermore, we found that accuracy was higher in a version
of our method where region segmentation images are used and then the text color
is automatically set, compared to a version that uses region segmentation images
alone. Fig. 21 and 22 are of the same scene. We can see that Fig. 22 shows less
missing pixels of characters and more correct pixels have been detected compared
to Fig. 21. Subtitles that do not benefit from a method using region segmentation
images alone can be improved if text color is considered. Use of text color may be
the factor in the improvement in recall.
5. Conclusion
This paper has proposed a method for detecting subtitle regions using region segmentation images. Assessment experiments confirmed that our proposed method
has high detection accuracy over a traditional method of using video text candidate
images. It is thought that our proposed method has arrived at a practicable level
because it can clearly detect the character from the text regions as shown in Fig.
20 and 22. However, there is a problem with a lot of parameter and thresholds
that should be set to make video text candidate images. Future issues include a
review of the ways to automatically set parameters under our proposed method,
and improvement of accuracy by eliminating minute non-subtitle regions.
54
Y. Matsumoto, T. Uemiya, M. Shishibori and K. Kita
Acknowledgments
This study was financed partly by the Basic Scientific Research Grant (B)
(17300036) and the Basic Scientific Research Grant (C) (17500644).
References
1. H.D. Wactler, A.G. Hauptmann, and M.J. Witbrock, Informedia News-on-Demand: Using
Speech Recognition to Create a Digital Video Library, CMU Tech. Rep. CMU-CS-98-109,
Carnegie Mellon University, 1998.
2. H.D. Wactler, M.G. Christel, Y. Gong, and A.G. Hauptmann, Lessons Learned from Building
a Terabyte Digital Video Library, IEEE Comput., 32(2), pp. 66-73, 1999.
3. H. Miura K. Takano S. Hamada I. Iide, O. Sakai and H. Tanaka, Video Analysis of the
Structure of Food and Cooking Steps with the Corresponding, IEICE Journal, J86-D-II(11),
pp. 1647-1656,2003.
4. I. Iide, S. Hamada, S. Sakai and E. Tanaka, TV News Subtitles for the Analysis of the Semantic
Dictionary Attributes, IEICE Journal, J85-D-II(7), pp. 1201-1210,2002.
5. S. Mori, M. Kurakake, T. Sugimura, T. Shio and A. Suzuki,The Shape of Characters and
the Background Characteristics Distinguish Correction Function by Using Dynamic Visual
Character Recognition in the Subtitles, IEICE Journal, J83-D-II(7), pp. 1658-1666, 2000.
6. S. Sato, Y. Shinkura, Y. Taniguchi, A. Akutsu, Y. Sotomura and H. Hamada,Subtitles from
the MPEG High-speed Video Coding Region of the Detection Method, IEICE Journal, J81D-II(8), pp. 1847-1855,1998.
7. K. Arai, H. Kuwano, M.Kurakage, T.Sugimura,The Video Frame Subtitle Display Detection
Method, IEICE Journal, D-2, J83-D-2(6), pp. 1477-1486, 2000.
8. O. Hori, U. Mita,Subtitles for Recognition from the Video Division Robust Character Extraction Method, IEICE Journal,D-2, J84-D-2(8), pp. 1800-1808, 2001.
9. http://svmlight.joachims.org: “SVM-Light Support Vector Machine”
10. C. Harris, and M. Stephens,A Combined Corner and Edge Detector, Proceeding of the 4th
Alvey Vision Conference, pp. 147-151, 1988.
11. V. Gouet and N. Boujemaa,Object-based Queries Using Color Points of Interest, Proceedings
of IEEE workshop CBAIVLICBPR, pp. 30-38, 2001.
12. D. Hiramatsu, M. Shishibori and K. Kita,Subtitled Subtitles from the Area of Video Data
Detection Method, IEICE Journal Information Systems and Information Industry Association
and the Joint Research, IP-07-24 IIS-07-48, 2007.
Yoshihide Matsumoto
He graduated The Information System Technology
& Enginering cource student from Kochi University of
Technology in Mar. 2002. He joined Laboatec in Japan
Co,LTD. in same year, he became the CTO position, Applied IT Lab. Master in 2008. He became a Graduate
School of Doctor Program cource of Advanced Technology & Sience at the University of Tokushima in Oct.
2006. His Study of Multi-Media IT System Publication
was received a 2003 Japan IBM user symposium of QuasiSelected Award.
A Method for Detecting Subtitle Regions in Videos
55
Tadashi Uemiya
He graduated from Waseda University in Mar.1968,
joined Kawasaki Heavy Industry Co. Ltd. in 1968,
through 2000, and same year, transferred to IT Dep.
of Benesse Co.Ltd to 2006 of Retired. He became a
Graduate School of Doctor Program cource student of
Advanced Technology & Sience at the University of
Tokushima in Oct. 2006. His research interests include IE
& IT and Inovation IT Solution and information Technology. He has experienced for Aero Jet Engines Deveropment Project of international 5 countries, development CAD/CAM/CAE/CG SYSTEM and Implementation, PICS, Web Information Infrastructure and was Senior member of IEEE etc. And First instration IP public
network implementation with MPLS Technology in Japan
with Co-work Project of NTT and CISCO Japan. ; and
many experience for Security system of ISMS, SRMS and
The personal information Protection. He is Executive IT
Consulting Now.
Masami Shishibori
He graduated from the University of Tokushima in
1991, completed the doctoral program in 1995, and joined
the faculty as a research associate, becoming a lecturer in
1997 and an associate professor in 2001. His research interests are multimedia data search and natural language
processing. He is a coauthor of information Retrieval Algorithms (Kyoritsu Shuppan). He received the ISP 45th
Natl. Con. Incentive Award. He holds a D.Eng. degree,
and is a member of ICIER and NLP
Kenji Kita
He graduated from Waseda University in 1981, joined
Oki Electric Industry Co., Ltd. in 1983, and transferred
to ART Interpreting Telephony Research Laboratories
in 1987. He became a lecturer at the University of
Tokushima in 1992, an associate professor in 1993, and
a professor in 2000. His research interests include natural language processing and information retrieval. He
received a 1994 ASJ Technology Award. His publications
include Probabilistic Language Models (Tokyo University
Press) and Information Retrieval Algorithms (Kyoritsu
Shuppan). He holds a D.Eng. degree.